python爬蟲例項之多執行緒爬取小說

之前寫過一篇爬取**的部落格，但是單執行緒爬取速度太慢了，之前爬取一部**花了700多秒，1秒兩章的速度有點讓人難以接受。

所以弄了個多執行緒的爬蟲。

這次的思路和之前的不一樣，之前是一章一章的爬，每爬一章就寫入一章的內容。這次我新增加了乙個字典用於存放每章爬取完的內容，最後當每個執行緒都爬取完之後，再將所有資訊寫入到檔案中。

之所以用字典是因為爬完之後需要排序，字典的排序比較方便

為了便於比較，這次選擇的還是之前部落格裡面相同的**，不清楚的可以看看那篇部落格：

python爬蟲例項之**爬取器

下面就上新鮮出爐**：

import threading
import time
from bs4 import beautifulsoup
import codecs
import requests
begin = time.clock(
)#多執行緒類
class
mytread
(threading.thread)
:def
__init__
(self,threadid,name,st)
: threading.thread.__init__ (self)
self.threadid = threadid
self.name = name
self.st = st
defrun(self)
:print
('start '
,str
(self.name)
) threadget(self.st)
print
('end '
,str
(self.name)
)txtcontent =
#儲存**所有內容
novellist =
#存放**列表
)#獲取頁面html原始碼
defgetpage
(url)
: headers =
page = requests.get(url)
.content.decode(
'utf-8'
)return page
chaptername =
#存放**章節名字
chapteraddress =
#存放**章節位址
#獲取**所有章節以及位址
defgetchapter
(html)
: soup = beautifulsoup(html,
'lxml'
)try
: alist = soup.find(
'div',id
='list'
).find_all(
'a')
forlist
in alist:
list
.string)
href =
''+list
['href'
]return
true
except
:print
('未找到章節'
)return
false
#獲取章節內容
defgetdetail
(html)
: soup = beautifulsoup(html,
'lxml'
)try
: content =
' '
pstring = soup.find(
'div',id
='content'
).find_all(
'p')
for p in pstring:
content += p.string
content +=
'\n '
return content
except
:print
('出錯'
)return
'出錯'
defthreadget
(st)
:max
=len
(chaptername)
#print('threadget函式',st,max)
while st <
max:
url =
str(chapteraddress[st]
) html = getpage(url)
content = getdetail(html)
txtcontent[st]
= content
print
(+chaptername[st]
) st += thread_count
url =
'/xiaoshuodaquan/'
#**大全**
html = getpage(url)
getnovels(html)
#獲取**名單
name =
input()
if name in novellist:
print()
url =
str(novellist[name]
) html = getpage(url)
getchapter(html)
thread_list =
thread_count =
int(
input
('請輸入需要開的執行緒數'))
forid
inrange
(thread_count)
: thread1 = mytread(id,
str(id)
,id)for t in thread_list:
t.setdaemon(
false
) t.start(
)for t in thread_list:
t.join(
)print
('\n子執行緒執行完畢'
) txtcontent1 =
sorted
(txtcontent)
file
= codecs.
open
('c:/users/lenovo/desktop/novellist/'
+name+
'.txt'
,'w'
,'utf-8'
)#**存放在本地的位址
chaptercount =
len(chaptername)
#寫入檔案中
for ch in
range
(chaptercount)
: title =
'\n 第'
+str
(ch +1)
+'章 '
+str
(chaptername[ch])+
' \n\n'
content =
str(txtcontent[txtcontent1[ch]])
file
.write(title+content)
file
.close(
) end = time.clock(
)print
(,end-begin,
'秒')
else
:print
('未找見該**'
)

我開了100個執行緒用來測試：

速度比單執行緒提高了很多

同一時間段的單執行緒花了1200多秒，而100個執行緒的速度是他的20多倍。

爬蟲之多執行緒

之前寫的爬蟲都是單個執行緒的，一旦某個地方卡住不動了，那就要演員等待下去了，所以我們可以使用多執行緒或多程序來處理但是我個人不建議用，不過還是簡單的介紹下爬蟲使用多執行緒來處理網路請求，使用執行緒來處理url佇列中的url，然後將url返回的結果儲存在另乙個佇列中，其它執行緒在讀取這個佇列中的...

Python之多執行緒

1 呼叫thread模組中的start new thread 函式來產生新執行緒 thread.start new thread function,args kwargs function 執行緒函式。args 傳遞給執行緒函式的引數,他必須是個tuple型別。kwargs 可選引數。2 使用thr...

python之多執行緒

學習了一下多執行緒用到爬蟲裡面簡直爽歪歪呀定義就很簡單，為了實現高併發，能夠同時在乙個指令碼下執行多個程式，節約時間新增執行緒用到的 import threading as td def sum num1,num2 sum num1 num2 print sss sum def divided...

python爬蟲例項之 多執行緒爬取小說

爬蟲之多執行緒

Python之多執行緒

python之多執行緒

相關推薦

python爬蟲例項之多執行緒爬取小說