爬蟲多執行緒

多執行緒在之前的scrapy裡面已經接觸過了，就是裡面的yiled，開啟乙個新的執行緒。但是這是這是基於這個高階框架的，用的時候只知道這是開啟了乙個新的執行緒，並不是很清楚到底是怎麼執行的。

而在python裡面有包

import threading

引入這個包之後就可以寫自己的多執行緒了

寫多執行緒的時候需要注意資料的一直性，比如說倆個執行緒，乙個列表，

執行緒一二同時開始對乙個列表進行操作，執行緒一刪除了列表裡的乙個元素，而執行緒二不知道執行緒一刪除了此元素，所以還在對此元素進行操作，這時的操作的不必要的無用的操作。

所以多執行緒裡就有了「鎖」，在進行公用資料修改的時候需要上鎖，上鎖之後就只能自己對這段資料進行操作，而其它執行緒需要等著此執行緒操作結束後釋放鎖，才能對這段資料進行訪問。

lock=threading.lock()#用鎖之前先給例項乙個鎖的物件

lock.acquire()#獲得鎖
lock.release()#釋放鎖

使用多執行緒的時候還可以定義類來使用

class
threading
(threading.thread):
#繼承多執行緒的初始類
def__init__
(self,name,q):
threading.thread.__init__(self)#重寫構造方法
self.name=name
self.q=q
defrun(self):
#這個方法顧名思義跑起來的意思
lock.acquire()
url=self.q.pop_url()
bas=basic(url,self.q)
bas.get_urls()
lock.release()
bas.get_body()

#codind=utf-8
from bs4 import beautifulsoup
import threading
import re
import json
import codecs
import requests
lock=threading.lock()
class
basic
():def
__init__
(self,url,q):
self.url=url
self.q=q
defrequests_get
(self):
r=requests.get(self.url,headers=header)
return r
defget_body
(self):
try:
page = self.requests_get().text
soup = beautifulsoup(page, 'html.parser')
body = soup.get_text(strip=true)
dict = 
f = codecs.open('mg1_1.json', 'a+', 'utf-8')
f_json = json.dumps(dict, ensure_ascii=false)
f.write(f_json + '\n')
f.close()
except:
pass
defget_urls
(self):
try:
print
'asdafdfasfsd'
html=self.requests_get().text
soup=beautifulsoup(html,'html.parser')
urls=soup.find_all('a',href=re.compile('.*html'))
for url in urls:
eurl=''+url['href']
self.q.new_url(eurl)
except:
pass
wait_url = 
com_url = 
class
que():
def__init__
(self,url):
self.url=url
defnew_url
(self,url):
if url in wait_url:
return
elif url in com_url:
return
else:
defpop_url
(self):
url=wait_url[0]
del wait_url[0]
return url
class
threading
(threading.thread):
def__init__
(self,name,q):
threading.thread.__init__(self)
self.name=name
self.q=q
defrun(self):
lock.acquire()
url=self.q.pop_url()
bas=basic(url,self.q)
bas.get_urls()
lock.release()
bas.get_body()
defbegain
(url):
q=que(url)
bas=basic(url,q)
bas.get_urls()
bas.get_body()
while(wait_url):
threading1=threading('threading1',q)
threading1.run()
begain('/')

多執行緒爬蟲

python標準庫是執行緒之間常見的資料交換形式 queue的使用可以確保python的執行緒安全 q queue.queue maxsize 建立佇列，並可以指定大小 q.empty 判斷佇列是否為空 q.full 判斷佇列是否滿 q.put data 向佇列中放入資料 q.get 從佇列中拿資料...

爬蟲多執行緒

執行緒程序 import requests import threading import json from queue import queue import time 寫子類 class thread crawl threading.thread def init self,name,page...

爬蟲多執行緒

個執行緒寫個執行緒讀取,沒問題,如果兩個執行緒都寫呢？import threading import time import dis num 0def demo1 nums global num for i in range nums num 1print demo1 d num defdemo...

爬蟲多執行緒

多執行緒爬蟲

爬蟲多執行緒

爬蟲 多執行緒

相關推薦

爬蟲多執行緒