python多執行緒爬蟲使用

1.getip方法：從**獲取免費**ip並儲存在csv文件中。

2.getproxy方法：從文件中取出ip和埠，組裝ip，並儲存在集合中。

4.主方法：分別用普通方法測試完成時間singleprocess needs 和使用多執行緒測試完成時間multiprocess needs 。由於本電腦核心數為4，使用多執行緒的時間要比普通方法少大概4倍。

import requests
from bs4 import beautifulsoup
import csv
import time
from multiprocessing import pool
def getip(numpage):
csvfile = open('ips.csv', 'w')
writer = csv.writer(csvfile,dialect = ("excel"))
time.sleep(10)
url = ''
user_agent = 'ip'
headers = 
for i in range(1, numpage + 1):
real_url = url + str(i)
response = requests.get(real_url, headers=headers)
content = response.text
bs = beautifulsoup(content,"lxml")
trs = bs.find_all('tr')
for items in trs:
tds = items.find_all('td')
temp = 
try:
#print(temp)
writer.writerow(temp)
except:
pass
getip(1)
def getproxy():
with open('ips.csv','r') as csvfile:
reader = csv.reader(csvfile)
proxy = 
for row in reader:
try:
proxy = 
except:
continue
return proxy
def test(proxy):
try:
response = requests.get('', proxies=proxy, timeout=2)
if response:
return proxy
except:
pass
if __name__ == '__main__':
proxy = getproxy()
ippool1 = 
time1 = time.time()
for item in proxy:
time2 = time.time()
print('singleprocess needs ' + str(time2 - time1) + ' s')
pool = pool()
ippool2 = 
temp = 
time3 = time.time()
for item in proxy:
pool.close()
pool.join()
for item in temp:
time4 = time.time()
print('multiprocess needs ' + str(time4 - time3) + ' s')

5.結果展示：

python爬蟲多執行緒爬蟲

在進行爬蟲工作的時候，考慮到爬蟲執行的速度慢，那麼怎樣提公升爬蟲的速度呢，那麼就得使用多執行緒爬蟲了，接下來我以糗事百科段子的爬取進行對多執行緒爬蟲的概述 github鏈結鏈結一不使用多執行緒爬取糗事百科 1.上 import urllib.request import re headers f...

python多執行緒爬蟲

先記錄一下，普通的糗事百科爬蟲 import urllib.request import re import time import urllib.error headers user agent mozilla 5.0 windows nt 10.0 win64 x64 rv 63.0 gecko...

python多執行緒爬蟲

python多執行緒爬蟲 python單執行緒爬蟲對於應付小規模資料是可以的，但是面對大量資料，我們就要用到多執行緒爬蟲技術。使用多執行緒，一方面可能會加快效率，另一方面可以施加一些小技巧，如不同的執行緒使用不同的 ip從而避免出發反爬機制。python 多執行緒 python的多執行緒可以用thr...

python多執行緒爬蟲使用

python爬蟲 多執行緒爬蟲

python多執行緒爬蟲

python多執行緒爬蟲

相關推薦

python爬蟲多執行緒爬蟲