xpath解析多執行緒爬取表情包

要點

請求頭要有referer和user-agent

xpath解析響應的html字串，.//img表示得到當前目錄下所有的img標籤，@data-backup表示得到某個屬性值

總體**

import requests
from lxml import etree
from bs4 import beautifulsoup
url=
""headers=
resp=requests.get(url,headers=headers)
#print(resp.text)
#開始解析
html=etree.html(resp.text)
srcs=html.xpath(
'.//img/@data-backup'
)for src in srcs:
filename=src.split(
'/')[-
1]#img是響應不能字串解析
img=requests.get(src,headers=headers)
with
open
('imgs/'
+filename,
'wb')as
file
:file
.write(img.content)
#content是位元組內容
print
(src,filename)

自動獲取下乙個頁面的url

)#解析當前url下的url

for img_url in img_url_list:

dowmload_img(img_url)

next_link = html.xpath(

'.//a[@rel="next"]/@href'

)return next_link

通過函式加迴圈的方式，爬取每一頁的

next_link_base=
'article/list/?page='
current_num=
0next_link =
'article/list/?page=1'
while next_link:
current_num+=
1 next_link=get_page(next_link_base+
str(current_num)
)if current_num>3:
break

多執行緒高併發提高爬取速度

from concurrent import futures
ex=futures.threadpoolexecutor(max_workers=40)
#最大工作執行緒數
for img_url in img_url_list:
ex.submit(dowmload_img,img_url,dirname)
#提交執行緒，引數包括函式名和函式的引數

python使用多執行緒爬取表情包

使用多執行緒爬取資料可以顯著提高效率編輯環境 pycharm 目標爬取表情包庫的所有表情包首先在同目錄下建乙個images資料夾如下 coding utf8 import os import threading import requests import urllib from bs4 i...

爬取表情包

這是我第一次在這寫部落格，還是有點興奮。我也是剛接觸python不久，發現python 真的很強大，簡單就可以處理複雜的事。最近很想寫個爬蟲，但水平沒達到，正好csdn學院有個公開課，黃勇老師講的 90分鐘掌握python多執行緒爬蟲全程實戰 3月6日晚20 00我聽了直播，當時沒跟上，等看了回播...

表情包的爬取

import requests from bs4 import beautifulsoup from urllib.request import urlretrieve import os base page url page url list for x in range 1 50 url bas...

xpath解析多執行緒爬取表情包

python使用多執行緒爬取表情包

爬取表情包

表情包的爬取

相關推薦