爬蟲分享四多執行緒爬取小說

解析網頁獲取每章**位址

這次要爬取一本名為《元尊》的**

url = 『

進入網頁開啟開發者工具

這樣，我們就獲取到了每章**的位址儲存每章**本地

隨便開啟一章，開啟開發者工具，就可以輕鬆定位標題和文字。

再加入多執行緒，我們便能夠以較快速度爬取**，完整**如下：

# -*- ecoding: utf-8 -*-
# @modulename: novel
# @function:
# @author: shenfugui
# @email: [email protected]
# @time: 3/13/2020 3:36 pm
import requests
import os
import time
import threading
from lxml import etree
from queue import queue
from bs4 import beautifulsoup
# 獲取每個章節**位址
defget_urls
(headers,threads,q)
: url =
''r = requests.get(url, headers=headers)
html = etree.html(r.text)
urls = html.xpath(
'/html/body/div[3]/div[3]/dl/dd/a/@href'
)for url in urls:
n_url =
''+ url
q.put(n_url)
for i in
range(10
):t = threading.thread(target=download_novel,args=
(headers,q)
) t.start(
) q.join(
)for i in
range(10
):q.put(
none
)for t in threads:
t.join(
)print
('finished'
)def
download_novel
(headers,q)
:while
true
:# 阻塞直到從佇列獲取一條訊息
url = q.get(
)if url is
none
:break
try:
r = requests.get(url, headers=headers,timeout=10)
path =
'./novel/'
ifnot os.path.exists(path)
: os.mkdir(path)
soup = beautifulsoup(r.text,
'lxml'
) title = soup.find(
'div'
,class_=
"inner"
).h1.get_text(
) contents = soup.find(
'div',id
="booktext"
).find_all(
'p')
n_path = path + title +
'.txt'
with
open
(path + title +
'.txt'
,'a'
, encoding=
'utf-8'
)as f:
for content in contents:
f.write(content.get_text())
print
(% title)
except requests.exceptions.connectionerror:
pass
except requests.exceptions.timeout:
pass
except requests.exceptions.readtimeout:
pass
q.task_done(
)def
main()
: start = time.time(
) q = queue(
) threads =
headers =
get_urls(headers,threads,q)
end = time.time(
)print
('共用時%s s'
%(end - start)
)if __name__ ==
'__main__'
: main(
)

python爬蟲例項之多執行緒爬取小說

之前寫過一篇爬取的部落格，但是單執行緒爬取速度太慢了，之前爬取一部花了700多秒，1秒兩章的速度有點讓人難以接受。所以弄了個多執行緒的爬蟲。這次的思路和之前的不一樣，之前是一章一章的爬，每爬一章就寫入一章的內容。這次我新增加了乙個字典用於存放每章爬取完的內容，最後當每個執行緒都爬取完之後，再將所...

爬蟲之小說爬取

以筆趣閣為例，爬取一念永恆這本具體如下 1 from bs4 import beautifulsoup 2from urllib import request 3import requests 4importre5 import sys6 def down this chapter chapt...

爬蟲小說爬取待修改

爬蟲進一步學習，找到了乙份筆趣說的爬取亟待需要維護，修正。但頻繁爬取後出現503錯誤，等待進一步學習解決。from urllib import request from bs4 import beautifulsoup import collections import re import o...

爬蟲分享 四 多執行緒爬取小說

python爬蟲例項之 多執行緒爬取小說

爬蟲之小說爬取

爬蟲小說爬取 待修改

相關推薦

爬蟲分享四多執行緒爬取小說

python爬蟲例項之多執行緒爬取小說

爬蟲小說爬取待修改