爬蟲之小說爬取

以筆趣閣**為例，爬取一念永恆這本**

具體**如下：

1
from bs4 import
beautifulsoup
2from urllib import
request
3import
requests
4importre5
import
sys6
def down_this_chapter(chapter_url,name):#
7 r = requests.get(chapter_url,timeout = 30)#
防止爬取時間過長造成爬蟲假死
8 r.raise_for_status()#
自動判斷返回的狀態碼是不是200
使用備用編碼代替現在的編碼，一般是'utf-8'
10 demo = r.text#
獲得頁面文字資訊
11 soup=beautifulsoup(demo,'
lxml
')#解析頁面
12 text=soup.find_all(id='
content
',class_='
showtxt
')#尋找特定標籤下的內容
13 soup_text = beautifulsoup(str(text), '
lxml
')#重寫解析頁面
14 demo1=soup_text.div.text.replace('
\xa0
','')#
去除無用內容
15print
(name)
16 with open('
d:一念永恆.txt
','a
',encoding='
utf-8
') as f:#
將找到的內容寫到d盤下的檔案中
17 f.write('
\t\t\t\t\t\t\t\t\t\t
'+name+'
\n')#
處理章節名格式問題
18 f.write('' +demo1)
19 f.write('
\n\n')
20f.close()
2122
def novel_url(novel_url):#
23 r = requests.get(novel_url,timeout = 30)
24r.raise_for_status()
26 demo =r.text
27 soup = beautifulsoup(demo,'
lxml')
28 text = soup.find_all('
div',class_ = '
listmain')
29 soup_url = beautifulsoup(str(text),'
lxml')
30 flag=false
31 numbers=(len(soup_url.dl.contents) - 1)#
32 index=1
33for child in soup_url.dl.children:#
遍歷章節
34if child!='
\n':#
過濾35
if child.string ==u"
《一念永恆》正文卷
":#爬取正文卷
36 flag=true#
識別符號37
if flag==true and child.a!=none:#
爬取章節鏈結的條件
38 download_url = "
"+child.a.get('
href
')#獲得爬取鏈結
39 name =child.string
40down_this_chapter(download_url,name)
41 sys.stdout.write("
" % float(index/numbers) + '\r'
)42sys.stdout.flush()
43 index += 1
4445
defmain ():
46 novel_url='
/1_1094/'#
獲得筆趣閣要爬取的**的位址
47 novel_url(novel_url)#
爬取章節的鏈結
48print("
爬取**成功，請到d盤下檢視")
49main()
5051
"""下面是部分爬取結果：
52外傳1 柯父。
5354
外傳2 楚玉嫣。
5556
外傳3 鸚鵡與皮凍。
5758
第一章 他叫白小純
5960
第二章 火灶房
6162
第三章 六句真言
6364
第四章 煉靈
6566
第五章 萬一丟了小命咋辦
6768
第六章 靈氣上頭
6970
第七章 龜紋認主
7172
第八章 我和你拼了！
7374
第九章 延年益壽丹
7576
第十章 師兄別走
7778
第十一章 侯小妹
7980
81"""

總結：一定要對將要爬取的網頁的**進行徹底的分析，不然可能得不到想要的效果

python爬蟲初戰之小說爬取

廢話不多說，上總體思路是構建函式然後迴圈。函式分兩塊，第乙個函式得到標題和每一章節的第二個函式得到每一章節的具體內容，然後迴圈就ok。import urllib.request as req import re 開啟頁面，找到正文 url name 流星蝴蝶劍 defget url title ...

爬蟲小說爬取待修改

爬蟲進一步學習，找到了乙份筆趣說的爬取亟待需要維護，修正。但頻繁爬取後出現503錯誤，等待進一步學習解決。from urllib import request from bs4 import beautifulsoup import collections import re import o...

Python爬蟲例項，爬取小說

import pprint import requests from bs4 import beautifulsoup 獲取原始碼 defget source url r requests.get url if r.status code 200 print r.status code 錯誤 rai...

爬蟲之小說爬取

python爬蟲初戰之小說爬取

爬蟲小說爬取 待修改

Python爬蟲例項，爬取小說

相關推薦

爬蟲小說爬取待修改