用Python爬取小說《完美世界》

由於是新手，python剛入門不久，寫的不好之處，還請各大神諒解。

這裡參考了大神的部落格：python每日一練(18)-抓取**目錄和全文

開啟**後如下圖：

通過分析網頁**，如上圖所示：發現所有章節在div id="list"的下面，**如下。

def
get_info
(url)
: response = requests.get(url,headers=headers)
response.encoding =
'utf-8'
get_info_list =
html = etree.html(response.text)
dd_list = html.xpath(
'//*[@id="list"]/dl/dd'
)for dd in dd_list:
title = dd.xpath(
'a/text()')[
0]href =
''+ dd.xpath(
'a/@href')[
0]chapter =
return get_info_list

這裡我們獲取了正本**的章節和標題。然後接著分析每個章節頁面的內容

這是為了獲取每個章節的內容，我這裡用了正規表示式獲取內容，並寫入檔案中。**如下：

def
get_content
(get_info)
:for chapter_info in get_info:
response = requests.get(url=chapter_info[
'href'
],headers=headers)
response.encoding =
'utf-8'
if os.path.exists(
'完美世界'):
pass
else
: os.makedirs(
'完美世界'
) contents = re.findall(
'(.*?)
',response.text)
with
open
('./完美世界/'
+chapter_info[
'title']+
'.txt'
,'w'
,encoding=
'utf-8'
)as f:
for content in contents:
f.write(content.replace(
'    ',''
).replace(''
,'').strip())
print
()

最後呼叫主函式，**如下：

if __name__==
'__main__'
: get_content(get_info(url)
)

最後通過**整合如下：

import requests
import re,os
from lxml import etree
headers =
url =
''defget_info
(url)
: response = requests.get(url,headers=headers)
response.encoding =
'utf-8'
get_info_list =
html = etree.html(response.text)
dd_list = html.xpath(
'//*[@id="list"]/dl/dd'
)for dd in dd_list:
title = dd.xpath(
'a/text()')[
0]href =
''+ dd.xpath(
'a/@href')[
0]chapter =
return get_info_list
defget_content
(get_info)
:for chapter_info in get_info:
response = requests.get(url=chapter_info[
'href'
],headers=headers)
response.encoding =
'utf-8'
if os.path.exists(
'完美世界'):
pass
else
: os.makedirs(
'完美世界'
) contents = re.findall(
'(.*?)
',response.text)
with
open
('./完美世界/'
+chapter_info[
'title']+
'.txt'
,'w'
,encoding=
'utf-8'
)as f:
for content in contents:
f.write(content.replace(
'    ',''
).replace(''
,'').strip())
print()
if __name__==
'__main__'
: get_content(get_info(url)
)

最後成果如下：每章乙個text文字：

這裡最後出現乙個問題：

最終只爬取了七百來章出問題了，有可能是章節標題出現了字元無法識別，寫入文字標題，所以出錯。對於這個問題還望各大神指導。

寫的不好只為鍛鍊自己

Python爬取小說

感覺這個夠蛋疼的，因為你如果正常寫的話，前幾次執行沒問題，之後你連都沒改，再執行就出錯了。其實這可能是網路請求失敗，或者有反爬蟲的東西吧。但這就會讓你寫的時候非常苦惱，所以這這東西，健壯性及其重要！import requests from bs4 import beautifulsoup impo...

python 爬取小說

前些天突然想看一些可能是因為壓力大，所以就要有補償機制吧。為了節省流量，就想著把內容爬下來，然後就可以在路上看了。於是有了下面的指令碼。usr bin env python coding utf 8 import requests from lxml import etree 為了解決unicod...

python爬取小說

一準備安裝 requests pyquery庫二使用定義了search類初始化時傳入第一章url 和名即可再呼叫all content方法即可 coding utf8 import re import requests from requests.exceptions import...

用Python爬取小說《完美世界》

Python爬取小說

python 爬取小說

python爬取小說

相關推薦