python爬蟲爬取網路小說

首先，獲取html頁面並解析，為方便儲存和使用頁面的encoding，直接使用全域性變數

章節名章節名

章節名......

從結構可以看出，可以先獲取目錄的頂層標籤（class=「box」的標籤），然後再獲取該標籤內所有的li標籤即可。由於頁面內有其他的class=「box」的標籤，因此本次使用soup.find(style="min-height:420px;")搜尋style屬性為「min-height:420px;」來獲取該標籤（而不是使用：soup('div','box')）

獲取章節內容也同理，找到展示頁面內容的標籤即可，並將內容的多餘的空格和換行符刪除，然後將分散的內容合併即可

def get_chapter_content(url,chapter_name):

print('正在獲取%s內容。。。'%chapter_name)

soup = get_html_soup(url)

page = ''

for p in soup.find(id='htmlcontent').strings:

if p == '\n':

continue

else:

p.replace('\xa0',' ')

page += p

return chapter_name + '\n' +page

主函式，獲取章節列表，再獲取頁面內容，並儲存到本地即可

def main():

chapter_list = get_chapter_list(novel_url)

#print(chapter_list)

for chapter_info in chapter_list[1709:]:

page = get_chapter_content(chapter_info['chapter_address'],chapter_info['chapter_name'])

#print(page)

with open('惡魔總裁霸道寵：老婆，太腹黑.txt','a',encoding='utf-8') as f:

f.write(page)

f.close()

if __name__=='__main__':

main()

大功告成，為了防止網路作妖，可以加上斷點續傳（爬？）的功能，只要對當前爬取的進度進行記錄即可，昨晚爬取的時候就斷了一次，今早起來手工斷點續爬才想起這個問題，mark一下，待會改進。

起點中文網的頁面內容跟這些盜版**有所不同，每一段的內容都用標籤進行包裹，不過問題不大，只要能爬取頁面，解析什麼的都問題不大。end

Python爬蟲爬取網路小說

太古神王 txt a encoding utf 8 errors ignore i 1while i 2062 single web web file.readline replace n url single web print url header data requests.get url u...

python爬蟲爬網路小說

最近閒的蛋疼想看一些爽文於是只能自己來爬一篇完整版的進第一章，檢視源發現內容在.裡面爬內容分了兩步先爬.裡面的，再爬裡面的。但是不能只爬一章，還要繼續爬，找下一章的鏈結，在下一章裡面還要爬標題，在裡面於是分了四個正規表示式 story pattern1 re.compile r re...

利用Python爬取網路小說（基礎）

1.通過requests庫獲取網頁內容 2.通過beautifulsoup庫解析網頁內容 3.在原始碼裡找到要爬取的內容 4.成功 ps 建議還是學一部分網頁知識之後再來學爬蟲更好理解一些 import requests import bs4 from bs4 import beautifulso...

python爬蟲爬取網路小說

Python爬蟲爬取網路小說

python爬蟲爬網路小說

利用Python爬取網路小說（基礎）

相關推薦