python爬蟲爬網路小說

最近閒的蛋疼想看一些爽文

於是只能自己來爬一篇完整版的**！

進第一章，檢視源**，發現**內容在...

裡面

爬內容分了兩步：先爬...

裡面的，再爬裡面的。

但是不能只爬一章，還要繼續爬，找下一章的鏈結，在下一章裡面

還要爬標題，在裡面

於是分了四個正規表示式：

story_pattern1 = re.compile(r'(.*?)

', re.s)story_pattern2 = re.compile(r'', re.s)next_pattern = re.compile(r'title_pattern = re.compile(r'', re.s)

最後流程是：

開啟鏈結

獲取html內容

獲取**內容（粗）

獲取標題

獲取**內容（細）

去掉內容html標籤

寫入標題

寫入**內容

這樣迴圈下去，就爬到了一篇**！

源**：

import urllib.request
import re
url =
''url_head =
''next_url =[''
]file
=open
(r"d:\route\***x.txt"
,"a+"
)while next_url[0]
!='/195/'
: temp = urllib.request.urlopen(url)
content = temp.read(
).decode(
'utf-8'
) story_pattern1 = re.
compile
(r'(.*?)
', re.s)
story_pattern2 = re.
compile
(r''
, re.s)
next_pattern = re.
compile
(r', re.s)
title_pattern = re.
compile
(r''
, re.s)
story_content = re.finditer(story_pattern1, content)
next_url = re.findall(next_pattern, content)
url = url_head + next_url[0]
title = re.findall(title_pattern, content)
file
.write(title[0]
+'\n\n'
)print
(title[0]
)for match in story_content:
match_content = re.finditer(story_pattern2, match.group())
for aa in match_content:
result = re.sub(r'<.*?>',""
, aa.group())
file
.write(result +
'\n\n'
)

python爬蟲爬取網路小說

首先，獲取html頁面並解析，為方便儲存和使用頁面的encoding，直接使用全域性變數章節名章節名章節名.從結構可以看出，可以先獲取目錄的頂層標籤 class box 的標籤然後再獲取該標籤內所有的li標籤即可。由於頁面內有其他的class box 的標籤，因此本次使用soup.find s...

Python爬蟲爬取網路小說

太古神王 txt a encoding utf 8 errors ignore i 1while i 2062 single web web file.readline replace n url single web print url header data requests.get url u...

利用Python爬取網路小說（基礎）

1.通過requests庫獲取網頁內容 2.通過beautifulsoup庫解析網頁內容 3.在原始碼裡找到要爬取的內容 4.成功 ps 建議還是學一部分網頁知識之後再來學爬蟲更好理解一些 import requests import bs4 from bs4 import beautifulso...

python爬蟲爬網路小說

python爬蟲爬取網路小說

Python爬蟲爬取網路小說

利用Python爬取網路小說（基礎）

相關推薦