Python爬蟲實踐

爬取的是盜版網的「免費」**《三寸人間》（閱讀**請支援正版）

以下是源**：

from urllib import request
from bs4 import beautifulsoup
import re
//獲取html原始碼
response=request.urlopen("")
html = response.read()
//解析html
soup = beautifulsoup(html,"html.parser")
//通過正則匹配找到需要的超連結
all_href = soup.find_all(href=re.compile("^/14_14055/"))
//分割獲取到的語句，取出標題和超連結
all_href_name = str(all_href).split(",")
//定義字典，把標題和超連結當作key，value
all_href_name_dict = {}
for each_href in all_href_name:
soup_href = beautifulsoup(each_href,"html.parser")
key = soup_href.a["href"]
value = soup_href.get_text()
all_href_name_dict[key] = value
print(all_href_name_dict)
# del all_href_name_dict['/14_14055/']
//獲取html原始碼
def get_html(url):
response = request.urlopen(url)
html = response.read()
return html
//解析原始碼，獲取**內容
def get_content(url):
content_html = get_html(url)
soup = beautifulsoup(content_html,"html.parser")
txt_show = soup.find_all('div',attrs=)
return txt_show[0]
//把**寫進本地txt文字
def write_to_txt(context,title):
with open('sancunrenjian','a',encoding='utf-8')as f:
f.write('\n'+title+'\n'+context)
//開始執行
for k,v in all_href_name_dict.items():
charpter_url=""+k
print(charpter_url)
charpter_txt = get_content(charpter_url)
//把不需要的內容替換掉
txt = str(charpter_txt).replace("","").replace("
","").replace("
","")
write_to_txt(str(txt),v)

部分操作結果：

python 爬蟲實踐

詳解 python3 urllib requests 官方文件 timeout 引數是用於設定請求超時時間。單位是秒。cafile和capath代表 ca 證書和 ca 證書的路徑。如果使用https則需要用到。context引數必須是ssl.sslcontext型別，用來指定ssl設定 cadef...

python爬蟲實踐目的 python 爬蟲實踐

python之路第一課day4 隨堂筆記迭代生成裝飾器上節回顧 1.集合 a.關係測試 b.去重 2.檔案操作及編碼 3.函式 4.區域性變數和全域性變數上節回顧本節課內容 1.迭代器生成器 2.裝飾器 3.json pickle資料序列化 4.軟體 winform panelcontrol...

python3爬蟲實踐（二）爬蟲前奏

1.1 什麼是網路爬蟲 1.2 通用爬蟲和聚焦爬蟲 2.1 什麼是 http 和 https 協議 2.2 在瀏覽器中傳送乙個 http 請求的過程 2.3 url 詳解 scheme host port path query string anchor2.4 常用請求方法 get 請求一般情況下...

Python爬蟲實踐

python 爬蟲實踐

python爬蟲實踐目的 python 爬蟲實踐

python3爬蟲實踐（二） 爬蟲前奏

相關推薦

python3爬蟲實踐（二）爬蟲前奏