ulrlib案例爬取百度貼吧

因此，傳送get請求時，設定不同的kw可以訪問不同的貼吧。

同一主題的貼吧中，有分頁，通過點選不同的分頁，可以看出，url中的pn規律變化

第1頁 pn = 0

第2頁 pn=50

第3頁 pn=100

據此規律可以爬取不同頁數的網頁。

1
from urllib import
request,parse
2import
time
3import
random
4importos5
6 kw = input('
請輸入貼吧名稱：')
7 start = input('
請輸入起始頁：')
8 end = input('
請輸入結束頁：')
910#構建請求字串
11 qs =
14 qs =parse.urlencode(qs)
1516
#構建貼吧鏈結位址
17 base_url = '
' +qs
1819 start = (int(start) - 1) * 50
20 end = (int(end) - 1) * 50 + 1
2122
for pn in range(start,end,50):23#
pn 分頁數字24#
檔名25 　　fname = str((pn//50 + 1)) + '
.html
'26 　　fullurl = base_url + '
&pn=
' +str(pn)
27print
(fullurl)
28 　　response =request.urlopen(fullurl)
29 　　data = response.read().decode('
utf-8')
3031
#自動建立目錄
32 　　path = '
./tieba/
' +kw
33if
notos.path.exists(path):
34os.makedirs(path)
3536 　　with open(os.path.join(path,fname),'
w',encoding='
utf-8
') as f:
37f.write(data)
3839
#加入請求間隔
40 　　time.sleep(random.random() * 2)

後面的加入隨機請求間隔，防止請求過於頻繁導致被封ip。

get案例爬取百度貼吧

需求爬取貼吧的資料 1.輸入爬取貼吧的主題列如火影忍者 2.輸入起始頁和終止頁列如 3 5 3.把每一頁的資料儲存到本地列如第一頁.html 第二頁.html 思路第一頁第二頁第三頁第四頁 pn page 1 50 發起請求資料儲存資料 python import urllib...

爬取百度貼吧

import urllib.request import urllib.parse import os,time 輸入貼吧名字 baname input 請輸入貼吧的名字 start page int input 請輸入起始頁 end page int input 請輸入結束頁不完整的url ur...

爬取百度貼吧

帶入需要使用的包 from urllib import request,parse importos 基礎知識變數賦值字串賦值爬取的關鍵字 kw lol 數值賦值爬取的頁數範圍 start 1end 4 輸出 print kw,start,end 宣告需要爬取的連線 base url 建立資...

ulrlib案例 爬取百度貼吧

get案例 爬取百度貼吧

爬取百度貼吧

爬取百度貼吧

相關推薦

ulrlib案例爬取百度貼吧

get案例爬取百度貼吧