使用requests模組進行簡單爬蟲

我最近在學習python爬蟲相關的內容，打算學一點寫一點，一邊總結一邊學習。

使用requests模組可以對一些缺乏反爬蟲限制的**進行爬取。

本次爬取的是貓眼的電影排行，利用url中offset的設定即可爬取前十頁每一頁的html**，再通過re模組使用正規表示式提取網頁中我們需要的成分，然後將這些成分以字典的形式寫到乙個文件裡面，也就實現了簡單的爬蟲，具體**如下：

#coding=gbk
'''created on 2023年4月12日
@author: zeng
'''import requests,re,json,time
#讀取乙個頁面，並返回頁面的html**
def get_one_page(url):
response=requests.get(url)
if response.status_code==200:
return response.text
else:
return none
#使用正規表示式提取頁面中所需要的元素，用生成器的方式返回
def parse_one_page(html):
pattern=re.compile('.*?board-index.*?>(.*?).*?data-src="(.*?)".*?name.*?a.*?>(.*?).*?star.*?>(.*?)
.*?releasetime.*?>(.*?)
.*?integer.*?>(.*?).*?fraction.*?>(.*?).*?',re.s)
items=re.findall(pattern,html)
for item in items:
yield
# print items
#將json型別資料寫入乙個文件中儲存 
def write_to_file(content):
with open("result.txt",'a') as f:
t=json.dumps(content)
print type(t),t.encode('gb18030')
f.write(json.dumps(content, ensure_ascii=false).encode("gb18030")+'\n')
#設定爬蟲的迴圈 
def main(offset):
url=''+str(offset)
html=get_one_page(url)
for item in parse_one_page(html):
write_to_file(item)
#print html.encode('gb18030')
if __name__=='__main__':
for i in range(10):
main(offset=i*10)
time.sleep(1)

requests模組的使用

寫在前面的話在學習爬蟲入門時，會常用到requests模組，熟悉這個模組的使用需要熟悉http，https 及瀏覽器的請求原理。初次接觸爬蟲時了解下，掌握瀏覽器的請求過程和爬蟲的本質，學起來就輕鬆多啦。get response requests.get url,headers headers ge...

requests模組高階使用

編輯本隨筆 cookie作用伺服器使用cookie來記錄客戶端的狀態資訊實現流程執行登陸操作獲取cookie 在發起個人主頁請求時，需要將cookie攜帶到該請求中注意 session物件，也可以傳送請求，如果伺服器端會給客戶端返回cookie，session物件自動將cookie進行儲存...

Requests模組的使用

requests 是用python語言編寫，基於 urllib，採用 apache2 licensed 開源協議的 http 庫。它比 urllib 更加方便，可以節約我們大量的工作，完全滿足 http 測試需求。requests 的哲學是以 pep 20 的習語為中心開發的，所以它比 urllib...

使用requests模組進行簡單爬蟲

requests模組的使用

requests模組高階使用

Requests模組的使用

相關推薦