爬蟲學習第二天

概念：乙個基於網路請求的模組，作用就是模擬瀏覽器發起請求

編碼流程：制定url–>進行請求的傳送–>獲取響應資料（爬取到的資料）–>持久化儲存

#制定url
url =''
#傳送請求,返回值是乙個響應物件
response=requests.get(url=url)
#獲取相應,text返回的是字串形式的相應資料
page_text=response.text
#儲存下來
with io.open('./sogou.html','w',encoding='utf-8')as fp:
fp.write(page_text)

出現亂碼問題

反爬機制：ua檢測

反反爬機制

wd=raw_input('enter a key:')
#想要將url攜帶的引數設定成動態變化的
url='/web'
#用字典儲存動態的請求引數
params=
#即將發起請求的頭資訊
headers=
#一定要把params作用到請求中
#params引數表示的是對請求url引數的封裝
#headers用來實現ua偽裝
response=requests.get(url=url,params=params,headers=headers)
#手動修改響應資料的編碼
response.encoding='utf-8'
page_text=response.text
filename=wd+'.html'
#儲存下來
with io.open(filename,'w',encoding='utf-8')as fp:
fp.write(page_text)

分析：當滾動滑輪滑動到底部的時候，發起了乙個ajax的請求，且該請求到了一組電影資料

動態載入的資料：就是通過另乙個額外的請求請求到的資料

–ajax請求

–js生成動態載入的資料

url=''
start=input('enter a key')
limit=input('enter a limit')
#處理請求引數
params=
response=requests.get(url=url,params=params,headers=headers)
#json返回的是序列化的物件
data_list=response.json()
fp=open('douban.txt','w',encoding='utf-8')
for dic in data_list:
name = dic['title']
score=dic['score']
fp.write(name+':'+score+'\n')
print(name,'爬取成功')
fp.close()

post_url=''
city =input('enter a city name')
data=
response=requests.post(url=post_url,data=data,headers=headers)
response.json()

針對於**

抓包工具進行區域性搜尋–在該url中對應的資訊中沒有搜到網頁中展示的資訊，則有動態載入的資訊存在

如果判定出頁面中有動態載入的資料如何進行動態載入定位呢–使用抓包工具進行全域性搜尋

對乙個陌生的**資料爬取前一定要確定你爬取的資料是否為動態載入的

分析：**的首頁和企業的詳情頁的資料都是動態載入的

分析某一家企業的詳情頁資料如何得到的–通過乙個ajax請求（post）請求得到的，該請求攜帶乙個引數id (只有引數id不同)

#請求到每一家企業對應的id
url=' '
data=
fp=open('./company.txt','w',encoding='utf-8')
#該json()返回值中就有每一家的id
data_dic=requests.post(url=url,data=data,headers=headers).json()
#解析id
for dic in data_dic['list']:
_id=dic['id']
#print(_id)
#對每乙個id對應的企業詳情資料進行捕獲（發起請求）
post_url=''
post_data=
#json的返回值是一家企業的詳情返回值
detail_dic=requests.post(url=post_url,data=post_data,headers=headers).json()
company_title=detail_dic['epsname']
address=delattr(['epsproductaddress'])
fp.write(company_title+':'+address+'\n')
print(company_title,'爬取成功！！')
fp.close()

爬蟲學習第二天

全稱網路爬蟲排除標準。作用告知網路爬蟲哪些頁面可以爬取，哪些不可以。形式在網路根目錄下的robots.txt檔案。ex.檢視京東的robots.txt檔案爬蟲應該自動識別robots.txt檔案，再進行內容爬取。顯示這個商品的資訊 import requests url try r requ...

爬蟲第二天

作用網路使用者去取得網路信任 1.突破自身ip限制，去訪問一些不能訪問的站點 2.提高網路速度，服務通過有比較大的硬碟快取區，當外界資訊訪問通過後，將資訊儲存在緩衝區，其他使用者訪問相同資訊，直接在緩衝區拿 3.隱藏真實ip，對於爬蟲來說為了隱藏自身ip，防止自身ip被封鎖爬蟲分類 1.ftp...

爬蟲第二天學習工具

1urllib.request模組方法需要新增cookie和data或者headers時候需要先收集乙個響應物件關鍵字引數因為urlopen 不支援重構 1 urllib.request.urlopen 需要新增cookie和data或者headers時候需要放乙個響應物件 2 respo...

爬蟲學習第二天

爬蟲學習第二天

爬蟲第二天

爬蟲第二天學習工具

相關推薦