python爬蟲筆記01

《精通python網路爬蟲》筆記

下面**大部分來自此書，僅為本人筆記

urllib.request的使用以及將爬取內容儲存html檔案

**示例：

import urllib.request
url = ""
file = urllib.request.urlopen(url)
data = file.read() #讀取全部，賦予乙個字串變數
dataline = file.readline() #讀取一行
datalines = file.readlines() #讀取全部，賦予乙個列表變數
#列印data內容
print(data)
#將抓取到的內容儲存到html檔案（法1）
#步驟：將爬取內容賦值給變數--》以寫入的方式開啟本地檔案，命名*.html--》讓變數值寫入檔案--》關閉檔案
fhandle = open("f:/htmls/1.html","wb")
fhandle.write(data)
fhandle.close()
#抓取內容寫入檔案（法2）urllib.request.urlretrieve(url,filename=本地儲存路徑)
filename = urllib.request.urlretrieve(url,filename = "f:/htmls/2.html")
#清除快取
urllib.request.urlcleanup()
#返回與當前環境有關的資訊
print("當前環境資訊：" + str(file.info()))
#返回爬取網頁狀態碼200表示正確
print("網頁狀態碼：" + str(file.getcode()))
#返回網頁url
print("網頁url:" + str(file.geturl()))
#編碼解碼
#漢字和一些字元&等不符合url標準，需要編碼
print("編碼解碼：")

示例**

#爬蟲模擬成瀏覽器訪問
#有些網頁無法爬取出現403錯誤
#方法1--使用builder_opener()修改報頭
#由於urlopen()不支援一些http的高階功能
#header = ("user-agent",具體資訊)
#過程：報頭headers--》建立urllib.request.build_operder()物件opender--》opender.addheaders = [報頭]設定報頭--》
#物件.open(url).read 讀取網頁內容
import urllib.request
url = ""
opener = urllib.request.build_opener()
opener.addheaders = [headers]
data = opener.open(url).read()
fhandle = open("f:/htmls/3.html","wb")
fhandle.write(data)
fhandle.close()
#方法2：使用 add_header()新增報頭
req = urllib.request.request(url)
#注意兩個引數；物件名.add_header(欄位名，字段值)
data2 = urllib.request.urlopen(req).read()

超時設定

#超時設定
import urllib.request
#設定timeout的值，單位-秒
file = urllib.request.urlopen("",timeout=30)

以相應的url為引數，構建request物件

通過urlopen()開啟構建的request請求

按需求處理抓取的內容

import urllib.request
keywd = "hello"
url = "/s?wd=" + keywd
req = urllib.request.request(url)
data = urllib.request.urlopen(req).read()
fhandle = open("f:/htmls/4.html","wb")
fhandle.write(data)
fhandle.close()

示例**

import urllib.request
url = "/s?wd="
#如果key是中文則會出現編碼問題
key = "你好"
key_code = urllib.request.quote(key)
url_all = url + key_code
req = urllib.request.request(url_all)
data = urllib.request.urlopen(req).read()
fh = open("f:/htmls/5.html","wb")
fh.write(data)
fh.close()

設定好url位址

構建表單資料，並使用urllib.parse.urlencode對資料精選編碼處理

建立request物件，引數包括url位址和要傳入的資料

使用add_header()新增頭資訊，模擬瀏覽器

使用urllib.request.urlopen()開啟對應的request物件，完成資訊傳遞

後續處理。。。

示例**

#post請求--登入、註冊等操作
import urllib.request
import urllib.parse
url = ""
postdata = urllib.parse.urlencode().encode('utf-8') #將資料使用urlencode編碼處理後，使用encode（）設定為utf-8編碼
req = urllib.request.request(url,postdata)
data = urllib.request.urlopen(req).read()
fh = open("f:/htmls/6.html","wb")
fh.write(data)
fh.close()

示例**

#**伺服器的使用
#**伺服器位址，
proxy_addr = "118.212.137.135:31288"
data = use_porxy(proxy_addr,"")
print(len(data))

分別使用urllib.request.httphander()和urllib.request.httpshander()將debuglevel值設為1

使用urllib.request.build_opener()建立自定義的opener物件，並使用 1. 中設定的·值作為物件

用urllib.request.install_opener建立全域性預設的opener物件，這樣在使用urlopen時就會使用我們安裝的opener物件

後續操作，如urlopene()等

#執行程式時列印除錯log

未~

爬蟲學習筆記01

固定部分 import requests base url headers params response requests.get base url,headers headers,params params 第一種讀取後顯示為亂碼通過檢視網頁原始碼查詢到網頁編碼格式為gbk 修改編碼格式後，...

Python爬蟲學習01

由於自身對python有比較大的興趣，但是畢竟有業務需求才能推動學習在休息的時候看了幾天的基礎，對python的基礎還是可以掌握的，但是一些api的方法確實沒有多大興趣，畢竟乙個乙個api的學習python這種方法，於我而言，確實想睡覺，所以我想以乙個點帶面的學習python，爬蟲是python...

python爬蟲學習 01爬蟲介紹

前戲 1.你是否在節假日出行高峰的時候，想快速搶購火車票成功 2.你是否在網上購物的時候，想快速且精準的定位到口碑質量最好的商品什麼是爬蟲通過編寫程式，模擬瀏覽器上網，然後讓其去網際網路上抓取資料的過程。爬蟲的價值實際應用就業爬蟲究竟是合法還是違法的？如何在使用編寫爬蟲的過程中避免進入局子...

python爬蟲筆記01

爬蟲學習筆記01

Python爬蟲學習01

python爬蟲學習 01爬蟲介紹

相關推薦