python3 網路爬蟲開發實戰（崔慶才著）第三章

3.1 urllib

是 python 內建的 http 請求庫

urlopen()

urllib.request.urlopen()函式用於實現對目標url的訪問。

import urllib.request
response = urllib.request.urlopen('') #response是乙個httpresposne型別的物件
print(response.read().decode('utf-8')) #直接用urllib.request模組的urlopen（）獲取頁面，page的資料格式為bytes型別，需要decode（）解碼，轉換成str型別。
print(response.status) #響應的狀態碼
print(response.getheaders()) #響應的頭資訊
print(response.getheader('server')) #獲取 headers 中的 server 值

函式原型：urllib.request.urlopen(url, data=none, [timeout, ]*, cafile=none, capath=none, cadefault=false, context=none)

import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode(), encoding='utf8')
response = urllib.request.urlopen('', data=data)
print(response.read())

傳遞了乙個引數 word，值是 hello。它需要被轉碼成bytes（位元組流）型別。其中轉位元組流採用了 bytes() 方法，第乙個引數需要是 str（字串）型別，需要用 urllib.parse 模組裡的 urlencode() 方法來將引數字典轉化為字串。第二個引數指定編碼格式，在這裡指定為 utf8。

import urllib.request
response = urllib.request.urlopen('', timeout=1)
print(response.read())

#通過設定這個超時時間來控制乙個網頁如果長時間未響應就跳過它的抓取
import socket
import urllib.request
import urllib.error
try:
response = urllib.request.urlopen('', timeout=0.1)
except urllib.error.urlerror as e:
if isinstance(e.reason, socket.timeout):
print('time out')

request

urllib.request.request(url, data=none, headers={}, origin_req_host=none, unverifiable=false, method=none)

from urllib import request, parse
url = ''
headers = 
dict = 
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.request(url=url, data=data, headers=headers, method='post')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

通過四個引數構造了乙個 request，url 即請求 url，在headers 中指定了 user-agent 和 host，傳遞的引數 data 用了 urlencode() 和 bytes() 方法來轉成位元組流，另外指定了請求方式為 post。

高階用法

各種處理器，有專門處理登入驗證的，有處理 cookies 的，有處理**設定的，利用它們我們幾乎可以做到任何 http 請求中所有的事情。

auth_handler = httpbasicauthhandler(p) #例化了乙個 httpbasicauthhandler 物件,引數是 httppasswordmgrwithdefaultrealm 物件,它利用 add_password() 新增進去使用者名稱和密碼，這樣我們就建立了乙個處理認證的 handler。

opener = build_opener(auth_handler) #利用 build_opener() 方法來利用這個 handler 構建乙個 opener，那麼這個 opener 在傳送請求的時候就相當於已經認證成功

try:

result = opener.open(url) #利用 opener 的 open() 方法開啟鏈結，就可以完成認證了，在這裡獲取到的結果就是認證後的頁面原始碼內容。

html = result.read().decode('utf-8')

print(html)

except urlerror as e:

print(e.reason)**

from urllib.error import urlerror
from urllib.request import proxyhandler, build_opener
proxy_handler = proxyhandler()
opener = build_opener(proxy_handler)
try:
response = opener.open('')
print(response.read().decode('utf-8'))
except urlerror as e:
print(e.reason)

在此本地搭建了乙個**，執行在 9743 埠上。

在這裡使用了 proxyhandler，proxyhandler 的引數是乙個字典，鍵名是協議型別，比如 http 還是 https 等，鍵值是**鏈結，可以新增多個**。

然後利用 build_opener() 方法利用這個 handler 構造乙個 opener，然後傳送請求即可。

3. cookies

cookies 的處理就需要 cookies 相關的 handler 了。

cookie = http.cookiejar.cookiejar() #宣告乙個 cookiejar 物件

handler = urllib.request.httpcookieprocessor(cookie) #利用 httpcookieprocessor 來構建乙個 handler

opener = urllib.request.build_opener(handler) #利用 build_opener() 方法構建出 opener

response = opener.open('') #執行 open() 函式

for item in cookie:

print(item.name+"="+item.value)

python3 網路爬蟲開發實戰（崔慶才著）第三章

《Python3網路爬蟲開發實戰》爬蟲有關庫的安裝

python3 網路爬蟲開發實戰貓眼top100

Python3網路爬蟲開發實戰分布式爬蟲原理

python3 網路爬蟲開發實戰（崔慶才著）第三章

《Python3網路爬蟲開發實戰》爬蟲有關庫的安裝

python3 網路爬蟲開發實戰 貓眼top100

Python3網路爬蟲開發實戰 分布式爬蟲原理

相關推薦

python3 網路爬蟲開發實戰貓眼top100

Python3網路爬蟲開發實戰分布式爬蟲原理