爬蟲庫之urllib

官方文件：

utllib是python內建的http請求庫

包含一下模組：

相比與python的變化：

python2

import urllib2
response = urllib2.urlopen('')

python3

import urlib.request
response = urllib.request.urlopen('')

urllib.request模組

測試**：

urlopen

傳送請求到伺服器（get型別）

get方式我們可以直接把引數寫到**上面，直接構建乙個帶引數的url

import urllib.request
response = urllib.request.urlopen('')
print(response.read().decode('utf-8'))

urlopen()一般有三個引數: urllib.request.urlopen(url, data, timeout)

urlopen返回乙個位元組物件。這是因為urlopen無法自動從http伺服器接收到的位元組流的編碼。返回值如下：

response.read()可以獲取response內容。。

data

post型別請求，需要傳遞乙個data，post不會在**上顯示所有的引數

import urllib.parse
import urllib.request
data = bytes(urllib.parse.urlencode(), encoding='utf8')
response = urllib.request.urlopen('/post', data=data)
print(response.read())

timeout

import urllib.request
response = urllib.request.urlopen('/get', timeout=1)
print(response.read())

如果將timeout改小一些，會報錯：socket.timeout: timed out

抓取異常 try...except

import urllib.request
import urllib.error
try:
response = urllib.request.urlopen('/get', timeout=0.1)
except urllib.error.urlerror as e:
print('e')
except socket.timeout as a:
print('a')

響應

import urllib.request
response = urllib.request.urlopen('')
print(type(response))

結果：

狀態嗎、響應頭

import urllib.request
response = urllib.request.urlopen('')
print('status:', response.status)    #200 響應成功
print('header:', response.getheaders())
print('headers:', response.getheader('server')) #傳參獲取特定的響應頭

response.read() 獲取響應體內容

import urllib.request
response = urllib.request.urlopen('
print(response.read().decode('uft-8'))

與上例結果一致

import urllib.request
request = urllib.request.request('')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

設定headers

有的**必須攜帶headers響應頭資訊才能訪問，這時就需要新增header偽裝成瀏覽器。psot請求

from urllib import request, parse
url = '/post'
headers = 
dic = 
data = bytes(parse.urlencode(dic), encoding='utf-8')
req = request.request(url=url, data=data, headers=headers, method='post')
response = request.urlopen(req)
print(response.read().decode('utf8'))

新增請求頭第二種方法 add_header()

from urllib import request, parse
url = '/post'
dict = 
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.request(url=url, data=data, method='post')
req.add_header('user-agent', 'mozilla/4.0 (compatible; msie 5.5; windows nt)')
response = request.urlopen(req)
content = response.read().decode('utf-8')
ret_dict = ******json.loads(content)
for i in ret_dict:
print('%s: %s' % (i, ret_dict[i]))

爬蟲之urllib庫

一 urllib的基本使用 import urllib.request response urllib.request.urlopen 獲取當前爬取網頁的狀態碼 print response.getcode decode 解碼位元組資料轉成字串資料 data response.read decod...

Python 爬蟲乾貨之urllib庫

小試牛刀怎樣扒網頁呢？其實就是根據url來獲取它的網頁資訊，雖然我們在瀏覽器中看到的是一幅幅優美的畫面，但是其實是由瀏覽器解釋才呈現出來的，實質它是一段html 加 js css，如果把網頁比作乙個人，那麼html便是他的骨架，js便是他的肌肉，css便是它的衣服。所以最重要的部分是存在於html...

爬蟲基礎 urllib庫

使用 urllib 匯入必要模組 from urllib import request 如果需要 url轉碼 from urllib import parse print parse.quote 范冰冰 e8 8c 83 e5 86 b0 e5 86 b0 urlopen url rsp reque...

爬蟲庫之urllib

爬蟲之urllib庫

Python 爬蟲乾貨之urllib庫

爬蟲基礎 urllib庫

相關推薦