Python爬蟲基礎 4

proxy 的設定

urllib2 缺省會使用環境變數 http_proxy 來設定 http proxy。如果想在程式中明確控制 proxy 而不受環境變數的影響，可以使用**。

簡單的**：

import urllib2
enable_proxy = true
proxy_handler = urllib2.proxyhandler()
null_proxy_handler = urllib2.proxyhandler({})
if enable_proxy:
opener = urllib2.build_opener(proxy_handler)
else:
opener = urllib2.build_opener(null_proxy_handler)
urllib2.install_opener(opener)

這裡要注意的乙個細節，使用 urllib2.install_opener() 會設定 urllib2 的全域性 opener 。這樣後面的使用會很方便，但不能做更細緻的控制，比如想在程式中使用兩個不同的 proxy 設定等。比較好的做法是不使用 install_opener 去更改全域性的設定，而只是直接呼叫 opener 的 open 方法代替全域性的 urlopen 方法。

timeout 設定

在 python 2.6 以後，超時可以通過 urllib2.urlopen() 的 timeout 引數直接設定。

import urllib2
response = urllib2.urlopen('', timeout=10)

在 http request 中加入特定的 header

import urllib2
request = urllib2.request('')
request.add_header('user-agent', 'fake-client')
response = urllib2.urlopen(request)
print response.read()

對有些 header 要特別留意，伺服器會針對這些 header 做檢查:

redirect

urllib2 預設情況下會針對 http 3xx 返回碼自動進行 redirect 動作，無需人工配置。要檢測是否發生了 redirect 動作，只要檢查一下 response 的 url 和 request 的 url 是否一致就可以了。

import urllib2  
my_url = '' 
response = urllib2.urlopen(my_url) 
redirected = response.geturl() == my_url 
print redirected 
my_url = '' 
response = urllib2.urlopen(my_url) 
redirected = response.geturl() == my_url 
print redirected

如果不想自動 redirect，可以自定義httpredirecthandler 類。

urllib2 對 cookie 的處理也是自動的。如果需要得到某個 cookie 項的值，可以這麼做：

'value = '+item.value使用 http 的 put 和 delete 方法

urllib2 只支援 http 的 get 和 post 方法，如果要使用 http put 和 delete ，只能使用比較低層的 httplib 庫。雖然如此，我們還是能通過下面的方式，使 urllib2 能夠發出 put 或delete 的請求：

import urllib2
request = urllib2.request(uri, data=data) 
request.get_method = lambda: 'put'
# or 'delete' 
response = urllib2.urlopen(request)

debug log

使用 urllib2 時，可以通過下面的方法把 debug log 開啟，這樣收發包的內容就會在螢幕上列印出來，方便除錯，有時可以省去抓包的工作

response = urllib2.urlopen('')表單的處理

登陸時需要填寫表單，首先使用工具擷取填寫的表單的內容，找到自己的post請求，以及post表單項。

以verycd為例，需要填username,password,continueuri,fk,login_submit幾項：

# -*- coding: utf-8 -*-  
import urllib 
import urllib2 
postdata=urllib.urlencode() 
req = urllib2.request( 
url = '', 
data = postdata 
) result = urllib2.urlopen(req) 
print result.read()

偽裝成瀏覽器訪問

headers =   
)

對付」反盜鏈」

headers =

headers是乙個dict資料結構，你可以放入任何想要的header，來做一些偽裝。

Python爬蟲基礎 4

python爬蟲基礎

python爬蟲基礎

python 爬蟲基礎

Python爬蟲基礎 4

python爬蟲基礎

python爬蟲基礎

python 爬蟲基礎

相關推薦