python 爬蟲基礎筆記（一）

筆記記錄來自慕課網（imooc）：

例：
import urllib2,cookielib
#建立cookie容器
cj = cookielib.cookiejar()
#建立1個opener
#給urllib2安裝opener
urllib2.install_opener(opener)
#使用帶有cookie的urllib2訪問網頁
response = urllib2.urlopen("")

有些網頁需要使用者登入才能訪問需要新增cookie的處理使用 httpcookieprocessor

有些網頁需要**才能訪問使用 proxyhandler

有些網頁使用協議https 加密訪問的我們使用httpshandler

還有的url之間相互跳轉的關係我們使用 httpredirecthandler

我們把handler 傳給 opener = urllib2.build_opener(handler)方法建立opener物件

然後使用把opener物件傳遞給 urllib2.install_opener（opener）這樣就有了處理這些場景的能力

例2：
import urllib2
url = ""
print
'第一種方法'
response1 = urllib2.urlopen(url)
print response1.getcode() #狀態碼
print len(response1.read())
print
'第二種方法'
request = urllib2.request(url)
request.add_header("user_agent","mozilla/5.0") #偽裝成mozilla瀏覽器
response2 = urllib2.urlopen(request)
print response2.getcode() #狀態碼
print len(response2.read())
print
'第三種方法'
import cookielib
cj = cookielib.cookiejar()
urllib2.install_opener(opener)
response3 = urllib2.urlopen(url)
print response3.getcode() #狀態碼
print cj #輸出cookie內容
print response3.read()

網頁解析器 - beautiful soup -語法：

1） 建立beautifulsoup物件

from bs4 import beautifulsoup 
#根據html網頁字串建立beautifulsoup物件
soup = beautifulsoup(
html_doc, #html文件字串
'html.parser'
#html解析器
from_encoding='utf8'
#html文件的編碼
)

2） 搜尋節點（find_all,find,(引數相同)）
#方法 ：find_all(name,attrs,string)

#查詢所有標籤為 a 的節點
soup.find_all('a')
#查詢所有標籤為a，鏈結符合/view/123.html形式的節點
soup.find_all('a',href=re.compile(r'/view/\d+\.html')) #可以傳入乙個正規表示式
#查詢標籤為div，class為abc，文字為python的節點
soup.find_all('div',class_='abc',string='python')

3)訪問節點資訊 #得到節點：python #獲取查詢到的節點的標籤名稱 node.name #獲取查詢到的a節點的href屬性 node['href'] #獲取查詢到的a節點的鏈結文字 mode.get_text()

beautifulsoup 例：

from bs4 import beautifulsoup
import re #正規表示式re模組
html_doc = ""
"title">the dormouse's story
story">once upon a time there were three little sisters; and their names were
" class="sister" id="link1">elsie,
" class="sister" id="link2">lacie and
" class="sister" id="link3">tillie;
and they lived at the bottom of a well.
story">..."""
soup =beautifulsoup(html_doc,'html.parser',from_encoding='utf-8')
print
"獲取所有鏈結"
links = soup.find_all('a')
forlink in links:
print
link.name,link['href'],link.get_text()
print
"獲取lacie的鏈結"
link_node = soup.find('a',href='')
print link_node.name,link_node['href'],link_node.get_text()
print
'正則匹配'
link_node = soup.find('a',href = re.compile(r,'ill'))
print link_node.name,link_node['href'],link_node.get_text()
print
'獲取p段落文字'
p_node = soup.find('p',class_ = 'title')
print p_node.name,p_node.get_text()

Python爬蟲筆記爬蟲基礎第一課

0.獲取資料爬蟲程式會根據我們提供的向伺服器發起請求，然後返回資料。1.解析資料爬蟲程式會把伺服器返回的資料解析成我們能讀懂的格式。2.提取資料爬蟲程式再從中提取出我們需要的資料。3.儲存資料爬蟲程式把這些有用的資料儲存起來。屬性作用response.status code 檢查請求是否...

麥子學院 python 爬蟲基礎學習筆記（一）

因為版本的問題，所以裡面講的都需要自己再修改和查新 import urllib s urllib.urlopen 會報錯 module urllib has no attribute urlopen 這是因為版本的問題，修正如下 from urllib.request import urlopen...

Python網路爬蟲基礎一

2.urllib和urllib2模組使用 3.requests模組使用 4.python三種網頁內容抓取方法當瀏覽器向web伺服器發出請求時，它向伺服器傳遞了乙個資料塊，也就是請求信息，http請求資訊由3部分組成請求方法 uri 協議版本請求頭 request header 請求正文右...

python 爬蟲基礎筆記（一）

Python爬蟲筆記 爬蟲基礎第一課

麥子學院 python 爬蟲基礎學習筆記（一）

Python網路爬蟲基礎 一

相關推薦

Python爬蟲筆記爬蟲基礎第一課

Python網路爬蟲基礎一