python 輕量級爬蟲開發3

採用beautiful外掛程式

建立beautifulsoup物件

from bs4 import beautifulsoup
#根據html網頁字串建立beautifulsoup物件
soup = beautifulsoup(
html_doc, #html文件字串
'html.parser', #html解析器
from_encoding='utf8' #html文件的編碼
)

搜尋節點(find_all,find)

#方法：find_all(name,attrs,string)
#查詢所有標籤為a的節點
soup.find_all('a')
#查詢所有標籤為a，鏈結符合/pic/123.html形式的節點
soup.find_all('a',href=re.compile(r'/view/\d+\.htm')) //該方法支援正規表示式
#查詢所有標籤為div,class為abc，文字為python的節點
soup.find_all('div',class_='abc',string='python')

訪問節點資訊

#得到節點:python #獲取查詢到的節點的標籤名稱 node.name #獲取查詢到的a節點的href屬性 node['href'] #獲取查詢到的a節點的鏈結文字

node.get_text()

python 輕量級爬蟲開發2

urllib2 python官方基礎模組 request 第三方包更強大 url urllib2.urlopen url coding utf 8 import urllib2 直接請求 response urllib2.urlopen 獲取狀態碼 print response.getcode 讀取...

輕量級爬蟲開發（二）

二簡單爬蟲架構動態執行流程三 url管理器管理待抓取url集合和已抓取的url集合目的在於防止重複和迴圈抓取。url之間往往迴圈指向的，如果不對url進行管理，爬蟲就會不斷的抓取這些url，最糟糕的情況兩個url互相指向，則我們將不停的抓取這兩個url管理器，形成死迴圈。功能 url管理...

Python輕量級爬蟲教程網頁解析器

網頁解析器從網頁中提取我們想要的資料的工具 python的幾種網頁解析器正規表示式模糊匹配結構化解析 html.parser beautifulsoup 第三方外掛程式 lxml 第三方外掛程式網頁解析器之 beautiful soup 首先測試是否安裝beautiful soup4 im...

python 輕量級爬蟲開發3

python 輕量級爬蟲開發2

輕量級爬蟲開發（二）

Python輕量級爬蟲教程 網頁解析器

相關推薦

Python輕量級爬蟲教程網頁解析器