python爬蟲五網頁解析器

網頁解析器:是從網頁中提取有價值資料的工具

python 有四種網頁解析器:

1 正規表示式:模糊匹配解析

2 html.parser:結構化解析

3 beautiful soup :結構化解析

4 lxml:結構化解析

其中 beautiful soup 功能很強大,有html.parse和 lxml的解析器.

結構化解析-dom(document object model)樹

beautifulsoup 語法:

其中find_all方法會搜尋滿足要求的所有節點

find方法只會搜尋第乙個滿足要求的節點

節點的介紹:

一建立beautifulsoup物件

二搜尋節點

其中beautifulsoup有個強大的功能是可以傳入正規表示式來匹配的內容.

class_ 這裡加乙個下劃線是因為避免與python關鍵字衝突所以用乙個下劃線.

三訪問節點資訊

例項測試:

from bs4 import beautifulsoup
import re
html_doc = ""
"title">the dormouse's story
story">once upon a time there were three little sisters; and their names were
" class="sister" id="link1">elsie,
" class="sister" id="link2">lacie and
" class="sister" id="link3">tillie;
and they lived at the bottom of a well.
story">..."""
soup=beautifulsoup(html_doc,'html.parser',from_encoding='utf-8')
print('獲取所有鏈結')
links=soup.find_all('a')
forlink in links :
print (link.name, link['href'],link.get_text())
print('獲取lacie鏈結')
linknode=soup.find_all('a',href='')
forlink in linknode :
print (link.name, link['href'],link.get_text())
print('正則匹配')
linknode=soup.find_all('a',href=re.compile(r'ill'))
forlink in linknode :
print (link.name, link['href'],link.get_text())
print('獲取p')
pnode=soup.find_all('p',class_='title')
forlink in pnode :
print (link.name,link.get_text())

學習自:慕課網.

03網頁解析器

網頁解析器從網頁中提取有價值資料的工具,也會提取到網頁中所有的url，用於後續的訪問。python網頁解析器 1.正規表示式最直觀，將網頁當作是乙個字串，進行模糊匹配但如果對於較為複雜的文件，會相當複雜 2.html.parser python自帶 3.beautifulsoup 第三方外掛程式...

1 6 網頁解析器beautifulsoup

beautifulsoup介紹 beautifulsoup實戰為了實現解析器，可以選擇使用 1.正規表示式 2.html.parser 3.beautifulsoup 4.lxml等，這裡我們選擇beautifulsoup。其中，正規表示式基於模糊匹配，而另外三種則是基於dom結構化解析。而且be...

Python爬蟲（三）網頁解析

所需庫from bs4 import beautifulsoup專案示例html askurl 獲取頁面html文字 soup beautifulsoup html,html.parser 使用html解析來處理html變數變數名 item1 soup.find all article 匹配ar...

python爬蟲 五 網頁解析器

03網頁解析器

1 6 網頁解析器beautifulsoup

Python爬蟲（三） 網頁解析

相關推薦

python爬蟲五網頁解析器

Python爬蟲（三）網頁解析