使用BeautifulSoup解析HTML

通過css屬性來獲取對應的標籤，如下面兩個標籤

可以通過class屬性抓取網頁上所有的紅色文字，具體**如下：

from urllib.request import urlopen
from bs4 import beautifulsoup
html = urlopen("")
bsobj = beautifulsoup(html)
namelist = bsobj.findall("span", )
for name in namelist:
print(name.get_text()

2. get_text()方法解析

.get_text() 會把你正在處理的 html 文件中所有的標籤都清除，然後返回乙個只包含文字的字串。假如你正在處理乙個包含許多超連結、段落和標籤的大段源**，那麼 .get_text() 會把這些超連結、段落和標籤都清除掉，只剩下一串不帶標籤的文字。

3. find()和findall()

findall(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)

你可以傳乙個標籤的名稱或多個標籤名稱組成的 python 列表做標籤引數。例如，下面的**將返回乙個包含 html 文件中所有標題標籤的列表：

findall()

屬性引數 attributes 是用乙個 python 字典封裝乙個標籤的若干屬性和對應的屬性值。例如，下面這個函式會返回 html 文件裡紅色與綠色兩種顏色的 span 標籤：

findall("span", })

如果 recursive 設定為 true，findall 就會根據你的要求去查詢標籤引數的所有子標籤，以及子標籤的子標籤。如果 recursive 設定為 false，findall 就只查詢文件的一級標籤。findall 預設是支援遞迴查詢的（recursive 預設值是 true）；一般情況下這個引數不需要設定，除非你真正了解自己需要哪些資訊，而且抓取速度非常重要，那時你可以設定遞迴引數。

它是用標籤的文字內容去匹配，而不是用標籤的屬性。

3. 導航樹

findall 函式通過標籤的名稱和屬性來查詢標籤，導航樹是通過縱向或橫向導航來查詢標籤。

如果你只想找出子標籤，可以用 .children 標籤：

from urllib.request import urlopen 
from bs4 import beautifulsoup 
html = urlopen("") 
bsobj = beautifulsoup(html) 
for child in bsobj.find("table",).children: 
print(child)

beautifulsoup 的 next_siblings() 函式可以讓收集**資料成為簡單的事情，尤其是處理帶標題行的**：

from urllib.request import urlopen 
from bs4 import beautifulsoup 
html = urlopen("") 
bsobj = beautifulsoup(html) 
for sibling in bsobj.find("table",).tr.next_siblings:、
print(sibling)

from urllib.request import urlopen 
from bs4 import beautifulsoup 
html = urlopen("") 
bsobj = beautifulsoup(html) 
print(bsobj.find("img",).parent.previous_sibling.get_text())

使用BeautifulSoup解析HTML

BeautifulSoup 安裝使用

BeautifulSoup使用相關知識

BeautifulSoup庫的使用

使用BeautifulSoup解析HTML

BeautifulSoup 安裝使用

BeautifulSoup使用相關知識

BeautifulSoup庫的使用

相關推薦