從零開始的python爬蟲教程 Day05

beautiful soup 是乙個可以從html或xml檔案中提取資料的python庫.它能夠通過你喜歡的轉換器實現慣用的文件導航,查詢,修改文件的方式.beautiful soup會幫你節省數小時甚至數天的工作時間.(摘自beautifulsoup中文文件)

和lxml庫一樣，beautifulsoup庫是乙個功能強大的解析庫，可以方便地提取各個網頁元素，是爬蟲的一大利器。

安裝beautifulsoup庫：

pip install bs4

beautifulsoup在解析網頁時需要解析器。以下是一些beautifulsoup庫支援的解析器：

解析器使用方法

python標準庫

beautifulsoup(html, 「html.parser」)

lxml html解析器

beautifulsoup(html, 「lxml」)

lxml xml解析器

beautifulsoup(html, 「xml」)

html5lib

beautifulsoup(html, 「html5lib」)

匯入beautifulsoup庫：

from bs4 import beautifulsoup
import re

例項html網頁**：

html = """the dormouse's story once upon a time there were three little sisters; and their names were ,lacie andtillie ;and they lived at the bottom of a well.

..."""

soup = beautifulsoup(html,
'lxml'
)print
(soup.title.string.strip())
# 使用string來獲取標籤裡面的字串
print
(soup.p.name)
# name獲取標籤名稱
print
(soup.p.attrs)
# attrs獲取標籤屬性

the dormouse's story

p

print
(soup.head.title.string.strip())
# 可以使用[標籤1.標籤2]的形式對標籤1下一層節點進行選擇

the dormouse's story

print
(soup.p.contents)

['\n',the dormouse's story, '\n']

print
(soup.body.children)
for child in soup.body.children:
print
(child)
print
('---------'
)

--------- the dormouse's story --------- --------- once upon a time there were three little sisters; and their names were ,lacie andtillie ;and they lived at the bottom of a well. --------- --------- ... ---------

---------

(1)name

print
(soup.find_all(name=
'b')
)# name為標籤型別

[the dormouse's story]

(2)attrs

print
(soup.find_all(attrs =))
# 根據標籤屬性選擇標籤

[
, lacie
, tillie
]

print
(soup.find_all(class_ =
'sister'))
# 使用標籤屬性名稱加「_」效果相同

[
, lacie
, tillie
]

(3)text

print
(soup.find_all(
'a',))
# 匹配標籤屬性的方法

[
, lacie
, tillie
]

print
(soup.find_all(text = re.
compile
('dormouse'))
)# 匹配標籤內容的方法

["\n   the dormouse's story\n  ", "\n    the dormouse's story\n   "]

print
(soup.select(
'.sister'))
# 選擇所有class為sister的標籤

[
, lacie
, tillie
]

從零開始的python爬蟲教程 Day03

re庫的部分方法詳細匹配規則請參考菜鳥教程。在正規表示式裡面寫入普通字元，則可以直接匹配出來。字元作用 w匹配字母數字下劃線 w匹配非字母非數字非下劃線 s匹配空白字元 s匹配非空白字元 d匹配數字 d匹配非數字 a匹配字串開頭 z匹配字串結尾包括換行 z匹配字串結尾不包括換行 n匹配...

從零開始的python爬蟲教程 Day03

C C 從零開始的Make教程

include int main 為了編譯生成對應的可執行檔案，你可能會使用下面的命令 cc o hello hello.c hello hello,world 但是，如果使用make 前提是你的作業系統已經安裝了gcc和gnu make 會顯得更清爽一些。make hello cc hello.c...

從零開始的python爬蟲教程 Day05

從零開始的python爬蟲教程 Day03

從零開始的python爬蟲教程 Day03

C C 從零開始的Make教程

相關推薦