Python BS4庫的安裝與使用詳解

beautiful soup 庫一般被稱為bs4庫，支援python3，是我們寫爬蟲非常好的第三方庫。因用起來十分的簡便流暢。所以也被人叫做「美味湯」。目前bs4庫的最新版本是4.60。下文會介紹該www.cppcns.com庫的最基本的使用，具體詳細的細節還是要看：[官方文件](beautiful soup documentation)

bs4庫的安裝

python的強大之處就在於他作為乙個開源的語言，有著許多的開發者為之開發第三方庫，這樣我們開發者在想要實現某乙個功能的時候，只要專心實現特定的功能，其他細節與基礎的部分都可以交給庫來做。bs4庫就是我們寫爬蟲強有力的幫手。

安裝的方式非常簡單：我們用pip工具在命令列裡進行安裝

$ pip install beautifulsoupwww.cppcns.com4

接著我們看一下是否成功安裝了bs4庫

$ pip list

這樣我們就成功安裝了 bs4 庫

bs4庫的簡單使用

這裡我們先簡單的講解一下bs4庫的使用，

暫時不去考慮如何從web上抓取網頁，

假設我們需要爬取的html是如下這麼一段：

下面的一段html**將作為例子被多次用到.這是愛麗絲夢遊仙境的的一段內容(以後內容中簡稱為愛麗絲的文件):

the dormouse's story

once upon a time there were three little sisters; and their names were

" class程式設計客棧="sister" id="link1">elsie,

" class="sister" id="link2">lacie and

" class="sister" id="link3">tillie;

and they lived at the bottom of a well.

...下面我們開始用bs4庫解析這一段html網頁**。

#匯入bs4模組

from bs4 import beautifulsoup

#做乙個美味湯

soup = beautifulsoup(html，'html.parser')

#輸出結果

print(soup.prettify())

'''out:

# # #

# the dormouse's story

# #

# # the dormouse's story

# #

# # once upon a time there were three little sisters; and their names were

# # elsie

# # ,

# # lacie

# # and

# # tillie

# # ; and they lived at the bottom of a well.

# #

# ...

# #

# '''

可以看到bs4庫將網頁檔案變成了乙個soup的型別，

事實上，bs4庫是解析、遍歷、維護、「標籤樹「的功能庫。

通俗一點說就是： bs4庫把html源**重新進行了格式化，

從而方便我們對其中的節點、標籤、屬性等進行操作。

下面是幾個簡單的瀏覽結nxcvkvrsv構化資料的方式：

請仔細觀察最前面的html檔案

# 找到文件的title

soup.title

# the dormouse's story

#title的name值

soup.title.name

# u'title'

#title中的字串string

soup.title.string

# u'the dormouse's story'

#title的父親節點的name屬性

soup.title.parent.name

# u'head'

#文件的第乙個找到的段落

soup.p

# the dormouse's story

#找到的p的class屬性值

soup.p['class']

# u'title'

#找到a標籤

soup.a

# " id="link1">elsie

#找到所有的a標籤

soup.find_all('a')

# [" id="link1">elsie,

# " id="link2">lacie,

# " id="link3">tillie]

#找到id值等於3的a標籤

soup.find(id="link3")

# " id="link3">tillie

通過上面的例子我們知道bs4庫是這樣理解乙個html原始檔的：

首先把html原始檔轉換為soup型別

接著從中通過特定的方式抓取內容www.cppcns.com

更高階點的用法？

從文件中找到所有標籤的鏈結:

#發現了沒有，find_all方法返回的是乙個可以迭代的列表

for link in soup.find_all('a'):

print(link.get('href'))

# #

# 從文件中獲取所有文字內容:

#我們可以通過get_text 方法快速得到原始檔中的所有text內容。

print(soup.get_text())

# the dormouse's story

## the dormouse's story

## once upon a time there were three little sisters; and their names were

# elsie,

# lacie and

# tillie;

# and they lived at the bottom of a well.

## ...

bs4庫的入門使用我們就先進行到這。

本文標題: python bs4庫的安裝與使用詳解

本文位址:

Python BS4庫的安裝與使用詳解

pyenv virtualenv 的安裝與使用

Spotlight on Unix 的安裝與使用

docker windows版本的安裝與使用

Python BS4庫的安裝與使用詳解

pyenv virtualenv 的安裝與使用

Spotlight on Unix 的安裝與使用

docker windows版本的安裝與使用

相關推薦