爬蟲 BeautifulSoup 模組

2021-07-25 11:40:20 字數 2156 閱讀 8782

二、根據這個dom樹就可以按照節點的名稱、屬性和文字搜尋節點:find_all()方法會搜尋出所有滿足要求的節點,find()方法只會搜尋出第乙個滿足要求的節點;兩個方法的引數一模一樣;

三、得到節點以後,就可以訪問它的名稱、屬性、文字。

#a為標籤名稱(超連結),href,class為屬性,顯示在頁面上的是python

in [6]: soup.find_all('a')#查詢所有標籤為a的節點

out[6]:

[elsie,

lacie]

in [20]: link[1]['href'] #查詢a節點的href屬性

out[20]: u''

in [21]: link[1].get_text()#查詢a節點鏈結文字

out[21]: u'lacie'

in [22]: link[0].get_text()

out[22]: u'elsie'

in [24]: link[1].name#查詢a節點標籤的名稱

out[24]: u'a'

soup=beautifulsoup(html_doc,'html.parser',from_encoding='utf-8')#第乙個引數為文件,第二個引數為解析器,第三個引數為編碼

in [28]: links=soup.find_all('a')

in [30]: for i in links:

print i.name,i['href'],i.get_text()

....:

a elsie

a lacie

a tillie

in [34]: link_node=soup.find('a',href=re.compile(r'els'))#使用正規表示式匹配

in [39]: in [35]: print link_node.name,link_node['href'],link_node.get_text()

a elsie

in [5]: from bs4 import beautifulsoup

in [6]: html_doc = """

...: ...: ...:the dormouse's story

...:

...: once upon a time there were three little sisters; and their names were

...: elsie,

...: lacie and

...: tillie;

...: and they lived at the bottom of a well.

...:

...: ...

...: """

in [8]: soup=beautifulsoup(html_doc)

in [9]: print(soup.prettify)

the dormouse's story

once upon a time there were three little sisters; and their names were

elsie,

lacie and

tillie;

and they lived at the bottom of a well.

...>

in [12]: p_node=soup.find('p',class_='title')

in [13]: p_node.name

out[13]: u'p'

in [15]: print p_node['class']

[u'title']

in [16]: print p_node.get_text

the dormouse's story

>

in [17]: print p_node.get_text()

the dormouse's story

爬蟲beautifulsoup實踐

爬蟲beautifulsoup實踐 一 觀察response。首先,在chrome瀏覽器裡觀察一下該網頁的response內容,可以觀察到,的url都存放在img標籤下面,srcset屬性裡面,而且它們的class屬性都為 2zekz。二 理清爬蟲步驟的思路。規律已經找出來了 下一步就把爬蟲的思路寫...

爬蟲資料 Beautiful Soup

安裝 pip intsall bs4 beautiful soup的簡介 beautiful soup是python的乙個庫,最主要的功能是從網頁抓取資料,官方解釋如下 github位址 和lxml一樣,beautifulsoup也是乙個html xml的解析器,主要功能也是如何解析和提取html ...

BeautifulSoup爬蟲實戰

import requests from bs4 import beautifulsoup 定義請求url url 定義請求頭 headers res requests.get url url,headers headers 判斷是否成功並獲取原始碼 if res.status code 200 p...