python 網路爬蟲之beautifulsoup

beautifulsoup: 用來提取請求返回資訊

安裝：pip install beautifulsoup4

平行遍歷，發生在同乙個父節點下的各節點間

標籤的平行遍歷的結果不一定是標籤

import requests
from bs4 import beautifulsoup
import re
url =
''try:
r = requests.get(url)
demo = r.text
soup = beautifulsoup(demo,
"html.parser"
)#html.parser是直譯器
# print(soup.title) #列印標題
# print(soup.a) #列印第乙個標籤是的標籤的內容
# print(soup.find_all('a')) #獲取所有的a標籤
# print(soup.find_all(true)) #獲取所有的標籤
print
(soup.find_all(re.
compile
('b'))
)#獲取所有b開頭的標籤
# print(soup.a.name) #標籤名字
# print(soup.a.string) #標籤string???
# print(soup.a.attrs) #標籤屬性
# print(soup.a.attrs['class']) #獲取指定的屬性的值
# print(soup.a.parent.name) #列印a的父親的名字
# print(soup.a.parents) #列印a的先輩
# print(soup.a.parent.parent.name) #列印a的父親的父親的名字
# print(soup.prettify())#列印整個頁面
# print(soup.head) #列印head
# print(soup.head.contents) #列印head
# print(soup.body.contents)
# print(soup.body.children)
# for child in soup.body.children:
# print(child)
# print(soup.a.next_sibling)
# print(soup.a.next_sibling.next_sibling)
# print(soup.a.previous_sibling)
except
:print
('get fail'
)

Python 網路爬蟲之BeautifulSoup

在上一節記錄了如何使用urllib進行網路爬蟲，並將資料儲存。但是我當時是使用的正規表示式進行的資料過濾，有些不全面。接下來我將記錄一種更加方便的解析資料的操作 beautifulsoup 安裝beautifulsoup4 導包import urllib.request from bs4 impor...

Python之網路爬蟲（1）

將中所有的出版社資訊都爬取出來。如下可以看到，網頁中有許多的出版社。下面我們用將所有出版社的名字爬取出來，並儲存在檔案中。import urllib.request import re url data urllib.request.urlopen url read data data.dec...

Python之網路爬蟲（4）

使用伺服器進行資訊爬取，可以很好的解決ip限制的問題。import urllib.request def use proxy url,proxy addr proxy urllib.request.proxyhandler 由於urllib.request.urlopen不支援很多高階網頁，因此使...

python 網路爬蟲之beautifulsoup

Python 網路爬蟲之BeautifulSoup

Python之網路爬蟲（1）

Python之網路爬蟲（4）

相關推薦