python爬蟲學習筆記

from bs4 import beautifulsoup

#建立 beautiful soup 物件

# 使用lxml來進行解析

soup = beautifulsoup(h程式設計客棧tml,"lxml")

print(soup.prettify())

返回結果

就是 html 中的乙個個標籤

在上面範例的基礎上新增

from bs4 import beautifulsoup

#建立 beautiful soup 物件

# 使用lxml來進行解析

soup = beautifulsoup(html,"lxml")

#print(soup.prettify())

#建立 beautiful soup 物件

soup = beautifulsoup(html,'lxml')

print (soup.title)#none因為這裡沒有tiele標籤所以返回none

print (soup.head)#none因為這裡沒有head標籤所以返回none

print (soup.a)#返回編輯自我介紹，讓更多人了解你

print (type(soup.p))#返回

print( soup.p)

其中print( soup.p)

返回結果為

同樣地，在上面地基礎上新增

print (soup.name)# [document] #soup 物件本身比較特殊，它的 name 即為 [document]

返回print (soup.head.name)#head #對於其他內部標籤程式設計客棧，輸出的值為標籤本身的名稱

print (soup.p.attrs)##把p標籤的所有屬性列印出來,得到的型別是乙個字典。

返回print (soup.p['class'])#獲取p標籤下地class標籤

soup.p['class'] = "newclass"

print (soup.p) # 可以對這些屬性和內容等等進行修改

返回前面地基礎上新增

print (soup.p.string)

# the dormouse's story

print (type(soup.p.string))

# thon

返回結果

beautiful soup物件表示文件的全部內容。大多數情況下，它可以被視為標記物件。它支援遍歷文件樹並搜尋文件樹中描述的大多數方法因為beauty soup物件不是真正的html或xml標記，所以它沒有名稱和屬性。但是，有時檢視其內容很方便。name屬性，因此美麗的湯物件包含乙個特殊屬性。值為「[文件]」的名稱

print(soup.name)

#返回 '[document]'

用於解釋注釋部分的內容

markup = ""

soup = beautifulsoup(markup)

comment = soup.b.string

type(comment)

# 在上面的基礎上新增

head_tag = soup.div

# 返回所有子節點的列表

print(head_tag.contents)

返回同理

head_tag = soup.div

# 返回所有子節點的迭代器

for child in head_tag.children:

print(child)

返回可用 .strings 來迴圈獲取

for string in soup.strings:

print(repr(string))

返回for string in soup.stripped_strings:

print(repr(string))

返回找到所有

print(soup.find_all("a",id='link2'))

find方法是找到第乙個滿足條件的標籤後立即返回，返回乙個元素。find_all方法是把所有滿足條件的標籤都選到，然後返回。

#通過標籤名查詢：

print(soup.select('a'))

#通過類名查詢：

#通過類名，則應該在類的前面加乙個'.'

print(soup.select('.sister'))

#通過id查詢：

#通過id查詢，應該在id的名字前面加乙個＃號

print(soup.select("#link1"))

查詢a標籤返回的結果

其他因為網頁本身沒有，返回的是乙個空列表

組合查詢

print(soup.select("p #link1"))#查詢 p 標籤中，id 等於 link1的內容

子標籤查詢

print(soup.select("head > title"))

通過屬性查詢

print(soup.select('a[href=""]'))#屬性與標籤屬同一節點，中間不能有空格

先檢視型別

print (type(soup.select('div')))

for tjazzmzjzitle in soup.select('div'):

print (title.get_text())

返回print (soup.select('div')[20].get_text())#選取第20個div標籤的內容

返回本文標題: python爬蟲學習筆記--beautifulsoup4庫的使用詳解

本文位址: /jiaoben/python/418842.html

python爬蟲學習筆記

一爬蟲思路對於一般的文章而言，思路如下 1.通過主頁url獲取主頁原始碼，從主頁原始碼中獲得標題鏈結如想要抓取知乎上的新聞，就獲得主頁上的新聞鏈結 2.繼續通過標題鏈結獲得標題原始碼，進而獲得標題中的內容。其中，當存在多頁時，先將每一頁都一樣的url寫下來，然後迴圈加入頁碼，具...

Python爬蟲學習筆記

1.使用build opener 修改報頭 headers user agent 定義變數headers儲存user agent資訊 opener urllib.request.build opener 建立opener物件並賦給變數 openeropener.addheaders headers ...

python爬蟲學習筆記

2.網頁資訊提取 beautiful soup庫這是學習北理的嵩山天老師mooc教程的筆記，是老師上課用的例項。import requests url try kv 將爬蟲偽裝成瀏覽器 r requests.get url,headers kv r.raise for status print ...

python爬蟲學習筆記

python爬蟲學習筆記

Python爬蟲學習筆記

python爬蟲學習筆記

相關推薦