Python爬蟲中html資料抽取方法對比分析

python中常用的html資料抽取方法有正則、xpath和beautifulsoup這三種。其中，最常用的xpath庫是lxml。今天再介紹乙個庫simplifieddoc，一起比較一下他們的優劣。

1、安裝

名稱安裝方法

包大**明

正則不需安裝（內建）

lxml

pip install lxml

4.5mb

依賴c語言庫

beautifulsoup

pip install beautifulsoup4

107kb

如果不使用第三方庫，則不需要別的安裝

simplifieddoc

pip install simplified-scrapy

43kb

沒有第三方依賴

2、python版本支援

這幾種方法都同時支援python2和python3。

3、使用方法

對正則和xpath的使用方法，這裡就不重複了，只簡單對比下beautifulsoup和simplifieddoc。下面的**展示了兩者例項化及提取資料的方法。

html =
'''test text
'''# 例子：
from bs4 import beautifulsoup
soup = beautifulsoup(html,features=
'html.parser'
)soup = beautifulsoup(html,features=
'lxml'
)title = soup.title
# 取所有
divs = soup.findall(id=
'test'
)divs = soup.select(
'div#test'
)# 取第乙個
div = soup.find(id=
'test'
)div = soup.select_one(
'div#test'
)print
(div.text)
# 例子：
from simplified_scrapy import simplifieddoc
doc = simplifieddoc(html)
title = doc.title
# 取所有
divs = doc.getelements(
'div'
,attr=
'id'
,value=
'test'
)divs = doc.selects(
'div#test'
)# 取第乙個
div = doc.getelement(
'div'
,attr=
'id'
,value=
'test'
)div = doc.select(
'div#test'
)print
(div.text)

在使用方法上，有相似的地方，也有不同的地方，但是都挺簡單的。這裡特別提一下simplifieddoc中的getelement方法，每個方法中都有三個可選的引數start=none,end=none,before=none。使用這三個引數，可以幫助定位需要抽取的資料，在合適的時候，可以使抽取很方便。

4、效能對比

在處理速度上，對於正則，處理速度快，並且是有針對性的只處理需要的資料，所以比較公認的是處理速度最快的方式，但是使用起來相對困難。下面只對比lxml、beautifulsoup、simplifieddoc這三種方式。對比**如下：

from lxml import etree
from bs4 import beautifulsoup
from simplified_scrapy import simplifieddoc
import time
html =
'''this domain is for use in illustrative examples in documents. you may use this
domain in literature without prior coordination or asking for permission.
more information...
'''start = time.time(
)for i in
range(0
,1000):
root = etree.html(html)
text = root.xpath(
'//h1/text()')[
0]print
(time.time(
)-start,text)
start = time.time(
)for i in
range(0
,1000):
soup = beautifulsoup(html,features=
'html.parser'
) text = soup.h1.text
print
(time.time(
)-start,text)
start = time.time(
)for i in
range(0
,1000):
soup = beautifulsoup(html,features=
'lxml'
) text = soup.h1.text
print
(time.time(
)-start,text)
start = time.time(
)for i in
range(0
,1000):
doc = simplifieddoc(html)
text = doc.h1.html
print
(time.time(
)-start,text)

使用vscode測試對比結果如下：

名稱除錯模式耗時（單位：秒）

lxml

0.10795402526855469

beautifulsoup(html.parser)

2.5450849533081055

beautifulsoup(lxml)

2.236968994140625

simplifieddoc

0.25988101959228516 名稱

非除錯模式耗時（單位：秒）

lxml

0.12264490127563477

beautifulsoup(html.parser)

0.799994945526123

beautifulsoup(lxml)

0.7144896984100342

simplifieddoc

0.14832687377929688

不管除錯模式或非除錯模式，lxml的速度是最快的，simplifieddoc第二，beautifulsoup第三。其中有乙個奇怪的地方不知道是怎麼回事，非除錯模式下較除錯模式下速度都相對提高，lxml卻是變慢了。

5、總結

lxml不負眾望速度是除正則外最快的，小眾的simplifieddoc速度挺快，使用方法也簡單，值得大家試用一下。

名稱安裝難度

使用難度

速度包大小

正則無（內建）

困難最快

無lxml

一般一般快大

beautifulsoup

簡單容易慢較小

simplifieddoc

簡單容易較快小

Python爬蟲中html資料抽取方法對比分析

Python爬蟲之HTML知識

Python爬蟲實戰之解密HTML

python小爬蟲爬小說（html

Python爬蟲中html資料抽取方法對比分析

Python爬蟲之HTML知識

Python爬蟲實戰之解密HTML

python小爬蟲 爬小說（html

相關推薦

python小爬蟲爬小說（html