輕鬆搞定python爬蟲，秒懂資料抓取。

#匹配國家面積

context=re.findall(r'

(.*?)

',areas)[1

] print context

執行結果:

正規表示式提供了抓取資料的快捷方式，但是該方法過於脆弱，容易在網頁更新後出現問題。

beautful soup

beautful soup是乙個非常流行的python模組。該模組可以解析網頁，並提供定位內容的便捷介面。

安裝

pip install beautifulsoup4 --user

import lxml.html
broken_html='
'tree=lxml.html.fromstring(broken_html)
fixed_html=lxml.html.tostring(tree,pretty_print=true)
print fixed_html
""""""

測試三種方法的效能

import re
import urllib2
import urlparse
from
bs4 import beautifulsoup
import lxml.html
import time##
##獲取網頁內容
def download(url,user_agent='
wswp
',proxy=none,num_retries=2
): print 
'downloading:
',url
headers=
request=urllib2.request(url,headers=headers)
opener=urllib2.build_opener()
ifopener:
proxy_params=
opener.add_handler(urllib2.proxyhandler(proxy_params))
try:
html=urllib2.urlopen(request).read()
except urllib2.urlerror 
ase:
print 
'download:
',e.reason
html=none
if num_retries>0
: 
if hasattr(e,'
code
') and 500
<=e.code<600
: 
return download(url,num_retries-1
) 
return
html
#使用正規表示式匹配
def re_scraper(html):
results={}
results[
'area
']=re.search('
.*?(.*?)
',html).groups()[0
] 
return
results
#使用beautifulsoup匹配
def bs_scraper(html):
soup=beautifulsoup(html)
results={}
results[
'area
']=soup.find('
table
').find('
tr',id='
places_area__row
').find('
td', class_='
w2p_fw
').string
return
results
#使用cssselect選擇器匹配
def lxml_scraper(html):
tree=lxml.html.fromstring(html)
results={}
conf=tree.cssselect('
table > tr#places_area__row > td.w2p_fw
')[0
].text_content()
results[
'area
']=conf
return
results
#計算獲取時間
#每個**爬取的次數
num_iterations=1000
html=download('
')for name,scraper in [('
re',re_scraper),('
bs',bs_scraper),('
lxml
',lxml_scraper)]:
#開始的時間
start=time.time()
for i in
range(num_iterations):
if scraper==re_scraper:
#預設情況下,正規表示式模組會快取搜尋結果,為了使對比條件更一致,re.purge()方法清除快取
re.purge()
result=scraper(html)
#檢查結果
assert(result[
'area
']=='
647,500 square kilometres')
#結束時間
end=time.time()
print 
'%s: %.2f seconds
' %(name,end-start)

結果分析:

由於lxml和正規表示式都是用c語言寫的,所以效果比用python寫的beautifulsoup要好.由於lxml在搜尋之前必須輸入解析為內部格式,所以會產生額外的開銷.而爬取同一網頁時這種開銷會降低.

方法總結

抓取方法 -效能- 使用難度 -安裝難度

正規表示式- 快 -困難- 簡單(內建)

beautiful soup -慢 -簡單- 簡單(純python)

lxml 快 -簡單 -相對困難

如果是下在網頁,而不是抽取資料的話,那麼使用beautiful soup,如果只需抓取少量資料,並且避免額外依賴的話,選擇正,通常情況下使用lxml比較合適.

輕鬆搞定python系列

輕鬆搞定python 變數與數算輕輕鬆鬆搞定python 輕鬆搞定python 字串與序列輕鬆搞定python 字典與集合變數引用輕鬆搞定python 流程控制與函式程式或者叫本質上就是一堆讓計算機幹活的指令，每一條指令就是乙個幹活步驟。指令讓計算機幹什麼就幹什麼，沒有指令，計算機就不知...

搞定Python網路爬蟲，吃裡爬外？

資料分析多人學習python，不知道從何學起。很多人學習python，掌握了基本語法過後，不知道在尋找案例上手。很多已經做案例的人，卻不知道如何去學習更加高深的知識。那麼針對這三類人，我給大 qq群 1057034340 大資料時代，要想進行資料分析，首先要有資料單靠公司那幾條毛毛雨資料分...

秒懂 python 閉包和裝飾器

定義函式內的屬性，都是有生命週期的都在函式執行期間存活。內部函式對外部函式作用域裡的變數的引用閉包內的閉包函式私有化了變數，完成了資料的封裝，類似於物件導向。def fun a 1def fun1 num print this is fun1 print num a return fun1 if...

輕鬆搞定python爬蟲，秒懂資料抓取。

輕鬆搞定python系列

搞定Python網路爬蟲，吃裡爬外？

秒懂 python 閉包和裝飾器

相關推薦