Python 三種網頁抓取方法

摘要：本文講的是利用python實現網頁資料抓取的三種方法；分別為正規表示式（re）、beautifulsoup模組和lxml模組。本文所有**均是在python3.5中執行的。

本文抓取的是[**氣象台](首頁頭條資訊：

其html層次結構為：

抓取其中href、title和標籤的內容。

一、正規表示式

copy outerhtml：

高溫預警

**：

# coding=utf-8
import re, urllib.request
url = ''
html = urllib.request.urlopen(url).read()
html = html.decode('utf-8') #python3版本中需要加入
links = re.findall('',html)
tags = re.findall('(.+?)',html)
for link,title,tag in zip(links,titles,tags):
print(tag,url+link,title)

from bs4 import beautifulsoup
import urllib.request
url = ''
html = urllib.request.urlopen(url).read()
soup = beautifulsoup(html,'lxml')
content = soup.select('#alarmtip > ul > li.waring > a')
for n in content:
link = n.get('href')
title = n.get('title')
tag = n.text
print(tag, url + link, title)

輸出結果同上。

三、lxml 模組

lxml是基於libxml2這一xml解析庫的python封裝。該模組使用c語言編寫，解析速度比beautiful soup更快，不過安裝過程也更為複雜。

**：

import urllib.request,lxml.html
url = ''
html = urllib.request.urlopen(url).read()
tree = lxml.html.fromstring(html)
content = tree.cssselect('li.waring > a')
for n in content:
link = n.get('href')
title = n.get('title')
tag = n.text
print(tag, url + link, title)

輸出結果同上。

四、將抓取的資料儲存到列表或者字典中

以beautifulsoup 模組為例：

from bs4 import beautifulsoup
import urllib.request
url = ''
html = urllib.request.urlopen(url).read()
soup = beautifulsoup(html,'lxml')
content = soup.select('#alarmtip > ul > li.waring > a')
######### 新增到列表中
link = 
title = 
tag = 
for n in content:
######## 新增到字典中
for n in content:
data =

五、總結

表2.1總結了每種抓取方法的優缺點。

Python抓取網頁

在python中，使用urllib2這個元件來抓取網頁。coding utf 8 urllib2是python的乙個獲取urls uniform resource locators 的元件。import urllib2 它以urlopen函式的形式提供了乙個非常簡單的介面 response urll...

Python網頁抓取

coding utf 8 import urllib 匯入模組 print dir urllib 檢視urllib方法 print help urllib.urlopen 檢視幫助文件 url 定義 html urllib.urlopen url 開啟url print html.read urlo...

爬取網頁後的抓取資料 3種抓取網頁資料方法

1.正規表示式 1 re.findall html 2 import re pattern re.compile hello match list re.findall pattern,hello world hello 這個是找全部匹配的，返回列表 match pattern.match hell...

Python 三種網頁抓取方法

Python抓取網頁

Python網頁抓取

爬取網頁後的抓取資料 3種抓取網頁資料方法

相關推薦