網路資料爬取例項教程

2023年趵突泉會停止噴湧嗎

用眼睛找到我們需要的資訊

事實上，從網上爬取資料的過程和我們瀏覽網頁的過程是一樣的，同樣也包含這兩個步驟，只是工具略有不同而已。

python有兩個內建的模組urllib和urllib2，可以用來作為爬取資料用的「瀏覽器」，pycurl也是乙個不錯的選擇，可以應對更複雜的要求。

我們知道，http協議共有8種方法，真正的瀏覽器至少支援兩種請求網頁的方法：get和post。相對於urllib2而言，urllib模組只接受字串引數，不能指定請求資料的方法，更無法設定請求報頭。因此，urllib2被視為爬取資料所用「瀏覽器」的首選。

這是urllib2模組最簡單的應用：

import urllib2
response = urllib2.urlopen('')
if response.code == 200:
html = response.read() # html就是我們所請求的網頁的原始碼
print html
else:
print
'response error.'

urllib2.urlopen除了可以接受字串引數，還可以接受urllib2.request物件。這意味著，我們可以靈活地設定請求的報頭（header）。

urllib2.urlopen的建構函式原型如下：

urllib2.urlopen(url[, data[, timeout[, cafile[, capath[, cadefault[, context]]]]])

urllib2.request的建構函式原型如下：

urllib2.request(url[, data][, headers][, origin_req_host][, unverifiable])

beautiful soup做為python的第三方庫，可以幫助我們從網頁原始碼中找到我們需要的資料。beautiful soup可以從乙個html或者xml提取資料，它包含了簡單的處理、遍歷、搜尋文件樹、修改網頁元素等功能。安裝非常簡單（如果沒有解析器，也一併安裝）：

pip install beautifulsoup4

pip install lxml

下面的例子演示了如何從上找到我們需要的資訊。

import urllib2
from bs4 import beautifulsoup
response = urllib2.urlopen('')
if response.code == 200:
html = response.read()
soup = beautifulsoup(html, 'lxml')
div = soup.find('div') # 找出第乙個div節點
print div # 檢視div標籤的全部資訊
print div.text # 列印div標籤的內容
print div.attrs # 列印div標籤的全部屬性
print div['style'] # 列印div標籤的樣式
divs = soup.find_all('div') # 找出所有的div節點
for div in divs: # 遍歷所有的div節點
print div
for tr in divs[1].find_all('tr'): # 遍歷所有的tr
for td in tr.find_all('td'):
print td.text
else:
print
'response error.'

#-*- coding: utf-8 -*-
import re
html = u"""
2023年10月15日
2023年10月15日
2023年5月9日
2023年11月23日
2023年7月8日
2023年12月5日
2023年3月22日
2023年6月17日
"""pattern = re.compile(r'(\d)年(\d)月(\d)日')
data = pattern.findall(html.encode('utf-8'))
print data

濟南市城鄉水務局**（每天都會發布趵突泉和黑虎泉的地下水位，並且可以查詢到更早期的水位資料。本教程的目標任務是：

繪製指定時間段的水位變化曲線分析濟南市地下水位變化規律，對2023年泉水是否會停止噴湧做出預判簡單操作幾次，我們就會發現，自2023年5月2日開始，濟南市城鄉水務局**每天都會發布趵突泉和黑虎泉的地下水位資料，全部資料採用分頁方式顯示，每頁顯示20天，頁面編號從1開始，第1頁的url為：


.gov
.cn/list.php?catid=101&page=1

import urllib2
defget_data_by_page
(page):
url = '.gov.cn/list.php?catid=101&page=%d'%page
req = urllib2.request(url, '', )
response = urllib2.urlopen(req)
if response.code == 200:
html = response.read()
return html
else:
return
false


width="25%"
align="center"
bgcolor="#dae9f8"
class="s14">日期td>
width="15%"
align="center"
bgcolor="#dae9f8">
class="s14">趵突泉水位span>
td>
width="14%"
align="center"
bgcolor="#dae9f8">
class="s14">黑虎泉水位span>
td>
tr>
align="center"
bgcolor="#daf8e8"
class="s14">2023年10月2日td>
align="center"
bgcolor="#daf8e8"
class="s14">28.20公尺td>
align="center"
bgcolor="#daf8e8"
class="s14">28.15公尺td>
tr>
align="center"
bgcolor="#daf8e8"
class="s14">2023年10月1日td>
align="center"
bgcolor="#daf8e8"
class="s14">28.16公尺td>
align="center"
bgcolor="#daf8e8"
class="s14">28.10公尺td>
tr>

def
get_data_by_page
(page):
data = list()
p_date = re.compile(r'(\d)\d+(\d)\d+(\d)')
url = '.gov.cn/list.php?catid=101&page=%d'%page
req = urllib2.request(url, '', )
response = urllib2.urlopen(req)
if response.code == 200:
html = response.read()
soup = beautifulsoup(html, 'lxml')
for tr in soup.find_all('tr'):
tds = tr.find_all('td')
if len(tds) == 3
and'bgcolor'
in tds[0].attrs and tds[0]['bgcolor'] == '#daf8e8':
year, month, day = p_date.findall(tds[0].text.encode('utf8'))[0]
baotu = float(tds[1].text[:-1])
heihu = float(tds[2].text[:-1])
return data

目前已知的錯誤有：

最好的方法是，從最近的資料日期開始，使用datetime模組的timedelta物件，逐天處理，日期錯誤的，改正日期，缺失資料的，用前後兩天的平均值補齊。

待續待續

待續

網路資料爬取例項教程

awk例項教程

CSS reflow例項教程

MySQL檢索資料的例項教程

網路資料爬取例項教程

awk例項教程

CSS reflow例項教程

MySQL檢索資料的例項教程

相關推薦