python爬蟲六正則提取資料

首先抓

取豆瓣t

250的網頁

首先抓取豆瓣top250的網頁

首先抓取豆瓣

top2

50的網

頁 首先看主函式

import urllib.request,parser
from bs4 import beautifulsoup
import re
findlink = re.compile(r'')
findimage = re.compile(r'',re.s)
findjudge = re.compile(r'([\d]*人評價)')
#以上三個正規表示式是為了下面方便抓取內容
#分別抓取a標籤超連結,img標籤,span標籤多少人評價
begin_url="" #抓取的**
getdata(begin_url) #獲取資料

獲取資料的函式

def getdata(baseurl):
for i in range(0,1):
url = baseurl + str(i*25) #觀察豆瓣top250的網頁,網頁後面的數字代表是哪一張網頁
html = askurl(url) #askurl()是獲取某個網頁內容的自定義函式
soup = beautifulsoup(html,'html.parser') #使用bs4解析成樹的結構圖
for item in soup.find_all('div',class_='item'): #遍歷div標籤且class='item'
#print(item) item包含了我們想要的全部資訊
item=str(item) #轉化為string型別,因為下面要用正規表示式
link = re.findall(findjudge,item) $findall(s1,s2),s1是模式串,s2是匹配串
print(link)

抓取網頁函式

def askurl(url):
response = urllib.request.request(url=url,headers=head); #封裝request物件
content = urllib.request.urlopen(response) #開啟網頁內容
html = content.read().decode('utf-8') #解碼
return html 返回

Python網路爬蟲資料提取xpath

xpath，即為xml路徑語言 xmlpathlanguage 它是一種用來確定xml文件中某部分位置的語言。xml和html異同都是玩標籤，標籤中都有屬性 xml必須為雙標籤，html單雙都可 xml標籤為自定義，html標籤都為內建xpath 使用路徑表示式在 xml 文件中進行導航 xpat...

python爬蟲資料解析（正則）

正則解析案例爬取糗事百科的糗事百科url 檢視網頁源發現儲存的位址 import requests import re import os if name main headers 判斷是否存在qiushi資料夾，如果不存在就建立乙個 ifnot os.path.exists qiushi o...

Python爬蟲之資料解析和提取

獲取資料之後需要對資料進行解析和提取，需要用到的庫是beautifulsoup，需要在終端安裝 pip install beautifulsoup4 1 解析資料 bs物件 beautifulsoup 要解析的文字解析器解析器我們一般用python的內建庫 html.parser 示例 impo...

python爬蟲 六 正則提取資料

Python網路爬蟲 資料提取xpath

python爬蟲 資料解析（正則）

Python爬蟲之資料解析和提取

相關推薦

python爬蟲六正則提取資料

Python網路爬蟲資料提取xpath

python爬蟲資料解析（正則）