python爬蟲正規表示式

正規表示式是十分高效而優美的匹配字串工具，一定要好好掌握。利用正規表示式可以輕易地從返回的頁面中提取出我們想要的內容。

1)貪婪模式與非貪婪模式

python預設是貪婪模式。貪婪模式，總是嘗試匹配盡可能多的字元；非貪婪模式，總是嘗試盡可能少的字元。

一般採用非貪婪模式來提取。

2)反斜槓問題

正規表示式裡使用"\"作為轉義字元，這會造成困擾。如果你要匹配文中的字元"\"，那麼正規表示式裡需要4個反斜槓"\\\\"。python裡的原生字元很好解決了這個問題，上面的例子可以使用正規表示式r"\\"，匹配乙個數字的"\\d"可以寫成r"\d".

re模組：

該模組的主要方法：

#返回pattern物件
re.compile(string[,flag]) 
#以下為匹配所用函式
re.match(pattern, string[, flags])
re.search(pattern, string[, flags])
re.split(pattern, string[, maxsplit])
re.findall(pattern, string[, flags])
re.finditer(pattern, string[, flags])
re.sub(pattern, repl, string[, count])
re.subn(pattern, repl, string[, count])

pattern為乙個匹配模式：

pattern = re.compile(r'hello')

flags表示匹配模式，取值可以使用按位或運算子'|'表示同時生效：

• re.i(全拼：ignorecase): 忽略大小寫（括號內是完整寫法，下同） • re.m(全拼：multiline): 多行模式，改變'^'和'$'的行為（參見上圖） • re.s(全拼：dotall): 點任意匹配模式，改變'.'的行為 • re.l(全拼：locale): 使預定字元類 \w \w \b \b \s \s 取決於當前區域設定 • re.u(全拼：unicode): 使預定字元類 \w \w \b \b \s \s \d \d 取決於unicode定義的字元屬性

• re.x(全拼：verbose): 詳細模式。這個模式下正規表示式可以是多行，忽略空白字元，並可以加入注釋。

re.match標識從開頭嘗試匹配pattern，返回match物件。match物件標識一次匹配結果，關於很多關於此次匹配的資訊：

# -*- coding: utf-8 -*-
#匯入re模組
import re
# 將正規表示式編譯成pattern物件，注意hello前面的r的意思是「原生字串」
pattern = re.compile(r'hello')
# 使用re.match匹配文字，獲得匹配結果，無法匹配時將返回none
result1 = re.match(pattern,'hello')
if result1:
# 使用match獲得分組資訊
print result1.group()
else:
print '1匹配失敗！'

match的屬性：

1.string: 匹配時使用的文字。

2.re: 匹配時使用的pattern物件。

3.pos: 文字中正規表示式開始搜尋的索引。值與pattern.match()和pattern.seach()方法的同名引數相同。

4.endpos: 文字中正規表示式結束搜尋的索引。值與pattern.match()和pattern.seach()方法的同名引數相同。

5.lastindex: 最後乙個**獲的分組在文字中的索引。如果沒有**獲的分組，將為none。

6.lastgroup: 最後乙個**獲的分組的別名。如果這個分組沒有別名或者沒有**獲的分組，將為none。

方法：1.group([group1, …]):

獲得乙個或多個分組截獲的字串；指定多個引數時將以元組形式返回。group1可以使用編號也可以使用別名；編號0代表整個匹配的子串；不填寫引數時，返回group(0)；沒有截獲字串的組返回none；截獲了多次的組返回最後一次截獲的子串。

2.groups([default]):

以元組形式返回全部分組截獲的字串。相當於呼叫group(1,2,…last)。default表示沒有截獲字串的組以這個值替代，預設為none。

3.groupdict([default]):

返回以有別名的組的別名為鍵、以該組截獲的子串為值的字典，沒有別名的組不包含在內。default含義同上。

4.start([group]):

返回指定的組截獲的子串在string中的起始索引（子串第乙個字元的索引）。group預設值為0。

5.end([group]):

返回指定的組截獲的子串在string中的結束索引（子串最後乙個字元的索引+1）。group預設值為0。

6.span([group]):

返回(start(group), end(group))。

7.expand(template):

將匹配到的分組代入template中然後返回。template中可以使用\id或\g、\g引用分組，但不能使用編號0。\id與\g是等價的；但\10將被認為是第10個分組，如果你想表達\1之後是字元』0』，只能使用\g0。

re.search()不同於match()只檢測是否在string的開始位置匹配，search()會掃瞄整個string查詢匹配。

匯入re模組
import re
# 將正規表示式編譯成pattern物件
pattern = re.compile(r'world')
# 使用search()查詢匹配的子串，不存在能匹配的子串時將返回none
# 這個例子中使用match()無法成功匹配
match = re.search(pattern,'hello world!')
if match:
# 使用match獲得分組資訊
print match.group()
### 輸出 ###
# world

re.split()按能夠匹配的字串將string分割後返回列表。其中的引數maxsplit指定最大分割數，不指定的話將全部分割

import re
pattern = re.compile(r'\d+')
print re.split(pattern,'one1two2three3four4')
### 輸出 ###
# ['one', 'two', 'three', 'four', '']

re.findall()以列表形式返回全部匹配的字串

import re
pattern = re.compile(r'\d+')
print re.findall(pattern,'one1two2three3four4')
### 輸出 ###
# ['1', '2', '3', '4']

re.finditer()返回乙個順序訪問每乙個匹配結果的迭代器。

import re
pattern = re.compile(r'\d+')
for m in re.finditer(pattern,'one1two2three3four4'):
print m.group(),
### 輸出 ###
# 1 2 3 4

re.sub(pattern,repl,string[,count)使用repl替換string中每乙個匹配的字串後返回替換後的字串。count指定最多替換次數，不指定時全部替換。

pattern = re.compile(r'(\w+) (\w+)')
s = 'i say, hello world!'
print re.sub(pattern,r'\2 \1', s)
def func(m):
return m.group(1).title() + ' ' + m.group(2).title()
print re.sub(pattern,func, s)
### output ###
# say i, world hello!

re.subn(pattern,repl,string[,count])返回(subn(repl,string[,count])替換次數

import re
pattern = re.compile(r'(\w+) (\w+)')
s = 'i say, hello world!'
print re.subn(pattern,r'\2 \1', s)
def func(m):
return m.group(1).title() + ' ' + m.group(2).title()
print re.subn(pattern,func, s)
### output ###
# ('say i, world hello!', 2)

python爬蟲正規表示式

Python爬蟲正規表示式

Python 爬蟲正規表示式

Python爬蟲正規表示式

python爬蟲 正規表示式

Python爬蟲 正規表示式

Python 爬蟲 正規表示式

Python爬蟲 正規表示式

相關推薦

python爬蟲正規表示式

Python爬蟲正規表示式

Python 爬蟲正規表示式

Python爬蟲正規表示式