網路爬蟲系列筆記（1） Re庫正規表示式

unit1 正規表示式 re

正規表示式：regular expression regex re

簡潔的表達一組字串的表示式，（查詢，替換，匹配）。

表達「特徵」（病毒，入侵）

使用：

語法：字元+操作符

.[ ]

[^ ]*+

?|任意單個字元

字符集，某乙個字元

非字符集，非某乙個字元

前一字元，0或n次重複

前一字元，1或n次重複

前一字元，重複 0或1次

左右任意乙個 ^

$( )

\d\w

重複前乙個字元m次

前一m到n次（含n）

開頭結尾

分組，內部只能有 | 操作符

數字單詞字元

舉例：經典表達

^[a-za-z]+$ 由26個字母組成的字串

^[a-za-z0-9]+$ 由26個字母和數字組成的字串

^-?\d+$ 整數

[\u4e00-\u9fa5] 匹配中文字元（utf-8編碼）

unit 2 re庫

re庫介紹 import re

re庫採用raw string (不包含轉義符的字串)型別表示正規表示式，表示為：r'text'

如：r'[1-9]\d'

一、主要函式

1、re.search( pattern, string, flags=0)

匹配正規表示式的第乙個位置，返回match物件。

import re

match = re.search(r'[1-9]\d', 'bit 100086')

if match:

print(match.group(0))

#匹配結果10086

2、re.match( pattren, string, flags=0)

從乙個字串的

起始位置匹配，返回乙個match。

import re

match = re.search(r'[1-9]\d', 'buaa 100086')

if match:

print(match.group(0))

#匹配結果無attributeerror: 'nonetype' object has no attribute 'group'

match = re.search(r'[1-9]\d', '100086 buaa')

if match:

print(match.group(0))

#匹配結果10086

3、re.findall( pattern, string, flags=0)

搜尋字串，以列表型別返回全部能匹配的子串。

>>> ls = re.findall(r'[0-9]\d', 'buaa 100086 bit 100081')

>>> ls

['100086', '100081']

4、re.split( pattern, string, maxsplit=0, flags=0)

將字串按照正規表示式匹配結果進行分割，

返回列表型別。

>>> re.split(r'[0-9]\d', 'buaa 100086 bit 100081')

['buaa ', ' bit ', '']

>>> re.split(r'[0-9]\d', 'buaa 100086 bit 100081', maxsplit = 1)

['buaa ', ' bit 100081']

5、re.finditer( pattern, string, flags=0)

搜尋字串，返回乙個匹配結果

迭代型別，每個迭代型別是match物件。

import re

for m in re.finditer(r'[0-9]\d', 'buaa 100086 bit 100081'):

if m:

print(m.group(0))

###100086

###100081

6、re.sub( pattern, repl, string, count=0, flags=0)

將字串中正規表示式的子串

替換，返回替換後的字串。

>>> re.sub(r'[0-9]\d',':zipcode', 'buaa 100086 bit 100081')

'buaa :zipcode bit :zipcode'

用法總結：

函式式用法：一次操作

rst = re.search(r'[1-9]\d', 'buaa 100086')

物件導向用法：編譯後多次操作

pat = re.compile(r'[1-9]\d') #pattern物件

rst = pat.search('buaa 100086')

7+、re.compile( pattern, flags=0)

將正規表示式字串形式（不是正規表示式）編譯成正規表示式物件，pattern物件

pattern物件的方法：

regex = re.compile(r'[1-9]\d')

此時，可以使用 regex 的六個方法，與前 re 的方法對應。

二、re庫的match物件

match物件：

match物件的屬性：

match物件的方法：

三、re庫的貪婪匹配和最小匹配

例項：貪婪匹配（最長匹配，預設）

match = re.search(r.'py.*n', 'pyanbn***n') #有四種匹配方式，re庫預設採用貪婪匹配方法。

match.group(0)

最小匹配，擴充套件操作符

*？前乙個字元0次或無限次擴充套件，最小匹配

+？前乙個字元1次或無限次擴充套件，最小匹配

？？前乙個字元0次或1次擴充套件，最小匹配

?擴充套件前乙個字元m至n次（含n），最小匹配

python爬蟲 re庫（正則）

1.re.match re.match嘗試從字元創的起始位置匹配乙個模式，如果不是起始位置匹配成功的話，就會返回none。re.match pattern,string,flags 0 2.最常規的匹配 import re content hello 123 4567 world this is a...

網路爬蟲之網頁資料解析（正則re）

正規表示式測試 title u 你好，hello，世界,天安門，願望 pattern re.compile u u4e00 u9fa5 result pattern.findall title print result 貪婪模式與非貪婪模式import re str aatest1 bbtest2 ...

python網路爬蟲學習筆記（1）

一三種網頁抓取方法 1 正規表示式模組使用c語言編寫，速度快，但是很脆弱，可能網頁更新後就不能用了。2 beautiful soup 模組使用python編寫，速度慢。安裝 pip install beautifulsoup4 3 lxml 模組使用c語言編寫，即快速又健壯，通常應該是最好的選擇...

網路爬蟲系列筆記（1） Re庫 正規表示式

python爬蟲 re庫（正則）

網路爬蟲之網頁資料解析（正則re）

python網路爬蟲學習筆記（1）

相關推薦

網路爬蟲系列筆記（1） Re庫正規表示式