網路爬蟲之網頁資料解析（正則re）

正規表示式測試**

title = u'你好，hello，世界,天安門，願望'

pattern = re.

compile

(u'[\u4e00-\u9fa5]+'

)result = pattern.findall(title)

(result)

貪婪模式與非貪婪模式

import re
str=
'aatest1
bbtest2
cc'p = re.
compile
(r'(.*?)')
m = p.search(
str)
print
(m,m.group(
))

正則案例

import re
import requests
import time
import random
import threading
url =
''defget_proxies
(proxies)
: host,port,protocol = random.choice(proxies)
headers =
fp =
open
('./proxies.txt'
, mode=
'a', encoding=
'utf-8'
)for i in
range(10
,20):
response = requests.get(url=url %
(i),
headers=headers,
proxies =
) response.encoding =
'utf-8'
html = response.text
# with open('./xici.html',mode = 'w',encoding='utf-8') as fp:
# fp.write(html)
result = re.findall(r'(.*?)'
, html, flags=re.s)
'''182.35.80.136
9999
山東泰安
高匿http
1分鐘19-10-29 13:20
'''print
('----------------'
,len
(result)
)for item in result[1:
]:try:
ip = re.findall(r'([\d\.]*)'
, item, re.s)
type
= re.findall(r'([a-z]+)'
, item, re.s)
fp.write(
'%s,%s,%s\n'
%(ip[0]
, ip[1]
,type[0
]))except exception as e:
with
open
('./log.txt'
,mode =
'a',encoding=
'utf-8'
)as f:
f.write(item +
'\n'
+str
(e)+
'\n'
)print
('第%d頁**爬取成功！'
%(i)
) time.sleep(random.randint(1,
3)) fp.close(
)num =
0fp =
open
('./proxies.txt'
,'r'
,encoding=
'utf-8'
)fp2 =
open
('./verified_proxie.txt'
,'a'
,encoding=
'utf-8'
)def
verify_proxy()
:global num
while
true
: line = fp.readline(
).strip(
'\n'
)if line !='':
try:
ip,host,protocol = line.split(
',')
except
:print
('------------------------------'
,line)
# 要訪問的**，如果是https，那麼**也要是https，不對應不走**，走本地
# 要訪問的**，如果是http，那麼**也要是http型別
('該ip：%s:%s驗證通過'
%(ip,host)
) fp2.write(
'%s,%s,%s\n'
%(ip,host,protocol)
) num +=
1except exception as e:
print
('該ip：%s:%s驗證失敗'
%(ip, host)
)else
:try
: requests.get(url1, proxies=
, timeout=3)
print
('該ip：%s:%s驗證通過'
%(ip, host)
) fp2.write(
'%s,%s,%s\n'
%(ip, host, protocol)
) num +=
1except exception as e:
print
('該ip：%s:%s驗證失敗'
%(ip, host)
)else
:break
return num
if __name__ ==
'__main__'
:with
open
('./verified_proxie.txt'
,mode =
'r',encoding=
'utf-8'
)as f:
proxies = f.readlines(
) proxies =
[proxy.strip(
'\n'
).split(
',')
for proxy in proxies]
print
(proxies)
get_proxies(proxies)
# threads = 
# for i in range(1000):
# t = threading.thread(target=verify_proxy)
# t.start()
# # join必須單獨寫，目的：執行緒啟動
# for t in threads:
# t.join()
# print('-----------------所有的子執行緒結束任務，主線程開始執行')
# fp.close()
# fp2.close()

- ##### 正規表示式測試**
[

網路爬蟲之網頁資料解析（XPath）

xpath定義 xpath表示式 lxml庫 xpath案例引入有人說，我正則用的不好，處理html文件很累，有沒有其他的方法？有！那就是xpath，我們可以先將網路獲取的string型別資料轉換成 html xml文件，然後用 xpath 查詢 html xml 節點或元素。什麼是xml 大家都...

爬蟲之網頁資料提取

爬蟲流程指定url 發請求收響應解資料存資料資料解析方法分類正則各程式語言都可以用 bs4 python獨有 xpath 重點，各種程式語言都可用 bs4.beautifulsoup 提供的方法和屬性例項化beautifulsoup的方法本地html檔案例 beautifulsoup...

網頁資料抓取爬蟲

資料抓取其實從字面意思就知道它是抓取資料的，在網際網路世界中，資料量是乙個非常大的。有時候靠人為去獲取資料這是乙個非常不明智的。尤其是你需要的資料來自很多不同的地方。網路爬蟲是是一種按照一定的規則，自動地抓取網際網路資訊的程式或者指令碼。它主要抓取形式有兩種 1種是抓取網頁鏈結，通過url鏈結得到...

網路爬蟲之網頁資料解析（正則re）

網路爬蟲之網頁資料解析（XPath）

爬蟲之網頁資料提取

網頁資料抓取 爬蟲

相關推薦

網頁資料抓取爬蟲