python爬蟲小記

['__builtins__','__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__','__path__', '__spec__', 'error', 'parse', 'request', 'response']

request的使用

1.request請求最簡單的操作是用urlopen方法，**如下：

import urllib.request

response = urllib.request.urlopen('')

result=response.read().decode('utf-8')

print(result)

四、xpath介紹

xpath即為xml語言，它是一種用來確定xml（標準通用對的標記語言的子集）文件中某部分位置的語言。xpath基於xml的樹狀結構，有不同型別的節點，包括元素節點，提供的資料結構中找尋節點的能力

1.獲取某個標籤的內容，獲取a標籤的所有內容

寫法一：

html = etree.html(wb_data)

html_data = html.xpath('/html/body/div/ul/li/a')

print(html)

for i in html_data:

print(i.text)

結果first item

second item

third item

fourth item

fifth item

寫法二：

html = etree.html(wb_data)

html_data = html.xpath('/html/body/div/ul/li/a/text()')

print(html)

for i in html_data:

print(i)

結果：first item

second item

third item

fourth item

fifth item

2.開啟讀取html檔案

#使用parse開啟html的檔案

html = etree.parse('test.html')

html_data = html.xpath('//*')

#列印是乙個列表，需要遍歷

print(html_data)

for i in html_data:

print(i.text)

python爬蟲小記

1 在寫爬蟲的時候，思考的總體格局，讓既可以捕捉異常又容易閱讀 2 具有周密的異常處理功能，會讓快速穩定地網路資料採集變得簡單易行。3 面對頁面解析難題 gordian knot 的時候，不假思索地直接寫幾行語句來抽取資訊是非常直接的做法。但是，像這樣魯莽放縱地使用技術，只會讓程式變得難以除錯或...

Python 爬蟲小記

1 背景需要爬取網上的資訊，ubuntu系統下使用python完成使用方法 from bs4 import beautifulsoup requests pip install requests 用python語言基於urllib編寫的，採用的是apache2 licensed開源協議的htt...

爬蟲小記（2）

自從兩個月前說要學習爬蟲開始，我其實做了好多準備了，只是開始的有點晚了吧。希望後面的進度能夠變的快一點。使用這些庫就能實現我們想要做的一些基本的功能，包括爬取一些簡單的的內容，requests庫的功能是我覺得最實用的，只依靠這樣乙個庫我就可以實現解析乙個url，獲得網頁裡的內容，還可以判斷在連線u...

python爬蟲小記

python爬蟲小記

Python 爬蟲小記

爬蟲小記（2）

相關推薦