python爬蟲讀書筆記（3）

1.解析robots.txt

robotparser模組首先載入robots.txt檔案，然後通過can_fetch()函式確定指定的使用者**是否允許訪問網頁。

為了將該功能整合到爬蟲中，我們需要在crawl迴圈中新增該檢查。

while crawl_queue:
url=crawl_queue.pop()
#檢查url是否通過robot.txt限制
if rp.can_fetch(user.agent,url)
...else:
print('blocked by robots.txt:',url)

2.支援**

def download(url,user_agent='wswp',proxy=none,num_retries=2):
print('downloading',url)
headers=
request=urllib2.request(url,headers=headers)
opener=urllib.build_opener()
if proxy:
proxy_params=
opener.add_header(urrlib2.proxyhandler(proxy_params))
try:
html=opener.open(request).read()
except urllib2.urlerror as e:
print 'download error:',e.reason
html=none
if num_retries>0:
if hasattr(e,'code')and 500<=e.code<600:
html=download(url,user_agent,proxy,num_retires-1)
return html

class throttle:
def __init__(self,delay):
self.delay=delay
#乙個最近儲存的域的時間戳
self.domains={}
def wait(self,url):
domain=urlparse.urlparse(url).netloc
last_accessed=self.domains.get(domian)
if self.delay>0 and last_accessed is not none:
sleep_secs=self.delay-(datetime.datetime.now()-last_accessed).seconds
if sleep_secs>0:
time.sleep(sleep_secs)
#更新最後獲取時間
self.domians[domain]=datetime.datetime.now()

throttle類記錄了每個網域名稱上次訪問的時間內，如果距上次訪問時間小於制定延遲時間，則執行睡眠操作。

throttle=throttle(delay)

throttle.wait(url)

result=download(url,headers,proxy=proxy,num_retries=num_retries)

4.避免爬蟲陷阱

想要避免陷入爬蟲陷阱，乙個簡單的方法是記錄到達當前網頁經過了多少個鏈結，也就是深度。當到達最大深度時，爬蟲就不再向佇列中新增該網頁中的鏈結了。要實現這一功能，我們需要修改 seen 變數。該變數原先只記錄訪問過的網頁鏈結，現在修改為乙個字典，增加了頁面深度的記錄。

def link_crawler(...,max_depth=2):

max_depth=2

seen={}

...depth=seen[url]

if depth!=max_depth:

for link in links:

if link not in seen:

seen[link]=depth+1

禁用該功能，只需要將max_path設為乙個負數即可，此時當前深度永遠不會與之相等。

讀書筆記3

1.資料治理的本質資料治理的本質是組織對資料的可用性完整性和安全性的整體管理。可用性指資料可用可信且有質量保證，不會因為分析結果的準確性造成偏差，從業者可以放心地根據資料結果做業務決策完整性分為兩個方面，一方面指資料需覆蓋各類資料應用的需要，另一方面指不會因為資料治理沒有到位而造成資料資產的...

python讀書筆記

numpy篇 numpy.around 函式返回指定數字的四捨五入值 numpy.floor numpy.floor 返回小於或者等於指定表示式的最大整數，即向下取整 numpy.ceil numpy.ceil 返回大於或者等於指定表示式的最小整數，即向上取整 numpy.reciprocal nu...

python讀書筆記

python有六個標準的資料型別 1.number 數字 int,float,bool,complex 2.string 字串 3.tuple 元祖 4.list 列表 5.dictionary 字典 6.sets 集合迭代器迭代器物件從集合的第乙個元素開始訪問，直到所有的元素被訪問完結束。迭代...

python爬蟲讀書筆記（3）

讀書筆記3

python讀書筆記

python讀書筆記

相關推薦