第乙個Python爬蟲程式！

跟隨udacity的cs101課程學習，今天學完了unit 3，寫了乙個爬蟲程式：

import
urllib2
defget_next_target(page):
start_link=page.find('')
if start_link==-1:
return
none,0
start_quote=page.find('"'
,start_link)
end_quote=page.find('
"',start_quote+1)
url=page[start_quote+1:end_quote]
return
url,end_quote
defget_all_links(page):
links=
while
true:
url,endpos=get_next_target(page)
ifurl:
page=page[endpos:]
else
: 
break
return
links
defcrawl_web(seed):
tocrawl=[seed]
crawled=
while len(tocrawl)>0:
link=tocrawl.pop()
if link not
incrawled:
page =urllib2.urlopen(link).read() 
tocrawl=tocrawl+(get_all_links(page))
return
crawled
link='
'##page = urllib2.urlopen('').read()
##links=get_all_links(page)
print crawl_web(link)

注意的幾點：

1. 抓取url裡的html**，要用到urllib2包裡的urllib2.urlopen(link).read() 函式

2.這裡用的是pop，故而是深搜

下面是乙個廣搜的爬蟲例子，只需要改一下crawl_web(),新增乙個next_tocrawl變數，記錄下一層的點:

def
crawl_web(seed):
tocrawl=[seed]
crawled=
next_tocrawl=
while len(tocrawl)>0:
link=tocrawl.pop()
if link not
incrawled:
page =urllib2.urlopen(link).read() 
next_tocrawl=next_tocrawl+(get_all_links(page))
if len(tocrwal)==0:
tocrwal,next_tocrawl=next_tocrawl,
return crawled

python爬蟲第乙個爬蟲

1.本地安裝了nginx,在預設的html目錄下建立測試html 如圖，即index.html導向a,b,c 3個html,a.html又可以導向aa,bb兩個html,c.html可以導向cc.html。2.修改nignx配置使得本地可以訪問寫的kmtest裡的index.html。參考文件 ng...

Python 第乙個爬蟲

1 import urllib.request 2importre3 4class downpic 56 def init self,url,re str 7 self.url url 8 self.re str re str910 defgethtml self,url 11 page urlli...

第乙個爬蟲程式總結

網路爬蟲主要分3個大的版塊抓取，分析，儲存爬蟲豆瓣讀書其中注意要點 xlrd xlwt與openpyxl的讀寫效率比較兩種包對小檔案的讀寫速度差別不大，而面對較大檔案，xlrd xlwt速度明顯優於openpyxl，但因為xlwt無法生成xlsx是個硬傷，所以想要盡量提高效率又不影響結果時，...

第乙個Python爬蟲程式！

python爬蟲 第乙個爬蟲

Python 第乙個爬蟲

第乙個爬蟲程式總結

相關推薦

python爬蟲第乙個爬蟲