python爬取網易評論

爬取的是「最近華北空氣汙染嚴重」的新聞（

1、首先獲取json格式的檔案：我用的是360瀏覽器（貌似用谷歌比較好，但我谷歌出了點問題

#最新跟帖

所以要同時爬取兩種**。

3、處理字串：用 json.loads() 解碼字串轉換為python形式時格式很重要，在這裡轉換成字典

形式。

將開頭和結尾去掉，只剩下乙個大字典，並去掉裡面多餘的資訊

開頭：

結尾：

這時候可以print

post.keys()

接著獲取這兩個鍵裡的值，可以得到每條帖子的資訊：

4、報錯：socket.error: [errno 10054]，查了一下是遠端主機重置了鏈結，是大量訪問的原因，加了乙個 time.sleep()就好了。

5、**如下：

#coding=utf-8
import urllib2
import json
import re
import time
class wepl:
def __init__(self):
self.user_agent = 'mozilla/4.0(compatible;msie 5.5;windows nt)'
self.headers = 
self.url1=''
def getpageindex(self,pageindex):
url2=''+str(pageindex)+'.html'
return url2
def gethtml(self,url):
try:
request=urllib2.request(url,headers=self.headers)
response=urllib2.urlopen(request)
html=response.read()
return html
except urllib2.urlerror,e:
if hasattr(e,'reason'):
print u'loading error',e.reason
return none
def getpost(self):
for i in range(1,10):
#兩種**分開處理
if i==1:
html=self.gethtml(self.url1)
data1=re.sub('^var replydata=','',html)
data2=data1[:-1]
else:
url2=self.getpageindex(i)
html=self.gethtml(url2)
data1=re.sub('^var newpostlist=','',html)
data2=data1[:-2]
data3=re.sub(" \[\(.*?)\\]： ","",data2)
data4=re.sub("\","",data3)
data5=re.sub("\
","",data4)
#將json檔案解碼為python格式
post=json.loads(data5)
if i==1:
for allvalue in post['hotposts']:
with open('pl3.txt','a+') as fd:
fd.write(allvalue['1']['f'].encode('utf-8')+'('+'ip:')
fd.write(allvalue['1']['ip'].encode('utf-8')+')'+'\n'+'---')
fd.write(allvalue['1']['b'].encode('utf-8')+'\n')
else:
for allvalue in post['newposts']:
with open('pl3.txt','a+') as fd:
fd.write(allvalue['1']['f'].encode('utf-8')+'('+'ip:')
fd.write(allvalue['1']['ip'].encode('utf-8')+')'+'\n'+'---')
fd.write(allvalue['1']['b'].encode('utf-8')+'\n')
#防止鏈結被重置
time.sleep(2)
spider=wepl()
spider.getpost()

python爬取京東評論

這不是我的第乙個爬蟲，但大多數都是像這樣簡單粗暴的，因為一開始對於定義函式，然後再進行相應的操作，是比較困難的，這能直接寫for迴圈語句。然後，我們便開始進行相應的爬蟲第一步匯入必要的包 import requests import json header這個的作用在於偽裝成瀏覽器進行操作，有些網...

Python實現的爬取網易動態評論操作示例

開啟網易的一條新聞的源後，發現並沒有所要得評論內容。經過學習後發現，源只是乙個完整頁面的骨架而我所需要的內容kkwhnxtff是它的填充物，這時候需要開啟工具裡面的開發人員工具，從載入的骨肉裡找到我所要的評論圈住的是型別找到之後開啟網頁，發現json型別的格式，用我已學過的正則，bs...

爬蟲案例爬取網易雲熱門評論

import requests import json import re defget res url proxy 最好使用萬一網易把你ip乾掉了，請求頭最好多複製一些，尤其時 referer，這個判斷請求從哪來的。headers data 這個url 是在我們發現熱評的檔案裡的 url 雖然開...

python爬取網易評論

python爬取京東評論

Python實現的爬取網易動態評論操作示例

爬蟲案例 爬取網易雲熱門評論

相關推薦

爬蟲案例爬取網易雲熱門評論