爬蟲大作業

1.選乙個自己感興趣的主題。

2.用python 編寫爬蟲程式，從網路上爬取相關主題的資料。

3.對爬了的資料進行文字分析，生成詞云。

4.對文字分析結果進行解釋說明。

5.寫一篇完整的部落格，描述上述實現過程、遇到的問題及解決辦法、資料分析思想及結論。

6.最後提交爬取的全部資料、爬蟲及資料分析源**。

在本次作業，我決定爬取網易新聞科技頻道的it專題，首先這是新聞頻道的首頁。

首先我們開啟瀏覽器的開發者工具，快捷鍵為f12會ctrl+shift+i，找到我們要爬取新聞的新聞內容列表結構。

所以我們可以發現新聞列表都儲存在類為 .newslist 的標籤裡，新聞鏈結儲存在類為 .titlebar 的標籤裡的 <

標籤內。

然後問我們開啟乙個新聞頁，分析它的結構。依然開啟開發者工具，分析其結構

新聞詳情頁

新聞結構

所以我們發現新聞內容都儲存在類為 .post_content_main 的標籤裡，其中新聞資訊儲存在類為 .post_time_source 的標籤裡，標題儲存在 <

h1>

標籤裡。

詳細**如下：　

import
requests, re, jieba, pandas
from bs4 import
beautifulsoup
from datetime import
datetime
from wordcloud import
wordcloud
import
matplotlib.pyplot as plt
#獲取新聞細節
defgetnewsdetail(newsurl):
res =requests.get(newsurl)
res.encoding = '
gb2312
'soupd = beautifulsoup(res.text, '
html.parser')
detail = .\d.\d\s\d.\d.\d)
', soupd.select('
.post_time_source
')[0].text).group(1),
'%y-%m-%d %h:%m:%s
'), '
source
': re.search('
', soupd.select('
.post_time_source
')[0].text).group(1),
'content
': soupd.select('
#endtext
')[0].text}
return
detail
#通過jieba分詞，獲取新聞詞云
defgetkeywords():
content = open('
news.txt
', '
r', encoding='
utf-8
').read()
wordset = set(jieba._lcut(''.join(re.findall('
[\u4e00-\u9fa5]
', content)))) #
通過正規表示式選取中文字元陣列，拼接為無標點字元內容,再轉換為字元集合
worddict ={}
deletelist, keywords =, 
for i in
wordset:
worddict[i] = content.count(i) #
生成詞云字典
for i in
worddict.keys():
if len(i) < 2:
#生成單字無意義字元列表
for i in
deletelist:
del worddict[i] #
在詞云字典中刪除無意義字元
dictlist =list(worddict.items())
dictlist.sort(key=lambda item: item[1], reverse=true)
for dict in
dictlist:
writekeyword(keywords)
#將新聞內容寫入到檔案
defwritenews(pagedetail):
f = open('
news.txt
', '
a', encoding='
utf-8')
for detail in
pagedetail:
f.write(detail[
'content'])
f.close()
#將詞云寫入到檔案
defwritekeyword(keywords):
f = open('
keywords.txt
', '
a', encoding='
utf-8')
for word in
keywords:
f.write(
'' +word)
f.close()
#獲取一頁的新聞
defgetlistpage(listurl):
res =requests.get(listurl)
res.encoding = "
utf-8
"soup = beautifulsoup(res.text, '
html.parser')
pagedetail = #
儲存一頁所有新聞的詳情
for news in soup.select('
#news-flow-content
')[0].select('li'
): newsdetail = getnewsdetail(news.select('
a')[0]['
href
']) #
呼叫getnewsdetail()獲取新聞詳情
return
pagedetail
defgetwordcloud():
keywords = open('
keywords.txt
', '
r', encoding='
utf-8
').read() #
開啟詞云檔案
wc = wordcloud(font_path=r'
c:\windows\fonts\simfang.ttf
', background_color='
white
', max_words=100).generate(
keywords).to_file(
'kwords.png
') #
生成詞云，字型設定為可識別中文字元
plt.imshow(wc)
plt.axis(
'off')
plt.show()
pagedetail = getlistpage('
') #
獲取首頁新聞
writenews(pagedetail)
for i in range(2, 20): #
因為網易新聞頻道只訪問20頁新聞，直接設定20
listurl = '
' % i #
填充新聞頁，頁面格式為兩位數字字元
pagedetail =getlistpage(listurl)
writenews(pagedetail)
getkeywords() 
#獲取詞云，並且寫到檔案
getwordcloud() #
從詞云檔案讀取詞云，生成詞云

生成詞云結果

我們可以從此圖中分析出，科技新聞近期熱點是中美、技術公司等，可以了解到無論是在中國還是美國，在it行業中乙個公司最重要的因素就是技術，其次是產品產業諸如晶元等的由此才能佔據it領域的市場。

結果：

爬蟲大作業

1 選乙個自己感興趣的主題。2 用python 編寫爬蟲程式，從網路上爬取相關主題的資料。3 對爬了的資料進行文字分析，生成詞云。4 對文字分析結果進行解釋說明。5 寫一篇完整的部落格，描述上述實現過程遇到的問題及解決辦法資料分析思想及結論。6 最後提交爬取的全部資料爬蟲及資料分析源 impo...

爬蟲大作業

1.選乙個自己感興趣的主題。2.用python 編寫爬蟲程式，從網路上爬取相關主題的資料。3.對爬了的資料進行文字分析，生成詞云。4.對文字分析結果進行解釋說明。5.寫一篇完整的部落格，描述上述實現過程遇到的問題及解決辦法資料分析思想及結論。6.最後提交爬取的全部資料爬蟲及資料分析源 impo...

爬蟲大作業

1.選乙個自己感興趣的主題。2.用python 編寫爬蟲程式，從網路上爬取相關主題的資料。3.對爬了的資料進行文字分析，生成詞云。4.對文字分析結果進行解釋說明。5.寫一篇完整的部落格，描述上述實現過程遇到的問題及解決辦法資料分析思想及結論。6.最後提交爬取的全部資料爬蟲及資料分析源 codi...

爬蟲大作業

爬蟲大作業

爬蟲大作業

爬蟲大作業

相關推薦