python爬取電影熱評生成高頻詞詞云

2021-09-27 11:34:44 字數 4154 閱讀 2885

目標:針對一部電影,爬取他的熱評高頻詞,並生成詞云

分解目標:

1、爬取熱評內容,只保留文字部分

2、熱評文字儲存到本地的txt文件,以便後續的分詞

3、文字分詞

4、生成詞云

拿到乙個電影:

這是他的熱評列表reviews

考慮拿到每篇熱評的詳情頁url:

import requests

from bs4 import beautifulsoup

for i in range(5):

allurl='reviews?start='+str(i*20)

res=requests.get(allurl)

html=res.text

soup=beautifulsoup(html,'html.parser')

items=soup.find('div',class_="article").find('div',class_="review-list").find_all(class_='main-bd')

for item in items:

comment_url=item.find('a')['href']

print(comment_url)

可以拿到前100條熱評的詳情頁url。

對於每篇熱評,爬取文字內容:

import requests

from bs4 import beautifulsoup

for i in range(5):

allurl='reviews?start='+str(i*20)

res=requests.get(allurl)

html=res.text

soup=beautifulsoup(html,'html.parser')

items=soup.find('div',class_="article").find('div',class_="review-list").find_all(class_='main-bd')

for item in items:

comment_url=item.find('a')['href']

#print(comment_url)

res2=requests.get(comment_url)

html2=res2.text

soup2=beautifulsoup(html2,'html.parser')

items2=soup2.find('div',class_="article").find('div',id="link-report").find_all('p')

for item2 in items2:

print(item2.text)

將以上爬取到的文字,寫入txt文件:

import requests

from bs4 import beautifulsoup

comments=open('comments.txt','w+',encoding='utf-8')

for i in range(5):

allurl='reviews?start='+str(i*20)

res=requests.get(allurl)

html=res.text

soup=beautifulsoup(html,'html.parser')

items=soup.find('div',class_="article").find('div',class_="review-list").find_all(class_='main-bd')

for item in items:

comment_url=item.find('a')['href']

#print(comment_url)

res2=requests.get(comment_url)

html2=res2.text

soup2=beautifulsoup(html2,'html.parser')

items2=soup2.find('div',class_="article").find('div',id="link-report").find_all('p')

for item2 in items2:

#print(item2.text)

comments.writelines(item2.text)

comments.close()

寫完記得關閉文件。

使用jieba庫進行分詞:

import jieba

f=open('comments.txt','r',encoding='utf-8')

t=f.read()

f.close()

ls=jieba.lcut(t)

txt=' '.join(ls)

import jieba,wordcloud

f=open('comments.txt','r',encoding='utf-8')

t=f.read()

f.close()

ls=jieba.lcut(t)

txt=' '.join(ls)

w=wordcloud.wordcloud(width=800,height=600,background_color='white',font_path='msyh.ttc',max_words=100)

w.generate(txt)

w.to_file('豆瓣某電影熱評.png')

f.close()

最後進行整合:

import requests,jieba,wordcloud

from bs4 import beautifulsoup

comments=open('comments.txt','w+',encoding='utf-8')

for i in range(5):

allurl='reviews?start='+str(i*20)

res=requests.get(allurl)

html=res.text

soup=beautifulsoup(html,'html.parser')

items=soup.find('div',class_="article").find('div',class_="review-list").find_all(class_='main-bd')

for item in items:

comment_url=item.find('a')['href']

#print(comment_url)

res2=requests.get(comment_url)

html2=res2.text

soup2=beautifulsoup(html2,'html.parser')

items2=soup2.find('div',class_="article").find('div',id="link-report").find_all('p')

for item2 in items2:

#print(item2.text)

comments.writelines(item2.text)

comments.close()

f=open('comments.txt','r',encoding='utf-8')

t=f.read()

f.close()

ls=jieba.lcut(t)

txt=' '.join(ls)

w=wordcloud.wordcloud(width=800,height=600,background_color='white',font_path='msyh.ttc',max_words=100)

w.generate(txt)

w.to_file('豆瓣某電影熱評.png')

f.close()

最終詞云效果:

以上就是全部啦!

Python爬取貓眼電影

不多說,直接上 import requests import re import random import pymysql import time 連線資料庫 db pymysql.connect host localhost port 3306,user root passwd a db pyt...

爬蟲2 爬取豆瓣網熱映電影

1.爬取一部電影的詳細內容 from bs4 import beautifulsoup import requests 獲取爬取的 url requests.get 獲取網頁源 v source beautifulsoup url.text,lxml print v source 爬取標題 v ti...

爬取豆瓣熱映電影資訊(爬蟲例項)

在學習完requests網路請求方法和xpath資料解析方法之後,今天通過乙個例項來對前面所學的知識進行鞏固,也算是一種學以致用吧!0 匯入所需要的包 import requests from lxml import etree 1 資訊的獲取 headers url response reques...