python123 三國演義詞頻分析

2022-09-24 05:48:12 字數 1737 閱讀 4908

文字獲取:threekingdoms.txt(三國演義.txt):

因為文字是複製到txt文件中的,第一次提示編碼錯誤:'utf-8' codec can't decode byte 0xc8 in position 0: invalid continuation byte。

將文字開啟另存為utf-8編碼格式即可

txt = open("threekingdoms.txt","r",encoding="utf-8").read()  #開啟檔案

words = jieba.lcut(txt)                        #分詞

counts = {}                               #建空字典

for word in words:

if len(word) == 1:

continue

else:

counts[word] =counts.get(word,0) +1            #判斷名字是否在字典中存在,存在則+1,否則為1

items = list(counts.items())                     #列表化

items.sort(key=lambda x:x[1],reverse=true)            #排序,預設從小到大,reverse 反序輸出

for i in range(15):

word,count = items[i]

print("".format(word,count))

根據結果繼續優化**:

#calthreekingdomsv1.py

import jieba

txt = open("threekingdoms.txt","r",encoding="utf-8").read()

excludes =

words = jieba.lcut(txt)

counts = {}

for word in words:

if len(word) == 1:

continue

elif word =="諸葛亮" or word =="孔明曰":

rword ="孔明"

elif word =="關公" or word =="雲長":

rword ="關羽"

elif word =="玄德" or word =="玄德曰":

rword ="劉備"

elif word =="孟德" or word =="丞相":

rword ="曹操"

else:

rword =word

counts[rword] =counts.get(rword,0) +1

for word in excludes:

del counts[word]

items = list(counts.items())

items.sort(key=lambda x:x[1],reverse=true)

for i in range(10):

word,count = items[i]

print("".format(word,count))

三國演義人物詞頻統計 1

沒有把長度為1的單詞進行篩選 path c users desktop 三國演義.txt text open path,r encoding utf 8 read 使用結巴的函式對文字進行分詞 words jieba.lcut text 定義字典型別去儲存文字和文字出現的次數 counts for ...

三國演義人物詞頻統計 2

對長度為1的單詞進行篩選 import jieba path c users desktop 三國演義.txt text open path,r encoding utf 8 read 使用結巴的函式對文字進行分詞 words jieba.lcut text 定義字典型別去儲存文字和文字出現的次數 ...

Python爬蟲三國演義

定位目標 在這裡插入 片import requests from bs4 import beautifulsoup f open sanguo.txt w encoding utf 8 檔案儲存在當前資料夾中 headers url page text requests.get url url,he...