python123 三國演義詞頻分析

文字獲取：threekingdoms.txt（三國演義.txt）：

因為文字是複製到ｔｘｔ文件中的，第一次提示編碼錯誤：'utf-8' codec can't decode byte 0xc8 in position 0: invalid continuation byte。

將文字開啟另存為ｕｔｆ－８編碼格式即可

txt = open("threekingdoms.txt","r",encoding="utf-8").read()　　＃開啟檔案

words = jieba.lcut(txt)　　　　　　　　　　　　　　　　　　　　　　　　＃分詞

counts = {}　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　＃建空字典

for word in words:

if len(word) == 1:

continue

else:

counts[word] =counts.get(word,0) +1　　　　　　　　　　　　＃判斷名字是否在字典中存在，存在則＋１，否則為１

items = list(counts.items())　　　　　　　　　　　　　　　　　　　　　＃列表化

items.sort(key=lambda x:x[1],reverse=true)　　　　　　　　　　　　＃排序，預設從小到大，ｒｅｖｅｒｓｅ　反序輸出

for i in range(15):

word,count = items[i]

print("".format(word,count))

根據結果繼續優化**：

#calthreekingdomsv1.py
import jieba
txt = open("threekingdoms.txt","r",encoding="utf-8").read()
excludes =
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word =="諸葛亮" or word =="孔明曰":
rword ="孔明"
elif word =="關公" or word =="雲長":
rword ="關羽"
elif word =="玄德" or word =="玄德曰":
rword ="劉備"
elif word =="孟德" or word =="丞相":
rword ="曹操" 
else:
rword =word
counts[rword] =counts.get(rword,0) +1
for word in excludes:
del counts[word]
items = list(counts.items())
items.sort(key=lambda x:x[1],reverse=true)
for i in range(10):
word,count = items[i]
print("".format(word,count))

三國演義人物詞頻統計 1

沒有把長度為1的單詞進行篩選 path c users desktop 三國演義.txt text open path,r encoding utf 8 read 使用結巴的函式對文字進行分詞 words jieba.lcut text 定義字典型別去儲存文字和文字出現的次數 counts for ...

三國演義人物詞頻統計 2

對長度為1的單詞進行篩選 import jieba path c users desktop 三國演義.txt text open path,r encoding utf 8 read 使用結巴的函式對文字進行分詞 words jieba.lcut text 定義字典型別去儲存文字和文字出現的次數 ...

Python爬蟲三國演義

定位目標在這裡插入片import requests from bs4 import beautifulsoup f open sanguo.txt w encoding utf 8 檔案儲存在當前資料夾中 headers url page text requests.get url url,he...

python123 三國演義詞頻分析

三國演義人物詞頻統計 1

三國演義人物詞頻統計 2

Python爬蟲三國演義

相關推薦