Python例項 文字詞頻統計

2021-09-01 12:54:10 字數 1919 閱讀 9340

最近在mooc跟著北京理工大學的嵩天老師學習python(受益匪淺,老師所講的通俗易懂,推薦給大家。

在此記點筆記和注釋,備忘。

今天所記得是文字詞頻統計-hamlet文字詞頻統計。

直接上源**

#calhamletv1.py

def gettext():

txt = open("e:\hamlet.txt", "r").read() #讀取hamlet文字檔案,並返回給txt

txt = txt.lower() #將檔案中的單詞全部變為小寫

for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_『~':

txt = txt.replace(ch, " ") #將文字中特殊字元替換為空格

return txt

hamlettxt = gettext()

words = hamlettxt.split() #按照空格,將文字分割

counts = {}

for word in words: #統計單詞出現的次數,並儲存到counts字典中

counts[word] = counts.get(word,0) + 1 #先給字典賦值,如果字典中沒有word這個鍵,則返回0;見下面函式講解

items = list(counts.items()) #將字典轉換為列表,以便操作

items.sort(key=lambda x:x[1], reverse=true) # 見下面函式講解

for i in range(10):

word, count = items[i]

print ("".format(word, count))

所用函式講解:

①dict.get(key, default=none):函式返回指定鍵的值,如果值不在字典中返回預設值

②list.sort(cmp=none, key=none, reverse=false):

三國演義文字:

#calthreekingdomsv2.py

import jieba

excludes =

txt = open("threekingdoms.txt", "r", encoding='utf-8').read()

words = jieba.lcut(txt)

counts = {}

for word in words:

if len(word) == 1:

continue

elif word == "諸葛亮" or word == "孔明曰":

rword = "孔明"

elif word == "關公" or word == "雲長":

rword = "關羽"

elif word == "玄德" or word == "玄德曰":

rword = "劉備"

elif word == "孟德" or word == "丞相":

rword = "曹操"

else:

rword = word

counts[rword] = counts.get(rword,0) + 1

for word in excludes:

del counts[word]

items = list(counts.items())

items.sort(key=lambda x:x[1], reverse=true)

for i in range(10):

word, count = items[i]

print ("".format(word, count))

函式講解:

jieba.lcut(s):精確分詞模式,返回乙個列表型別的分詞結果。沒有冗餘。

例項 文字詞頻統計

降噪,避免大小寫的干擾 用空格替換特殊符號 for ch in txt txt.replace ch,return txt hamletxt gettext words hamletxt.split counts for word in words counts word counts.get wo...

Python 文字詞頻統計

hamlettxt gettext words hemlettxt.split counts for word in words counts word counts.get word,0 1這是一段遍歷hamlet.txt檔案的一段 s.split 函式返回的是列表list 我有一些困惑 1.最後...

python詞頻統計例項

詞頻統計 import jieba 分詞庫包 import snownlp 情感分析 words 非常時尚鞋子,非常非常非常時尚的一款鞋子,設計好看,設計設計做活動買的,超超超超超超超超超划算。滿意。設計好看!words list list jieba.cut words words frequen...