python 詞頻統計

import re  # 正規表示式庫
import collections # 詞頻統計庫
f=open
("text_word_frequency_statistics.txt"
)article=f.
read()
.lower
() #統一轉化成小寫
f.close()
pattern = re.
compile
("\t|!|,|\n|\.|:|;|\)|\(|\?|\""
)article = re.
sub(pattern,
' ', article) # 將符合正規表示式的字元用' '替代
done=article.
split
(' '
) #以空格為分隔符，分詞
remove=
['the'
,'and'
,'of'
,'a'
,'i'
,'in'
,'you'
,'my'
,'he'
,'his'
,','
,'s',''
] #需要去除的詞
over=
for i in done:
if i not in remove and i!=
" ":
over.
(i)counts= collections.
counter
(over) # 對分詞做詞頻統計 這裡返回的是counter物件
sum=
dict
(counts)
#b=list(zip(sum.keys(),sum.values()) ) #打包的方式
#sum=list(sorted(b,key=operator.itemgetter(1),reverse=true))
sum=
sorted
(sum.
items()
,key=lambda sum:
(-sum[1]
,sum[0]
))#lamabda 內的順序為排序優先順序 後面的以前面的為基準！！！即在sum[
1]相等的時候才用得上sum[0]
x=0for i in sum:
print(''
.format
(i[0])
,''.format
(i[1])
) x+=1
if(x==10)
: #輸出詞頻前十的單詞
break

想說的都在注釋裡了

Python 統計詞頻

calhamletv1.py def gettext txt open hamlet.txt r read txt txt.lower for ch in txt txt.replace ch,將文字中特殊字元替換為空格 return txt hamlettxt gettext words haml...

python統計詞頻

已知有鍵值對店名，城市的鍵值對，我們現在的需求是根據城市來統計店的分布。資料的格式如下我們希望輸出資料的格式如下所示所有的資料都是以txt檔案儲存的。from collections import counter from pprint import pprint import os imp...

python統計詞頻

1 將檔案讀入緩衝區 dst指文字檔案存放路徑，設定成形參，也可以不設，具體到函式裡設定 def process file dst 讀檔案到緩衝區 try 開啟檔案 txt open dst,r except ioerror ass print s return none try 讀檔案到緩衝區 b...

python 詞頻統計

Python 統計詞頻

python統計詞頻

python統計詞頻

相關推薦