統計文章詞頻（python實現）

統計出文章重複詞語是進行文字分析的重要一步，從詞頻能夠概要的分析文章內容。

2.建立用於詞頻計算的空字典

3.對文字的每一行計算詞頻

4.從字典中獲取資料對到列表中

5.對列表中的資料交換位置，並排序

6.輸出結果

2.網上下來的英文文章可能有一些不是utf-8編碼，並且文章中有一些字元包含一些格式符可能或導致解碼錯誤（unicodedecodeerror: 'gbk' codec can't decode byte 0xff in position 0: illegal multibyte sequence）

**實現如下：

from string import punctuation
#對文字的每一行計算詞頻的函式
def processline(line,wordcounts):
#用空格替換標點符號
line=replacepunctuations(line)
words = line.split()
for word in words:
if word in wordcounts:
wordcounts[word]+=1
else:
wordcounts[word]=1
def replacepunctuations(line):
for ch in line :
#這裡直接用了string的標點符號庫。將標點符號替換成空格
if ch in punctuation:
line=line.replace(ch," ")
return line
def main():
infile=open("englishi.txt",'r')
count=10
words=
data=
# 建立用於計算詞頻的空字典
wordcounts={}
for line in infile:
processline(line.lower(), wordcounts)#這裡line.lower()的作用是將大寫替換成小寫，方便統計詞頻
#從字典中獲取資料對
pairs = list(wordcounts.items())
#列表中的資料對交換位置,資料對排序
items = [[x,y]for (y,x)in pairs]
items.sort()
#因為sort()函式是從小到大排列，所以range是從最後一項開始取
for i in range(len(items) - 1, len(items) - count - 1, -1):
print(items[i][1] + "\t" + str(items[i][0]))
infile.close()
if __name__ == '__main__':
main()

python xx 文章詞頻統計

import jieba txt open r g txt 全面深化金融供給側結構性改革.txt r encoding utf 8 read words jieba.lcut txt 精準切詞 count for word in words iflen word 1 continue else co...

用python統計英文文章詞頻

import re with open text.txt as f 讀取檔案中的字串 txt f.read 去除字串中的標點數字等 txt re.sub d s txt 替換換行符，大小寫轉換，拆分成單詞列表 word list txt.replace n replace lower split ...

統計文章內詞頻率

import collections target str the tragedy of romeo and juliet with open 羅密歐與朱麗葉英文版莎士比亞.txt encoding utf 8 as file txts file.read 用 split 將單詞利用空格切分開 ...

統計文章詞頻（python實現）

python xx 文章詞頻統計

用python統計英文文章詞頻

統計文章內詞頻率

相關推薦