python中的中英文本元統計

英語字元和中文字元的區別在於：大小寫字元和字元個數（中文中是乙個詞語）

統計英語字元：

def gettext():

txt=open(『halmet.txt』,』r』).read()

txt=txt.lower() #文中所有英語小寫

for ch in 『!@#$%^&*()<>?」:{}|』:

txt = txt.replace(ch,』』)

return txt

halmettxt=gettext()

words=halmettxt.split()

counts={}

for word in words:

counts[word]=couts.get(word,0)+1

items=list(counts.items())

item.sort(key=lambda x:x[1],reverse=true)

for i in range(10):

word,count=items[i]

print(『』.format(word,count))

jieba庫：中文分詞

jiebe.lcut(s) 精確模式

jieba.lcut(s,cut_all=true) 全模式

jieba.lcut_for_search(s) 搜尋引擎模式

中文字元（使用jieba庫）：

import jieba

f=open(『紅樓夢.txt』.』r』)

txt=f.read()

f.close()

words=jieba.lcut(txt)

counts]{}

for word in words:

if len(word)==1:

continue

else:

counts[word]=counts.get(word,0)+1

items=list(counts.items())

item.sort(key=lambda x:x[1],reverse=true)

for i in range(10):

word,count=items[i]

print(『』.format(word,count))

如查詢人物名字的出現頻次將會出現許多無關的答案。所以可以用excludes進一步完整**。

新增的部分我在下面都加了括號。

import jiebe

(exclude=)

f=open(『紅樓夢.txt』.』r』)

txt=f.read()

f.close()

words=jieba.lcut(txt)

counts]{}

for word in words:

if len(word)==1:

continue

else:

counts[word]=counts.get(word,0)+1

(for word in excludes:

del(counts[word]) )

items=list(counts.items())

item.sort(key=lambda x:x[1],reverse=true)

for i in range(10):

word,count=items[i]

print(『』.format(word,count))

進一步可用wordcloud庫進一步進行展示。

string中英文本元

在c 中字串類的string的模板原型是basic string template class elem,class traits char traits elem class ax allocator elem class basic string 第乙個引數 elem表示型別。第二個引數trai...

中英文本串中統計英文本元個數

工作中遇到如下問題，搜尋網路資源得以解決，記錄以供參考。問題在一段中英文混合的字串中，通過關關鍵字查詢到某位置p。需要擷取p前後一定長度字元，構成乙個新的字串。問題解析由於是中貢混合的字串，當向前後擷取長度不當時會出現擷取到中文半個字的情況。面引起出現亂碼的情況。方法首先確定字串的編碼格式，由...

Python 文字詞頻統計中英文

統計一段英文中出現次數最多的幾個單詞 def get text text open eng.txt r read text text.lower 所有單詞都替換成小寫 for ch in 去噪，歸一化處理，把所有特殊符號替換為空格 text text.replace ch,return text ...

python中的中英文本元統計

string中英文本元

中英文本串中統計英文本元個數

Python 文字詞頻統計中英文

相關推薦