jieba庫的使用

1. jieba庫概述

jieba是優秀的中文分詞第三方庫

2. jieba庫的安裝

（cmd命令列）pip install jieba

3. jieba的分詞原理

4. jieba庫的使用

4.1 jieba分詞的的三種模式

4.2 jieba庫常用函式

詞頻統計例項：

英文文字——《哈姆雷特（英文版）》

要點：文字去噪歸一化、使用字典表示詞頻

def
gettext()
: txt =
open
("hamlet.txt"
,"r"
).read(
) txt = txt.lower(
)for ch in
'!"#$%&()*+,-./:;<=>?@{}[\\]^_|~·'
: txt = txt.replace(ch,
" ")
return txt
hamlettxt = gettext(
)words = hamlettxt.split(
)counts =
for word in words:
counts[word]
= counts.get(word,0)
+1items =
list
(counts.items())
# 列表中的鍵值對是元組形式
items.sort(key=
lambda x:x[1]
, reverse=
true
)for i in
range(10
):word, count = items[i]
# 對列表中相應的元組表示的鍵值對進行序列解包
print(""
.format
(word, count)
)

輸出：

the 1137 and936 to 728 of 665 a 527 i 515 my 513 in423 hamlet 407

you 406

中文文字——《三國演義》

import jieba
txt =
open
("threekingdoms.txt"
,"r"
, encoding=
"utf-8"
).read(
)words = jieba.lcut(txt)
counts =
for word in words:
iflen
(word)==1
:continue
else
: counts[word]
= counts.get(word,0)
+1items =
list
(counts.items())
items.sort(key=
lambda x:x[1]
, reverse=
true
)for i in
range(15
):word, count = items[i]
print
("."
.format
(i+1
, word, count)
)

輸出：

1 .曹操 953 2.孔明 836 3.將軍 772 4.卻說 656 5.玄德 585 6.關公 510 7.丞相 491 8.二人 469 9.不可 440 10.荊州 425 11.玄德曰 390 12.孔明曰 390 13.不能 384 14.如此 378

15.張飛 358

過程**現的問題：

valueerror: cannot switch from automatic field numbering to manual field specification

意思是，電腦太笨了，輸出print需要指定編號

結果不夠理想：存在「將軍」、「卻說」、「玄德」、「孔明曰」等等需要處理的情況，在除錯過程中根據結果逐步優化程式

優化版本

import jieba
txt =
open
("threekingdoms.txt"
,"r"
, encoding=
"utf-8"
).read(
)excludes =
words = jieba.lcut(txt)
counts =
for word in words:
iflen
(word)==1
:continue
elif word ==
"諸葛亮"
or word ==
"孔明曰"
: rword =
"諸葛亮"
elif word ==
"關公"
or word ==
"雲長"
: rword =
"關羽"
elif word ==
"玄德"
or word ==
"玄德曰"
: rword =
"劉備"
elif word ==
"孟德"
or word ==
"丞相"
: rword =
"曹操"
else
: rword = word
counts[rword]
= counts.get(rword,0)
+1for word in excludes:
del counts[word]
items =
list
(counts.items())
items.sort(key=
lambda x:x[1]
, reverse=
true
)for i in
range(10
):word, count = items[i]
print
("."
.format
(i+1
, word, count)
)

輸出：

1 . 曹操 1451 2. 劉備 1252 3. 孔明 836 4. 關羽 784 5.諸葛亮 547 6. 張飛 358 7. 呂布 300 8. 趙雲 278 9. 孫權 264

10.司馬懿 221

jieba庫的使用

jieba是優秀的中文分詞第三方庫中文文字需要通過分詞獲得單個的詞語 jieba是優秀的中文分詞第三方庫，需要額外安裝 jieba庫提供三種分詞模式，最簡單只需掌握乙個函式 cmd命令列 pip install jieba jieba分詞依靠中文詞庫利用乙個中文詞庫，確定漢字之間的關聯概率漢字...

jieba庫的使用

如何安裝jieba?我們使用cmd命令輸入python m pip install jieba 等一段時間就下好了。jieba庫有啥用？就是將中文語句進行分詞它有幾種模式精確模式全模式搜尋引擎模式精確模式把文字精確地切分開，不存在冗餘單詞全模式把文字中所有可能的詞語都掃瞄出來，有冗餘...

jieba庫的使用

安裝jieba庫，在命令列中輸入以下語句就可以安裝，前提是安裝了python和pip pip install jiebajieba庫常用函式有四個，分別對應三種分詞模式和一種向詞庫新增新詞的功能 1 lcut s 精確模式，返回乙個列表型別的分詞結果，不存在冗餘詞 import jieba str ...

jieba庫的使用

jieba庫的使用

jieba庫的使用

jieba庫的使用

相關推薦