使用python jieba庫進行中文分詞

jieba

「結巴」中文分詞：做最好的 python 中文分詞元件

"jieba" (chinese for "to stutter") chinese text segmentation: built to be the best python chinese word segmentation module.

功能引數：

安裝：pip install jieba

例子：
# encoding=utf-8
import jieba
seg_list = jieba.cut("我來到北京清華大學", cut_all=true)
print("full mode: " + "/ ".join(seg_list)) # 全模式
seg_list = jieba.cut("我來到北京清華大學", cut_all=false)
print("default mode: " + "/ ".join(seg_list)) # 精確模式
seg_list = jieba.cut("他來到了網易杭研大廈") # 預設是精確模式
print(", ".join(seg_list))
seg_list = jieba.cut_for_search("小明碩士畢業於中國科學院計算所，後在日本京都大學深造") # 搜尋引擎模式
print(", ".join(seg_list))

輸出:【全模式】: 我/ 來到/ 北京/ 清華/ 清華大學/ 華大/ 大學

【精確模式】: 我/ 來到/ 北京/ 清華大學

【新詞識別】：他, 來到, 了, 網易, 杭研, 大廈 (此處，「杭研」並沒有在詞典中，但是也被viterbi演算法識別出來了)

【搜尋引擎模式】：小明, 碩士, 畢業, 於, 中國, 科學, 學院, 科學院, 中國科學院, 計算, 計算所, 後, 在, 日本, 京都, 大學, 日本京都大學, 深造

vi extract_tags.py
import sys
import jieba
import jieba.analyse
from optparse import optionparser
usage = "usage: python extract_tags.py [file name] -k [top k]"
parser = optionparser(usage)
parser.add_option("-k", dest="topk")
opt, args = parser.parse_args()
if len(args) < 1:
print(usage)
sys.exit(1)
file_name = args[0]
if opt.topk is none:
topk = 10
else:
topk = int(opt.topk)
content = open(file_name, 'rb').read()
tags = jieba.analyse.extract_tags(content, topk=topk)
print(",".join(tags))

執行（需分詞的文字test.txt）

python extract_tags.py test.txt -k 20

jieba開源主頁：

Python jieba庫的使用

jieba 是 python 中乙個重要的第三方中文分詞函式庫對於一段英文文字，例如，i like python and big data 如果希望提取其中的單詞，只要使用字串處理的split 方法即可。例如str i like python and big data print str spli...

8月隨筆 Python jieba庫的使用

函式描述jieba.lcut s 精確模式，返回乙個列表型別的分詞結果 jieba.lcut s,cut all true 全模式jieba.lcut for search s 搜尋引擎模式，返回乙個列表型別的分詞結果，存在冗餘 jieba.add word w 向分詞詞典新增新詞w 如果您還未看...

python jieba分詞庫的使用

測試環境 py3 win10 import jieba str test 有很多人擔心，美國一聲令下，會禁止所有的開源軟體被中國使用,這樣的擔憂是不必要的。返回迭代器 c1 jieba.cut str test c2 jieba.cut str test,cut all true c3 jieba....

使用python jieba庫進行中文分詞

Python jieba庫的使用

8月隨筆 Python jieba庫的使用

python jieba分詞庫的使用

相關推薦