中文分詞jieba python 學習關鍵詞

jieba.cut_for_search方法接受兩個引數：需要分詞的字串；是否使用 hmm 模型。該方法適合用於搜尋引擎構建倒排索引的分詞，粒度比較細

注意：待分詞的字串可以是 gbk 字串、utf-8 字串或者 unicode

jieba.cut以及jieba.cut_for_search返回的結構都是乙個可迭代的 generator，可以使用 for 迴圈來獲得分詞後得到的每乙個詞語(unicode)，也可以用 list(jieba.cut(...)) 轉化為 list

import jieba

seg_list = jieba.cut("我來到北京清華大學"

, cut_all

=true)

print("full mode:"

, "/ "

.join(seg_list)) # 全模式

seg_list = jieba.cut("我來到北京清華大學"

, cut_all

=false)

print("default mode:"

, "/ "

.join(seg_list)) # 精確模式

seg_list = jieba.cut("他來到了網易杭研大廈"

) # 預設是精確模式

print(", "

.join(seg_list))

seg_list = jieba.cut_for_search("小明碩士畢業於中國科學院計算所，後在日本京都大學深造"

) # 搜尋引擎模式

print(", "

.join(seg_list))

【全模式】: 我/ 來到/ 北京/ 清華/ 清華大學/ 華大/ 大學【精確模式】: 我/ 來到/ 北京/ 清華大學【新詞識別】：他, 來到, 了, 網易, 杭研, 大廈 (此處，「杭研」並沒有在詞典中，但是也被viterbi演算法識別出來了)

【搜尋引擎模式】：小明, 碩士, 畢業, 於, 中國, 科學, 學院, 科學院, 中國科學院, 計算, 計算所, 後, 在, 日本, 京都, 大學, 日本京都大學, 深造

jieba

jieba.load_userdict(

"userdict.txt"

)# 如果想單獨使用自己定義的詞典，使用jieba.set_dictionary("d:\\python27\\lib\\site-packages\\jieba\\custom.txt"),這裡在custom.txt中加了麗江古城

>>> import jieba.posseg as pseg
>>> words = pseg.cut("我愛北京天安門")
>>> for w in words:
... print w.word, w.flag
...

# 動態增加和刪除詞典:

print (','.join(cut))

#encoding=utf-8

import jieba

import jieba.analyse

jieba.analyse.set_stop_words("c:\\python36\\lib\\site-packages\\jieba\\stop_words.txt")

seg_list = jieba.cut("我來到北京清華大學", cut_all=false)

print("default mode:", "/ ".join(seg_list)) # 精確模式

#encoding=utf-8

import jieba.analyse

import jieba.posseg as pseg

import time

jieba.analyse.set_stop_words("c:\\python36\\lib\\site-packages\\jieba\\stop_words.txt")

#要分析的文字,注意編碼

filename='1.txt'

def file_jieba_wordcout(filename):

file=open(filename,'r').read()

file=jieba.cut(file)

dict={}

for word in file:

if word in dict:

dict[word]+=1

else:

dict[word]=1

file.close()

return dict

def print_top100(filename):

words=file_jieba_wordcout(filename)

dict1=sorted(words.items(),key=lambda item:item[1], reverse = true)

for item in dict1[:100]:

print(item[0],item[1])

# wordcout 前100 次

# print_top100(filename)

tfidf_result=jieba.analyse.extract_tags(open(filename,'ru').read(), topk=100, withweight=false, allowpos=())

print(tfidf_result)

# textrank_result=jieba.analyse.textrank(open(filename,'ru').read(), topk=100, withweight=false, allowpos=('ns', 'n', 'vn', 'v'))

# print(textrank_result)

#詞性標註

中文分詞jieba python 學習關鍵詞

中文分詞中文分詞及其應用

bilstm crf中文分詞多標準中文分詞模型

mysql 中文分詞 MySQL 中文分詞原理

中文分詞jieba python 學習關鍵詞

中文分詞 中文分詞及其應用

bilstm crf中文分詞 多標準中文分詞模型

mysql 中文分詞 MySQL 中文分詞原理

相關推薦

中文分詞中文分詞及其應用

bilstm crf中文分詞多標準中文分詞模型