Python 中文NLP （一） NLTK庫

nltk是乙個python工具包, 用來處理與自然語言相關的東西. 包括分詞(tokenize), 詞性標註(pos), 文字分類等，是較為好用的現成工具。但是目前該工具包的分詞模組，只支援英文分詞，而不支援中文分詞。

在命令列輸入：

conda install nltk #anaconda環境 pip install nltk #純python環境

進入對應的環境中，輸入如下：

import nltk
nltk.download()

執行後，彈出nltk **********視窗，自定義安裝內容（博主選擇all ，即全部安裝，大概3.2g左右），安裝成功如下圖所示：

自然語言工具包入門

【注】txt檔案為使用如jieba等中文分詞包分詞後

comment4.txt內容如下：從下單到手只用了 3 個多小時，真快啊，贊一下京東的配送速度，機子收到是原封的，深圳產，沒有陰陽屏和跑馬燈，還不錯，三星的 u ，但不糾結，也沒有感覺有多費電，啟用後買了 ac + ，可以隨意裸機體驗了，整體來說很滿意

comment5.txt內容如下：使用了一周多才來評價優化過後開機 10 秒左右執行不卡頓螢幕清晰無漏光巧克力鍵盤觸感非常不錯音質也很好外觀漂亮質量輕巧尤其值得稱讚的是其散熱系統我玩 lol 三四個小時完全沒有發燙暫時沒有發現什麼缺點如果有光碟機就更好了值得入手值得入手值得入手～不枉費我浪費了 12 期免息券加首單減免 * 的優惠最後換了這台適合辦公的之前是買的惠普的暗夜精靈玩遊戲超棒的

from nltk.corpus import plaintextcorpusreader
#corpus_root = 'd://data//nltk_data//corpusdata' #//不會產生轉義 【語料庫路徑】
corpus_root = r"d:\data\nltk_data\corpusdata" #r""防止轉義 【語料庫路徑】 
file_pattern=['comment4.txt','comment5.txt'] #【txt檔名】
pcrtext = plaintextcorpusreader(corpus_root,file_pattern) #nltk的本地語料庫載入方法
pcrtext.fileids() #輸出目錄下所有檔名
pcrtext.words('comment4.txt')

結果如下：

②法二：bracketparsecorpusreader【適合已解析過的語料庫】

from nltk.corpus import bracketparsecorpusreader
#corpus_root = 'd://data//nltk_data//corpusdata' #//不會產生轉義 【語料庫路徑】
corpus_root = r"d:\data\nltk_data\corpusdata" #r""防止轉義 【語料庫路徑】
file_pattern =r"comment.*txt" #匹配corpus_root目錄下的所有txt檔案
bcrtext = bracketparsecorpusreader(corpus_root, file_pattern) #初始化讀取器：語料庫目錄和要載入檔案的格式，預設utf8格式的編碼
bcrtext.fileids() #輸出目錄下所有檔名
bcrtext.raw('comment4.txt')

結果如下：

【注】建議編碼：txt預存為utf8編碼（輸入） ——> unicode（處理） ——> utf8（輸出）

python3.x預設為unicode編碼，python2.x則需decode解碼為unicode

from nltk.text import textcollection
#sinica_text = nltk.text(pcrtext.words()) #pcrtext.words()返回所載入文件的所有詞彙
sinica_text = nltk.text(pcrtext.words('comment4.txt')) #pcrtext.words()返回comment4.txt的所有詞彙
mytexts = textcollection(pcrtext) #textcollection()用於返回乙個文件集合
len(mytexts._texts) #表示文件集合裡面包含的文件個數
len(mytexts) #表示文件集合裡面包含的詞彙個數
the_set = set(sinica_text) #去除文件中重複的詞彙，從而形成詞彙表。
len(the_set)
for tmp in the_set:
print(tmp, "【tf】", mytexts.tf(tmp, pcrtext.raw(['comment4.txt'])), "【idf】", mytexts.idf(tmp),"【tf_idf】", mytexts.tf_idf(tmp, pcrtext.raw(['comment4.txt'])))
#pcrtext.raw(['comment4.txt'])用於返回對應文章的所有內容，以便計算tf和tf_idf值。
#通過tf,idf,tf_idf這三個函式來計算每個詞彙在語料庫以及對應文章中的值

comment4.txt文字

comment4.txt 部分tf_idf計算結果如下：

常用方法包括：

④len：可以用於判斷重複詞密度

Python 中文NLP （一） NLTK庫

中文NLP知識總結

nlp 中文資料預處理

中文NLP技術學習 1 搭建NLP開發環境

Python 中文NLP （一） NLTK庫

中文NLP知識總結

nlp 中文資料預處理

中文NLP技術學習 1 搭建NLP開發環境

相關推薦