(2 NLTK學習筆記

2021-10-23 06:08:29 字數 2893 閱讀 9258

1、分詞

nltk內建的分詞器

from nltk.tokenize import linetokenizer,spacetokenizer,tweettokenizer

from nltk import word_tokenize

linetokenizer字串拆分成行:

ltokenizer=linetokenizer();

print

(「output:」, ltokenizer.tokenize(「」)

)

spacetokenizer空格符分詞:

rawtext=」line…」

stokenizer= spacetokenizer(

)print

(「output:」, stokenizer.tokenize(rawtext)

)

tweettokenizer處理特殊字元

ttokenizer=tweettokenizer(

)print

(「output:」,ttokenizer.tokenize(「」)

)

2、詞幹提取

from nltk import porterstemmer,lancasterstemmer,word_tokenize

raw=」line…」 //分詞

tokens = word_tokenize(raw)

porter = porterstemmer(

)//去除字尾

pstems =

[porter.stem(t)

for t in tokens]

print

(pstems)

lancaster = lancasterstemmer(

)//包含更多的去字尾

lstems =

[lancaster.stem(t)

for t in tokens]

print

(lstems)

3、詞性還原(非專有名詞去除/替換字尾 專有名詞不替換

from nltk import wordnetlemmatizer(

)lemmas =

[lemmatizer. lemmatize(t)

for t in tokens]

print

(lemmas)

4、停用詞

i

mport nltk    //載入語料庫

from nltk.corpus import gutenberg

print

(gutenberg.fileids)

//是否成功

gd_words = gutenberg.words(『bible-kjv.txt』)

//拷貝txt所有單詞列表

words_filtered =

[e for e in gd_words if

len(e)

>=3]

//遍歷並去除len

<

3的單詞

載入english停用詞到stopwords變數中;過濾掉所有停用詞

stopwords = nltk.corpus.stopwords.words(『english』)

words =

[w for w in words_filtered if w.lower(

)not

in stopwords]

5、編輯距離

from nltk.metrics.distance import edit_distance
def

my_edit_distance

(str1,str2)

:

獲取str長度,建立乙個m*n表

m=

len(str1)+1

n=len

(str2)

+1

建立乙個table並初始化第一行第一列

table=

for i in

range

(m):table[i,0]

=ifor j in

range

(n):table[j,0]

=j

填充矩陣

for i in

range(1

, m)

:for j in

range(1

, n)

:cost =o if str1[i-1]

== str2[j-1]

else

1table[i,j]

=min

(table[i, j-1]

+1, table[i-

1, j]+1

, table[i-

1,j-1]

+cost)

最終的編輯距離:

return table[i, j]
呼叫函式以及nltk包中的edit_distance()函式來分別計算兩個字串的編輯距離:

print

("our algorithm :"

, my_edit_distance (

"hand"

,"and"))

print

("nltk algorithm :"

,edit_distance (

"hand"

,"and"

))

NLTK 學習筆記(2)

pos速查表 標記含義 例子adj 形容詞new,good,high,special,big,local adv副詞 really,already,still,early,now cnj連詞 and,or,but,if,while,although det限定詞 the,a,some,most,ev...

NLTK學習筆記

學習參考書 nltk.set proxy com 80 nltk.download 2.使用sents fileid 函式時候出現 resource tokenizers punkt english.pickle not found.please use the nltk to obtain the...

NLTK學習筆記

學習參考書 nltk.set proxy com 80 nltk.download 2.使用sents fileid 函式時候出現 resource tokenizers punkt english.pickle not found.please use the nltk to obtain the...