Python文字分析及預處理

2021-09-28 11:00:50 字數 2702 閱讀 1861

文字分析的基本功能學習,包括句子切分、單詞切分、大小寫轉化、刪除停用詞、題幹提取、詞性還原。

基本功能學習

#句子切分,單詞切分

import nltk

a=s=nltk.sent_tokenize(a)

print

(s)w=

for i in s:

for j in nltk.word_tokenize(i)

:print

(w)#大小寫轉換

w='china'

print

(w.lower())

print

(w.upper())

#刪除停用詞

#nltk.download('stopwords')

stopwords=nltk.corpus.stopwords.words(

'english'

)#print(stopwords)

a2=w2=

[nltk.word_tokenize(i)

for i in nltk.sent_tokenize(a2.lower())

]print

(w2)

nw=[

]for j in w2:

for i in j:

if i not

in stopwords:

print

(nw)

#題幹提取

from nltk.stem import porterstemmer

print

(porterstemmer(

).stem(

'liking'))

from nltk.stem import lancasterstemmer

print

(lancasterstemmer(

).stem(

'knowing'))

#詞形還原

nltk.download(

'wordnet'

)from nltk.stem import wordnetlemmatizer

print

(wordnetlemmatizer(

).lemmatize(

'cars'

,'n'))

print

(wordnetlemmatizer(

).lemmatize(

'running'

,'v'))

print

(wordnetlemmatizer(

).lemmatize(

'fancier'

,'a'

))

import nltk

stop =

standard_stop =

text =

after_text =

#處理後的文字

file_stop = r'd:\stopwords.txt'

# 停用詞表

file_text = r'd:\before.txt'

# 要處理的英文文字

with

open

(file_stop,

'r',encoding=

'utf-8-sig'

)as f :

lines = f.readlines(

)for line in lines:

lline = line.strip(

)

for i in

range(0

,len

(stop)):

for word in stop[i]

.split():

add_stop=

['/'

,'-'

,'@'

,'('

,')'

,','

,'&'

,'.'

,':'

,';'

]#發現結果中有些不需要的符號,於是新增停用詞

standard_stop.extend(add_stop)

with

open

(file_text,

'r',encoding=

'utf-8-sig'

)as f :

lines = f.readlines(

)for line in lines:

s=nltk.sent_tokenize(line)

print

(s)for i in s:

for j in nltk.word_tokenize(i)

:if j not

in standard_stop:))

print

(after_text)

s = nltk.stem.snowballstemmer(

'english'

) cleaned_text =

[s.stem(ws)

for ws in after_text]

print

(cleaned_text)

with

open

(r'd:\after.txt'

,'w+'

)as f :

for i in cleaned_text:

f.write(i)

python,文字分析

記得將當前目錄設定為檔案目錄 spyder編譯器的右上角,本人用spyder filename input 請輸入你的檔名 file open filename txt try for eachline in file print eachline except print 開啟檔案出錯 final...

Linux文字分析處理命令大全

一文帶你從零掌握linux文字分析處理命令,包括cat sort uniq cut paste join comm diff patch tr sed aspell。cat 進行檔案之間的拼接並且輸出到標準輸出 常用選項 a,顯示文字中的控制字元,如tab和換行 n,給文字行新增行號 s,禁止輸出多...

文字預處理

常見預處理步驟,預處理通常包括四個步驟 讀入文字 分詞建立字典,將每個詞對映到乙個唯一的索引 index 將文字從詞的序列轉換為索引的序列,方便輸入模型 現有的工具可以很好地進行分詞,我們在這裡簡單介紹其中的兩個 spacy和nltk。text mr.chen doesn t agree with ...