中文文字分析（1）分詞

import jieba

import re

資料格式：

[「晚上想吃五花肉土豆蓋澆飯」，

「今晚吃雞嘿咻嘿」，

「綠皮環保小火車進站」，

「一首《夢醒時分》送給大家」]

具體流程如下：

目的：清洗文字中的特殊符號

sentence =
["晚上想吃五花肉土豆蓋澆飯"
,"今晚吃雞嘿咻嘿"
,"綠皮環保小火車進站"
,"一首《夢醒時分》送給大家"
]def
subreplace
(lines)
:#清洗文字中的特殊符號
#re.compile將正規表示式編譯成乙個物件
regex = re.
compile
(r"[0-9__~（）《》___()、/，...,！。：:;%-. 【】]"
) result =
for line in lines:
line = regex.sub('',
str(line)
)#字串替換
return result
print
(subreplace(sentence)
)

其中：自定義停用詞與自定義分詞詞庫notepad++處編輯，注意儲存格式為utf-8，可將文字放在如下位址d:/python/python/lib/site-packages/jieba/

def
cut_word
(sentences_list)
: all_result_list =
jieba.load_userdict(
"my_dict_1.txt"
)#自定義詞庫如（夢醒時分、吃雞）
for sentence in sentences_list:
result_list =
[ word.upper(
)for word in jieba.cut(sentence)
]return all_result_list

輸出：

[[『晚上』, 『想』, 『吃』, 『五花肉』, 『土豆』, 『蓋澆飯』],

[『今晚』, 『吃雞』, 『嘿咻嘿』],

[『綠皮』, 『環保』, 『小』, 『火車』, 『進站』],

[『一首』, 『夢醒時分』, 『送給』, 『大家』]]

def
stop_words_list()
:#匯入停用詞
stop_words =
with
open
("my_stopword.txt"
,encoding =
"utf-8"
)as file_obj:
for word in file_obj:
str(word.strip())
)return stop_words 
defdel_stop_words
(word_list)
: stop_words = stop_words_list(
)#匯入停用詞
result =
all_result =
for sentences in word_list:
for word in sentences:
if word.isspace()==
true
:#去除空格
pass
elif word not
in stop_words :
else
:pass
result =
return all_result

輸出：

[[『晚上』, 『吃』, 『五花肉』, 『土豆』, 『蓋澆飯』],

[『今晚』, 『吃雞』],

[『綠皮』, 『環保』, 『火車』, 『進站』],

[『一首』, 『夢醒時分』, 『送給』 ,『大家』]]

其中：同義詞詞庫notepad++處編輯，注意儲存格式為utf-8，一行詞為同義詞，用tab鍵隔開，第乙個詞為替換詞。可將文字放在如下位址d:/python/python/lib/site-packages/jieba/

def
replace_syn
(word_list)
:# 1讀取同義詞表：並生成乙個字典。
synonym_dict =
with
open
("my_synonym.txt"
,encoding =
"utf-8"
)as file_obj:
for line in file_obj:
seperate_word=line.strip(
).split(
"\t"
) num =
len(seperate_word)
for i in
range(1
,num)
: synonym_dict[seperate_word[i]
]= seperate_word[0]
sen =
result =
for sentences in word_list:
for word in sentences:
if word in synonym_dict:
word = synonym_dict[word]
else
: sen =
return result

輸出：

[[『今晚』, 『吃』, 『五花肉』, 『土豆』, 『蓋澆飯』],

[『今晚』, 『吃雞』],

[『綠皮』, 『環保』, 『火車』, 『進站』],

[『一首』, 『夢醒時分』, 『送給』 ,『大家』]]

中文文字分句

關於文字分句這點，說簡單也簡單，說複雜也複雜。一般的自然語言處理任務中對這點要求並不嚴格，一般按照句末標點切分即可。也有一些專門從事文字相關專案的行業，可能就會有較高的要求，想100 分句正確是要考慮許多語言本身語法的，這裡算是寫個中等水平的。以背影中的一段話為例我心裡暗笑他的迂他們只認得錢...

python實現中文文字分句

對於英文文字分句比較簡單，只要根據終結符劃分就好，中文文字分句看似很簡單，但是實現時會遇到很多麻煩，尤其是處理社交資料時，會遇到文字格式不規範等問題。下面針對一段一段的短文本組成了文件分句 import re def cut sent infile,outfile cutlineflag 本文...

NLP 中文文字分類詳細

實現如下customprocessor class customprocessor dataprocessor def get train examples self,data dir return self.create examples self.read tsv os.path.join da...

中文文字分析（1） 分詞

中文文字分句

python實現中文文字分句

NLP 中文文字分類 詳細

相關推薦

中文文字分析（1）分詞

NLP 中文文字分類詳細