中文文字分句

關於文字分句這點，說簡單也簡單，說複雜也複雜。一般的自然語言處理任務中對這點要求並不嚴格，一般按照句末標點切分即可。也有一些專門從事文字相關專案的行業，可能就會有較高的要求，想100%分句正確是要考慮許多語言本身語法的，這裡算是寫個中等水平的。以《背影》中的一段話為例：

我心裡暗笑他的迂；他們只認得錢，託他們只是白託!而且我這樣大年紀的人，難道還不能料理自己麼？唉，我現在想想，那時真是太聰明了!

我說道：「爸爸，你走吧。」他往車外看了看說：「我買幾個橘子去。你就在此地，不要走動。」我看那邊月台的柵欄外有幾個賣東西的等著顧客。走到那邊月台，須穿過鐵道，須跳下去又爬上去。

python實現：

import re
def __merge_symmetry(sentences, symmetry=('「','」')):
'''合併對稱符號，如雙引號'''
effective_ = 
merged = true
for index in range(len(sentences)): 
if symmetry[0] in sentences[index] and symmetry[1] not in sentences[index]:
merged = false
elif symmetry[1] in sentences[index] and not merged:
merged = true
effective_[-1] += sentences[index]
elif symmetry[0] not in sentences[index] and symmetry[1] not in sentences[index] and not merged :
effective_[-1] += sentences[index]
else:
return [i.strip() for i in effective_ if len(i.strip()) > 0]
def to_sentences(paragraph):
"""由段落切分成句子"""
sentences = re.split(r"(？|。|！|\…\…)", paragraph)
sentences = ["".join(i) for i in zip(sentences[0::2], sentences[1::2])]
sentences = [i.strip() for i in sentences if len(i.strip()) > 0]
for j in range(1, len(sentences)):
if sentences[j][0] == '」':
sentences[j-1] = sentences[j-1] + '」'
sentences[j] = sentences[j][1:]
return __merge_symmetry(sentences)

主要考慮分句之後要帶上句末標點，以及遇到人物有對話時保證話語完整性。分句結果：

我心裡暗笑他的迂；他們只認得錢，託他們只是白託!而且我這樣大年紀的人，難道還不能料理自己麼？唉，我現在想想，那時真是太聰明了! 我說道：「爸爸，你走吧。」他往車外看了看說：「我買幾個橘子去。你就在此地，不要走動。」我看那邊月台的柵欄外有幾個賣東西的等著顧客。

走到那邊月台，須穿過鐵道，須跳下去又爬上去。

中文文字分句

python實現中文文字分句

python實現中文文字分句的例子

NLP 中文文字分類詳細

中文文字分句

python實現中文文字分句

python實現中文文字分句的例子

NLP 中文文字分類 詳細

相關推薦

NLP 中文文字分類詳細