文字預處理（5）文字糾錯的簡單案例

上一節我們留下了，乙個小問題，就是如何對給定的英文文字語料，來進行拼寫糾錯。

首先，我們給定乙個語料文字「beyes_train_text.txt」,然後統計語料中各單詞的出現情況。

import re,collections
# 提取語料庫中的所有單詞並且轉化為小寫
def words(text):
return re.findall("[a-z]+", text.lower())
# 若單詞不在語料庫中，預設詞頻為1，避免先驗概率為0的情況
def train(features):
model = collections.defaultdict(lambda:1)#若key為空，預設值為1
for f in features:
model[f]+=1#統計詞頻
return model
words_n = train(words(open("bayes_train_text.txt").read()))
print(words_n)

輸出結果：

#英文本母
alphabet="abcdefghijklmnopqrstuvwxyz"
# 編輯距離為1的所有單詞
def edits1(word):
n = len(word)
# 刪除某一字母而得的詞
s1 = [word[0:i]+word[i+1:] for i in range(n)]
# 相鄰字母調換位置
s2 = [word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)]
# 替換
s3 = [word[0:i]+c+word[i+1:] for i in range(n) for c in alphabet]
# 插入
s4 = [word[0:i]+c+word[i:] for i in range(n+1) for c in alphabet]
edits1_words = set(s1+s2+s3+s4)
return edits1_words
# 編輯距離為2的所有單詞
def edits2(word):
edits2_words = set(e2 for e1 in edits1(word) for e2 in edits1(e1))
return edits2_words

有了候選項之後，便可以通過一定演算法找出最有可能的糾正項，由於我們沒有歷史的錯誤項、糾正項相對應的語料，因此只根據詞語在語料**現頻次來確定此詞語的候選可能性：

# 過濾非詞典中的單詞
def known(words):
return set(w for w in words if w in words_n)
def correct(word):
if word not in words_n:
candidates = known(edits1(word)) | known(edits2(word))
return max(candidates, key=lambda w:words_n[w])
else:
return none

做一些簡單的實驗：

print(correct("het"))
print(correct("annd"))
# 輸出結果為
# the
# and

由此可見，對於一般的拼寫錯誤，基於貝葉斯原理的糾錯能力還是不錯的。

此外，目前最新的糾錯模型是文字糾錯最優模型：soft-masked bert 來自於復旦大學的研究人員在2020 acl上發表了最新**：

「spelling error correction with soft-masked bert」

文字預處理（4）文字糾錯

一般有兩種文字糾錯的型別首先看一下non word的拼寫錯誤，這種錯誤表示此詞彙本身在字典中不存在，比如把要求誤寫為藥求把 correction 誤拼寫為 corrction 尋找這種錯誤很簡單，例如分完詞以後找到哪個詞在詞典中不存在，那麼這個詞就可能是錯誤拼出來的的詞。操作步驟找到候選...

書蘊筆記 0 文字預處理

整體索引在此書蘊基於書評的人工智慧推薦系統 import re import os from openpyxl import load workbook defread from xlsx path wb load workbook path ws wb wb.sheetnames 0 rows...

5文字溢位

文字溢位 overflow visible 預設值 hidden 超出隱藏 scroll 顯示滾動條 auto 自適應檢視 inherit 繼承父元素overflow值文字換行 white space normal 預設值 pre 原格式輸出，空白會被瀏覽器保留 pre wrap 文字不會換行，文字...

文字預處理 （5）文字糾錯的簡單案例

文字預處理 （4）文字糾錯

書蘊筆記 0 文字預處理

5文字溢位

相關推薦

文字預處理（5）文字糾錯的簡單案例

文字預處理（4）文字糾錯