文字複述平行語料同語言對齊

平行語料資料對齊，如果是機器翻譯的平行語料，目前有很多資源，而且語料大多已經預處理好了，可以直接拿來用。就算沒有對齊，也有一些工具可以使用。比如tmxmall，然而他是針對翻譯句的對齊工具，不支援對齊同一種語言。

如果你做文字複述（或者文字改寫任務）亦或是文字風格遷移任務，就可能需要同一種語言的平行語料了。這樣的話不免會遇到對齊語料的問題。具體地說：

問題：當你有兩大段，同一種語言的平行語料 a、b，太長了，需要切分句子才能輸入網路。這時，如何切分才能使語料對應上呢。如果直接按「。」切分，句子數量一般是對不上的。

這時候不難想到，先用句子級別的符號（或直接用逗號級別的符號）切分，同時遍歷切分後的a、b。

遍歷的索引分別為 indexa indexb

同時建立空對的語句對陣列 c，當前索引為 index

不妨把簡化成：a中第indexa個句子，和b中第indexb第去向：

indexa 是要和 indexb 組成乙個新到語句對；並新增到 c的末尾，同時 index += 1、indexa += 1 、indexb += 1

indexa 合併到 indexa - 1，修改c的末尾語句對。 indexa += 1

indexa 合併 indexa + 1 和 indexb 組成新語句對，新增到c的末尾。 index += 1、indexa += 2

indexb 和 indexb-1合併。indexb += 1。indexb += 1

indexb 合併indexb+1 和 indexa 組成新的語句對。新增到c的末尾。 index += 1， indexb += 2

具體選擇哪種情況，要看每種情況的得分。（注意這裡沒有考慮所有情況，也沒有對在合併時考慮合併後的長度）

如何決定得分，要看具體任務，對於文字複述任務，平行語料是十分相似的句子。這個時候直接用類似jaccard相似度就行了。

完整**：

1. 簡單的jaccard實現：

import re
import jieba
from collections import defaultdict
def jaccard(x, y):
x = set(x)
y = set(y)
return len(x & y) / len(x|y)
def countdic(x):
d = defaultdict(int)
for i in x:
d[i] = d[i] + 1
return d
def jaccardrepeated(a, b):
longdic = a if len(a) >= len(b) else b
shortdic = a if len(a) < len(b) else b
totallen = len(a) + len(b)
longlen, shortlen = len(longdic), len(shortdic)
if totallen == 0: return 1
longdic = countdic(longdic)
shortdic = countdic(shortdic)
num = 0
for key in shortdic.keys():
num = num + min(shortdic[key], longdic[key])
# 這裡如果用總長度當分母會有問題：相似度永遠不會到1
return num/(longlen + shortlen - num)
def jacseten(x, y, repeat = true, tokenmode = 'jieba'):
if tokenmode == 'jieba':
x = list(jieba.cut(x))
y = list(jieba.cut(y))
else:
x = list(x)
y = list(y)
if (repeat): return jaccardrepeated(x, y)
return jaccard(x, y)

2. 上述演算法思路實現（有待優化能用就行～）

## 根據相似度 合併前後句，使平行語料對應上； 還需要優化
## needed to be improved
def merge(sen1, sen2, tokenmode=none, maxlen=256):
res1 = 
res2 = 
index1 = 1
index2 = 1
while ((index1 < len(sen1)) and (index2 < len(sen2))):
sim = [0, 0, 0, 0, 0]
# sim1_2, sim11_2, sim1_22, sim_11_2, sim_1_22 = 0, 0, 0, 0, 0
sim[0] = jacseten(sen1[index1], sen2[index2], tokenmode=tokenmode)
sim[1] = jacseten(res1[-1] + sen1[index1], res2[-1], tokenmode=tokenmode)
sim[2] = jacseten(res1[-1], res2[-1] + sen2[index2], tokenmode=tokenmode)
if (index1 + 1 < len(sen1)):
sim[3] = jacseten(sen1[index1] + sen1[index1 + 1], sen2[index2], tokenmode=tokenmode)
if (index2 + 1 < len(sen2)):
sim[4] = jacseten(sen1[index1], sen2[index2] + sen2[index2 + 1], tokenmode=tokenmode)
maxindex = sim.index(max(sim))
## and len(res1[-1]) + len(sen1[index1]) <= maxlen
if (maxindex == 1 and len(res1[-1]) + len(sen1[index1]) <= maxlen):
res1[-1] = res1[-1] + sen1[index1]
index1 += 1
## and len(res2[-1]) + len(sen2[index2]) <= maxlen
elif (maxindex == 2 and len(res2[-1]) + len(sen2[index2]) <= maxlen):
res2[-1] = res2[-1] + sen2[index2]
index2 += 1
## and (len(sen1[index1]) + len(sen1[index1+1])) <= maxlen
elif (maxindex == 3 and (len(sen1[index1]) + len(sen1[index1 + 1])) <= maxlen):
index1 += 2
index2 += 1
## and (len(sen2[index2]) + len(sen2[index2+1])) <= maxlen
elif (maxindex == 4 and (len(sen2[index2]) + len(sen2[index2 + 1])) <= maxlen):
index1 += 1
index2 += 2
else:
index1 += 1
index2 += 1
if (index1 < len(sen1)):
res1[-1] = res1[-1] + ''.join(sen1[index1:])
if (index2 < len(sen2)):
res2[-1] = res2[-1] + ''.join(sen2[index2:])
assert len(res1) == len(res2)
return res1, res2

事實上，如果想得到完整的句子，的語句對。一般一開始會用「。」先切分。那這時，在對齊的時候不考慮長度，各種合併句子的話，就會出現超級長的句子。這時候，可以對超長對句子再進行『，』級別對切分，然後再進行一遍上述對對齊過程。（不過最後還是有一些比較長對語句對，這時候就直接扔掉吧）

這一過程在上面提供對**鏈結裡已經實現，這裡不過多贅述。

為什麼不在對齊過程中就限制最大長度呢？我嘗試過，效果一般。看具體情況了。

這個方法是前幾天趕實驗臨時想的方法，而且剛轉的nlp 有點小白。有更好方法或者改進思路的，歡迎討論，哦是萬分希望能與我交流～

文字複述平行語料同語言對齊

平行樣測量

平行思維摘抄

Lucene的平行索引

文字複述 平行語料 同語言 對齊

平行樣測量

平行思維摘抄

Lucene的平行索引

相關推薦

文字複述平行語料同語言對齊