NLP資料增強隨機替換命名實體

主要參考這位大佬的部落格參考鏈結.

輸入一句話：我不是張加，使用標註實體是因為之前寫過相關的部落格。會按照姓名實體庫里的實體隨機替換，從而擴充語料。

**如下：

#!
/usr/bin/python
# -*
- coding: utf-8-
*-import codecs
import jieba as t_jieba
import random
import os
root_path = os.path.
abspath
(os.path.
dirname
(__file__)
)data_path = os.path.
join
(root_path,
'data'
)company_path = os.path.
join
(data_path,
'per.txt'
)random_path = os.path.
join
(data_path,
'per.txt'
)class
basetool
: def __init__
(self, base_file: str, create_num: int =
5, change_rate: float =
0.1, seed: int =1)
: self.random = random
self.random.
seed
(seed)
self.base_file = base_file
self.create_num = create_num
self.change_rate = change_rate
self.jieba = t_jieba
self.
set_userdict
(company_path)
self.loop_t =
2 self.base_file_mapobj = self.
load_paser_base_file()
def set_userdict
(self, txt_path: str)
:'''
設定你自己的使用者字典
:param txt_path:
:return
:'''
self.jieba.
load_userdict
(txt_path)
def add_word
(self, word: str)
:'''
增加使用者字典，更好切詞
:param word:
:return
:'''
self.jieba.
add_word
(word)
def add_words
(self, word_list: list)
:for w in word_list:
self.
add_word
(w) def load_paser_base_file
(self)
:return none
def replace
(self, replace_str)
:return none
class
randomword
(basetool)
:'''
隨機詞替換，【詞級別的】，增強資料
base_file:相同型別的word集合檔案
'''def __init__
(self, base_file=random_path, create_num=
5, change_rate=
0.05
, seed=1)
:super
(randomword, self)
.__init__
(base_file, create_num, change_rate, seed)
def load_paser_base_file
(self)
: company_a =
for line in
open
(self.base_file,
"r", encoding=
'utf-8'):
company_a.
(line.
replace
('\n',''
))print
('load :%s done'
%(self.base_file)
)return company_a
def replace
(self, replace_str: str)
: replace_str = replace_str.
replace
('\n',''
).strip()
seg_list = self.jieba.
cut(replace_str, cut_all=false)
words =
list
(seg_list)
sentences =
[replace_str]
iflen
(words)
<=3:
return sentences
t =0while
len(sentences)
< self.create_num:
t +=
1 a_sentence =
''for word in words:
a_sentence += self.
s1(word)
if a_sentence not in sentences:
sentences.
(a_sentence)
if t > self.create_num * self.loop_t / self.change_rate:
break
return sentences
def s1
(self, word: str)
: # 替換所有在combine_dict中的
iflen(word)==1
:return word
if word in self.base_file_mapobj and self.random.
random()
< self.change_rate:
wi = self.random.
randint(0
,len
(self.base_file_mapobj)-1
) place = self.base_file_mapobj[wi]
return place
else
:return word
def test
(test_str, create_num=
150, change_rate=
0.3)
: smw =
randomword
(create_num=create_num, change_rate=change_rate)
return smw.
replace
(test_str)
if __name__ ==
'__main__'
: # 【程晉培】是乙個姓名實體，隨機替換per.txt檔案中的姓名
output_data = codecs.
open
('name_ner_test.txt'
,'w+'
,'utf-8'
) ts =
'''我叫張加，積極參與相關活動'
''print
('例句：'
,ts)
rs =
test
(ts)
print
('---------替換開始--------'
)for s in rs:
output_data.
write
(s+'\n'
)print
(s)print
('--------替換結束--------'
)

NLP資料增強方法

以下是一些針對文字的資料的增強方法隨機drop和shuffle 資料增強主要採取兩種方法,一種是 drop,對於標題和描述中的字或詞,隨機的進行刪除,用空格代替。另一種是 shuffle,即打亂詞序。對於如何評價 2017 知乎看山杯機器學習比賽?這個問題,使用 drop 對詞層面進行處理之後,...

條件隨機場命名實體識別

介紹在命名實體識別任務中，bilstm模型中crf層的通用思想詳細的例項通過例項來一步步展示crf的工作原理實現 crf層的一步步實現過程 1.介紹基於神經網路的方法，在命名實體識別任務中非常流行和普遍。在文獻中，作者提出了bi lstm模型用於實體識別任務中，在模型中用到了字嵌入和詞嵌入...

NLP中的資料增強

相關方法合集見較為簡單的資料增強的方法見中所使用的方法如下 1.同義詞替換 sr synonyms replace 不考慮stopwords，在句子中隨機抽取n個詞，然後從同義詞詞典中隨機抽取同義詞，並進行替換。同義詞其詞向量可能也更加接近，在使用詞向量的模型中不一定有用 2.隨機插入 ri r...

NLP資料增強 隨機替換命名實體

NLP資料增強方法

條件隨機場 命名實體識別

NLP中的資料增強

相關推薦

NLP資料增強隨機替換命名實體

條件隨機場命名實體識別