文字摘要專案 1 資料預處理與詞向量訓練

第一篇內容較為簡單，從搭建環境到具體做了哪些內容，以及中間的一些技巧，主要包含兩個方面：文字預處理和文字表示，其中文字表示採用gensim訓練詞向量；文字處理包括：資料清洗、文字分詞（句）、過濾、去停用詞等內容。先簡單實現，後面繼續深入整理優化，例如處理oov(out of vocabulary)等問題。

文字預處理

1. 原始資料處理

對原始資料處理包括去除重複、去除nan等內容，整個過程中利用pandas對資料進行處理。樣例如下：

# 0. 資料讀取
dataframe = pd.read_csv(csv_file_path)
# 1. 空值、重複值處理
dataframe.dropna(subset['report'], inplace=true)
dataframe.fillna('', inplace=true)
dataframe.drop_duplicates(keep='first', inplace=true)

2. 針對每條資料的預處理

針對每條資料的處理，裡面用了乙個小技巧：併發處理dataframe。由於原始dataframe較大，因此考慮將其分解若干個(cpu個數)小塊，每一塊並行進行預處理，最後將每個核處理的結果進行拼接，得到最終的處理結果。

# 2. 句子處理
dataframe = multi_process_csv(dataframe, func=sentences_proc)

def multi_process_csv(dataframe, func):    
# 資料切分    
data_split = np.array_split(dataframe, cpu_cores)    
# 併發處理    
with pool(processes=cpu_cores) as pool:        
dataframe = pd.concat(pool.map(func, data_split))    
pool.close()    
pool.join()    
return dataframe

這裡面乙個比較巧妙的點在於資料劃分,將原始資料劃分為若干個dataframe後，又能利用dataframe的並行處理效果，雙重省時。其中func為文字預處理函式。接下來考慮編寫func中所做的事情。

def sentences_proc(dataframe):
col_list = ['brand', 'model', 'question', 'dialogue', 'report']
for col in col_list:    
if col in dataframe.columns:        
return dataframe

def sentence_proc(sentence):    
# 將原對話拆分為若干個句子    
sent_generator = sentence.split('|')    
# 每個句子分別進行分處理    
# 1. 去除非中文符號    
sent_generator = (clean_sent(sent) for sent in sent_generator)    
# 2. 分詞處理    
sent_generator = (seg_words(sent) for sent in sent_generator)    
# 重新組合成處理後的句子    
return ' '.join(sent_generator)

注意，這裡有乙個點：generator部分，本來這兒應該用列表推導式[ ]，但是，用( )會得到乙個生成器，速度更快。中間涉及兩個函式對句子處理的函式，樣例如下：

def clean_sent(sent):    
"""      
:param sent: strings      
:return: 去除非中文字元      
"""   
sent = re.sub(r'[^\u4e00-\u9fa5]', '', sent)   
return sent

def seg_words(sent):    
# 分詞    
word_generator = jieba.cut(sent)   
# 過濾條件1    
word_generator = (word for word in word_generator if word and word not in remove_words)    
# 過濾條件2    
word_generator = (word for word in word_generator if word and word not in stop_words)
return ' '.join(word_generator)

至此，資料預處理部分基本完成，將處理後的文字儲存並構建詞向量訓練所需要的格式。

詞向量訓練

def train_word2vec(file_path=config.merged_seg_path):    
# 訓練詞向量    
model = word2vec(        
linesentence(source=file_path),
vector_size=config.embedding_dim,        
sg=1,        
workers=cpu_cores,        
window=5,        
min_count=5,        
epochs=config.word2vec_train_epochs,)    
return model

這中間有乙個有乙個點：linsentence可接收兩種引數來訓練，乙個是file物件，乙個str物件。

text summary

文字摘要專案 1 資料預處理與詞向量訓練

NLP系列文字預處理1

資料預處理總結1

100天專案 Day1 資料預處理

文字摘要專案 1 資料預處理與詞向量訓練

NLP系列 文字預處理1

資料預處理總結1

100天專案 Day1 資料預處理

相關推薦

NLP系列文字預處理1