python學習 102 文字資料的預處理分詞

對於自然語言處理來講在一些情況下需要建立自己的語料庫，並將其訓練為模型，本片文章是將已經整理好的資料進行分詞和去除雜亂字元的操作。通過結巴分詞工具進行分詞，載入自定義的停用詞表(停用詞表內容=中科院+自定義)

不喜勿噴^-^

資料儲存在txt檔案中如下：

分詞完成：

# 1讀入檔案分詞之後存入檔案

def readcutremovewrite(readfile_path, writefile_path):

inputs = open(readfile_path, 'r', encoding='utf-8')

outputs = open(writefile_path, 'w', encoding='utf8')

for line in inputs:

line_seg = seg_sentence(line) # 這裡的返回值是字串

outputs.write(line_seg + '\n')

outputs.close()

inputs.close()

# 2句子分詞並去停用詞

def seg_sentence(sentence):

# 2建立停用詞list

stopwords = [line.strip() for line in open('data/stopword.txt', 'r', encoding='utf-8').readlines()]

sentence_seged = jieba.cut(sentence.strip())

outstr = ''

for word in sentence_seged:

if word not in stopwords:

if word != '\t':

outstr += word

outstr += " "

return outstr

if __name__ == '__main__':

readfile_path =r'f:\data\test1.txt'

#工具類方法讀入分詞寫入

readcutremovewrite(readfile_path,writefile_path)

print('資料預處理完成')

python處理文字資料

處理文字資料，主要是通過seris的str訪問。遇到nan時不做任何處理，保留結果為nan，遇到數字全部處理為nan。str是seris的方法，dataframe不能直接使用，但是通過索引選擇dataframe中的某一行或者某一列，結果為seris，然後就可以使用了。例如定義乙個seris和data...

1 3 文字資料建模流程範例

文字資料預處理較為繁瑣，包括中文切詞本示例不涉及構建詞典，編碼轉換，序列填充，構建資料管道等等。在tensorflow中完成文字資料預處理的常用方案有兩種，第一種是利用tf.keras.preprocessing中的tokenizer詞典構建工具和tf.keras.utils.sequence構...

學習pandas 讀入文字資料

import pandas as pd pd.read csv filepath or buffer 檔案路徑不要包含中文 sep 列分隔符 header infer 指定資料中的第幾行作為變數名 names none 自定義變數名列表 index col none 將被作為索引的列，多列時只能使...

python學習 102 文字資料的預處理 分詞

python處理文字資料

1 3 文字資料建模流程範例

學習pandas 讀入文字資料

相關推薦

python學習 102 文字資料的預處理分詞