TF IDF文字向量化及樸素貝葉斯文字分類

jieba分詞和去停用詞**：

import jieba
# 建立停用詞列表函式
def stopwordslist():
stopwords = [line.strip() for line in open('c:/users/yin/desktop/chinesestopwords.txt','r').readlines()]
return stopwords
# 對句子進行中文分詞和去停用詞函式
def seg_depart(sentence):
# 對文件中的每一行進行中文分詞
sentence_depart = jieba.cut(sentence.strip())
# 建立乙個停用詞列表
stopwords = stopwordslist()
# 輸出結果為outstr
outstr = ''
# 去停用詞
for word in sentence_depart:
if word not in stopwords:
if word != '\t':
outstr += word
outstr += " "
#字串轉換成位元組
#outstr = outstr.encode()
return outstr
# 給出文件路徑
filename = "c:/users/yin/desktop/data16.txt"#輸入檔案路徑
outfilename = "c:/users/yin/desktop/data17.txt"#輸出檔案路徑
inputs = open(filename, 'r',encoding = 'utf-8')#注意檔案的編碼格式
outputs = open(outfilename, 'w+',encoding = 'utf-8')
# 將輸出結果寫入ou.txt中
for line in inputs:
line_seg = seg_depart(line)
#輸出每行的分詞和去停用詞結果，然後換行
outputs.write(line_seg + '\n')
outputs.close()
inputs.close()
print("刪除停用詞和分詞成功！！！")

分詞和去停用詞完成之後，再進行特徵提取：

import jieba.analyse
with open('c:/users/yin/desktop/data17.txt','r',encoding = 'utf-8') as fr,open('c:/users/yin/desktop/data16.txt','w',encoding = 'utf-8') as fd:
for text in fr.readlines():
if text.split():#去除掉文字之間的空行，如果沒有空行則不需要
keywords = jieba.analyse.extract_tags(text,topk = 10)
for item in keywords:
fd.write(item[0]+item[1])
fd.write(' ')
#fd.write(item[1])
fd.write('\n')
print('輸出成功....')

得到的效果基本上如下圖所示：(文字特徵詞和標籤)

然後就把上述的兩個檔案匯入到下方的**中實現向量化和分類器：（訓練和測試的資料按照7:3進行隨機切割的）

import numpy as np
from sklearn.*****_bayes import gaussiannb
from sklearn.feature_extraction.text import tfidfvectorizer
from sklearn.metrics import classification_report,accuracy_score
from sklearn.model_selection import train_test_split # 切割資料---train + test
from sklearn import preprocessing # 結果評估
corpus = open("c:/users/yin/desktop/data16.txt","r",encoding="utf-8-sig")
corpus_tags = open("c:/users/yin/desktop/data14.txt","r",encoding="utf-8-sig")
cv=tfidfvectorizer(binary=false,decode_error='ignore',stop_words='english')
vec=cv.fit_transform(corpus.readlines())
arr=vec.toarray() #文字特徵值矩陣向量arr
#print(arr)
dicts = 
a = np.array(list(map(lambda x: dicts[x.strip()], corpus_tags))) #標籤矩陣向量a
x_train,x_test, y_train, y_test =train_test_split(arr,a,test_size=0.3, random_state=0)#把文字特徵向量和標籤向量分割成訓練集和測試集
def test_gaussian_nb():
x = x_train
y = y_train
gnb = gaussiannb()
gnb.fit(x, y)
result = gnb.predict(x_test)
print(classification_report(y_test,result))
if __name__ == '__main__':
test_gaussian_nb()

可以得到乙個簡單的分類評價效果：

以上就是一些簡單的過程，歡迎交流，謝謝！

文字向量化從向量到向量（tfidf）

corpus dictionary.doc2bow text for text in texts tfidf models.tfidfmodel corpus 第一步初始化乙個模型 doc bow 0,1 1,1 print tfidf doc bow 第二步用模型轉換向量 0,0.707106...

文字向量化詞袋模型 TF IDF

對文字資料進行建模，有兩個問題需要解決模型進行的是數算，因此需要數值型別的資料，而文字不是數值型別資料。模型需要結構化資料，而文字是非結構化資料。將文字轉換為數值特徵向量的過程，稱為文字向量化。將文字向量化，可以分為如下步驟對文字分詞，拆分成更容易處理的單詞。將單詞轉換為數值型別，即使用合適的...

文字向量化

table of contents概述 word2vec nnlm c wcbow and skip gram doc2vec str2vec 文字表示是自然語言處理中的基礎工作，文字表示的好壞直接影響到整個自然語言處理系統的效能。文字向量化就是將文字表示成一系列能夠表達文字語義的向量，是文字表示的...

TF IDF文字向量化及樸素貝葉斯文字分類

文字向量化 從向量到向量（tfidf）

文字向量化 詞袋模型 TF IDF

文字向量化

相關推薦

文字向量化從向量到向量（tfidf）

文字向量化詞袋模型 TF IDF