python文字聚類對一列文字進行聚類

1. 需求

乙個txt文件，如下圖。大概幾萬行資料，但是沒有歸類，要人工歸類的話耗時耗力，打算用文字聚類的方法對txt裡面的每條資料自動分類。

2. 參考

主要參考

3. **

#encoding = utf-8
import numpy as np
import pandas as pd
import re
import os
import codecs
import jieba
#開啟檔案，用jieba進行分詞
f1=open("c:/users/administrator/desktop/test/case.txt","r",encoding='utf-8',errors='ignore')
f2=open("c:/users/administrator/desktop/test/case_fenci.txt",'w',encoding='utf-8',errors='ignore')
for line in f1:
seg_list = jieba.cut(line, cut_all=false)
f2.write((" ".join(seg_list)).replace("\t\t\t","\t"))
#print(w)
f1.close()
f2.close()
titles=open("c:/users/administrator/desktop/test/case_fenci.txt",encoding='utf-8',errors='ignore').read().split('\n')
#新增停用詞
def get_custom_stopwords(stop_words_file):
with open(stop_words_file,encoding='utf-8')as f:
stopwords=f.read()
stopwords_list=stopwords.split('\n')
custom_stopwords_list=[i for i in stopwords_list]
return custom_stopwords_list
stop_words_file="c:/users/administrator/desktop/test/stopwords.txt"
stopwords=get_custom_stopwords(stop_words_file)
#聚類from sklearn.feature_extraction.text import countvectorizer
count_vec=countvectorizer(stop_words=stopwords)
km_matrix= count_vec.fit_transform(titles)
#類別儲存
f3 =open("c:/users/administrator/desktop/test/title_clusters.txt", 'w',encoding='utf-8',errors='ignore')
for i in clusters:
f3.write(str(i))
f3.write("\n")
f3.close()
print(km_matrix.shape)

4.注意1）這裡涉及到中文，注意將txt檔案格式儲存為utf-8格式的，否則可能是亂碼。

2）停用詞stopwords要自己建立，裡面主要是不被考慮到分詞裡的單詞（根據需求自己寫），如下

文字聚類用k means對文字進行聚類

coding utf 8 created on thu nov 16 10 08 52 2017 author li pc import jieba from sklearn.feature extraction.text import tfidfvectorizer from sklearn.cl...

python統計excel 表中某一列文字的詞頻

jieba庫的使用以及csv庫的使用 import jieba import csv txt open complaint.csv rt encoding utf 8 read 讀取所需要分析的檔案內容 excel open baogao.csv w newline 開啟檔案，若檔案不存在則建立...

文字聚類demo

1 排序去重，經過排序去重後資料從10萬條變為3萬條。2 結巴分詞。3 特徵提取，使用平滑後的tf idf作為特徵，為每個使用者問題構建特徵向量，採用了scikit learn 中的類 tfidfvectorizer。4 採用了兩種聚類方法k means k means 演算法的優點是收斂速度快，缺...

python文字聚類 對一列文字進行聚類

文字聚類 用k means對文字進行聚類

python統計excel 表中某一列文字的詞頻

文字聚類demo

相關推薦

python文字聚類對一列文字進行聚類

文字聚類用k means對文字進行聚類