關鍵詞提取

import jieba.analyse

index=2400

print(df_news['content'][index])

# str.join(sequence) sequence中用str連線

content_s_str=''.join(content_s[index])

print(' '.join(jieba.analyse.extract_tags(content_s_str,topk=5,withweight=false)))

#lda主題模型
from gensim import corpora,models,similarities
import gensim
# 做對映，相當於詞袋 輸入為ist of list形式
# 單詞及其整數id之間的對映。可以理解為python中的字典物件,
# 其key是字典中的詞，其val是詞對應的唯一數值型id
dictionary=corpora.dictionary(contents_clean)
# 將文件轉換為詞袋（bow）格式= （token_id，token_count）元組的列表。
# doc2bow（document，allow_update = false，return_missing = false ）
# 輸入為list of str
corpus=[dictionary.doc2bow(sentence) for sentence in contents_clean]
#類似kmeans自己指定k值
lda = gensim.models.ldamodel.ldamodel(corpus=corpus, id2word=dictionary, num_topics=20)
#第一類主題，顯示頻率最高的5個
print(lda.print_topic(1,topn=5))
#20個分類結果
for topic in lda.print_topics(num_topics=20,num_words=5):
print(topic[1])

關鍵詞提取

隱含主題模型優缺點隱含主題模型可以很好地表示文件和標籤主題，有效降低標籤系統中噪音的影響。但是另外乙個方面，隱含主題相對於詞而言粒度較粗，對於具體實體如人名地名機構名和產品名的標籤沒有辦法做到很好地區分，因此對這些細粒度標籤推薦效果較差典型的聚類演算法層次聚類 hierarchical...

關鍵詞提取

encoding utf 8 import jieba.analyse as analyse lines 1 open nba.txt encoding utf 8 read print join analyse.extract tags lines 1,topk 20,allowpos 時間建議...

TF IDF 提取關鍵詞

class document p s p p string,1,preg split no empty this build tf else public function build tf public function build tfidf idf else arsort this tfidf...

關鍵詞提取

關鍵詞提取

關鍵詞提取

TF IDF 提取關鍵詞

相關推薦