使用python提取關鍵詞

需要記錄的是freqdist的成員函式

plot(n)，繪製出現次數最多的前n項

tabulate(n)，該方法接受乙個數字n作為引數，會以**的方式列印出現次數最多的前n項

most_common(n)，該方法接受乙個數字n作為引數，返回出現次數最多的前n項列表

hapaxes()，返回乙個低頻項列表

max()，該方法會返回出現次數最多的項。

# -*- coding: utf-8 -*-
import requests
from bs4 import beautifulsoup
import jieba
import re
from nltk.book import *
from pylab import *
from jieba.analyse import *
def stop_words():
stop_word_list = 
f = open('stopwords.txt', 'ru',encoding='utf-8')
for word in f:
return stop_word_list
r = requests.get('')
soup = beautifulsoup(r.text, 'lxml')
# 獲得主要內容
context = soup.find('article').get_text()
# 進行結巴中文分詞，獲得字串陣列
jieba.load_userdict('user_dict.txt')
word_list = jieba.cut(context)
word_list_str = (",".join(word_list))
word_list = re.split(",", word_list_str)
#去掉長度為1的單詞，同時去掉停止詞
stop_word_list = stop_words()
word_list = [w for w in word_list if (len(w)>1 and (w not in stop_word_list))]
word_freq_list = freqdist(word_list)
# 根據次品得到前20 項
word_commons = word_freq_list.most_common(20)
for word in word_commons:
print(word[0], word_freq_list.freq(word[0]))

ps 1.如果發現jieba分詞結果不準確的時候，可以使用load_userdict進行自定義分詞，不過詞典檔案必須是utf-8編碼

python提取關鍵詞

value 34895348587575 value 34895348587575 abababbaba value 1.290934 coding utf 8 created on sun aug 16 20 57 31 2020 author jwy coding utf 8 version p...

關鍵詞提取

隱含主題模型優缺點隱含主題模型可以很好地表示文件和標籤主題，有效降低標籤系統中噪音的影響。但是另外乙個方面，隱含主題相對於詞而言粒度較粗，對於具體實體如人名地名機構名和產品名的標籤沒有辦法做到很好地區分，因此對這些細粒度標籤推薦效果較差典型的聚類演算法層次聚類 hierarchical...

關鍵詞提取

encoding utf 8 import jieba.analyse as analyse lines 1 open nba.txt encoding utf 8 read print join analyse.extract tags lines 1,topk 20,allowpos 時間建議...

使用python提取關鍵詞

python提取關鍵詞

關鍵詞提取

關鍵詞提取

相關推薦