Python3 文章標題關鍵字提取

2021-08-31 16:10:15 字數 3559 閱讀 4107

結巴分詞詳見:結巴分詞github

sklearn詳見:文字特徵提取——4.2.3.4 tf-idf項加權

import os

import jieba

import sys

from sklearn.feature_extraction.text import tfidfvectorizer

jieba.load_userdict('userdicttest.txt')

stop_words = set((

"基於", "面向", "研究", "系統", "設計", "綜述", "應用", "進展", "技術", "框架", "txt"

))def getfilelist(path):

filelist =

files = os.listdir(path)

for f in files:

if f[0] == '.':

pass

else:

return filelist, path

def fenci(filename, path, segpath):

# 儲存分詞結果的資料夾

if not os.path.exists(segpath):

os.mkdir(segpath)

seg_list = jieba.cut(filename)

result =

for seg in seg_list:

seg = ''.join(seg.split())

if len(seg.strip()) >= 2 and seg.lower() not in stop_words:

# 將分詞後的結果用空格隔開,儲存至本地

f = open(segpath + "/" + filename + "-seg.txt", "w+")

f.write(' '.join(result))

f.close()

def tfidf(filelist, sfilepath, path, tfidfw):

corpus =

for ff in filelist:

fname = path + ff

f = open(fname + "-seg.txt", 'r+')

content = f.read()

f.close()

vectorizer = tfidfvectorizer() # 該類實現詞向量化和tf-idf權重計算

tfidf = vectorizer.fit_transform(corpus)

word = vectorizer.get_feature_names()

weight = tfidf.toarray()

if not os.path.exists(sfilepath):

os.mkdir(sfilepath)

for i in range(len(weight)):

print('----------writing all the tf-idf in the ', i, 'file into ', sfilepath + '/', i, ".txt----------")

f = open(sfilepath + "/" + str(i) + ".txt", 'w+')

result = {}

for j in range(len(word)):

if weight[i][j] >= tfidfw:

result[word[j]] = weight[i][j]

resultsort = sorted(result.items(), key=lambda item: item[1], reverse=true)

for z in range(len(resultsort)):

f.write(resultsort[z][0] + " " + str(resultsort[z][1]) + '\r\n')

print(resultsort[z][0] + " " + str(resultsort[z][1]))

f.close()

tfidfvectorizer( ) 類 實現了詞向量化和tf-idf權重的計算

using jieba on 農業大資料研究與應用進展綜述.txt

using jieba on 基於hadoop的分布式並行增量爬蟲技術研究.txt

using jieba on 基於rpa的財務共享服務中心賬表核對流程優化.txt

using jieba on 基於大資料的特徵趨勢統計系統設計.txt

using jieba on 網路大資料平台異常風險監測系統設計.txt

using jieba on 面向資料中心的多源異構資料統一訪問框架.txt

----------writing all the tf-idf in the  0 file into  ./keywords/ 0 .txt----------

農業 0.773262366783

大資料 0.634086202434

----------writing all the tf-idf in the  1 file into  ./keywords/ 1 .txt----------

hadoop 0.5

分布式 0.5

並行增量 0.5

爬蟲 0.5

----------writing all the tf-idf in the  2 file into  ./keywords/ 2 .txt----------

rpa 0.408248290464

優化 0.408248290464

服務中心 0.408248290464

流程 0.408248290464

財務共享 0.408248290464

賬表核對 0.408248290464

----------writing all the tf-idf in the  3 file into  ./keywords/ 3 .txt----------

特徵 0.521823488025

統計 0.521823488025

趨勢 0.521823488025

大資料 0.427902724969

----------writing all the tf-idf in the  4 file into  ./keywords/ 4 .txt----------

大資料平台 0.4472135955

異常 0.4472135955

監測 0.4472135955

網路 0.4472135955

風險 0.4472135955

----------writing all the tf-idf in the  5 file into  ./keywords/ 5 .txt----------

多源異構資料 0.57735026919

資料中心 0.57735026919

統一訪問 0.57735026919

Python3爬取簡書首頁文章的標題和文章鏈結

from urllib import request from bs4 import beautifulsoup beautiful soup是乙個可以從html或xml檔案中提取結構化資料的python庫 構造標頭檔案,模擬瀏覽器訪問 url page request.request url,he...

python 3讀取檔案 Python3 檔案讀寫

python open 方法用於開啟乙個檔案,並返回檔案物件,在對檔案進行處理過程都需要使用到這個函式 1.讀取檔案 with open test json dumps.txt mode r encoding utf 8 as f seek 移動游標至指定位置 f.seek 0 read 讀取整個檔...

python3中文長度 python3獲得漢字長度

import string def str count str 找出字串中的中英文 空格 數字 標點符號個數 count en count dg count sp count zh count pu 0 for s in str 英文 if s in string.ascii letters cou...