Python文字挖掘練習（一）新聞摘要

1、掌握讀取文件內容、文章分句、文字分詞的方法

2、掌握文字向量化，剔除停用詞

3、掌握用cosine方法計算文件相似度，並基於此提取文件摘要

4、將過程封裝成函式，方便呼叫

函式功能：實現文字摘要

引數說明：

path：文件路徑

num_summary：摘要長短

返回：result：摘要

'''import re #文件內容分句

import os #獲取檔案路徑

import jieba #分詞

import numpy

from sklearn.metrics import pairwise_distances #計算文字相似度

from sklearn.feature_extraction.text import countvectorizer #轉化為文字向量

#匯入文字

cwd=os.getcwd(

) contents=

''with

open

(cwd+path,

'r')

asfile

: contents=

file

.read(

).strip(

)#分句

subcorpus=

[contents]

+re.split(

'[。？！\n]'

,contents)

#匯入停用詞

stop_words_path=cwd+

'/stop_words.txt'

stop_words=

set(

)with

open

(stop_words_path,

'r',encoding=

'utf-8'

)as sw:

[stop_words.add(line.strip())

for line in sw.readlines()]

#分詞 segments=

clean_subcorpus=

for content in subcorpus:

segs=jieba.cut(content)

#斷詞，list格式

segment=

' '.join(segs)

#轉化為乙個元素

iflen

(segment.strip())

>=5:

#剔除長度小於5的句子))

))#文字向量

countvectorizer=countvectorizer(stop_words=stop_words)

#設定關鍵引數stop_words

textvector=countvectorizer.fit_transform(segments)

#shape=(10, 89)

#文字相似度

distance_matrix=pairwise_distances(textvector,metric=

'cosine'

)#數值越小越相似

#生成摘要

sort_index=numpy.argsort(distance_matrix[0]

)#降序排列

num_summary=

min(

len(clean_subcorpus)

,num_summary+1)

summarys=

#存放摘要

sorts=

#存放索引

for i in

range(1

,num_summary):)

sorts_ix=numpy.argsort(sorts)

for ix in sorts_ix:])

result=

'。'.join(summarys)

return result

附呼叫函式

文字挖掘之新聞分類

增加序號列本實驗的資料來源是以單個新聞為單元，需要增加id列來作為每篇新聞的唯一標識，方便下面演算法的計算。分詞及詞頻統計這兩步都是文字挖掘領域最常規的做法。首先使用分詞元件對content欄位新聞內容進行分詞。去除過濾詞之後過濾詞一般是標點符號及助語再對詞頻進行統計。停用詞過濾停用詞...

整合搜尋優化一新聞搜尋

整合搜尋與2007年年底由谷歌首先推出，隨後被所有www.cppcns.com主流搜尋引擎所採用。所謂整合搜尋，就是在正常搜尋結果頁面中同時顯示了文字頁面之外的新聞部落格地圖圖書等垂直搜尋結果。其實所有主流搜尋引擎早就推出了整合搜尋，使用者只要單擊搜尋結果頁面上方的垂直搜尋導航，就可以顯示相...

python文字挖掘（一）初探jieba分詞包

一 jieba簡介近年來，隨著機器學習越來越火，python也跟著火了起來，而python在資料探勘領域的使用也越來越廣泛。在python的第三方包裡，jieba應該算得上是分詞領域的佼佼者，想要使用python做文字分析，分詞是必不可少的乙個環節。二安裝說明 1 python2.x 開啟cmd...

Python文字挖掘練習（一） 新聞摘要

文字挖掘之新聞分類

整合搜尋優化一 新聞搜尋

python文字挖掘（一） 初探jieba分詞包

相關推薦

Python文字挖掘練習（一）新聞摘要

整合搜尋優化一新聞搜尋

python文字挖掘（一）初探jieba分詞包