gensim基本使用文字相似度分析

gensim 是乙個通過衡量片語（或更高階結構，如整句或文件）模式來挖掘文件語義結構的工具

三大核心概念：文集（語料）–>向量–>模型

from gensim import corpora
import jieba
documents = ['工業網際網路平台的核心技術是什麼',
'工業現場生產過程優化場景有哪些']
def word_cut(doc):
seg = [jieba.lcut(w) for w in doc]
return seg
texts= word_cut(documents)
##為語料庫中出現的所有單詞分配了乙個唯一的整數id
dictionary = corpora.dictionary(texts)
dictionary.token2id

##該函式doc2bow()只計算每個不同單詞的出現次數，將單詞轉換為整數單詞id，並將結果作為稀疏向量返回
bow_corpus = [dictionary.doc2bow(text) for text in texts]
bow_corpus

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1)],
[(2, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1)]]

每個元組的第一項對應詞典中符號的 id，第二項對應該符號出現的次數。

from gensim import models
# train the model
tfidf = models.tfidfmodel(bow_corpus)

分詞工具

1、python︱六款中文分詞模組嘗試:jieba、thulac、snownlp、pynlpir、corenlp、pyltp

2、hanlp

首先要對句子進行初步處理。本文對文字依次進行了【去空去重、切詞分詞和停用詞過濾】操作。

原始資料會存在一些【空或重複的語句】，須過濾掉這些【無價值且影響效率】的語句。使用計算機自動地對中文文字進行詞語切分的過程稱為中文分詞(chinese word segmentation)，即使中文句子中的詞之間有空格標識。若要對乙個句子進行分析，就需要將其切分成詞的序列，然後以詞為單位進行句子的分析，故中文分詞是中文自然語言處理中最基本的乙個環節。

生成分詞列表

1、首先停用詞過濾，返回乙個停用詞表

可以使用中科院的「計算所漢語詞性標記集」以及哈工大停用詞表

def stopwordslist(filepath):
wlst = [w.strip() for w in open(filepath,'r',encoding='utf8').readlines()]
return wlst

2、結巴分詞後的停用詞性 [標點符號、連詞、助詞、副詞、介詞、時語素、『的』、數詞、方位詞、代詞]

stop_flag = ['x', 'c', 'u','d', 'p', 't', 'uj', 'm', 'f', 'r']

對文字集中的文字進行中文分詞，返回分詞列表

def seg_sentence(sentence,stop_words):
sentence_seged = jieba.cut(sentence.strip())
# sentence_seged = set(sentence_seged)
outstr = ''
for word in sentence_seged:
if word not in stop_words:
if word != '\t':
outstr += word
outstr += ' '
return outstr.split(' ')

#1、將【文字集】生產【分詞列表】
texts = [seg_sentence(seg,stop_words) for seg in open(tpath,'r',encoding='utf8').readlines()]
#一、建立詞袋模型
#2、基於檔案集建立【詞典】，並提取詞典特徵數
dictionary = corpora.dictionary(texts)
feature_cnt = len(dictionary.token2id.keys())
#3、基於詞典，將【分詞列表集】轉換為【稀疏向量集】，也就是【語料庫】
corpus = [dictionary.doc2bow(text) for text in texts]
#二、建立tf-idf模型
#4、使用「tf-tdf模型」處理【語料庫】
tfidf = models.tfidfmodel(corpus)
#三構建乙個query文字，利用詞袋模型的字典將其對映到向量空間 
#5、同理，用詞典把搜尋詞也轉換為稀疏向量
kw_vector = dictionary.doc2bow(seg_sentence(keyword,stop_words))
#6、對稀疏向量建立索引
index = similarities.sparsematrixsimilarity(tfidf[corpus],num_features=feature_cnt)
#7、相似的計算
sim = index[tfidf[kw_vector]]

全部**：

import jieba
import jieba.posseg as pseg
from gensim import corpora, models, similarities
def stopwordslist(filepath):
wlst = [w.strip() for w in open(filepath, 'r', encoding='utf8').readlines()]
return wlst
def seg_sentence(sentence, stop_words):
# stop_flag = ['x', 'c', 'u', 'd', 'p', 't', 'uj', 'm', 'f', 'r']#過濾數字m
stop_flag = ['x', 'c', 'u', 'd', 'p', 't', 'uj', 'f', 'r']
sentence_seged = pseg.cut(sentence)
# sentence_seged = set(sentence_seged)
outstr = 
for word,flag in sentence_seged:
# if word not in stop_words:
if word not in stop_words and flag not in stop_flag:
return outstr
if __name__ == '__main__':
sppath = 'stopwords.txt'
tpath = 'test.txt'
stop_words = stopwordslist(sppath)
keyword = '吃雞'
# 1、將【文字集】生產【分詞列表】
texts = [seg_sentence(seg, stop_words) for seg in open(tpath, 'r', encoding='utf8').readlines()]
orig_txt = [seg for seg in open(tpath, 'r', encoding='utf8').readlines()]
#一、建立詞袋模型
# 2、基於檔案集建立【詞典】，並提取詞典特徵數
dictionary = corpora.dictionary(texts)
feature_cnt = len(dictionary.token2id.keys())
# 3、基於詞典，將【分詞列表集】轉換為【稀疏向量集】，也就是【語料庫】
corpus = [dictionary.doc2bow(text) for text in texts]
# 4、使用「tf-tdf模型」處理【語料庫】
#二、建立tf-idf模型
tfidf = models.tfidfmodel(corpus)
#三構建乙個query文字，利用詞袋模型的字典將其對映到向量空間
# 5、同理，用詞典把搜尋詞也轉換為稀疏向量
kw_vector = dictionary.doc2bow(seg_sentence(keyword, stop_words))
# 6、對稀疏向量建立索引
index = similarities.sparsematrixsimilarity(tfidf[corpus], num_features=feature_cnt)
# 7、相似的計算
sim = index[tfidf[kw_vector]]
result_list = 
for i in range(len(sim)):
print('keyword 與 text%d 相似度為：%.2f' % (i + 1, sim[i]))
if sim[i] > 0.4:
print('原始的句子：',result_list)

python文字相似度計算

文字相似度

這種相似度計算方式相對簡單，原理也易於理解，就是計算單詞集合之間的交集和並集大小的比例，該值越大，表示兩個文字越相似。在涉及到大規模平行計算時，該方法效率上有一定的優勢。jaccard 相似度公式舉例句子a 我喜歡看電視，不喜歡看電影。句子b 我不喜歡看電視，也不喜歡看電影。分詞去噪後 a 我，...

使用余弦相似度計算文字相似度

1.使用simhash計算文字相似度 2.使用余弦相似度計算文字相似度 3.使用編輯距離計算文字相似度 4.jaccard係數計算文字相似度余弦相似性兩個向量的夾角越接近於0，其餘弦值越接近於1，表面兩個向量越相似。向量夾角余弦計算 co s x 1 x2 y1 y2x1 2 y1 2 x2 2...

計算文字相似度文字相似度演算法之 simhash

文字相似度演算法種類繁多，今天先介紹一種常見的網頁去重演算法simhash。1 什麼是simhash 2 simhash步驟人工智慧，1 大資料，2 科技，3 網際網路，4 機器學習，5 人工智慧 00101 大資料 11001 科技 00110 網際網路 10101 機器學習 01011 has...

gensim基本使用 文字相似度分析

文字相似度

使用余弦相似度計算文字相似度

計算文字相似度 文字相似度演算法之 simhash

相關推薦

gensim基本使用文字相似度分析

計算文字相似度文字相似度演算法之 simhash