搭建乙個簡單的FAQ系統

現在的智慧型問答系統的應用是非常普遍的，比如說客服，前台機械人，講解機械人等很多場景都可能會用到faq問答系統，所謂的faq就是 frequently asked questions，也就是說在某個場景下，比較常見的一些問題。

首先我們來看看整體的faq流程，我們需要對輸入的問題進行預處理，比如去停，分詞等；然後需要對預處理之後的語料進行向量化，這裡向量化的方法很多，也不拘泥於一種，常見的向量化方法有詞頻向量化、word2vec、tf-idf 等方法；向量化之後，就可以進行文字相似度計算了然後我們可以選取相似度最高的問題答案輸出就可以了。

整個處理的流程圖如下所示：

好了，明白了整個流程了之後，我們就可以開始搭建問答系統了。首先，是建立問答庫，這裡我們就建立乙個十來個問題的問題庫和答案庫，順序要一一對應起來：

問題庫：

答案庫：

好了，問答庫弄好了之後，我們要對問題庫進行預處理操作，主要就是進行分詞操作，**如下所示：

import jieba
def stopword_list():
stopwords = [line.strip() for line in open('stopword.txt', encoding='utf-8').readlines()]
return stopwords
def seg_with_stop(sentence):
sentence_seg = jieba.cut(sentence.strip())
stopwords = stopword_list()
out_string = ''
for word in sentence_seg:
if word not in stopwords:
if word != '\t':
out_string += word
out_string += " "
return out_string
def segmentation(sentence):
sentence_seg = jieba.cut(sentence.strip())
out_string = ''
for word in sentence_seg:
out_string += word
out_string += " "
return out_string
inputq = open('question.txt', 'r', encoding='gbk')
outputq = open('questionseg.txt', 'w', encoding='gbk')
inputa = open('answer.txt', 'r', encoding='gbk')
outputa = open('answerseg.txt', 'w', encoding='gbk')
for line in inputq:
line_seg = segmentation(line)
outputq.write(line_seg + '\n')
outputq.close()
inputq.close()
for line in inputa:
line_seg = segmentation(line)
outputa.write(line_seg + '\n')
outputa.close()
inputa.close()

我們逐行對問題庫進行了分詞操作，然後輸出。接下來我們就可以進行輸入問題 query 進行向量化，然後和問題庫中的問題向量進行相似度計算，這裡我們用的是余弦相似度演算法，然後取相似度最高的問題相對應的答案輸出即可，其實流程是比較簡單的。**如下所示：

from sklearn.feature_extraction.text import countvectorizer
import math
from segmentation import segmentation
count_vec = countvectorizer()
def count_cos_similarity(vec_1, vec_2):
if len(vec_1) != len(vec_2):
return 0
s = sum(vec_1[i] * vec_2[i] for i in range(len(vec_2)))
den1 = math.sqrt(sum([pow(number, 2) for number in vec_1]))
den2 = math.sqrt(sum([pow(number, 2) for number in vec_2]))
return s / (den1 * den2)
def cos_sim(sentence1, sentence2):
sentences = [sentence1, sentence2]
# print(count_vec.fit_transform(sentences).toarray()) # 輸出特徵向量化後的表示
# print(count_vec.get_feature_names()) # 輸出的是切分的詞， 輸出向量各個維度的特徵含義
vec_1 = count_vec.fit_transform(sentences).toarray()[0]
vec_2 = count_vec.fit_transform(sentences).toarray()[1]
# print(len(vec_1), len(vec_2))
return count_cos_similarity(vec_1, vec_2)
def get_answer(sentence1):
sentence1 = segmentation(sentence1)
score = 
for idx, sentence2 in enumerate(open('questionseg.txt', 'r')):
# print('idx: {}, sentence2: {}'.format(idx, sentence2))
# print('idx: {}, cos_sim: {}'.format(idx, cos_sim(sentence1, sentence2)))
if len(set(score)) == 1:
print('暫時無法找到您想要的答案。')
else:
index = score.index(max(score))
file = open('answer.txt', 'r').readlines()
print(file[index])
while true:
sentence1 = input('請輸入您需要問的問題(輸入q退出)：\n')
if sentence1 == 'q':
break
else:
get_answer(sentence1)

好了，我們可以試試效果如何：

僅僅通過余弦相似度匹配，我們就可以有乙個不錯的效果了，這就是乙個簡單的問答系統搭建。希望能讓大家對qa系統有乙個初步的了解，如有紕漏之處，也請大家不吝指教，**詳情請見 github，謝謝。

搭建乙個簡單的FAQ系統

搭建乙個簡單的後台管理系統（一）

搭建乙個簡單的mvc（一）

搭建乙個簡單的dubbo專案

搭建乙個簡單的FAQ系統

搭建乙個簡單的後台管理系統（一）

搭建乙個簡單的mvc（一）

搭建乙個簡單的dubbo專案

相關推薦