問答系統實踐（二）構建聊天機器人小天1.0

口水簡介

本文主要教你如何構建基於模糊檢索和深度學習的聊天機器人。之前在專欄的一篇文章已經介紹了FAQ客服機器人的基本構建流程，所以本文就不重複介紹了。詳細請參看：

其實無論客服機器人還是聊天機器人都離不開文本匹配，所以對於研究文本匹配的童鞋來說，能將自己所學的技術快速的應用到生活中去，算是一件最開心的事情了吧。本專欄介紹的聊天機器人均屬於單輪檢索式機器人，那多輪對話啥的，如果你看成由多個一輪構成，其實也可以，如果你有強大的語料庫，一個單輪檢索式機器人都有可能讓你覺得它能夠聯繫上下文；多輪對話難點在於如何利用你的session來進行下一輪對話，其實大部分都是機器人在引導你對話，而這後面都是強大的規則在支撐，機器人只是規則的表達方式而已，有時候你看到的“人工智能”其實就是無數個人工堆積的成果，有點跑題了。

本文介紹的聊天機器人可以幫助你熟悉構建機器人的一個簡單流程，技術比較簡單；同時爲了貫徹奧卡姆剃刀原則，本文用一份代碼同時完成了閒聊機器人和特定任務的FAQ客服機器人，讓你用最少的精力體會雙倍的快樂！下面請看具體介紹：

項目簡介

本項目由兩個部分組成，一是基於tf-idf檢索的召回模型，二是基於CNN的精排模型，本項目將兩者融合，構建召回+排序的客服聊天機器人。系統支持閒聊模式和FAQ問答模式，採取的數據分別爲小黃雞閒聊數據集和垂直領域的FAQ問答數據集。該聊天機器人的版本爲小天1.0，速度提升的小天2.0版本會在後期陸續上傳。

目前該系統的優點在於：

一、召回+排序 2個模塊互不干擾，便於自定義修改以及維護；

二、系統採取了排序規則優化，提升了檢索速度。

三、加入了簡單的倒排索引，優化了檢索流程。

根據目前的反饋，系統的難點在於構建一個精度高且耗時短的rerank模型，本項目所用的CNN模型需要你根據自己的語料去調參，CNN是最簡單也是比較有效的模型哦，後期有時間我會把其他相對來說比較nice的模型進行上傳。

項目結構和代碼

項目的基本結構如下：

stopwordList，userdict文件夾：

word2vec文件夾中是中文詞向量：

Recall文件夾：

recall_model.py

是模糊匹配模型的問答主函數，不需要訓練，需保證輸入正確語料，根據不同的任務調用不同的語料集，支持單獨測試

# 可以利用以下代碼單獨進行測試
if __name__ == '__main__':
    # 設置外部詞
    seg = Seg()
    seg.load_userdict('./userdict/userdict.txt')
    # 讀取數據
    List_kw, questionList, answerList = read_corpus1()
    # 初始化模型
    ss = SentenceSimilarity(seg)
    ss.set_sentences(questionList)
    ss.TfidfModel()         # tfidf模型
    # ss.LsiModel()         # lsi模型
    # ss.LdaModel()         # lda模型

    while True:
        question = input("請輸入問題(q退出): ")
        if question == 'q':
            break
        time1 = time.time()
        question_k = ss.similarity_k(question, 5)
        print("親，我們給您找到的答案是： {}".format(answerList[question_k[0][0]]))
        for idx, score in zip(*question_k):
            print("same questions： {},                score： {}".format(questionList[idx], score))
        time2 = time.time()
        cost = time2 - time1
        print('Time cost: {} s'.format(cost))

Rerank文件夾：

rerank使用說明：

第一步：qacnn.py

先訓練深度學習模型，得到checkpoint等文件，支持訓練和測試，這一步確保得到一個高效的rerank模型

def main():

    embedding = load_embedding(embeding, embeding_size, vocab_file)
    preprocess_data1 = preprocess(train_file)
    preprocess_data2 = preprocess(test_file)

    train_data = read_train(preprocess_data1, stopword_file, vocab_file)
    test_data = read_train(preprocess_data2, stopword_file, vocab_file)
    train_corpus = load_train_data(train_data, max_q_length, max_a_length)
    test_corpus = load_train_data(test_data, max_q_length, max_a_length)

    config = NNConfig(embedding)
    config.ques_length = max_q_length
    config.ans_length = max_a_length
    # config.embeddings = embedding
    train(deepcopy(train_corpus), test_corpus, config)


if __name__ == '__main__':
    save_path = "./model/checkpoint"
    best_path = "./model/bestval"
    train_file = '../data/corpus1/raw/train.txt'
    test_file = '../data/corpus1/raw/test.txt'
    stopword_file = '../stopwordList/stopword.txt'
    embeding = '../word2vec/70000-small.txt'
    vocab_file = '../data/corpus1/project-data/word_vocab.txt'
    max_q_length = 15
    max_a_length = 15
    embeding_size = 200
    main()

第二步：rerank_model.py

不支持單獨使用，是系統的調用文件，你可以修改裏面的相關信息來滿足你的需求

# 得到深度模型計算的相似度分數
def test(corpus, config):
    process_data = read_test(corpus, stopword_file, vocab_file)
    test_corpus = load_test_data(process_data, max_q_length, max_a_length)
    # tf.reset_default_graph() 可以防止模型重載時報錯
    tf.reset_default_graph()
    with tf.Session() as sess:
        model = SiameseQACNN(config)
        saver = tf.train.Saver()
        saver.restore(sess, tf.train.latest_checkpoint(best_path))

        iterator = Iterator(test_corpus)
        res = []
        for batch_x in iterator.next(config.batch_size, shuffle=False):
            batch_q, batch_a, batch_qmask, batch_amask = zip(*batch_x)
            batch_q = np.asarray(batch_q)
            batch_a = np.asarray(batch_a)
            predictions = sess.run([model.res], feed_dict={model._ques: batch_q,
                                                model._ans: batch_a,
                                                model.dropout_keep_prob: 1.0})
            res.append([i for i in predictions])
        return res

最後介紹的，就是系統的一箇中控文件qa-control.py