聊天機器人-基於QQ聊天記錄訓練

個人博客：http://www.chenjianqu.com/

原文鏈接：http://www.chenjianqu.com/show-39.html

本文介紹了基於keras框架，使用seq2seq模型，如何使用自己的QQ聊天記錄訓練一個聊天機器人——另一個’你‘。

NLP使我快樂！這段時間在寫Unity的同時也會看下NLP的知識，這不剛學了seq2seq，就想着用自己過去一年來的QQ聊天記錄做一個簡易的聊天機器人。做的時候發現網上的資料大多是關於字符級的seq2seq實現，而單詞級的又是用keras實現的基本沒有，只能摸索着，現在終於做個大概出來。就分享一下吧。

Seq2seq

Seq2Seq(Sequence to Sequence), 就是一種能夠根據給定的序列，通過特定的方法生成另一個序列的方法。比如在人機對話的場景下，你輸入:"Are you free tomorrow?"到seq2seq模型，它會生成:"Yes,what's up?"。

Seq2Seq主要的應用場景有① 機器翻譯（當前最爲著名的Google翻譯，就是完全基於Seq2Seq+Attention機制開發出來的）。② 聊天機器人（小愛，微軟小冰等也使用了Seq2Seq的技術（不是全部））。③ 文本摘要自動生成（今日頭條等使用了該技術）。④ 圖片描述自動生成。⑤ 機器寫詩歌、代碼補全、生成 commit message、故事風格改寫等。

原理

Seq2Seq的主要思路是通過深度神經網絡模型將輸入的序列映射爲中間語義向量，再將中間語義向量解碼得到輸出序列，這一過程由編碼輸入（encoder）與解碼輸出(decoder)兩個環節組成。encoder和decoder一般都是用RNN，通常是LSTM或GRU，詳情可看RNN、LSTM、GRU的原理和實現.

假如原句子爲X=(a,b,c,d,e,f)X=(a,b,c,d,e,f)，目標輸出爲Y=(P,Q,R,S,T)Y=(P,Q,R,S,T)，那麼一個基本的seq2seq就如下圖所示。

左邊是對輸入的encoder，它負責把輸入（可能是變長的）編碼爲一個固定大小的向量，這個可選擇的模型就很多了，用GRU、LSTM等RNN結構或者CNN+Pooling、Google的純Attention等都可以，這個固定大小的向量，理論上就包含了輸入句子的全部信息。

decoder負責將剛纔我們編碼出來的向量解碼爲我們期望的輸出。與encoder不同，我們在圖上強調decoder是“單向遞歸”的，因爲解碼過程是遞歸進行的，具體流程爲：

1、所有輸出端，都以一個通用的<start>標記開頭，以<end>標記結尾，這兩個標記也視爲一個詞/字；
2、將<start>輸入decoder，然後得到隱藏層向量，將這個向量與encoder的輸出混合，然後送入一個分類器，分類器的結果應當輸出P；
3、將P輸入decoder，得到新的隱藏層向量，再次與encoder的輸出混合，送入分類器，分類器應輸出Q；
4、依此遞歸，直到分類器的結果輸出<end>。

這就是一個基本的seq2seq模型的解碼過程，在解碼的過程中，將每步的解碼結果送入到下一步中去，直到輸出<end>位置。

下面開始做聊天機器人

神經網絡模型

訓練時使用的神經網絡模型結構如下：

詳細的結構

轉存失敗重新上傳取消

作爲一個單詞級的生成模型，如果使用one-hot數據輸入的話，那麼LSTM的計算量和所需要的內存將無比巨大。因此需要將one-hot降維，故在lstm層前面加上一個詞嵌入層。這裏使用的詞向量我前面的博客《文本CNN-中文酒店評論的二分類》有提到。

訓練過程是“teacher force”，這和預測過程是不同的，因此需要把預測模型和訓練模型分開，訓練模型和預測模型使用相同的層，但是他們的模型結構不同。

預測模型的第一個子模型是encoder：

第二個子模型是decoder_embedding：

第三個子模型是decoder：

其實理論上應該要把第二個和第三個模型合起來，但是我這裏合起來的時候有點問題。

數據集獲取和去噪

本次使用的是QQ聊天記錄作爲訓練集，QQ本身提供了聊天記錄導出的功能，步驟如下：打開消息管理器，右上角點擊導出全部消息記錄，保存爲.txt文件。

這樣就可以得到我們想要的數據了。

接下來進行數據清洗。代碼如下：

import re
data=[]
#讀取數據集
file = open("D:/NLP/dataset/對話數據集/聊天記錄.txt",encoding='utf-8')
last_name=''
name=''
multiLine=''
flag=0
count=0
keyNoise=['的口令紅包','邀請加入','申請加入','點擊查看','撤回了','(無)','對方已成功接收']
for line in file:
    #去除空行
    line=line.strip().replace('\n','')
    if(len(line)==0):
        continue
    #去噪
    if(len(line)>4 and (line[:4]=='消息記錄' or line[:4]=='消息分組' or line[:4]=='消息對象' or line[:4]=='===='  
      or line[:4]=='http' or line[:6]=='[QQ紅包]' or line[:3]=='管理員')):
        continue
    continueflag=False
    for s in keyNoise:
        if(s in line):
            continueflag=True
    if(continueflag):
        continue
    #同一個聊天對象的多行連接起來
    if(line[:4]=='2018' or line[:4]=='2019'):
        name=line.split(' ')[-1]
        if(name==last_name):
            flag=1
        else:
            flag=0
            last_name=name
            #print(name)
        continue
    if(flag==1):
        multiLine+=(' '+line)
        continue
    else:
        temp=line
        line=multiLine.replace('\n','')
        multiLine=temp
        
    if(name=='軌跡'):#添加“我”標記
        multiLine='CJQ'+temp
    else:#添加“朋友”標記
        multiLine='FRI'+temp
    #去除@某人的消息
    obj=re.findall( r'(@\S*\s)',line)
    for s in obj:
        line=line.replace(s,'')
    #去除圖片和表情
    line=line.replace('[圖片]','')
    line=line.replace('[表情]','')
    line=line.strip()
    #去除空行
    if(len(line)==3):
        continue
    data.append(line)
    count+=1
    if(count==30678):#我這裏只提取前30678行
        break
print(count)
#寫入數據
with open('data.txt','w',encoding='utf-8') as f:
    f.write('\n'.join(data))

將原始的聊天數據處理之後得到的如下類似的數據。

FRI你在寢室嗎
CJQ不在 等會回去
FRI你回到了告訴我吧。我來拿
CJQok 我回來了
FRI太晚了。我都要睡了。明天再來找你拿吧
CJQ臥槽我給你吧 我現在拿給你？
FRI別了吧。太麻煩你了

再把這個對話集拆分爲兩個文件，代碼如下：

input_text=[]
target_text=[]
for i in range(len(data)-1):
    if(data[i][:3]=='FRI' and data[i+1][:3]=='CJQ'):
        input_text.append(data[i][3:].strip())
        target_text.append(data[i+1][3:].strip())
        
with open('cjq.txt','w',encoding='utf-8') as f:
    f.write('\n'.join(target_text))
with open('fri.txt','w',encoding='utf-8') as f:
    f.write('\n'.join(input_text))

這樣就得到了"朋友說.txt"和"我回答.txt"兩個文件。

數據集向量化

使用keras的文本預處理器，將文本映射爲序列。

import jieba
from keras.models import Model
from keras.layers import Input, LSTM, Dense,Embedding
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

MAX_WORDS=15000 #使用20000個詞語
SEN_LEN=32 #每條聊天的長度

#將句子分詞並用空格隔開
inputTextList=[' '.join([w for w in jieba.cut(text)]) for text in input_text]
targetTextList=[' '.join([w for w in jieba.cut(text)]) for text in target_text]
tokenizer=Tokenizer(num_words=MAX_WORDS)
tokenizer.fit_on_texts(texts=inputTextList+targetTextList)

#將文本映射爲數字序列
input_sequences=tokenizer.texts_to_sequences(texts=inputTextList)
target_sequence=tokenizer.texts_to_sequences(texts=targetTextList)
word_index=tokenizer.word_index

#添加兩個轉義字符到詞典
word_index['\t']=len(word_index)+1
word_index['\n']=len(word_index)+1
reverse_word_index = dict([(i, t) for t, i in word_index.items()])#得到反轉詞典，用於恢復
print(len(input_sequences))
print(len(target_sequence))

由於數據集的單詞量很大，如果想直接一次訓練模型，內存肯定是不夠的，因此需要定義一個生成器。

#數據生成器
def train_gen(input_seq,target_seq, m, batch_size=64):
    input_seq=np.array(input_seq)
    target_seq=np.array(target_seq)
    permutation = np.random.permutation(input_seq.shape[0])
    shuffled_inputs = input_seq[permutation]
    shuffled_targets = target_seq[permutation]
    num_batches = int(m/batch_size)
    
    while 1:
        for i in range(num_batches):
            input_seq_batch=shuffled_inputs[i*batch_size:(i+1)*batch_size]
            target_seq_batch=shuffled_targets[i*batch_size:(i+1)*batch_size]
            
            encoder_x = np.zeros((batch_size, SEN_LEN),dtype='float32')
            decoder_x = np.zeros((batch_size, SEN_LEN),dtype='float32')
            decoder_y = np.zeros((batch_size, SEN_LEN, MAX_WORDS),dtype='float32')
            
            for i, (in_t, tar_t) in enumerate(zip(input_seq_batch, target_seq_batch)):
                lentext=len(tar_t[:SEN_LEN])
                lentext_in=len(in_t[:SEN_LEN])
                if(lentext==0 or lentext_in==0):
                    continue
                for j, w_index in enumerate(in_t[:SEN_LEN]):
                    encoder_x[i,j]=w_index
                
                for j, w_index in enumerate(tar_t[:SEN_LEN]):
                    if(j==0):#開始符號
                        decoder_x[i,0]=word_index['\t']
                        decoder_y[i,0, w_index] = 1.
                    elif(j==lentext-1):#結束符號
                        decoder_x[i,j]=tar_t[j-1]
                        if(lentext>=SEN_LEN):
                            decoder_y[i, j,word_index['\n']] = 1.
                        else:
                            decoder_y[i, j, w_index] = 1.
                    else:
                        decoder_x[i,j]=tar_t[j-1]
                        decoder_y[i, j, w_index] = 1.#解碼器輸出序列提前一個時間步
                if(lentext<SEN_LEN):#補上長度
                    decoder_x[i,lentext]=tar_t[lentext-1]
                    decoder_y[i,lentext,word_index['\n']] = 1.
            yield [encoder_x,decoder_x],decoder_y

載入詞向量矩陣

我是用的是300d的中文詞向量，下面的代碼載入詞向量文件並設置詞向量矩陣

#解析詞向量文件
embeddings_index={}
f=open(r'D:\NLP\wordvector\sgns.zhihu.word\sgns.zhihu.word',encoding='utf-8')
for line in f:
    values=line.split()
    word=values[0]#第一個是單詞
    coefs=np.asarray(values[1:],dtype='float32')#後面都是係數
    embeddings_index[word]=coefs
f.close()
#準備詞向量矩陣
EMBEDDING_DIM=300#詞向量的長度
embedding_matrix=np.zeros((MAX_WORDS,EMBEDDING_DIM))
for word,i in word_index.items():
    word_vector=embeddings_index.get(word)
    if(word_vector is not None):#若是未登錄詞，則詞向量爲初始值0
        embedding_matrix[i]=word_vector

搭建神經網絡模型

網絡結構如上面所述，下面代碼搭建模型，同時載入詞嵌入層的參數，並編譯模型。

from keras.models import Model
from keras.layers import Input, LSTM, Dense,Embedding
#編碼器
encoder_inputs = Input(shape=(None,))
encoder_eb = Embedding(MAX_WORDS, EMBEDDING_DIM)
encoder_eb_outputs = encoder_eb(encoder_inputs)#嵌入文本
encoder_lstm=LSTM(EMBEDDING_DIM,return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_eb_outputs)
encoder_states = [state_h, state_c]
#解碼器
decoder_inputs = Input(shape=(None,))
decoder_eb = Embedding(MAX_WORDS, EMBEDDING_DIM)
decoder_eb_outputs=decoder_eb(decoder_inputs)
decoder_lstm = LSTM(EMBEDDING_DIM, return_sequences=True,return_state=True)
decoder_outputs,_,_=decoder_lstm(decoder_eb_outputs, initial_state=encoder_states)

decoder_dense_1 = Dense(int(MAX_WORDS/8), activation='relu')
decoder_outputs_1 = decoder_dense_1(decoder_outputs)
decoder_dense = Dense(MAX_WORDS, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs_1)

#將編碼器和解碼器串聯起來
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

model.summary()
plot_model(model,to_file='chatbot_qq_model.png',show_shapes=True)

#把詞嵌入矩陣載入到詞嵌入層中
model.layers[2].set_weights([embedding_matrix])
model.layers[2].trainable=False#
model.layers[3].set_weights([embedding_matrix])
model.layers[3].trainable=False
#編譯模型
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

模型訓練

使用生成器進行訓練，訓練50輪。受制於顯存，我這裏的batch_size只能設置爲16，更大可能效果好一點，訓練速度也更快點。我這裏也沒有設置驗證集，主要是我太懶了。

import keras

callbacks_list=[
    keras.callbacks.EarlyStopping(
        monitor='acc',
        patience=10,
    ),
    keras.callbacks.ModelCheckpoint(
        filepath='chatbot1_model_checkpoint.h5',
        monitor='loss',
        save_best_only=True
    ),
    keras.callbacks.TensorBoard(
        log_dir='chatbot1_log'
    )
]

model.fit_generator(train_gen(input_sequences,target_sequence,
                    len(input_sequences), 
                    batch_size=16),
                    steps_per_epoch=1000,
                    callbacks=callbacks_list,
                    epochs=50
                   )

訓練結果如下：

可以看到50輪的訓練讓訓練精度達到0.25，如果輪數更多的話，其實可以達到更高的精度。

搭建預測模型

如前面所述搭建預測模型。

encoder_model = Model(encoder_inputs, encoder_states)
encoder_model.summary()
plot_model(encoder_model,to_file='chatbot_qq_encoder_model.png',show_shapes=True)

decoder_inputs = Input(shape=(None,))
decoder_eb = Embedding(MAX_WORDS, EMBEDDING_DIM)
decoder_eb_outputs=decoder_eb(decoder_inputs)
emb_model = Model(decoder_inputs, decoder_eb_outputs)
emb_model.summary()
plot_model(emb_model,to_file='chatbot_qq_emb_model.png',show_shapes=True)

decoder_eb_input = Input(shape=(None,EMBEDDING_DIM))

decoder_state_input_h = Input(shape=(None,EMBEDDING_DIM))
decoder_state_input_c = Input(shape=(None,EMBEDDING_DIM))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]

decoder_outputs, state_h, state_c = decoder_lstm(decoder_eb_input, initial_state=decoder_states_inputs)
decoder_states = [state_h, state_c]

decoder_outputs_1 = decoder_dense_1(decoder_outputs)
decoder_outputs = decoder_dense(decoder_outputs_1)

decoder_model = Model([decoder_eb_input] + decoder_states_inputs, [decoder_outputs]+decoder_states)
decoder_model.summary()
plot_model(decoder_model,to_file='chatbot_qq_decoder_model.png',show_shapes=True)

預測

讀取字典

f = open('word_index_chatbot.txt','r',encoding='utf-8')
dictStr = f.read()
f.close()
tk = eval(dictStr)
rtk = dict([(i, t) for t, i in tk.items()])

輸入聊天文本，預測生成。

text='早點睡吧你'
#將輸入轉換爲序列
input_texts=[w for w in jieba.cut(text)]
input_sequences=[tk[w] for w in input_texts if w in tk]
x = np.zeros((1, SEN_LEN),dtype='float32')
y = np.zeros((1, 1),dtype='float32')
y[0,0]=tk['\t']
for j, w_index in enumerate(input_sequences[:SEN_LEN]):
    x[0,j]=w_index
result=''

#序列預測
states_value = encoder_model.predict(x)#序列編碼
#維度變換
h,c=states_value
h_3=np.zeros((1,1, 300),dtype='float32')
c_3=np.zeros((1,1, 300),dtype='float32')
h_3[0]=h
c_3[0]=c
states_value=[h_3,c_3]
i=0
stop_condition = False
while not stop_condition:
    embeded_vector=emb_model.predict(y)
    output_tokens, h, c = decoder_model.predict([embeded_vector] + states_value)#序列解碼

    #將輸出轉換爲詞
    index = np.argmax(output_tokens[0, -1, :])
    word = rtk[index]
    result += word
    #結束解碼
    if (index>=SEN_LEN):
        stop_condition = True
    #更新狀態
    states_value = [h, c]
    i+=1
    y[0,0]=index
print(result)

預測結果：

效果其實一般般，主要原因是數據集太小了，怪我聊天聊得不多咯。

參考資料

[1]笨拙的石頭.NLP之Seq2Seq.https://blog.csdn.net/qq_32241189/article/details/81591456. 2018-08-12

[2]蘇劍林.玩轉Keras之seq2seq自動生成標題.https://spaces.ac.cn/archives/5861/comment-page-1.2018-09-01

聊天機器人-基於QQ聊天記錄訓練

詐騙（殺豬盤）網站進行滲透測試

Python 潮流週刊#50：我最喜歡的 Python 3.13 新特性！

外行也能讀懂的網絡硬件設備功能原理速成

詳解詞袋模型

非線性最小二乘求解方法詳解

PnP算法

最詳細的U-net論文筆記

詳解八叉樹地圖

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結