【自然語言處理】文本信息提取器-RNN

本文主要內容

簡略介紹循環神經網絡（RNN, Recurrent Neural Network），其中涉及單層RNN結構、多層RNN結構、雙向RNN結構、雙向RNN+Attention結構
使用RNN進行文本分類任務，並給出模型的定義代碼
本文代碼【 https://github.com/540117253/Chinese-Text-Classification 】

一、RNN概述

循環神經網絡RNN是特指一類專用於處理序列數據的模型，目前主流的RNN單元有LSTM(Long Short-Term Memory)和GRU(Gated Recurrent Unit)兩種。相比最原始的RNN單元，LSTM通過增加記憶單元來緩解長序列數據訓練時所產生的梯度消失問題，而GRU則是一種基於LSTM進行改進以進一步提升訓練速度的變體單元。下文依次介紹單層RNN結構、多層RNN結構、雙向RNN結構，這三種網絡結構都可以根據實際需求來任意選擇RNN單元（LSTM或GRU）。

1.1 單層RNN結構

圖1 RNN（以LSTM爲例）處理過程示意圖

假設給定一個句子 $S=\{w_t\}^N_{t=1}$ ，其中句子的長度爲 $N$ 個單詞，句子中第 $t$ 個單詞爲 $w_t$ 。在RNN處理前，需要先將每個單詞 $w_t$ 映射爲一個向量 $x_t$ 進行表達，即得到 $S=\{x_t\}^N_{t=1}$ 。在RNN的處理過程中，是依照從前往後的次序進行處理，即從第一個單詞 $x_1$ 到第 $t$ 個單詞 $x_t$ 的次序進行運算。圖1以LSTM作爲基本單元爲例（GRU單元同理），展示了RNN處理一個句子 $S=\{x_t\}^N_{t=1}$ 的整體示意圖。處理過程的公式描述如下：

$f_t=\sigma{(W_f[h_{t-1},x_t]+b_f)}$
$i_t=\sigma{(w_i[h_{t-1},x_t])+b_i}$
$\widetilde{C}=tanh(W_c[h_{t-1},x_t]+b_c)$
$C_t=f_t*C_{t-1}+i_t*\widetilde{C}$
$o_t=\sigma{(W_o[h_{t-1},x_t]+b_o)}$
$h_t=o_t*tanh(C_t)$

上述公式描述了句子 $S=\{x_t\}^N_{t=1}$ 中第 $t$ 個單詞 $x_t$ （第 $t$ 個時間步的輸入）得到運算結果 $h_t$ （第 $t$ 個時間步的輸出）的完成過程。 $[\cdot]$ 表示拼接操作， $\sigma$ 表示sigmoid函數， $*$ 表示哈達瑪積（Hadamard Product）。

圖2 GRU結構示意圖

同樣採用圖1的結構，可以將LSTM單元替換爲GRU單元，具體的GRU結構如圖2所示。針對句子 $S=\{x_t\}^N_{t=1}$ 中第 $t$ 個單詞 $x_t$ ，使用GRU計算出第 $t$ 個時間步的輸出 $h_t$ 的過程可以描述爲：

$z_t=\sigma{(W_z[h_{t-1},x_t])}$
$r_t=\sigma{(w_r[h_{t-1},x_t])}$
$\widehat{h_t}=tanh(W[r_t*h_{t-1},x_t])$
$h_t=(1-z_t)*h_{t-1}+z_t*\widehat{h}_t$

1.2 多層RNN結構

圖3 多層RNN結構（以LSTM爲例）

爲了更全面細緻地提取每個單詞的信息，同樣可以採取加深網絡結構的方式堆疊多個RNN單元構建深層網絡。多層RNN與單層RNN相似，同樣是採取將一個句子 $S=\{x_t\}^N_{t=1}$ 從前往後的次序進行處理。與單層RNN的區別在於，多層RNN的第 $n$ 層的第 $t$ 個時間步的輸入 $x_t$ 就是第 $n-1$ 層的第 $t$ 個時間步的輸出 $h_t$ 。以LSTM爲基本單元來搭建2層RNN爲例，具體如圖3所示（採用GRU爲基本單元同理）。

1.3 雙向RNN結構

圖4展示了雙向RNN結構（以LSTM爲例）。給定一個句子 $S=\{w_t\}^N_{t=1}$ ，其中句子的長度爲 $N$ 個單詞，句子中第 $t$ 個單詞的向量爲 $w_t$ 。雙向RNN是一種RNN網絡結構，其將讀入的序列（這裏是一個句子）從前往後和從後往前同時處理，最終將前向和後向的隱狀態拼接作爲最終的隱狀態。

圖4 雙向RNN結構（以LSTM爲例）

$\overrightarrow{h_t}=RNN(\overrightarrow{h}_{t-1}, w_t)$
$\overleftarrow{h_t}=RNN(\overleftarrow{h}_{t+1},w_t)$
$h_t=[\overrightarrow{h_t},\overleftarrow{h_t}]$

其中 $RNN()$ 可以是LSTM單元或者是GRU單元， $[ \cdot ]$ 表示拼接操作。

1.4 雙向RNN+Attention 結構

Attention機制的核心思想是賦予模型更關注與任務更相關的部分，而降低對任務次相關部分的關注程度。其原理可以看作爲鍵值查詢，通過用戶給定的Query，來得到序列中與任務最相關位置的權值。

在文本分類任務中，基於RNN結構的模型都是將序列處理後的最後一個隱狀態作爲分類依據。由於RNN能捕獲序列中的序列信息，因此最後一個隱狀態包含了整個句子的信息，能夠較好地應對文本分類任務。

但是，該句子爲何被模型劃分爲該類別，我們不能直觀地進行解釋。爲了提高分類結果的可解釋性，這裏採用Attention機制對句子序列中的每個位置都計算出一個注意力得分（權值），最終權值越高的位置表明對分類結果產生越重要的影響。

假設一個句子序列 $S=\{w_t\}^N_{t=1}$ 進行RNN處理後，得到的序列爲 $H=\{h_t\}^N_{t=1}$ ，其中作爲最後一個位置隱狀態 $h_N$ 包含了整個序列的信息。這裏 $h_N$ 將作爲Query，對序列中的各個位置進行計算，得到各個位置與 $h_N$ 的關聯度（與分類任務的關聯度）。

$query=W_qh_N+b_q$
$key=W_kH+b_k$
$attention=softmax(sum(W_a)tanh(keys+query))$
$outputs=attention \times H$

其中 $\times$ 表示 $H$ 按照attention得分加權求和，得到的 $outputs$ 繼續送入全連接層進行分類結果。

1.5 基於RNN的文本分類通用結構

在文本分類任務中，基於RNN結構的模型都是將序列處理後的最後一個隱狀態 $H_N$ 作爲分類依據。由於RNN能捕獲序列中的序列信息，因此最後一個隱狀態包含了整個句子的信息，能夠較好地應對文本分類任務。

因此，基於RNN的文本分類通用結構爲：

二、RNN文本分類實例

2.1 數據集介紹

1. 下載地址:

【https://github.com/skdjfla/toutiao-text-classfication-dataset 】

2. 格式:

6552431613437805063_!_102_!_news_entertainment_!_謝娜爲李浩菲澄清網絡謠言，之後她的兩個行爲給自己加分_!_佟麗婭,網絡謠言,快樂大本營,李浩菲,謝娜,觀衆們

每行爲一條數據，以_!_分割的個字段，從前往後分別是新聞ID，分類code（見下文），分類名稱（見下文），新聞字符串（僅含標題），新聞關鍵詞

分類code與名稱：

100 民生 故事 news_story
101 文化 文化 news_culture
102 娛樂 娛樂 news_entertainment
103 體育 體育 news_sports
104 財經 財經 news_finance
106 房產 房產 news_house
107 汽車 汽車 news_car
108 教育 教育 news_edu 
109 科技 科技 news_tech
110 軍事 軍事 news_military
112 旅遊 旅遊 news_travel
113 國際 國際 news_world
114 證券 股票 stock
115 農業 三農 news_agriculture
116 電競 遊戲 news_game

2.2 預訓練詞向量

預訓練詞向量使用的是，基於ACL-2018模型在百度百科訓練的詞向量。

下載地址：【 https://github.com/Embedding/Chinese-Word-Vectors 】

2.3 數據預處理

清除無用字符，並且進行分詞處理
建立整個數據集的字典,key=word, value=詞語的編號
對進行截斷或補0處理，確保每條樣本的長度爲maxlen
序列化樣本的標籤，例如“體育類新聞”的類別編號爲1，“娛樂類新聞”的類別編號爲2
將處理好的數據轉化爲DataFrame格式，並保存到硬盤

2.4 模型的定義

該小節一共分別給出多層LSTM（Multi_LSTM）, 雙向RNN（Bi_RNN）, 雙向RNN+Attention（Bi_RNN_Attention）的代碼定義。

2.4.1 Multi_LSTM

'''
    Text => Multi_Layer_LSTM => Fully_Connected => Softmax
'''
class Multi_LSTM:
    def __init__(self, rnn_output_dim, num_layers, embedded_size,
                 dict_size, maxlen, label_num, learning_rate):
        
        self.droput_rate = 0.5

        # print('model_Name:', 'Multi_LSTM')
        
        self.X = tf.placeholder(tf.int32, [None, maxlen], name='input_x')
        self.Y = tf.placeholder(tf.int64, [None])
        
        self.encoder_embeddings = tf.Variable(tf.random_uniform([dict_size, embedded_size], -1, 1), trainable=False)
        encoder_embedded = tf.nn.embedding_lookup(self.encoder_embeddings, self.X)
        
        rnn_cells = [keras.layers.LSTMCell(units=rnn_output_dim) for _ in range(num_layers)]
        outputs =keras.layers.RNN(rnn_cells,return_sequences=True, return_state=False)(encoder_embedded)
        outputs = tf.nn.dropout(outputs, keep_prob = self.droput_rate)
        
        self.logits = keras.layers.Dense(label_num, use_bias=True)(outputs[:,-1]) # 取出每條文本最後一個單詞的隱藏層輸出
        self.probability = tf.nn.softmax(self.logits, name='probability')
        
#         self.cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = self.logits, labels = self.Y))
        self.cost = tf.nn.sparse_softmax_cross_entropy_with_logits(
                                                                    labels = self.Y, 
                                                                    logits = self.logits)
        self.cost = tf.reduce_mean(self.cost)
        self.optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(self.cost)
        self.pre_y = tf.argmax(self.logits, 1, name='pre_y')
        correct_pred = tf.equal(self.pre_y, self.Y)
        self.accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

2.4.2 Bi_RNN

'''
    Text => Bidirectional GRU or LSTM => Fully_Connected => Softmax
'''
class Bi_RNN:
    def __init__(self, rnn_output_dim, embedded_size,
                 dict_size, maxlen, label_num, learning_rate, rnn_type):
        
        self.droput_rate = 0.5

        # print('model_Name:', 'Bi_RNN')

        '''
            Process the Reviews with Bi-RNN
        '''
        def bi_rnn(rnn_type, inputs, rnn_output_dim):
            if rnn_type == 'gru':
                h = keras.layers.Bidirectional(
                              keras.layers.GRU(rnn_output_dim,return_sequences=True,unroll=True),
                              merge_mode='concat'
                       )(inputs) 

            elif rnn_type == 'lstm' :
                h = keras.layers.Bidirectional(
                              keras.layers.LSTM(rnn_output_dim,return_sequences=True,unroll=True),
                              merge_mode='concat'
                       )(inputs)
            return h # shape= (None, u_n_words, 2*rnn_output_dim) or # shape(H_d) = (None, i_n_words, 2*rnn_output_dim)
        
        self.X = tf.placeholder(tf.int32, [None, maxlen], name='input_x')
        self.Y = tf.placeholder(tf.int64, [None])
        
        self.encoder_embeddings = tf.Variable(tf.random_uniform([dict_size, embedded_size], -1, 1), trainable=False)
        encoder_embedded = tf.nn.embedding_lookup(self.encoder_embeddings, self.X)

        outputs = bi_rnn(rnn_type = rnn_type, inputs = encoder_embedded, rnn_output_dim = rnn_output_dim)
        
        outputs = tf.nn.dropout(outputs, keep_prob = self.droput_rate)
 
        self.logits = keras.layers.Dense(label_num, use_bias=True)(outputs[:,-1]) # 取出每條文本最後一個單詞的隱藏層輸出
        self.probability = tf.nn.softmax(self.logits, name='probability')

        self.cost = tf.nn.sparse_softmax_cross_entropy_with_logits(
                                                                    labels = self.Y, 
                                                                    logits = self.logits)
        self.cost = tf.reduce_mean(self.cost)
        self.optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(self.cost)
        self.pre_y = tf.argmax(self.logits, 1, name='pre_y')
        correct_pred = tf.equal(self.pre_y, self.Y)
        self.accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

2.4.3 Bi_RNN_Attention

'''
    Text => Bidirectional GRU or LSTM => Word_Attention => Fully_Connected => Softmax
'''
class Bi_RNN_Attention:
    def __init__(self, rnn_output_dim, embedded_size,
                 dict_size, maxlen, label_num, learning_rate, attention_size, rnn_type):
        
        self.droput_rate = 0.5

        # print('model_Name:', 'Bi_RNN_Attention')

        '''
            Process the Reviews with Bi-RNN
        '''
        def bi_rnn(rnn_type, inputs, rnn_output_dim):
            if rnn_type == 'gru':
                h = keras.layers.Bidirectional(
                              keras.layers.GRU(rnn_output_dim,return_sequences=True,unroll=True),
                              merge_mode='concat'
                       )(inputs) 

            elif rnn_type == 'lstm' :
                h = keras.layers.Bidirectional(
                              keras.layers.LSTM(rnn_output_dim,return_sequences=True,unroll=True),
                              merge_mode='concat'
                       )(inputs)
            return h # shape= (None, u_n_words, 2*rnn_output_dim) or # shape(H_d) = (None, i_n_words, 2*rnn_output_dim)
        
        self.X = tf.placeholder(tf.int32, [None, maxlen], name='input_x')
        self.Y = tf.placeholder(tf.int64, [None])
        
        self.encoder_embeddings = tf.Variable(tf.random_uniform([dict_size, embedded_size], -1, 1), trainable=False)
        encoder_embedded = tf.nn.embedding_lookup(self.encoder_embeddings, self.X)

        outputs = bi_rnn(rnn_type = rnn_type, inputs = encoder_embedded, rnn_output_dim = rnn_output_dim) # shape = [None, maxlen, rnn_output_dim] 
        
        outputs = tf.nn.dropout(outputs, keep_prob = self.droput_rate)

        '''
            Word Attention Layer
        '''
        attention_w = tf.get_variable("attention_v", [attention_size], tf.float32)
        query = keras.layers.Dense(attention_size)(tf.expand_dims(outputs[:,-1], 1)) # shape =[None, 1, attention_size]
        keys = keras.layers.Dense(attention_size)(outputs) # shape = [None, maxlen, attention_size]
        self.attention = tf.reduce_sum(attention_w * tf.tanh(keys + query), 2) # shape = [None, maxlen]
        self.attention = tf.nn.softmax(self.attention, name='attention')
        outputs = tf.squeeze(
                                tf.matmul(
                                    tf.transpose(outputs, [0, 2, 1]),tf.expand_dims(self.attention, 2)
                                ), # shape = [None, rnn_output_dim, 1]
                            2) # shape = [None, rnn_output_dim]

        self.logits = keras.layers.Dense(label_num, use_bias=True)(outputs)
        self.probability = tf.nn.softmax(self.logits, name='probability')

        self.cost = tf.nn.sparse_softmax_cross_entropy_with_logits(
                                                                    labels = self.Y, 
                                                                    logits = self.logits)
        self.cost = tf.reduce_mean(self.cost)
        self.optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(self.cost)
        self.pre_y = tf.argmax(self.logits, 1, name='pre_y')
        correct_pred = tf.equal(self.pre_y, self.Y)
        self.accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

2.5 訓練模型

將預處理好的數據集切分爲80%的訓練集，10%作爲驗證集，10%作爲測試集
選定Multi_LSTM, Bi_RNN, Bi_RNN_Attention其中一個模型
每使用一次訓練集進行訓練後，就使用驗證集進行測試。
當驗證集的準確率連續下降5次，就停止步驟3，然後使用測試集的結果作爲模型的最終性能。

【自然語言處理】文本信息提取器-RNN

本文主要內容

一、RNN概述

1.1 單層RNN結構

1.2 多層RNN結構

1.3 雙向RNN結構

1.4 雙向RNN+Attention 結構

1.5 基於RNN的文本分類通用結構

二、RNN文本分類實例

2.1 數據集介紹

2.2 預訓練詞向量

2.3 數據預處理

2.4 模型的定義

2.4.1 Multi_LSTM

2.4.2 Bi_RNN

2.4.3 Bi_RNN_Attention

2.5 訓練模型

lightdb hash index的性能和限制

《MySQL必知必會》代碼總結

【自然語言處理】文本信息提取器-CNN

【自然語言處理】文本信息提取器-RNN

【環境配置】Ubuntu 18.04.2 LTS + RTX2080 + tensorflow 1.13.1安裝步驟

【畫圖代碼】matplotlib - 詞向量或類向量散點圖

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結