Keras實現用於文本分類的attention機制

keras沒有提供attention機制的實現，這裏參考kaggle上一個kernel中的attention機制的實現，也學習一下keras中如何自定義層。
也是想熟悉一下attention機制的代碼實現。本文中的attention layer用於文本分類，和encoder-decoder的attention有些不同。

1 Keras源碼參考

keras官網寫的非常簡潔，給了一個類的框架，然後直接說參考源碼中其他Layer的寫法。這裏直接參考一個最簡單的Dense吧，Keras中的Dense Layer也就是全連接層。說起來每個框架裏全連接層起名都不一樣…caffe叫InnerProduct（內積），IP層， pytorch叫linear，tf叫matmul，現在也改叫dense了，畢竟和keras一家的。
我下載的源碼版本是2.2.4。Dense類在 keras/layers/core.py中。
Dense類中的call()函數如下，call函數中寫具體邏輯，可以看到邏輯非常簡單，就是做一個點乘，有bias加bias，有activation做activation。self.kernel的初始化在函數build()裏。

    def call(self, inputs):
        output = K.dot(inputs, self.kernel)
        if self.use_bias:
            output = K.bias_add(output, self.bias, data_format='channels_last')
        if self.activation is not None:
            output = self.activation(output)
        return output

Dense類中的build()函數如下。其中對self.kernel做了初始化，其shape爲input_shape[-1], self_units, self_units就是創建Dense對象時傳進來的參數。自己從numpy寫神經網絡，或者使用框架時都非常需要ndarray要有一個維度是表示batch，這是我容易忽略的。
這裏爲參數分配空間時，顯然是不需要考慮batch的，因爲每batch顯然都是使用同樣的參數，可以看到其shape = (input_dim, self.units)。

    def build(self, input_shape):
        assert len(input_shape) >= 2
        input_dim = input_shape[-1]

        self.kernel = self.add_weight(shape=(input_dim, self.units),
                                      initializer=self.kernel_initializer,
                                      name='kernel',
                                      regularizer=self.kernel_regularizer,
                                      constraint=self.kernel_constraint)
        if self.use_bias:
            self.bias = self.add_weight(shape=(self.units,),
                                        initializer=self.bias_initializer,
                                        name='bias',
                                        regularizer=self.bias_regularizer,
                                        constraint=self.bias_constraint)
        else:
            self.bias = None
        self.input_spec = InputSpec(min_ndim=2, axes={-1: input_dim})
        self.built = True

可以注意到除了最後兩行，其他代碼都很清楚，self.built = True是必須要寫的，大概是表示分配空間，調用super(MyLayer, self).build(input_shape)也是可以的，Layer類中的build()函數裏就一句self.built = True。
InputSpec()類可以看一下其源碼，在engine/base_layer.py中，它有一個__repr__函數，是python類自帶的一個方法，是用來顯示的，print該類的對象時就會調用__repr__函數，也就是我們可以這樣，print(model.layer[2].input_spec)查看當前層的dtype，shape等信息。但是print(model.summary)更好用，能看到model結構更詳細的信息，以及每層之間怎麼連接的。

Dense類中的compute_output_shape()如下。可以看到這裏是計算輸出的shape， keras中間結果爲向量時，shape爲(None, dim)，類的說明寫了，shape[0]表示batch維度（即batch_size），這裏寫None表示與任何數值兼容。

        # now: model.output_shape == (None, 32)
        # note: `None` is the batch dimension

顯然全連接層的輸出shape爲(None, self.units)

    def compute_output_shape(self, input_shape):
        assert input_shape and len(input_shape) >= 2
        assert input_shape[-1]
        output_shape = list(input_shape)
        output_shape[-1] = self.units
        return tuple(output_shape)

2 Attention機制

這是ICLR14論文Neural machine translation by jointly learning to align and translate論文的部分截圖，也是最早提出attention的論文。

attention機制的權重怎麼來的用紅框標出來了，可以看到 $e_{ij}$ 是通過 $s_{i-1}, h_j$ 得出的，論文裏也說了這種機制就是基於encoder-decoder的，論文裏 $h_j$ 成爲annotation， $s_i$ 稱爲hidden state，其實 $s_i$ 就是decoder的第i個隱狀態， $h_j$ 就是encoder的第j個隱狀態。注意這裏 $s_i$ 也和 $y_{i-1}$ 有關，但是圖裏沒畫出來，論文的附錄裏有講具體推導。

$s_{i-1}$ 和 $h_j$ 都是向量，向量 $s_{i-1}$ 顯然等於decoder的每個timestep的輸出 $y_i$ 維度， $h_j$ 的維度則是手動設置的RNN的hidden units的維度。

3 Attention機制用於文本分類

encoder-decoder主要是解決輸入序列和輸出序列長度不同的問題，如語音識別中語音信息到文本，機器翻譯中的不同語言。對應文本分類來說，用不到encoder-decoder模型，只需要取出Neural machine translation by jointly learning to align and translate中的模型的一部分結構就可以用於文本分類。此時求 $e_{ij}$ 的公式也變成了
$e_{j} = a(h_j)$
這也是naacl16的論文Hierarchical Attention Networks for Document Classification 中的做法。有的文章可能覺得用到的數據集太常見了，所以也沒用說明，這篇文章用一個表格介紹了一下。這裏的Document Classification指的就是文本分類，用到的數據集文本長度也都是比較短的。
題目中的Hierarchical指的是做了兩種層次的attention。論文我還沒詳細看，只看了模型結構。但是從結構上來說，這篇文章因爲一段文本包括多句話（由句號或問號等標點符號分開），每句話包含多個詞，所以先使用GRU+Attention對每句話的詞向量進行訓練，得到sentence representation，然後用GRU+Attention進行分類。
這篇文章結果不錯，我的理解如下：

這篇文章裏分層，也就是先用詞向量訓練句向量，然後再對句向量用GRU的意義在哪啊？我感覺直接用詞向量訓練，然後attention應該也沒差，文章裏也沒說這樣做的目的，也可能是我看的還不夠仔細。我覺得這篇文章的結果好，可能是因爲直接用詞向量進行分類，輸入序列較長；但是用sentence representation進行分類，輸入序列長度較短，此時document representation能夠保留更多的信息，同時sentence representation是用詞向量訓練得到的，每個句子的長度也有限，也幫助保留了信息。可能這纔是Hierarchical的意義所在吧。

另外，用attention非常好的一點是還能知道文本中的每個句子對結果的影響程度，和句子中的每個詞對結果的影響程度。

(等等，我爲什麼想到了InfoGAN能夠查看隨機變量的每個維度對生成的複雜樣本的影響，是否和attention有什麼共通的地方)

4 代碼實現及註釋

首先明確這裏的Attention的輸入輸出是什麼，輸入顯然是RNN的輸出，Keras中RNN的輸出shape在註釋裏有寫，如下：

    # Output shape
        - if `return_state`: a list of tensors. The first tensor is
            the output. The remaining tensors are the last states,
            each with shape `(batch_size, units)`.
        - if `return_sequences`: 3D tensor with shape
            `(batch_size, timesteps, units)`.
        - else, 2D tensor with shape `(batch_size, units)`.

return_state這個參數不用在意，對單層RNN沒影響（看了源碼，但是沒測試），因爲其hidden state等於輸出（這裏單指RNN作爲模型中一個組件的輸出，不加softmax之類的函數）。
顯然用attention時需要return_sequences = True。

參考代碼：

class Attention(Layer):
	'''
		返回值：
			返回的不是attention權重，而是每個timestep乘以權重後相加得到的向量。
		輸入:
			輸入是rnn的timesteps，也是最長輸入序列的長度。keras
	'''
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):
        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0
        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight((input_shape[-1],),
                                 initializer=self.init,
                                 name='{}_W'.format(self.name),
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight((input_shape[1],),
                                     initializer='zero',
                                     name='{}_b'.format(self.name),
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
    	# 後面的層不需要mask了，所以這裏可以直接返回none
        return None

    def call(self, x, mask=None):
        features_dim = self.features_dim
		# 這裏應該是 step_dim是我們指定的參數，它等於input_shape[1],也就是rnn的timesteps
        step_dim = self.step_dim

		# 輸入和參數分別reshape再點乘後，tensor.shape變成了(batch_size*timesteps, 1),之後每個batch要分開進行歸一化
		# 所以應該有 eij = K.reshape(..., (-1, timesteps))
        eij = K.reshape(K.dot(K.reshape(x, (-1, features_dim)),
                        K.reshape(self.W, (features_dim, 1))), (-1, step_dim))

        if self.bias:
            eij += self.b
		# RNN一般默認激活函數爲tanh, 對attention來說激活函數差別不打，因爲要做softmax
        eij = K.tanh(eij)

        a = K.exp(eij)

        if mask is not None:
        # 如果前面的層有mask，那麼後面這些被mask掉的timestep肯定是不能參與計算輸出的，也就是將他們的attention權重設爲0
            a *= K.cast(mask, K.floatx())
		# cast是做類型轉換，keras計算時會檢查類型，可能是因爲用gpu的原因
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())

		# a = K.expand_dims(a, axis=-1) , axis默認爲-1， 表示在最後擴充一個維度。
		# 比如shape = (3,)變成 (3, 1)
        a = K.expand_dims(a)
        # 此時a.shape = (batch_size, timesteps, 1), x.shape = (batch_size, timesteps, units)
        weighted_input = x * a

		# weighted_input的shape爲 (batch_size, timesteps, units), 每個timestep的輸出向量已經乘上了該timestep的權重
		# weighted_input在axis=1上取和，返回值的shape爲 (batch_size, 1, units)
        return K.sum(weighted_input, axis=1)

    def compute_output_shape(self, input_shape):
    	# 返回的結果是c，其shape爲 (batch_size, units)
        return input_shape[0],  self.features_dim

5 比喻

LSTM就像人因爲記憶力有限，有些事該忘就忘。
Attention機制就像我們一生中會遇到很多人很多事，當回顧自己漫長的記憶時，對某個人，某件事，大概只有那麼幾件事是相關的，是有着光彩的吧。

Keras實現用於文本分類的attention機制

1 Keras源碼參考

2 Attention機制

3 Attention機制用於文本分類

4 代碼實現及註釋

5 比喻

圖像處理基礎01-直方圖均衡化的推導和編程實現

keil和arm裸機開發及彙編語言遇到的一些問題

Keras實現用於文本分類的attention機制

關於AutoML系統的思考

tqdm使用問題

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結