循環神經網絡介紹

在循環神經網絡中輸入數據是存在時間相關性的，也就是說，前一個時間點的數據會對後一時間點的數據產生影響。假設 $\boldsymbol{X}_t \in \mathbb{R}^{n \times d}$ 是序列中時間步 $t$ 的小批量輸入， $\boldsymbol{H}_t \in \mathbb{R}^{n \times h}$ 是該時間步的隱藏變量。與多層感知機不同的是，這裏我們保存上一時間步的隱藏變量 $\boldsymbol{H}_{t-1}$ ，並引入一個新的權重參數 $\boldsymbol{W}_{hh} \in \mathbb{R}^{h \times h}$ ，該參數用來描述在當前時間步如何使用上一時間步的隱藏變量。具體來說，時間步 $t$ 的隱藏變量的計算由當前時間步的輸入和上一時間步的隱藏變量共同決定：

$\boldsymbol{H}_t = \phi(\boldsymbol{X}_t \boldsymbol{W}_{xh} + \boldsymbol{H}_{t-1} \boldsymbol{W}_{hh} + \boldsymbol{b}_h).$

與多層感知機相比，我們在這裏添加了 $\boldsymbol{H}_{t-1} \boldsymbol{W}_{hh}$ 一項。由上式中相鄰時間步的隱藏變量 $\boldsymbol{H}_t$ 和 $\boldsymbol{H}_{t-1}$ 之間的關係可知，這裏的隱藏變量能夠捕捉截至當前時間步的序列的歷史信息，就像是神經網絡當前時間步的狀態或記憶一樣。因此，該隱藏變量也稱爲隱藏狀態。隱藏狀態中 $\boldsymbol{X}_t \boldsymbol{W}_{xh} + \boldsymbol{H}_{t-1} \boldsymbol{W}_{hh}$ 的計算等價於 $\boldsymbol{X}_t$ 與 $\boldsymbol{H}_{t-1}$ 連結後的矩陣乘以 $\boldsymbol{W}_{xh}$ 與 $\boldsymbol{W}_{hh}$ 連結後的矩陣。由於隱藏狀態在當前時間步的定義使用了上一時間步的隱藏狀態，上式的計算是循環的。使用循環計算的網絡即循環神經網絡（recurrent neural network）。

循環神經網絡有很多種不同的構造方法。含上式所定義的隱藏狀態的循環神經網絡是極爲常見的一種。在時間步 $t$ ，輸出層的輸出和多層感知機中的計算類似：

$\boldsymbol{O}_t = \boldsymbol{H}_t \boldsymbol{W}_{hq} + \boldsymbol{b}_q.$

循環神經網絡的參數包括隱藏層的權重 $\boldsymbol{W}_{xh} \in \mathbb{R}^{d \times h}$ 、 $\boldsymbol{W}_{hh} \in \mathbb{R}^{h \times h}$ 和偏差 $\boldsymbol{b}_h \in \mathbb{R}^{1 \times h}$ ，以及輸出層的權重 $\boldsymbol{W}_{hq} \in \mathbb{R}^{h \times q}$ 和偏差 $\boldsymbol{b}_q \in \mathbb{R}^{1 \times q}$ 。值得一提的是，即便在不同時間步，循環神經網絡也始終使用這些模型參數。因此，循環神經網絡模型參數的數量不隨時間步的增加而增長。

上圖展示了循環神經網絡在3個相鄰時間步的計算邏輯。在時間步 $t$ ，隱藏狀態的計算可以看成是將輸入 $\boldsymbol{X}_t$ 和前一時間步隱藏狀態 $\boldsymbol{H}_{t-1}$ 連結後輸入一個激活函數爲 $\phi$ 的全連接層。該全連接層的輸出就是當前時間步的隱藏狀態 $\boldsymbol{H}_t$ ，且模型參數爲 $\boldsymbol{W}_{xh}$ 與 $\boldsymbol{W}_{hh}$ 的連結，偏差爲 $\boldsymbol{b}_h$ 。當前時間步 $t$ 的隱藏狀態 $\boldsymbol{H}_t$ 將參與下一個時間步 $t+1$ 的隱藏狀態 $\boldsymbol{H}_{t+1}$ 的計算，並輸入到當前時間步的全連接輸出層。

代碼實現

在這個部分，我們將從零開始實現一個基於字符級循環神經網絡的語言模型，並在周杰倫專輯歌詞數據集上訓練一個模型來進行歌詞創作。

1、導入需要的庫

import tensorflow as tf
from tensorflow import keras
import numpy as np
import zipfile
import math

2、加載周杰倫歌詞數據集

def load_data_jay_lyrics():
    with zipfile.ZipFile('./jaychou_lyrics.txt.zip') as zin:
        with zin.open('jaychou_lyrics.txt') as f:
            corpus_chars = f.read().decode('utf-8')
    corpus_chars = corpus_chars.replace('\n', ' ').replace('\r', ' ')
    corpus_chars = corpus_chars[0:10000]
    idx_to_char = list(set(corpus_chars))
    char_to_idx = dict([(char, i) for i, char in enumerate(idx_to_char)])
    vocab_size = len(char_to_idx)
    corpus_indices = [char_to_idx[char] for char in corpus_chars]
    return corpus_indices, char_to_idx, idx_to_char, vocab_size

(corpus_indices, char_to_idx, idx_to_char, vocab_size) = load_data_jay_lyrics()

對這一部分以及下一部分的採樣有疑惑的小夥伴請移步至：Tensorflow2.0之語言模型數據集（周杰倫專輯歌詞）預處理。

3、定義採樣函數

3.1 隨機採樣

def data_iter_random(corpus_indices, batch_size, num_steps):
    num_examples = (len(corpus_indices)-1) // num_steps
    epoch_size = num_examples // batch_size
    example_indices = list(range(num_examples))
    random.shuffle(example_indices)
    
    # 返回從pos開始的長爲num_steps的序列
    def _data(pos):
        return corpus_indices[pos: pos + num_steps]

    for i in range(epoch_size):
        # 每次讀取batch_size個隨機樣本
        i = i * batch_size
        batch_indices = example_indices[i: i + batch_size]
        X = [_data(j * num_steps) for j in batch_indices]
        Y = [_data(j * num_steps + 1) for j in batch_indices]
        yield np.array(X), np.array(Y)

3.2 相鄰採樣

def data_iter_consecutive(corpus_indices, batch_size, num_steps, ctx=None):
    corpus_indices = np.array(corpus_indices)
    data_len = len(corpus_indices)
    batch_len = data_len // batch_size
    indices = corpus_indices[0: batch_size*batch_len].reshape((
        batch_size, batch_len))
    epoch_size = (batch_len - 1) // num_steps
    for i in range(epoch_size):
        i = i * num_steps
        X = indices[:, i: i + num_steps]
        Y = indices[:, i + 1: i + num_steps + 1]
        yield X, Y

4、one-hot向量

爲了將詞表示成向量輸入到神經網絡，一個簡單的辦法是使用one-hot向量，即獨熱編碼。假設詞典中不同字符的數量爲 $N$ （即詞典大小vocab_size），每個字符已經和一個從0到 $N-1$ 的連續整數值索引一一對應，包含這些一一對應的字符及其索引值的字典是第二部分得到的 char_to_idx。如果一個字符的索引是整數 $i$ , 那麼我們創建一個全0的長爲 $N$ 的向量，並將位置爲 $i$ 的元素設成1。該向量就是對原字符的one-hot向量。

我們每次採樣的小批量的形狀是(批量大小, 時間步數)。下面的函數將這樣的小批量變換成數個可以輸入進網絡的形狀爲 (批量大小, 詞典大小) 的矩陣，矩陣個數等於時間步數。也就是說，時間步 $t$ 的輸入爲 $\boldsymbol{X}_t \in \mathbb{R}^{n \times d}$ ，其中 $n$ 爲批量大小， $d$ 爲輸入個數，即one-hot向量長度（詞典大小）。

def to_onehot(X, size):
    # X shape: (batch, steps), output shape: (batch, vocab_size)
    return [tf.one_hot(x, size,dtype=tf.float32) for x in X.T]

舉例來說：

X = np.arange(10).reshape((2, 5))
inputs = to_onehot(X, vocab_size)

那麼 X 爲：

[[0 1 2 3 4]
 [5 6 7 8 9]]

是一個包含兩個 batch，且時間步長爲5的數據。也就是說，第一個時間點的輸入爲0和5，第二個時間點的輸入爲1和6，以此類推。
經過 one-hot 變換後得到的 inputs 爲：

[<tf.Tensor: id=9, shape=(2, 1027), dtype=float32, numpy=
array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)>, <tf.Tensor: id=14, shape=(2, 1027), dtype=float32, numpy=
array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)>, <tf.Tensor: id=19, shape=(2, 1027), dtype=float32, numpy=
array([[0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)>, <tf.Tensor: id=24, shape=(2, 1027), dtype=float32, numpy=
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)>, <tf.Tensor: id=29, shape=(2, 1027), dtype=float32, numpy=
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)>]

包含了5個時間點的數據，第一個時間點的數據是0和5的獨熱編碼，以此類推。

5、初始化模型參數

根據上文中對循環神經網絡的介紹，我們可以得到所有參數（權重和閾值）的 shape：

$\boldsymbol{W}_{xh} \in \mathbb{R}^{d \times h}$
$\boldsymbol{W}_{hh} \in \mathbb{R}^{h \times h}$
$\boldsymbol{b}_h \in \mathbb{R}^{1 \times h}$
$\boldsymbol{W}_{hq} \in \mathbb{R}^{h \times q}$
$\boldsymbol{b}_q \in \mathbb{R}^{1 \times q}$

其中， $d$ 指的是每個（經過獨熱編碼後的）輸入樣本的維度，即詞典的大小； $h$ 指的是隱藏層中的神經元個數； $q$ 指的是輸出向量的維度，由於輸出向量中包含着下一個時間點選擇各個字符的概率，所以 $q$ 也等於詞典大小。

num_epochs = 2500  # 訓練2500次
num_steps = 35  # 時間步長爲35
num_inputs, num_hiddens, num_outputs = vocab_size, 256, vocab_size
def get_params():
    def _one(shape):
        return tf.Variable(tf.random.normal(shape=shape,
                                             stddev=0.01,
                                             mean=0,
                                             dtype=tf.float32))

    # 隱藏層參數
    W_xh = _one((num_inputs, num_hiddens))
    W_hh = _one((num_hiddens, num_hiddens))
    b_h = tf.Variable(tf.zeros(num_hiddens), dtype=tf.float32)
    # 輸出層參數
    W_hq = _one((num_hiddens, num_outputs))
    b_q = tf.Variable(tf.zeros(num_outputs), dtype=tf.float32)
    params = [W_xh, W_hh, b_h, W_hq, b_q]
    return params

6、定義模型

在定義模型之前，我們要先對一個時間點上的輸入數據及隱藏狀態的 shape 進行歸納：

$\boldsymbol{X}_t \in \mathbb{R}^{n \times d}$
$\boldsymbol{H}_t \in \mathbb{R}^{n \times h}$

其中， $n$ 指的是每批樣本的個數； $d$ 指的是每個（經過獨熱編碼後的）輸入樣本的維度，即詞典的大小； $h$ 指的是隱藏層中的神經元個數。

6.1 初始化隱藏狀態

我們根據循環神經網絡的計算表達式實現該模型。首先定義 init_rnn_state 函數來返回初始化的隱藏狀態。它返回由一個形狀爲 (批量大小, 隱藏單元個數) 的所有值都爲0的由數組組成的元組。使用元組是爲了更便於處理隱藏狀態含有多個數組的情況。

# 返回初始化的隱藏狀態
def init_rnn_state(batch_size):
    return (tf.zeros(shape=(batch_size, num_hiddens)), )

6.2 在一個時間步裏計算隱藏狀態和輸出

這裏的激活函數使用了 tanh 函數，因爲當元素在實數域上均勻分佈時，tanh 函數值的均值爲0。

def rnn(inputs, state, params):
    W_xh, W_hh, b_h, W_hq, b_q = params
    H, = state
    outputs = []
    for X in inputs:
        X=tf.reshape(X,[-1,W_xh.shape[0]])
        H = tf.tanh(tf.matmul(X, W_xh) + tf.matmul(H, W_hh) + b_h)
        Y = tf.matmul(H, W_hq) + b_q
        outputs.append(Y)
    return outputs, (H,)

在上面的函數中，inputs 和 outputs 皆爲 num_steps 個形狀爲 (batch_size, vocab_size) 的矩陣，每次循環只對其中一個矩陣進行計算。

7、定義預測函數

以下函數基於前綴 prefix（含有數個字符的字符串）來預測接下來的 num_chars 個字符。

def predict_rnn(prefix, num_chars, params):
    state = init_rnn_state(batch_size=1)
    output = [char_to_idx[prefix[0]]]
    for t in range(num_chars + len(prefix) - 1):
        # 將上一時間步的輸出作爲當前時間步的輸入
        X = tf.convert_to_tensor(to_onehot(np.array([output[-1]]), vocab_size),dtype=tf.float32)
        X = tf.reshape(X,[1,-1])
        # 計算輸出和更新隱藏狀態
        (Y, state) = rnn(X, state, params)
        # 下一個時間步的輸入是prefix裏的字符或者當前的最佳預測字符
        if t < len(prefix) - 1:
            output.append(char_to_idx[prefix[t + 1]])
        else:
            output.append(int(np.array(tf.argmax(Y[0],axis=1))))
    return ''.join([idx_to_char[i] for i in output])

我們可以先試驗一下：

params = get_params()
print(predict_rnn('分開', 10, params))

得到：

分開盯欠袋王中寫鷗油村叢

因爲模型參數爲隨機值，所以預測結果也是隨機的。

8、裁剪梯度

循環神經網絡中較容易出現梯度衰減或梯度爆炸。爲了應對梯度爆炸，我們可以裁剪梯度（clip gradient）。假設我們把所有模型參數梯度的元素拼接成一個向量 $\boldsymbol{g}$ ，並設裁剪的閾值是 $\theta$ 。裁剪後的梯度

$\min\left(\frac{\theta}{|\boldsymbol{g}|}, 1\right)\boldsymbol{g}$

的 $L_2$ 範數不超過 $\theta$ 。

# 計算裁剪後的梯度
def grad_clipping(grads,theta):
    norm = np.array([0])
    for i in range(len(grads)):
        norm+=tf.math.reduce_sum(grads[i] ** 2)
    norm = np.sqrt(norm).item()
    new_gradient=[]
    if norm > theta:
        for grad in grads:
            new_gradient.append(grad * theta / norm)
    else:
        for grad in grads:
            new_gradient.append(grad)  
    return new_gradient

9、定義模型訓練函數

跟之前的模型訓練函數相比，這裏的模型訓練函數有以下幾點不同：

使用困惑度評價模型。
在迭代模型參數前裁剪梯度。
對時序數據採用不同採樣方法將導致隱藏狀態初始化的不同。

9.1 困惑度

我們通常使用困惑度（perplexity）來評價語言模型的好壞。困惑度是對交叉熵損失函數做指數運算後得到的值。特別地，

最佳情況下，模型總是把標籤類別的概率預測爲1，此時困惑度爲1；
最壞情況下，模型總是把標籤類別的概率預測爲0，此時困惑度爲正無窮；
基線情況下，模型總是預測所有類別的概率都相同，此時困惑度爲類別個數。

顯然，任何一個有效模型的困惑度必須小於類別個數。在本例中，困惑度必須小於詞典大小vocab_size。

9.2 初始化優化器

optimizer = tf.keras.optimizers.SGD(learning_rate=1e2)

9.3 定義梯度下降函數

def train_step(params, X, Y, state, clipping_theta):
    with tf.GradientTape(persistent=True) as tape:
        tape.watch(params)
        inputs = to_onehot(X, vocab_size)
        # outputs有num_steps個形狀爲(batch_size, vocab_size)的矩陣
        (outputs, state) = rnn(inputs, state, params)
        # 拼接之後形狀爲(num_steps * batch_size, vocab_size)
        outputs = tf.concat(outputs, 0)
        # Y的形狀是(batch_size, num_steps)，轉置後再變成長度爲
        # batch * num_steps 的向量，這樣跟輸出的行一一對應
        y = Y.T.reshape((-1,))
        y = tf.convert_to_tensor(y, dtype=tf.float32)
        # 使用交叉熵損失計算平均分類誤差
        l = tf.reduce_mean(tf.losses.sparse_categorical_crossentropy(y, outputs))

    grads = tape.gradient(l, params)
    grads = grad_clipping(grads, clipping_theta)  # 裁剪梯度
    optimizer.apply_gradients(zip(grads, params))
    return l, y

9.4 定義訓練函數

is_random_iter：是否隨機採樣；
pred_period：間隔多少次展示一次結果；
pred_len：要求預測的字符長度。

def train_and_predict_rnn(is_random_iter, batch_size, clipping_theta, pred_period, pred_len, prefixes):
    if is_random_iter:
        data_iter_fn = data_iter_random
    else:
        data_iter_fn = data_iter_consecutive
    params = get_params()
    
    for epoch in range(num_epochs):
        if not is_random_iter:  # 如使用相鄰採樣，在epoch開始時初始化隱藏狀態
            state = init_rnn_state(batch_size)
        l_sum, n = 0.0, 0
        data_iter = data_iter_fn(corpus_indices, batch_size, num_steps)
        for X, Y in data_iter:
            if is_random_iter:  # 如使用隨機採樣，在每個小批量更新前初始化隱藏狀態
                state = init_rnn_state(batch_size, num_hiddens)
            l, y = train_step(params, X, Y, state, clipping_theta)
            l_sum += np.array(l).item() * len(y)
            n += len(y)

        if (epoch + 1) % pred_period == 0:
            print('epoch %d, perplexity %f' % (epoch + 1, math.exp(l_sum / n)))
            for prefix in prefixes:
                print(prefix)
                print(' -', predict_rnn(prefix, pred_len, params))

9.5 訓練

pred_period, pred_len, prefixes = 50, 50, ['分開', '不分開']
clipping_theta = 0.01
batch_size = 32
train_and_predict_rnn(False, batch_size, clipping_theta, pred_period, pred_len, prefixes)

Tensorflow2.0之從零開始實現循環神經網絡

文章目錄