最近在做可以轉成pb模型的RNN/LSTM層的實現細節分析。經過一些分析,發現了在Keras裏面常見的
keras.layers.LSTM
和Tensorflow的tf.contrib.rnn.LSTMCell
有一些實現上面的區別。本文將立足於Keras和Tensorflow源碼,分別搭建兩個簡單的一層LSTM的神經網絡,驗證權重的解析順序及計算邏輯的正確性。Let’s roll~
0. 常見的LSTM層選擇
經過初步調查,常用的LSTM層有Keras.layers.LSTM
和 Tensorflow.contrib.nn.LSTMCell
及 Tensorflow.nn.rnn_cell.LSTMCell
,其中後面兩個的實現邏輯是一樣的。
這裏,
Keras.layers.LSTM
的計算源碼文件爲keras/layers/recurrent.py中的LSTMCell
類。Tensorflow.contrib.nn.LSTMCell
和Tensorflow.nn.rnn_cell.LSTMCell
的計算源碼文件爲tensorflow/python/ops/rnn_cell_impl.py中的LSTMCell
類。
1. Keras的LSTM計算邏輯梳理
從代碼的清晰程度和模型實現的方便情況來說,Keras確實很方便,爲了搞清楚實現邏輯,我搭了一個根據ABC—>D, BCD—>E, …, WXY—>Z的根據前三個字母預測下一個字母的模型。我將每個字母用一個數字表示,A = 0, B = 1,…,Z = 25,時間步爲3,每個時間步對應的輸入維度爲1(因爲將每個字母都編成長度爲1的數字/數組):
# coding: UTF-8
"""
@author: samuel ko
@date: 2018/12/12
@link: https://blog.csdn.net/zwqjoy/article/details/80493341
"""
import numpy
from keras.models import Sequential
from keras.utils import np_utils
numpy.random.seed(5)
# 定義數據集
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
print(len(alphabet))
# create mapping of characters to integers (0-25) and the reverse
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))
# 預備數據集
seq_length = 3
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
seq_in = alphabet[i:i + seq_length]
seq_out = alphabet[i + seq_length]
dataX.append([char_to_int[char] for char in seq_in])
dataY.append(char_to_int[seq_out])
print(seq_in, '->', seq_out)
# 喂入網絡的特徵爲 [batch_size, time_step, input_dim] 3D的Tensor
# 用易懂的語言就是: time_step爲時間步的個數, input_dim爲每個時間步喂入的數據
X = numpy.reshape(dataX, (len(dataX), seq_length, 1))
X = X / float(len(alphabet))
# 對標籤進行one-hot處理
y = np_utils.to_categorical(dataY)
由上面代碼可以看出,X
是輸入數據,y
是標籤,那麼搭建模型進行訓練(簡單起見,一層LSTM加一個全連接層,Tensorflow裏面也是採用這樣的結構):
model = Sequential()
# input_shape = (time_step, 每個時間步的input_dim)
# LSTM的第一個參數5表示LSTM的單元數爲5,我們可以把LSTM理解爲一個特殊的且帶有時序信息的全連接層。
# Dense的第一個參數爲y.shape[1] = 26,也就是label個數,顯而易見,有26個字母可能被預測出來,即26分類任務。
model.add(LSTM(5, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, nb_epoch=100, batch_size=1, verbose=2)
model.save("simplelstm.h5")
整體代碼爲:
# coding: UTF-8
"""
@author: samuel ko
@date: 2018/12/12
@link: https://blog.csdn.net/zwqjoy/article/details/80493341
"""
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM, SimpleRNN
from keras.utils import np_utils
# fix random seed for reproducibility
numpy.random.seed(5)
# define the raw dataset
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
print(len(alphabet))
# create mapping of characters to integers (0-25) and the reverse
char_to_int = dict((c, i) for i, c in enumerate(alphabet))
int_to_char = dict((i, c) for i, c in enumerate(alphabet))
# prepare the dataset of input to output pairs encoded as integers
seq_length = 3
dataX = []
dataY = []
for i in range(0, len(alphabet) - seq_length, 1):
seq_in = alphabet[i:i + seq_length]
seq_out = alphabet[i + seq_length]
dataX.append([char_to_int[char] for char in seq_in])
dataY.append(char_to_int[seq_out])
print(seq_in, '->', seq_out)
# 我們運行上面的代碼,來觀察現在我們的input和output數據集是這樣一種情況
# A -> B
# B -> C
# ...
# Y -> Z
# 喂入網絡的特徵爲 [batch_size, time_step, input_dim] 3D的Tensor
# 用易懂的語言就是: time_step爲時間步的個數, input_dim爲每個時間步喂入的數據
X = numpy.reshape(dataX, (len(dataX), seq_length, 1))
# print(X)
# [[[ 0]]
# [[ 1]]
# [[ 2]]
# [[ 3]]
# ...
# [[24]]]
# normalize 最後接一個分類的任務
X = X / float(len(alphabet))
print(X.shape)
# (25, 3, 1)
# one hot編碼輸出label
y = np_utils.to_categorical(dataY)
print(y.shape)
# 創建&訓練&保存模型
model = Sequential()
# input_shape = (time_step, 每個時間步的input_dim)
model.add(LSTM(5, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, nb_epoch=100, batch_size=1, verbose=2)
model.save("simplelstm.h5")
代碼跑完之後,得到simplelstm.h5
模型,下面我從Netron[1]
裏面,可以拆分得到權重。這裏面涉及到LSTM的一點知識,我們知道,LSTM有4個branch,對應有4個權重,按Keras的說法,分別爲i: input輸入門, c: new_input: 新輸出,f: forget遺忘門,o: output輸出門,具體情況請參考[2]
:
- ① forget門對應位置
- ② new_input門()和input輸入門
- ③ 更新cell狀態得到下一時間步的輸出
- ④ 計算輸出門output, 根據和得到這一時間步的輸出
可能大家會問了,4個權重比較容易理解,但是爲什麼看simplelstm.h5
的可視化結構時候,會有kernel
和recurrent_kernel
兩個東西呢?
以我們的3個時間步的結構爲例,如下,每個時間步的輸入都有兩個,一個是對應數據X
每個時間步輸入的維度,對我們的例子是1x1的數據;而則對應了同層間不同時間步傳遞的memory state/hidden state。
這個跟我們之前設置的LSTM(5, input_shape=(X.shape[1], X.shape[2]))
的5直接相關。對於4個不同的權重,它的維度都是5(LSTM層的units設置) x 5(LSTM層的units設置)的。
而對於對應的權重,它們的維度都是1(輸入維度) x 5(LSTM層的units設置)。
下面繼續返回看Netron裏面的kernel,recurrent_kernel以及bias的內容,我們發現其形狀分別爲1 x 20
, 5 x 20
, 1 x 20
:
那麼聰明的你應該可以想到,Keras是將i, j, c, o對應的4個1 x 5的kernel和bias以及4個5 x 5的recurrent kernel合在一起了,那麼看源碼進行對應的拆解就行了。
class LSTMCell(Layer):
...
def build(self, input_shape):
input_dim = input_shape[-1]
# self.kernel處理傳入本層的輸入
self.kernel = self.add_weight(shape=(input_dim, self.units * 4),
name='kernel',
initializer=self.kernel_initializer,
regularizer=self.kernel_regularizer,
constraint=self.kernel_constraint)
# self.recurrent_kernel處理本層不同時間步的輸入
self.recurrent_kernel = self.add_weight(
shape=(self.units, self.units * 4),
name='recurrent_kernel',
initializer=self.recurrent_initializer,
regularizer=self.recurrent_regularizer,
constraint=self.recurrent_constraint)
if self.use_bias:
if self.unit_forget_bias:
def bias_initializer(_, *args, **kwargs):
return K.concatenate([
self.bias_initializer((self.units,), *args, **kwargs),
initializers.Ones()((self.units,), *args, **kwargs),
self.bias_initializer((self.units * 2,), *args, **kwargs),
])
else:
bias_initializer = self.bias_initializer
self.bias = self.add_weight(shape=(self.units * 4,),
name='bias',
initializer=bias_initializer,
regularizer=self.bias_regularizer,
constraint=self.bias_constraint)
else:
self.bias = None
# 解析順序
self.kernel_i = self.kernel[:, :self.units]
self.kernel_f = self.kernel[:, self.units: self.units * 2]
self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]
self.kernel_o = self.kernel[:, self.units * 3:]
self.recurrent_kernel_i = self.recurrent_kernel[:, :self.units]
self.recurrent_kernel_f = (
self.recurrent_kernel[:, self.units: self.units * 2])
self.recurrent_kernel_c = (
self.recurrent_kernel[:, self.units * 2: self.units * 3])
self.recurrent_kernel_o = self.recurrent_kernel[:, self.units * 3:]
if self.use_bias:
self.bias_i = self.bias[:self.units]
self.bias_f = self.bias[self.units: self.units * 2]
self.bias_c = self.bias[self.units * 2: self.units * 3]
self.bias_o = self.bias[self.units * 3:]
...
可以看出,1 x 20 的kernel和bias以及 5 x 20 的recurrent kernel對應的解析順序爲i, f, c, o,以kernel爲例,我們對kernel的權重解析順序如下:
下面,我將把權重和bias都解析出來,並按照源碼中定好的計算邏輯,基於numpy科學計算庫,實現一版。並驗證其結果和Keras原生的效果:
- ① 首先,我們先做一個shape爲(1, 3, 1)的輸入,輸入網絡,將LSTM層的輸出打印出來:
"""
@author: samuel ko
@date: 2018/12/17
@target: 研究模型的中間輸出結果
@ref: 作者:揮揮灑灑
來源:CSDN
原文:https://blog.csdn.net/u010420283/article/details/80303231
"""
from keras.models import load_model
from keras import backend as K
import numpy as np
model = load_model("simplelstm.h5")
layer_1 = K.function([model.layers[0].input], [model.layers[0].output])#第一個 model.layers[0],不修改,表示輸入數據;第二個model.layers[you wanted],修改爲你需要輸出的層數的編號
layer_11 = K.function([model.layers[0].input], [model.layers[1].input])#第一個 model.layers[0],不修改,表示輸入數據;第二個model.layers[you wanted],修改爲你需要輸出的層數的編號
# 定義shape爲(1, 3, 1)的輸入,輸入網絡
inputs = np.array([[0], [0.03846154], [0.07692308]])
inputs = np.expand_dims(inputs, 0)
print(layer_1([inputs])[0]); print(layer_1([inputs])[0].shape)
print(layer_11([inputs])[0]); print(layer_11([inputs])[0].shape)
輸出爲(可以看到,LSTM層輸出的結果跟Dense層的輸入是一樣的~):
[[-0.6918077 -0.5736012 -0.6106971 -0.23724467 -0.28232932]]
(1, 5)
[[-0.6918077 -0.5736012 -0.6106971 -0.23724467 -0.28232932]]
(1, 5)
- ② 接着,我們根據Netron的網絡圖結果,拆解權重,並把
Keras.layers.LSTM
的計算邏輯用numpy重新實現:
"""
@author: samuel ko
@date: 2018/12/17
@target: 研究模型的中間輸出結果
@ref: 作者:揮揮灑灑
來源:CSDN
原文:https://blog.csdn.net/u010420283/article/details/80303231
"""
from keras.models import load_model
from keras import backend as K
import numpy as np
h_tm_i, h_tm_o, h_tm_c, h_tm_f, c_tm = None, None, None, None, None
def hard_sigmoid(x):
x = 0.2 * x + 0.5
x[x < -2.5] = 0
x[x > 2.5] = 1
return x
def lstm_keras_verify(inputs):
global h_tm_c, h_tm_f, h_tm_i, h_tm_o, c_tm
# kernel初始化
kernel_i = np.array([0.4309869408607483, 1.184934139251709, 1.1755656003952026, 0.29152509570121765, 0.9355264902114868])
kernel_f = np.array([0.4721968472003937, 0.8939654231071472, 0.3940809667110443, 0.32647714018821716, 0.3925175964832306])
kernel_c = np.array([0.43232300877571106, 0.9761391282081604, 0.4974423944950104, -0.5713692307472229, 0.6272905468940735])
kernel_o = np.array([0.4851478338241577, 0.4159347116947174, 0.8334378600120544, 0.6494604349136353, 1.4963207244873047])
recurrent_kernel_i = np.array([[-0.15266947448253632, -0.4967867434024811, -0.2602699398994446, -0.3376578092575073, 0.18315182626247406],
[0.40668627619743347, 0.11702277511358261, 0.2870166599750519, -0.09417486935853958, 1.2248116731643677],
[0.13948452472686768, -0.2935984432697296, -0.18430666625499725, 0.04545489326119423, 0.8304147720336914],
[-0.9957871437072754, -1.2020113468170166, -1.1591960191726685, -0.2052622139453888, -1.3381662368774414],
[1.1894947290420532, 0.675262451171875, 0.6069576144218445, 0.5705539584159851, 0.9218697547912598]])
recurrent_kernel_f = np.array([[-0.548134982585907, -0.12552201747894287, -0.41158366203308105, 0.09746172279119492, 0.19226618111133575],
[0.10524879395961761, 0.032132066786289215, 0.0605274997651577, 0.07235733419656754, 0.7413577437400818],
[-0.17540045082569122, -0.40539026260375977, -0.18782351911067963, 0.20610281825065613, 0.8710744380950928],
[-0.7760279178619385, -0.9006417393684387, -0.7003670334815979, -0.22393617033958435, -0.5202550888061523],
[0.7772086262702942, 0.7663999199867249, 0.5117960572242737, 0.13461880385875702, 0.7836397290229797]])
recurrent_kernel_c = np.array([[1.580788493156433, 1.0911318063735962, 0.6749269366264343, 0.30827417969703674, 0.7559695839881897],
[0.7300652265548706, 0.9139286875724792, 1.1172183752059937, 0.043491244316101074, 0.8009109497070312],
[1.49398934841156, 0.5944592356681824, 0.8874677419662476, -0.1583320051431656, 1.3592860698699951],
[0.032015360891819, -0.5035645365715027, -0.3792402148246765, 0.42566269636154175, -0.6349631547927856],
[0.12018230557441711, 0.33967509865760803, 0.5114297270774841, -0.062018051743507385, 0.5401539206504822]])
recurrent_kernel_o = np.array([[-0.41055813431739807, -0.017661772668361664, 0.06882145255804062, 0.09856614470481873, 0.44098445773124695],
[0.5692929625511169, 0.5409368872642517, 0.3319447338581085, 0.4997922480106354, 0.9462743401527405],
[0.1794481724500656, 0.10621143877506256, -0.0016202644910663366, -0.010369917377829552, 0.4268817901611328],
[-1.026210904121399, -0.6898611783981323, -0.9652346968650818, -0.07141508907079697, -0.6710768938064575],
[0.5829002261161804, 0.6890853047370911, 0.5738061666488647, -0.16630153357982635, 1.2376824617385864]])
bias_i = np.array([1.1197513341903687, 1.0861579179763794, 1.0329890251159668, 0.3536357581615448, 0.9598652124404907])
bias_f = np.array([2.020589828491211, 1.940927267074585, 1.9546188116073608, 1.1743367910385132, 1.7189750671386719])
bias_c = np.array([-0.41391095519065857, -0.21292796730995178, -0.30117690563201904, -0.24005982279777527, 0.053657304495573044])
bias_o = np.array([1.222458004951477, 1.1024200916290283, 1.0836670398712158, 0.3483290672302246, 0.9281882643699646])
# step 1 計算W * x
x_i = inputs * kernel_i
x_f = inputs * kernel_f
x_c = inputs * kernel_c
x_o = inputs * kernel_o
# step 2 加上bias
x_i += bias_i
x_f += bias_f
x_c += bias_c
x_o += bias_o
# step 3 計算
if not isinstance(h_tm_i, np.ndarray):
h_tm_i = np.zeros((1, 5))
h_tm_o = np.zeros((1, 5))
h_tm_f = np.zeros((1, 5))
h_tm_c = np.zeros((1, 5))
c_tm = np.zeros((1, 5))
i = hard_sigmoid(x_i + np.dot(h_tm_i, recurrent_kernel_i))
f = hard_sigmoid(x_f + np.dot(h_tm_f, recurrent_kernel_f))
c = f * c_tm + i * np.tanh(x_c + np.dot(h_tm_c, recurrent_kernel_c))
o = hard_sigmoid(x_o + np.dot(h_tm_o, recurrent_kernel_o))
h = o * np.tanh(c)
h_tm_c = h_tm_f = h_tm_o = h_tm_i = h
c_tm = c
print("當前的hidden state", h)
print("當前的cell state", c)
return h, c
得到結果:
[[-0.6918077 -0.5736012 -0.6106971 -0.23724467 -0.28232932]]
(1, 5)
[[-0.6918077 -0.5736012 -0.6106971 -0.23724467 -0.28232932]]
(1, 5)
輸入內容: [[0.]]
當前的hidden state [[-0.20567793 -0.10758754 -0.14600677 -0.07612558 0.02542126]]
當前的cell state [[-0.2836353 -0.15045176 -0.20660162 -0.13443607 0.03709382]]
輸入內容: [[0.03846154]]
當前的hidden state [[-0.52542272 -0.34593632 -0.39644344 -0.1596688 -0.1078329 ]]
當前的cell state [[-0.83987432 -0.52042347 -0.6076283 -0.29302937 -0.16417923]]
輸入內容: [[0.07692308]]
當前的hidden state [[-0.69180776 -0.57360109 -0.61069705 -0.23724468 -0.28232936]]
當前的cell state [[-1.51751077 -1.19211365 -1.25843129 -0.46999835 -0.55761341]]
可以看到,Keras的LSTM層輸出的結果跟LSTM層最後一個時間步輸出的memory state/hidden state一致。(有一點精度損失,可能是Cuda導致的…)
# Keras結果
[[-0.6918077 -0.5736012 -0.6106971 -0.23724467 -0.28232932]]
# Numpy自己實現結果
[[-0.69180776 -0.57360109 -0.61069705 -0.23724468 -0.28232936]]
2. Tensorflow的LSTM計算邏輯梳理
正如在文章開頭提到的,Tensorflow.contrib.nn.LSTMCell
和Tensorflow.nn.rnn_cell.LSTMCell
的計算源碼文件爲tensorflow/python/ops/rnn_cell_impl.py中的LSTMCell
類,是一樣的。所以我這裏使用的是tf.contrib.rnn.LSTMCell
,輸入數據X
和標籤y
跟Keras採用的一樣(直接拿過來用就行,這裏就不貼了),模型定義也很相似,遵循TF的特定範式:
"""
@author: samuel ko
@date: 2018/12/18
@target: 訓練一個只帶一層LSTM的TF模型
@ref: 作者:謝小小XH
來源:CSDN
原文:https://blog.csdn.net/xierhacker/article/details/78772560
"""
inputs = tf.placeholder(shape=(None, 3, 1), dtype=tf.float32, name='Inputs')
labels = tf.placeholder(shape=(None, 26), dtype=tf.float32, name="Labels")
lstm_cell = tf.contrib.rnn.LSTMCell(num_units=5)
# initialize to zero
init_state = lstm_cell.zero_state(batch_size=1, dtype=tf.float32)
output, state = tf.nn.dynamic_rnn(
cell=lstm_cell,
inputs=inputs,
dtype=tf.float32,
initial_state=init_state,
)
print("output.shape:", output.shape)
print("len of state tuple", len(state))
print("state.h.shape:", state.h.shape)
print("state.c.shape:", state.c.shape)
# output = tf.layers.dense(output, 26)
output = tf.layers.dense(state.h, 26, name="Outputs")
loss = tf.losses.softmax_cross_entropy(onehot_labels=labels, logits=output)
optimizer = tf.train.AdamOptimizer(0.001).minimize(loss=loss)
init = tf.global_variables_initializer()
saver = tf.train.Saver(max_to_keep=5)
#-------------------------------------------Define Session---------------------------------------#
with tf.Session() as sess:
sess.run(init)
for epoch in range(1, 100+1):
train_losses = []
print("epoch:", epoch)
for j in range(23):
_, train_loss = sess.run(
fetches=(optimizer, loss),
feed_dict={
inputs: X[j: j+1],
labels: y[j: j+1]
}
)
train_losses.append(train_loss)
print("average training loss:", sum(train_losses) / len(train_losses))
saver.save(sess, "model/simple_lstm")
訓練完成後,得到形式。
跟Keras的LSTM拆解類似,我們首先根據源碼分析不同的kernel,bias,recurrent_kernel的存放位置,然後再去拆解並用Numpy重新實現計算邏輯,代碼如下:
# coding: UTF-8
"""
@author: samuel ko
@date: 2018/12/18
@target: 研究TF模型的中間輸出結果
"""
import sys
import os
import numpy as np
import tensorflow as tf
h_tm_i, h_tm_o, h_tm_c, h_tm_f, c_tm = None, None, None, None, None
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
def lstm_tf_verify(inputs):
"""
2018/12/18
TF原生的解析順序爲i, j, f, o (j就是keras中的c)
:param inputs:
:return:
"""
global h_tm_c, h_tm_f, h_tm_i, h_tm_o, c_tm
bias_i = ...
bias_j = ...
bias_f = ...
bias_o = ...
kernel_i = ...
kernel_j = ...
kernel_f = ...
kernel_o = ...
recurrent_i = ...
recurrent_j = ...
recurrent_f = ...
recurrent_o = ...
# step 1 計算W * x
x_i = inputs * kernel_i
x_f = inputs * kernel_f
x_j = inputs * kernel_j
x_o = inputs * kernel_o
# step 2 加上bias
x_i += bias_i
x_f += bias_f
x_j += bias_j
x_o += bias_o
# step 3 計算
if not isinstance(h_tm_i, np.ndarray):
h_tm_i = np.zeros((1, 5))
h_tm_o = np.zeros((1, 5))
h_tm_f = np.zeros((1, 5))
h_tm_c = np.zeros((1, 5))
c_tm = np.zeros((1, 5))
i = sigmoid(x_i + np.dot(h_tm_i, recurrent_i))
# Tensorflow默認有一個forget_bias, 默認設置爲1.0
f = sigmoid(x_f + np.dot(h_tm_f, recurrent_f) + 1.0)
c = f * c_tm + i * np.tanh(x_j + np.dot(h_tm_c, recurrent_j))
o = sigmoid(x_o + np.dot(h_tm_o, recurrent_o))
h = o * np.tanh(c)
h_tm_c = h_tm_f = h_tm_o = h_tm_i = h
c_tm = c
print("當前的hidden state", h)
print("當前的cell state", c)
return h, c
跟Tensorflow的模型的LSTM層輸出結果進行比較,根據定義
output, state = tf.nn.dynamic_rnn(
cell=lstm_cell,
inputs=inputs,
dtype=tf.float32,
initial_state=init_state,
)
輸出有output和state兩個,其中output是每個時間步輸出的的彙總,state有兩個內容:state.h和state.c,前者是本層最後一個時間步輸出的hidden state/memory state,後者是本層最後一個時間步輸出的cell state(細胞狀態)。
整體代碼如下:
# coding: UTF-8
"""
@author: samuel ko
@date: 2018/12/18
@target: 研究TF模型的中間輸出結果
"""
import sys
import os
import numpy as np
import tensorflow as tf
path_file = __file__
dir_name = os.path.dirname(path_file)
# 1. 準備輸入
inputs = np.array([[0], [0.03846154], [0.07692308]])
inputs = np.expand_dims(inputs, 0)
labels = np.array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0]])
# 2. 加載模型, 輸出中間結果和最後結果
with tf.Session() as sess:
graph = tf.get_default_graph()
new_saver = tf.train.import_meta_graph(os.path.join(dir_name, 'model/simple_lstm.meta'))
# 注: tf.train_get_checkpoint_state不允許接收中文, tf.train.latest_checkpoint就沒問題...
# new_saver.restore(sess, tf.train.get_checkpoint_state(os.path.join(dir_name, "model/")))
new_saver.restore(sess, tf.train.latest_checkpoint(os.path.join(dir_name, "model/")))
input_x = graph.get_tensor_by_name("Inputs:0")
label_x = graph.get_tensor_by_name("Labels:0")
# out 是輸入到下一層的彙總 3 x 1 x 5
out = graph.get_tensor_by_name('rnn/TensorArrayStack/TensorArrayGatherV3:0')
# state_h 是LSTM層最後一個時間步的結果 1 x 5
state_h = graph.get_tensor_by_name('rnn/while/Exit_4:0') # 最後一個時間步的memory state 和state_h = graph.get_tensor_by_name('rnn/while/Switch_4:0') 一樣!
# state_h = graph.get_tensor_by_name('rnn/while/Exit_3:0') # 最後一個時間步的cell state
print(sess.run(out, feed_dict={input_x: inputs,
label_x: labels,
}))
print(sess.run(state_h, feed_dict={input_x: inputs,
label_x: labels,
}))
h_tm_i, h_tm_o, h_tm_c, h_tm_f, c_tm = None, None, None, None, None
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
def lstm_tf_verify(inputs):
"""
2018/12/18
TF原生的解析順序爲i, j, f, o (j就是keras中的c)
:param inputs:
:return:
"""
global h_tm_c, h_tm_f, h_tm_i, h_tm_o, c_tm
bias_i = np.array([0.9502341, 1.1212865, 0.5962041, 0.56686985, 0.65736747])
bias_j = np.array([-0.28798968, 0.31724977, -0.08590735, -0.13165179, -0.05694159])
bias_f = np.array([0.89209175, 1.0639387, 0.3089665, 0.42762548, 0.4232108])
bias_o = np.array([1.0723785, 1.2605966, 0.5964751, 0.6030057, 0.6930808])
kernel_i = np.array([0.96915483, 0.5620192, 0.5136176, 0.1521692, 0.96555483])
kernel_j = np.array([0.6295774, -0.72134864, 0.64238673, 0.48595947, 0.570404])
kernel_f = np.array([0.7884312, 0.56634164, 0.14510694, 0.19882877, 0.6444183])
kernel_o = np.array([0.55998164, 0.5682311, 0.9390488, 0.8536483, 0.9704966])
recurrent_i = np.array([[-0.30848396, -0.13132317, 0.6034289, 0.59028447, 0.09684605],
[0.28015903, -0.24312414, -0.42499176, -0.3367074, -0.06846467],
[0.7987564, 0.93413734, -0.15053841, 0.66372687, 0.06576955],
[0.24111897, 0.1684269, 0.5229809, 0.09525479, 0.28952646],
[0.70739645, 0.8474347, 0.19091478, 0.02707534, 0.52820826]])
recurrent_j = np.array([[1.272224, -1.475185, 0.38326767, 0.64769256, 0.83099645],
[-0.5344824, 1.2404263, -0.88588023, -0.7727197, -1.167835],
[0.86383224, -0.8951096, 0.08373257, 0.89576524, 0.53091526],
[0.7915831, -0.93986595, -0.02958089, 0.82741463, 0.55338454],
[0.39262557, -0.86354613, 0.62125677, 0.82101977, 0.13056423]])
recurrent_f = np.array([[0.17595771, 0.27790356, 0.6525466, 0.05647744, 0.06983535],
[0.26703873, 0.04883758, 0.0888641, -0.05813761, 0.0277635],
[0.6442748, 0.4176797, 0.5382307, 0.48299634, 0.7003999],
[0.19449034, 0.01752495, 0.13846086, 0.00932326, 0.4014144],
[0.6212245, 0.59203285, 0.05094814, 0.85539377, 0.6473349]])
recurrent_o = np.array([[0.29326066, 0.50268304, 0.544091, 0.76660025, 0.29213676],
[-0.44291726, -0.338039, -0.17275955, -0.7254445, -0.7070001],
[0.13272414, 0.8238844, -0.09202695, 0.9273238, 0.15251717],
[0.06204496, 0.6531808, 0.00607, 0.33238858, 0.04696886],
[0.9217779, 0.6748385, 0.61127436, 0.5573597, 0.21182081]])
# step 1 計算W * x
x_i = inputs * kernel_i
x_f = inputs * kernel_f
x_j = inputs * kernel_j
x_o = inputs * kernel_o
# step 2 加上bias
x_i += bias_i
x_f += bias_f
x_j += bias_j
x_o += bias_o
# step 3 計算
if not isinstance(h_tm_i, np.ndarray):
h_tm_i = np.zeros((1, 5))
h_tm_o = np.zeros((1, 5))
h_tm_f = np.zeros((1, 5))
h_tm_c = np.zeros((1, 5))
c_tm = np.zeros((1, 5))
i = sigmoid(x_i + np.dot(h_tm_i, recurrent_i))
# Tensorflow默認有一個forget_bias, 默認設置爲1.0
f = sigmoid(x_f + np.dot(h_tm_f, recurrent_f) + 1.0)
c = f * c_tm + i * np.tanh(x_j + np.dot(h_tm_c, recurrent_j))
o = sigmoid(x_o + np.dot(h_tm_o, recurrent_o))
h = o * np.tanh(c)
h_tm_c = h_tm_f = h_tm_o = h_tm_i = h
c_tm = c
print("當前的hidden state", h)
print("當前的cell state", c)
return h, c
if __name__ == "__main__":
for i in range(3):
print("輸入內容:", inputs[:, i])
# lstm_keras_verify(inputs[:, i])
lstm_tf_verify(inputs[:, i])
輸出結果爲:
# output 3 x 1 x 5 當前層的每個時間步的hidden state彙總
[[[-0.14857864 0.17725913 -0.03559565 -0.05385567 -0.02496454]]
[[-0.3793954 0.45447606 -0.13174371 -0.17756298 -0.17771873]]
[[-0.5253717 0.55423415 -0.25274208 -0.25586015 -0.34587777]]]
# state.h 最後一個時間步的hidden state
[[-0.5253717 0.55423415 -0.25274208 -0.25586015 -0.34587777]]
輸入內容: [[0.]]
當前的hidden state [[-0.14857867 0.17725915 -0.03559565 -0.05385567 -0.02496454]]
當前的cell state [[-0.20212986 0.23156138 -0.05525611 -0.08351723 -0.03746516]]
輸入內容: [[0.03846154]]
當前的hidden state [[-0.37939543 0.45447602 -0.13174374 -0.17756298 -0.17771877]]
當前的cell state [[-0.58665553 0.71037671 -0.21416421 -0.31547094 -0.28813169]]
輸入內容: [[0.07692308]]
當前的hidden state [[-0.5253716 0.55423418 -0.25274209 -0.25586014 -0.34587777]]
當前的cell state [[-1.12897442 1.26972863 -0.47543917 -0.66030582 -0.70899148]]
可以看出,我們的實現跟TF基本一樣(跟Keras一樣,都有一點點精度損失)。
# TF結果
[[-0.5253717 0.55423415 -0.25274208 -0.25586015 -0.34587777]]
# Numpy自己實現結果
[[-0.5253716 0.55423418 -0.25274209 -0.25586014 -0.34587777]]
3. Keras和TF的LSTM層異同分析
這部分,我們將對Keras和Tensorflow的LSTM層的計算邏輯進行細緻的分析,源碼位置在文章一開頭,建議大家進去看後再來看這部分,會更加明白。
實現的代碼主要對比lstm_keras_verify
函數和lstm_tf_verify
函數:顧名思義,前面是Keras的LSTM實現邏輯,後面的是Tensorflow的LSTM實現邏輯,下面講到的異同點如果源碼裏面不好理解,直接看這裏的實現區別也行。
- ① TF的
self._kernel
包含了input_depth(本例爲1)和h_depth(本例爲num_units,爲5),即把Keras裏面的kernel和recurrent_kernel統一放到了self._kernel
裏面了。
所以,當我打印simple_lstm
的Tensorflow模型時發現,rnn/lstm_cell/kernel
的size爲6 x 20, 6是啥意思呢?6也很簡單,其包含了一個1 x 20的(input_w_kernel)和 5 x 20的(recurrent_w_kernel)——解析順序也是這樣的。(即不像Keras分爲kernel和recurrent_kernel兩個分別保存權重。)
Tensorflow中LSTM用於存儲權重的self._kernel代碼:
@tf_export("nn.rnn_cell.LSTMCell")
class LSTMCell(LayerRNNCell):
...
@tf_utils.shape_type_conversion
def build(self, inputs_shape):
if inputs_shape[-1] is None:
raise ValueError("Expected inputs.shape[-1] to be known, saw shape: %s"
% str(inputs_shape))
input_depth = inputs_shape[-1]
h_depth = self._num_units if self._num_proj is None else self._num_proj
...
# self._kernel即包含Keras裏面的kernel,也包含recurrent_kernel,是對Keras的LSTM層權重的2合1.
self._kernel = self.add_variable(
_WEIGHTS_VARIABLE_NAME,
shape=[input_depth + h_depth, 4 * self._num_units],
initializer=self._initializer,
partitioner=maybe_partitioner)
...
self._bias = self.add_variable(
_BIAS_VARIABLE_NAME,
shape=[4 * self._num_units],
initializer=initializer)
- ② TF裏面的i, j, f, o分別對應Keras的LSTM中的i, c, f, o。也就是說:Keras對應的權重和Tensorflow的權重順序不一樣了!!!
3.2.1 Tensorflow的LSTM權重拆解順序
@tf_export("nn.rnn_cell.LSTMCell")
class LSTMCell(LayerRNNCell):
...
def call(self, inputs, state):
# i, j, f, o其中,j爲下面Keras對應的c
i, j, f, o = array_ops.split(
value=lstm_matrix, num_or_size_splits=4, axis=1)
# Diagonal connections
if self._use_peepholes:
# 我們先不看peephole這個LSTM變種.
...
else:
c = (sigmoid(f + self._forget_bias) * c_prev + sigmoid(i) *
self._activation(j))
...
m = sigmoid(o) * self._activation(c)
3.2.2 Keras的LSTM權重拆解順序
class LSTMCell(Layer):
def build(self, input_shape):
...
# Keras的4個權重存儲順序i, f, c, o與Tensorflow的權重存儲順序i, j, f, o中間順序調了一下,
# 也就是Keras的權重順序是a, b, c, d那麼Tensorflow對應的權重存儲爲a, c, b, d.
self.kernel_i = self.kernel[:, :self.units]
self.kernel_f = self.kernel[:, self.units: self.units * 2]
self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]
self.kernel_o = self.kernel[:, self.units * 3:]
# recurrent_kernel與kernel的順序是一樣的.
self.recurrent_kernel_i = self.recurrent_kernel[:, :self.units]
self.recurrent_kernel_f = (
self.recurrent_kernel[:, self.units: self.units * 2])
self.recurrent_kernel_c = (
self.recurrent_kernel[:, self.units * 2: self.units * 3])
self.recurrent_kernel_o = self.recurrent_kernel[:, self.units * 3:]
if self.use_bias:
self.bias_i = self.bias[:self.units]
self.bias_f = self.bias[self.units: self.units * 2]
self.bias_c = self.bias[self.units * 2: self.units * 3]
self.bias_o = self.bias[self.units * 3:]
...
- ③ Keras的LSTM中的recurrent_activation: (對應Part1的Keras的LSTM計算邏輯梳理介紹裏面的)用的是一種叫做
hard_sigmoid
的實現,TF的兩個的實現都是一樣的,用的是正常的sigmoid
。而無論是Keras還是Tensorflow,它們的activation都是tanh
,這個是一樣的。
# Tensorflow LSTM用的recurrent_activation.
def sigmoid(x):
return 1.0 / (1.0 + np.exp(-x))
# Keras LSTM用的recurrent_activation.
def hard_sigmoid(x):
x = 0.2 * x + 0.5
x[x < -2.5] = 0
x[x > 2.5] = 1
return x
- ④ Tensorflow還有一個叫做
forget_bias
的東西,默認爲1.0,關於這個參數的介紹如下:
Biases of the forget gate are initialized by default to 1 in order to reduce the scale of forgetting at the beginning of the training. Must set it manually to
0.0
when restoring from CudnnLSTM trained checkpoints.
它用在遺忘門(forget gate)(上面的lstm_tf_verify
函數),如下:
# Tensorflow默認有一個forget_bias, 默認設置爲1.0
f = sigmoid(x_f + np.dot(h_tm_f, recurrent_f) + 1.0)
# 而Keras默認不帶這個東西:
f = hard_sigmoid(x_f + np.dot(h_tm_f, recurrent_kernel_f))
- ⑤ Keras的LSTM實現起來很清爽,沒有什麼亂78糟的參數;而Tensorflow可以直接在LSTM上面做變種——比如peephole connection
[3]
, 就是說,我們讓門層也會接受細胞狀態(cell state)的輸入。
4. 一點思考
還有就是TF和Keras的LSTM實現上有一些不一致的地方,需要大家小心對待,找出異同點,根據自己的情況對層進行拆解,方便的完成解耦工作。
關於Keras和Tensorflow的LSTM層分析基本也就到此結束了,如果想更加深入的理解它們的實現,比如分析這種帶時間信息的層的反向傳播邏輯,建議深挖源碼,這塊我也不甚瞭解。希望能跟大家多多交流,謝謝~
5. 參考資料
[1] Netron: a viewer for neural network, deep learning and machine learning models.
[2] 理解 LSTM(Long Short-Term Memory, LSTM) 網絡
[3] Gers & Schmidhuber (2000) : Recurrent Nets that Time and Count