神經網絡與反向傳播算法
模型是前饋神經網絡,優化方法是梯度下降法,求偏導的方式是反向傳播算法,數據集依然是手寫數字。
本文關於反向傳播算法 BP,附了一些數學解釋,詳細講解了算法過程。
關於前向傳播算法 FP,可以參考:吳恩達機器學習CS229A_EX3_LR與NN手寫數字識別_Python3
特別注意:區分矩陣點乘 @ 和矩陣乘法 * ,寫錯的話可能會導致難以 debug 的錯誤。
導入並初始化數據,這裏用了 sklearn 的庫函數生成讀取的標籤集對應 one-hot 的輸出形式。
將 theta1 和 theta2 以向量形式初始化(用於後續給執行梯度下降的庫函數傳參),並且將所有數值隨機化到 -0.125 ~ +0.125 之間,如果不做隨機,神經網絡的權重參數會出現大量相同的冗餘情況。
import numpy as np
from scipy.io import loadmat
from scipy.optimize import minimize
from sklearn.preprocessing import OneHotEncoder
def loadData(filename):
return loadmat(filename)
def initData(data, input_size, hidden_size, output_size):
# X
X = data['X']
# y
y_load = data['y']
encoder = OneHotEncoder(sparse=False)
y = encoder.fit_transform(y_load)
# 隨機化 theta1/theta2 in vectors
params = (np.random.random(size = hidden_size * (input_size + 1) + output_size * (hidden_size + 1)) - 0.5) * 0.25
return X, y_load, y, params
兩個輔助函數:
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def sigmoid_gradient(z):
return sigmoid(z) * (1 - sigmoid(z))
神經網絡的結構是輸入層 400+1,隱藏層 25+1,輸出層 10:
前向傳播算法(EX3 中已做):
def FP(X, theta1, theta2):
m = X.shape[0]
a1 = np.insert(X, 0, values=np.ones(m), axis=1)
z2 = a1 @ theta1.T
a2 = np.insert(sigmoid(z2), 0, values=np.ones(m), axis=1)
z3 = a2 @ theta2.T
h = sigmoid(z3)
return a1, z2, a2, z3, h
根據公式計算 cost:
def cost(X, y, theta1, theta2, lamda):
m = len(y)
a1, z2, a2, z3, h = FP(X, theta1, theta2)
J = 0
for i in range(m):
first = - y[i,:] * np.log(h[i,:])
second = - (1 - y[i,:]) * np.log(1 - h[i,:])
J += np.sum(first + second)
J = J / m
# 正則化項
J += (float(lamda) / (2 * m)) * (np.sum(np.power(theta1[:,1:], 2)) + np.sum(np.power(theta2[:,1:], 2)))
return J
整個程序的難點是反向傳播算法,先簡述一下算法原理(這部分吳恩達老師的課程講的很簡略,最好還是找一些資料補一下):
首先要明白一點,BP 做的事情是求代價函數 J 對參數 theta (也可以用 weight 表示)的偏導數,根據鏈式法則,表示如下:
其中第一項定義爲誤差項,用 δ 表示,其直觀的數學含義是第 l 層的神經元的數值 z 的微變化對 NN 輸出的誤差的影響,或者是說,NN 最終輸出的誤差,對第 l 層神經元的敏感程度。
對第二項:
很好求得:
再來看第一項,同樣是使用鏈式求導法則:
至此,我們就得到了從輸出層逐層向前求 J 對 theta(weight) 偏導的方法,可以結合下面的具體算法和程序對照起來看。
首先使用前向傳播算法,依次計算得到 a1、 z2、 a2、 z3、 h,這裏 h 即爲 a3。
接着對各個樣本 k 依次處理:
計算輸出層的 error:
計算隱藏層的 error:
將求得的梯度累加到 delta1、delta2:
將累加值除以樣本數 m,並加上正則化項,即得到了最終求得的偏導結果:
def BP(params, input_size, hidden_size, output_size, X, y, lamda):
# 從 params 向量中提取 theta1、theta2 轉化爲 np.array 格式
theta1 = np.array(np.reshape(params[: hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.array(np.reshape(params[hidden_size * (input_size + 1):], (output_size, (hidden_size + 1))))
# 利用 FP 計算當前參數下得到的各個層的數值
a1, z2, a2, z3, h = FP(X, theta1, theta2)
# 初始化 delta1、delta2
delta1 = np.zeros(theta1.shape) # (25, 401)
delta2 = np.zeros(theta2.shape) # (10, 26)
# 計算 cost
J = cost(X, y, theta1, theta2, lamda)
# 樣本數
m = X.shape[0]
# BP :逐個樣本處理
for t in range(m):
# 抽取該樣本的 FP 數值
a1t = a1[t, :].reshape(1, 401) # (1, 401)
z2t = z2[t, :].reshape(1, 25) # (1, 25)
a2t = a2[t, :].reshape(1, 26) # (1, 26)
ht = h[t, :].reshape(1, 10) # (1, 10)
yt = y[t, :].reshape(1, 10) # (1, 10)
# 輸出層 error
d3t = ht - yt # (1, 10)
# 隱藏層 error
z2t = np.insert(z2t, 0, values=np.ones(1)) # (1, 26)
d2t = (theta2.T @ d3t.T).T * sigmoid_gradient(z2t) # (1, 26)
# 將 error 累加到 delta
delta1 = delta1 + (d2t[:, 1:]).T.reshape(25, 1) @ a1t
delta2 = delta2 + d3t.T.reshape(10, 1) @ a2t
# 得到最終的偏導數值
delta1 = delta1 / m
delta2 = delta2 / m
# 添加正則化項
delta1[:, 1:] = delta1[:, 1:] + (theta1[:, 1:] * lamda) / m
delta2[:, 1:] = delta2[:, 1:] + (theta2[:, 1:] * lamda) / m
# 將 delta1、delta2 轉化爲 np.array 格式用於傳參
grad = np.concatenate((np.ravel(delta1), np.ravel(delta2)))
return J, grad
將上面描述的具體過程和公式對應程序一步一步看,還是比較容易看明白的,但是要深入理解爲什麼這麼做,最好還是參閱講述 BP 的數學推導的資料。
接下來一步是吳恩達老師強烈建議的梯度檢測過程:
使用求近似微分的方式檢測 BP 算法是否正確,因爲有的時候你寫的錯誤的 BP 算法可以運行,但是得到的結果是錯誤的。
這個方法比較適用於處理很大的數據集或者訓練很複雜的神經網絡架構,本例程訓練 NN 的時間遠小於梯度檢測的時間,這裏就不做這一步了。
最後就是執行和測試算法,因爲 NN 的 cost 函數是非凸的,每次結果可能不同,這裏貼了一個正確率比較高的:
def predict(fmin_x, input_size, hidden_size, output_size, X, y_load):
theta1 = np.array(np.reshape(fmin_x[ : hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.array(np.reshape(fmin_x[hidden_size * (input_size + 1) : ], (output_size, (hidden_size + 1))))
a1, z2, a2, z3, h = FP(X, theta1, theta2)
y_pred = np.array(np.argmax(h, axis=1) + 1)
correct = [1 if a == b else 0 for (a, b) in zip(y_pred, y_load)]
accuracy = (sum(map(int, correct)) / float(len(correct)))
print('accuracy = {0}%'.format(accuracy * 100))
def main():
input_size = 400
hidden_size = 25
output_size = 10
lamda = 0.1
data = loadData('ex4data1.mat')
X, y_load, y, params = initData(data, input_size, hidden_size, output_size)
#gradient_check(params, input_size, hidden_size, output_size, X, y, lamda)
fmin = minimize(fun=BP, x0=params, args=(input_size, hidden_size, output_size, X, y, lamda),\
method='TNC', jac=True, options={'maxiter': 250})
print(fmin)
predict(fmin.x, input_size, hidden_size, output_size, X, y_load)
accuracy = 99.92%
Process finished with exit code 0
最後可視化了 theta1,不過貌似依舊看不出什麼:
def showHidden(fmin_x, input_size, hidden_size):
theta1 = np.array(np.reshape(fmin_x[: hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
hidden_layer = theta1[:, 1:]
fig, ax_array = plt.subplots(nrows=5, ncols=5, sharey=True, sharex=True, figsize=(5, 5))
for r in range(5):
for c in range(5):
ax_array[r, c].matshow(hidden_layer[5 * r + c].reshape((20, 20)), cmap=matplotlib.cm.binary)
plt.xticks(np.array([]))
plt.yticks(np.array([]))
plt.show()
最後貼出完整程序:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from scipy.io import loadmat
from scipy.optimize import minimize
from sklearn.preprocessing import OneHotEncoder
def loadData(filename):
return loadmat(filename)
def initData(data, input_size, hidden_size, output_size):
# X
X = data['X']
# y
y_load = data['y']
encoder = OneHotEncoder(sparse=False)
y = encoder.fit_transform(y_load)
# 隨機化 theta1/theta2 in vectors
params = (np.random.random(size = hidden_size * (input_size + 1) + output_size * (hidden_size + 1)) - 0.5) * 0.25
return X, y_load, y, params
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def FP(X, theta1, theta2):
m = X.shape[0]
a1 = np.insert(X, 0, values=np.ones(m), axis=1)
z2 = a1 @ theta1.T
a2 = np.insert(sigmoid(z2), 0, values=np.ones(m), axis=1)
z3 = a2 @ theta2.T
h = sigmoid(z3)
return a1, z2, a2, z3, h
# 計算 cost
def cost(X, y, theta1, theta2, lamda):
m = len(y)
a1, z2, a2, z3, h = FP(X, theta1, theta2)
J = 0
for i in range(m):
first = - y[i,:] * np.log(h[i,:])
second = - (1 - y[i,:]) * np.log(1 - h[i,:])
J += np.sum(first + second)
J = J / m
# 正則化項
J += (float(lamda) / (2 * m)) * (np.sum(np.power(theta1[:,1:], 2)) + np.sum(np.power(theta2[:,1:], 2)))
return J
def sigmoid_gradient(z):
return sigmoid(z) * (1 - sigmoid(z))
def BP(params, input_size, hidden_size, output_size, X, y, lamda):
# 從 params 向量中提取 theta1、theta2 轉化爲 np.array 格式
theta1 = np.array(np.reshape(params[: hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.array(np.reshape(params[hidden_size * (input_size + 1):], (output_size, (hidden_size + 1))))
# 利用 FP 計算當前參數下得到的各個層的數值
a1, z2, a2, z3, h = FP(X, theta1, theta2)
# 初始化 delta1、delta2
delta1 = np.zeros(theta1.shape) # (25, 401)
delta2 = np.zeros(theta2.shape) # (10, 26)
# 計算 cost
J = cost(X, y, theta1, theta2, lamda)
# 樣本數
m = X.shape[0]
# BP :逐個樣本處理
for t in range(m):
# 抽取該樣本的 FP 數值
a1t = a1[t, :].reshape(1, 401) # (1, 401)
z2t = z2[t, :].reshape(1, 25) # (1, 25)
a2t = a2[t, :].reshape(1, 26) # (1, 26)
ht = h[t, :].reshape(1, 10) # (1, 10)
yt = y[t, :].reshape(1, 10) # (1, 10)
# 輸出層 error
d3t = ht - yt # (1, 10)
# 隱藏層 error
z2t = np.insert(z2t, 0, values=np.ones(1)) # (1, 26)
d2t = (theta2.T @ d3t.T).T * sigmoid_gradient(z2t) # (1, 26)
# 將 error 累加到 delta
delta1 = delta1 + (d2t[:, 1:]).T.reshape(25, 1) @ a1t
delta2 = delta2 + d3t.T.reshape(10, 1) @ a2t
# 得到最終的偏導數值
delta1 = delta1 / m
delta2 = delta2 / m
# 添加正則化項
delta1[:, 1:] = delta1[:, 1:] + (theta1[:, 1:] * lamda) / m
delta2[:, 1:] = delta2[:, 1:] + (theta2[:, 1:] * lamda) / m
# 將 delta1、delta2 轉化爲 np.array 格式用於傳參
grad = np.concatenate((np.ravel(delta1), np.ravel(delta2)))
return J, grad
def gradient_check(params, input_size, hidden_size, output_size, X, y, lamda):
J, grad = BP(params, input_size, hidden_size, output_size, X, y, lamda)
theta1 = np.array(np.reshape(params[: hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.array(np.reshape(params[hidden_size * (input_size + 1):], (output_size, (hidden_size + 1))))
costArray = np.zeros((len(params), 2))
for i in range(0, hidden_size - 1):
for j in range(0, input_size):
item = i * hidden_size + j
theta1[i][j] += 0.001
costArray[item][0] = cost(X, y, theta1, theta2, lamda)
theta1[i][j] -= 0.002
costArray[item][1] = cost(X, y, theta1, theta2, lamda)
theta1[i][j] += 0.001
for i in range(0, output_size - 1):
for j in range(0, hidden_size):
item = hidden_size * (input_size + 1) + i * output_size + j - 1
theta1[i][j] += 0.001
costArray[item][0] = cost(X, y, theta1, theta2, lamda)
theta1[i][j] -= 0.002
costArray[item][1] = cost(X, y, theta1, theta2, lamda)
theta1[i][j] += 0.001
costArray[:, 0] = (costArray[:, 0] - costArray[:, 1]) / 0.002
costArray[:, 1] = costArray[:, 0] - grad
print('max error = {0}%'.format(max(costArray[:, 1])))
def predict(fmin_x, input_size, hidden_size, output_size, X, y_load):
theta1 = np.array(np.reshape(fmin_x[ : hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
theta2 = np.array(np.reshape(fmin_x[hidden_size * (input_size + 1) : ], (output_size, (hidden_size + 1))))
a1, z2, a2, z3, h = FP(X, theta1, theta2)
y_pred = np.array(np.argmax(h, axis=1) + 1)
correct = [1 if a == b else 0 for (a, b) in zip(y_pred, y_load)]
accuracy = (sum(map(int, correct)) / float(len(correct)))
print('accuracy = {0}%'.format(accuracy * 100))
def showHidden(fmin_x, input_size, hidden_size):
theta1 = np.array(np.reshape(fmin_x[: hidden_size * (input_size + 1)], (hidden_size, (input_size + 1))))
hidden_layer = theta1[:, 1:]
fig, ax_array = plt.subplots(nrows=5, ncols=5, sharey=True, sharex=True, figsize=(5, 5))
for r in range(5):
for c in range(5):
ax_array[r, c].matshow(hidden_layer[5 * r + c].reshape((20, 20)), cmap=matplotlib.cm.binary)
plt.xticks(np.array([]))
plt.yticks(np.array([]))
plt.show()
def main():
input_size = 400
hidden_size = 25
output_size = 10
lamda = 1
data = loadData('ex4data1.mat')
X, y_load, y, params = initData(data, input_size, hidden_size, output_size)
#gradient_check(params, input_size, hidden_size, output_size, X, y, lamda)
fmin = minimize(fun=BP, x0=params, args=(input_size, hidden_size, output_size, X, y, lamda),\
method='TNC', jac=True, options={'maxiter': 250})
print(fmin)
predict(fmin.x, input_size, hidden_size, output_size, X, y_load)
showHidden(fmin.x, input_size, hidden_size)
main()