採用何種方式對損失函數進行迭代優化,這是機器學習的一大主題之一,當一個機器學習問題有了具體的模型和評估策略,所有的機器學習問題都可以形式化爲一個最優化問題。這也是爲什麼我們說優化理論和凸優化算法等學科是機器學習一大支柱的原因所在。從純數學的角度來看,所有的數學模型儘管形式不一,各有頭面,但到最後幾乎到可以歸約爲最優化問題。所以,有志於奮戰在機器學習和深度學習領域的各位,學好最優化,責無旁貸啊。
要說機器學習和深度學習的優化算法,梯度下降必然是核心所在。神經網絡發展至今,優化算法層出不窮,但大底是出不了梯度下降的框框架架。這一篇筆記,筆者就和大家一起學習和回顧深度學習中常用的優化算法。在前面手動搭建神經網絡的代碼實踐中,我們對於損失函數的優化採用了一般的梯度下降法,所以本篇總結就從梯度下降法開始。
梯度下降法 Gradient Descent
想必大家對於梯度下降是很熟悉了,選擇負梯度方向進行參數更新算是常規操作了。話不多說,對於多層神經網絡如何執行梯度下降:
def update_parameters_with_gd(parameters, grads, learning_rate):
"""
Update parameters using one step of gradient descent
Arguments:
parameters -- python dictionary containing your parameters to be updated:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients to update each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
learning_rate -- the learning rate, scalar.
Returns:
parameters -- python dictionary containing your updated parameters
"""
L = len(parameters) // 2 # number of layers in the neural networks
# Update rule for each parameter
for l in range(L):
parameters['W' + str(l+1)] = parameters['W' + str(l+1)] - learning_rate * grads['dW' + str(l+1)]
parameters['b' + str(l+1)] = parameters['b' + str(l+1)] - learning_rate * grads['db' + str(l+1)]
return parameters
在上述代碼中,我們傳入含有權值和偏置的字典、梯度字段和更新的學習率作爲參數,按照開頭的公式編寫權值更新代碼,一個簡單的多層網絡的梯度下降算法就寫出來了。
小批量梯度下降法 mini-batch Gradient Descent
在工業數據環境下,直接對大數據執行梯度下降法訓練往往處理速度緩慢,這時候將訓練集分割成小一點的子集進行訓練就非常重要了。這個被分割成的小的子集就叫做 mini-batch,意爲小批量。對每一個小批量同時執行梯度下降會大大提高訓練效率。在實際利用代碼實現的時候,小批量梯度下降算法通常包括兩個步驟:充分打亂數據(shuffle)和分組組合數據(partition)。如下圖所示。
shuffle
partition
具體代碼實現爲:
def random_mini_batches(X, Y, mini_batch_size = 64, seed = 0):
"""
Creates a list of random minibatches from (X, Y)
Arguments:
X -- input data, of shape (input size, number of examples)
Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (1, number of examples)
mini_batch_size -- size of the mini-batches, integer
Returns:
mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
"""
np.random.seed(seed)
m = X.shape[1]
mini_batches = [] # Step 1: Shuffle (X, Y)
permutation = list(np.random.permutation(m))
shuffled_X = X[:, permutation]
shuffled_Y = Y[:, permutation].reshape((1,m)) # Step 2: Partition (shuffled_X, shuffled_Y). Minus the end case.
num_complete_minibatches = math.floor(m/mini_batch_size)
for k in range(0, num_complete_minibatches):
mini_batch_X = shuffled_X[:, 0:mini_batch_size]
mini_batch_Y = shuffled_Y[:, 0:mini_batch_size]
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch) # Handling the end case (last mini-batch < mini_batch_size)
if m % mini_batch_size != 0:
mini_batch_X = shuffled_X[:, 0: m-mini_batch_size*math.floor(m/mini_batch_size)]
mini_batch_Y = shuffled_Y[:, 0: m-mini_batch_size*math.floor(m/mini_batch_size)]
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)
return mini_batches
小批量梯度下降的實現思路非常清晰,先打亂數據在分組數據,需要注意的細節在於最後一個小批量所含的訓練樣本數,通常而言最後一個小批量會少於前面批量所含樣本數。
隨機梯度下降 Stochastic Gradient Descent
當小批量所含的訓練樣本數爲 1 的時候,小批量梯度下降法就變成了隨機梯度下降法(SGD)。SGD雖然以單個樣本爲訓練單元訓練速度會很快,但犧牲了向量化運算所帶來的便利性,在較大數據集上效率並不高。
我們可以看一下梯度下降和隨機梯度下降在實現上的差異:
# GD
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations): # Forward propagation
a, caches = forward_propagation(X, parameters) # Compute cost.
cost = compute_cost(a, Y) # Backward propagation.
grads = backward_propagation(a, caches, parameters) # Update parameters.
parameters = update_parameters(parameters, grads)
# SGDX = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
for j in range(0, m): # Forward propagation
a, caches = forward_propagation(X[:,j], parameters) # Compute cost
cost = compute_cost(a, Y[:,j]) # Backward propagation
grads = backward_propagation(a, caches, parameters) # Update parameters.
parameters = update_parameters(parameters, grads)
所以,從本質上看,梯度下降法、小批量梯度下降法和隨機梯度下降法,並沒有區別。唯一的區別就在於它們執行一次訓練過程所需要用到的訓練樣本數。梯度下降法用到的是全集訓練數據,隨機梯度下降則是單個樣本數據,而小批量則是介於二者之間。
帶動量的梯度下降法(momentum)
正如上圖中看到的一樣,我們假設梯度下降的橫向爲參數 W 的下降方向,而偏置 b 的下降方向爲縱軸,我們總是希望在縱軸上的震盪幅度小一點,學習速度慢一點,而在橫軸上學習速度快一點,無論是小批量梯度下降還是隨機梯度下降,好像都不能避免這個問題。爲了解決這個問題,帶動量的梯度下降法來了。帶動量的梯度下降考慮歷史梯度的加權平均值作爲速率進行優化。執行公式如下:
根據上述公式編寫帶動量的梯度下降法實現代碼:
def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
"""
Update parameters using Momentum
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
v -- python dictionary containing the current velocity:
v['dW' + str(l)] = ...
v['db' + str(l)] = ...
beta -- the momentum hyperparameter, scalar
learning_rate -- the learning rate, scalar
Returns:
parameters -- python dictionary containing your updated parameters
v -- python dictionary containing your updated velocities
"""
L = len(parameters) // 2 # number of layers in the neural networks
# Momentum update for each parameter
for l in range(L): # compute velocities
v['dW' + str(l+1)] = beta * v['dW' + str(l+1)] + (1-beta)* grads['dW' + str(l+1)]
v['db' + str(l+1)] = beta * v['db' + str(l+1)] + (1-beta)* grads['db' + str(l+1)] # update parameters
parameters['W' + str(l+1)] = parameters['W' + str(l+1)] - learning_rate*v['dW' + str(l+1)]
parameters['b' + str(l+1)] = parameters['b' + str(l+1)] - learning_rate*v['db' + str(l+1)]
return parameters, v
實現帶動量的梯度下降的關鍵點有兩個:一是動量是考慮歷史梯度進行梯度下降的,二是這裏的需要指定的超參數變成了兩個:一個是學習率 learning_rate
,一個是梯度加權參數beta
。
Adam算法
Adam 全稱爲 Adaptive Moment Estimation,是在帶動量的梯度下降法的基礎上融合了一種稱爲 RMSprop(加速梯度下降)的算法而成的。相較於帶動量的梯度下降法,無論是RMSprop 還是 Adam,其中的改進思路都在於如何讓橫軸上的學習更快以及讓縱軸上的學習更慢。RMSprop 和 Adam 在帶動量的梯度下降法的基礎上,引入了平方梯度,並對速率進行了偏差糾正。具體計算公式如下:
實現代碼如下:
def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate = 0.01,
beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8):
"""
Update parameters using Adam
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
v -- Adam variable, moving average of the first gradient, python dictionary
s -- Adam variable, moving average of the squared gradient, python dictionary
learning_rate -- the learning rate, scalar.
beta1 -- Exponential decay hyperparameter for the first moment estimates
beta2 -- Exponential decay hyperparameter for the second moment estimates
epsilon -- hyperparameter preventing division by zero in Adam updates
Returns:
parameters -- python dictionary containing your updated parameters
v -- Adam variable, moving average of the first gradient, python dictionary
s -- Adam variable, moving average of the squared gradient, python dictionary
"""
L = len(parameters) // 2
v_corrected = {}
s_corrected = {}
# Perform Adam update on all parameters
for l in range(L):
v["dW" + str(l+1)] = beta1 * v["dW" + str(l+1)] + (1 - beta1) * grads['dW'+str(l+1)]
v["db" + str(l+1)] = beta1 * v["db" + str(l+1)] + (1 - beta1) * grads['db'+str(l+1)] # Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
v_corrected["dW" + str(l+1)] = v["dW" + str(l+1)] / (1 - beta1**t)
v_corrected["db" + str(l+1)] = v["db" + str(l+1)] / (1 - beta1**t) # Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".
s["dW" + str(l+1)] = beta2 * s["dW" + str(l+1)] + (1 - beta2) * (grads["dW" + str(l+1)])**2
s["db" + str(l+1)] = beta2 * s["db" + str(l+1)] + (1 - beta2) * (grads["db" + str(l+1)])**2
# Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
s_corrected["dW" + str(l+1)] = s["dW" + str(l+1)] / (1 - beta2**t)
s_corrected["db" + str(l+1)] = s["db" + str(l+1)] / (1 - beta2**t) # Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters".
parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * v_corrected["dW" + str(l+1)] / (np.sqrt(s_corrected["dW" + str(l+1)]) + epsilon)
parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * v_corrected["db" + str(l+1)] / (np.sqrt(s_corrected["db" + str(l+1)]) + epsilon)
return parameters, v, s