Tensorflow中支持11中不同的優化器,包括:
tf.train.Optimizer
tf.train.GradientDescentOptimizer
tf.train.AdadeltaOptimizer
tf.train.AdagradOptimizer
tf.train.AdagradDAOptimizer
tf.train.MomentumOptimizer
tf.train.AdamOptimizer
tf.train.FtrlOptimizer
tf.train.RMSPropOptimizer
tf.train.ProximalAdagradOptimizer
tf.train.ProximalGradientDescentOptimizer
常用的主要有3種,分別是
(1) GradientDescent
optimizer = tf. train. GradientDescentOptimizer( learning_rate) . minimize( loss)
使用隨機梯度下降算法,使參數沿着
梯度的反方向,即總損失減小的方向移動,實現更新參數。
W [ l ] = W [ l ] − α d W [ l ] W^{[l]} = W^{[l]} - \alpha \text{ } dW^{[l]} W [ l ] = W [ l ] − α d W [ l ] b [ l ] = b [ l ] − α d b [ l ] b^{[l]} = b^{[l]} - \alpha \text{ } db^{[l]} b [ l ] = b [ l ] − α d b [ l ]
(2) Momentum
optimizer = tf. train. MomentumOptimizer( learning_rate, momentum) . minimize( loss)
在更新參數時,利用了超參數
{ v d W [ l ] = β v d W [ l ] + ( 1 − β ) d W [ l ] v d b [ l ] = β v d b [ l ] + ( 1 − β ) d b [ l ] \begin{cases}
v_{dW^{[l]}} = \beta v_{dW^{[l]}} + (1 - \beta) dW^{[l]} \\
v_{db^{[l]}} = \beta v_{db^{[l]}} + (1 - \beta) db^{[l]}
\end{cases} { v d W [ l ] = β v d W [ l ] + ( 1 − β ) d W [ l ] v d b [ l ] = β v d b [ l ] + ( 1 − β ) d b [ l ]
{ W [ l ] = W [ l ] − α v d W [ l ] b [ l ] = b [ l ] − α v d b [ l ] \begin{cases}
W^{[l]} = W^{[l]} - \alpha v_{dW^{[l]}} \\
b^{[l]} = b^{[l]} - \alpha v_{db^{[l]}}
\end{cases} { W [ l ] = W [ l ] − α v d W [ l ] b [ l ] = b [ l ] − α v d b [ l ]
其中,
β \beta β : the momentum
α \alpha α : the learning rate
(3) Adam
optimizer = tf. train. AdamOptimizer( learning_rate= 0.001 ,
beta1= 0.9 , beta2= 0.999 ,
epsilon= 1e - 08 ) . minimize( loss)
利用自適應學習率 的優化算法(此時learning_rate傳入固定值,不支持使用指數衰減方式),Adam 算法和隨機梯度下降算法不同。隨機梯度下降算法保持單一的學習率更新所有的參數,學習率在訓練過程中並不會改變。而 Adam 算法通過計算梯度的一階矩估計和二階矩估計而爲不同的參數設計獨立的自適應性學習率。
{ v d W [ l ] = β 1 v d W [ l ] + ( 1 − β 1 ) ∂ J ∂ W [ l ] v d b [ l ] = β 1 v d b [ l ] + ( 1 − β 1 ) ∂ J ∂ b [ l ] ( m o m e n t : β 1 ) \begin{cases}
v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial W^{[l]} } \\
v_{db^{[l]}} = \beta_1 v_{db^{[l]}} + (1 - \beta_1) \frac{\partial \mathcal{J} }{ \partial b^{[l]} }
\end{cases} (moment:\beta_1) { v d W [ l ] = β 1 v d W [ l ] + ( 1 − β 1 ) ∂ W [ l ] ∂ J v d b [ l ] = β 1 v d b [ l ] + ( 1 − β 1 ) ∂ b [ l ] ∂ J ( m o m e n t : β 1 )
{ s d W [ l ] = β 2 s d W [ l ] + ( 1 − β 2 ) ( ∂ J ∂ W [ l ] ) 2 s d b [ l ] = β 2 s d b [ l ] + ( 1 − β 2 ) ( ∂ J ∂ b [ l ] ) 2 ( R M S p r o p : β 2 ) \begin{cases}
s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial W^{[l]} })^2 \\
s_{db^{[l]}} = \beta_2 s_{db^{[l]}} + (1 - \beta_2) (\frac{\partial \mathcal{J} }{\partial b^{[l]} })^2
\end{cases} (RMSprop:\beta_2) { s d W [ l ] = β 2 s d W [ l ] + ( 1 − β 2 ) ( ∂ W [ l ] ∂ J ) 2 s d b [ l ] = β 2 s d b [ l ] + ( 1 − β 2 ) ( ∂ b [ l ] ∂ J ) 2 ( R M S p r o p : β 2 )
{ v d W [ l ] c o r r e c t e d = v d W [ l ] 1 − ( β 1 ) t v d W [ b ] c o r r e c t e d = v d W [ b ] 1 − ( β 1 ) t s d W [ l ] c o r r e c t e d = s d W [ 2 ] 1 − ( β 1 ) t s d W [ b ] c o r r e c t e d = s d W [ 2 ] 1 − ( β 1 ) t ( B i a s c o r r e c t i o n ) \begin{cases}
v^{corrected}_{dW^{[l]}} = \frac{v_{dW^{[l]}}}{1 - (\beta_1)^t} \\
v^{corrected}_{dW^{[b]}} = \frac{v_{dW^{[b]}}}{1 - (\beta_1)^t} \\
s^{corrected}_{dW^{[l]}} = \frac{s_{dW^{[2]}}}{1 - (\beta_1)^t} \\
s^{corrected}_{dW^{[b]}} = \frac{s_{dW^{[2]}}}{1 - (\beta_1)^t}
\end{cases} (Bias correction) ⎩ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎧ v d W [ l ] c o r r e c t e d = 1 − ( β 1 ) t v d W [ l ] v d W [ b ] c o r r e c t e d = 1 − ( β 1 ) t v d W [ b ] s d W [ l ] c o r r e c t e d = 1 − ( β 1 ) t s d W [ 2 ] s d W [ b ] c o r r e c t e d = 1 − ( β 1 ) t s d W [ 2 ] ( B i a s c o r r e c t i o n )
{ W [ l ] = W [ l ] − α v d W [ l ] c o r r e c t e d s d W [ l ] c o r r e c t e d + ε b [ l ] = b [ l ] − α v d b [ l ] c o r r e c t e d s d b [ l ] c o r r e c t e d + ε \begin{cases}
W^{[l]} = W^{[l]} - \alpha \frac{v^{corrected}_{dW^{[l]}}}{\sqrt{s^{corrected}_{dW^{[l]}}} + \varepsilon}\\
b^{[l]} = b^{[l]} - \alpha \frac{v^{corrected}_{db^{[l]}}}{\sqrt{s^{corrected}_{db^{[l]}}} + \varepsilon}
\end{cases} ⎩ ⎪ ⎪ ⎨ ⎪ ⎪ ⎧ W [ l ] = W [ l ] − α s d W [ l ] c o r r e c t e d + ε v d W [ l ] c o r r e c t e d b [ l ] = b [ l ] − α s d b [ l ] c o r r e c t e d + ε v d b [ l ] c o r r e c t e d
其中,
β 1 \beta_1 β 1 and β 2 \beta_2 β 2 are hyperparameters that control the two exponentially weighted averages.
α \alpha α is the learning rate
ε \varepsilon ε is a very small number to avoid dividing by zero