自適應學習率調整：AdaDelta

本文轉載自：https://www.cnblogs.com/neopenx/p/4768388.html 作者：neopenx 轉載請註明該聲明。

Reference：ADADELTA: An Adaptive Learning Rate Method

超參數

超參數（Hyper-Parameter)是困擾神經網絡訓練的問題之一，因爲這些參數不可通過常規方法學習獲得。

神經網絡經典五大超參數:

學習率(Leraning Rate)、權值初始化(Weight Initialization)、網絡層數(Layers)

單層神經元數(Units)、正則懲罰項（Regularizer|Normalization)

這五大超參數使得神經網絡更像是一門實踐課，而不是理論課。

懂神經網絡可能只要一小時，但是調神經網絡可能要幾天。

因此，後來Vapnik做SVM支持向量機的時候，通過巧妙的變換目標函數，避免傳統神經網絡的大部分超參數，

尤其是以自適應型的支持向量替代人工設置神經元，這使得SVM可以有效免於過擬合之災。

傳統對抗這些超參數的方法是經驗規則（Rules of Thumb)。

這幾年，隨着深度學習的推進，全球神經網絡研究者人數劇增，已經有大量研究組着手超參數優化問題：

★深度學習先鋒的RBM就利用Pre-Traning自適應調出合適的權值初始化值。

★上個世紀末的LSTM長短期記憶網絡，可視爲“神經網絡嵌套神經網絡”，自適應動態優化層數。

★2010年Duchi et.al 則推出AdaGrad，自適應來調整學習率。

自適應調整學習率的方法，目前研究火熱。一個經典之作，是 Matthew D. Zeiler 2012年在Google實習時，

提出的AdaDelta。

Matthew D. Zeiler亦是Hinton的親傳弟子之一，還是商業天才，大二時辦了一個公司賣複習舊書。

Phd畢業之後，創辦了Clarifai，估值五百萬刀。參考[知乎專欄]

Clarifai的傑出成就是贏得了ImageNet 2013冠軍，後來公佈出CNN結構的時候，Caffe、Torch之類

的框架都仿真不出他在比賽時候跑的結果，應該是用了不少未公佈的黑科技的。

再看他2012年提出的AdaDelta，肯定是用在的2013年的比賽當中，所以後來以普通方式才無法仿真的。

梯度更新

2.1 [一階方法] 隨機梯度

SGD(Stochastic Gradient Descent)是相對於BGD(Batch Gradient Descent)而生的。

BGD要求每次正反向傳播，計算所有Examples的Error，這在大數據情況下是不現實的。

最初的使用的SGD，每次正反向傳播，只計算一個Example，串行太明顯，硬件利用率不高。

後續SGD衍生出Mini-Batch Gradient Descent，每次大概推進100個Example，介於BGD和SGD之間。

現在，SGD通常是指Mini-Batch方法，而不是早期單Example的方法。

一次梯度更新，可視爲：

$x_{t+1}=x_{t}+\Delta x_{t} \quad where \quad \Delta x_{t}=-\eta \cdot g_{t}$

$x$爲參數，$t$爲時序，$\Delta$爲更新量，$\eta$爲學習率，$g$爲梯度

2.2 [二階方法] 牛頓法

二階牛頓法替換梯度更新量：

$\Delta x_{t}=H_{t}^{-1} \cdot g_{t}$

$H$爲參數的二階導矩陣，稱爲Hessian矩陣。

牛頓法，用Hessian矩陣替代人工設置的學習率，在梯度下降的時候，可以完美的找出下降方向，

不會陷入局部最小值當中，是理想的方法。

但是，求逆矩陣的時間複雜度近似$O(n^{3})$，計算代價太高，不適合大數據。

常規優化方法

3.1 啓發式模擬退火

早期最常見的手段之一就是模擬退火。當然這和模擬退火算法沒有半毛錢關係。

引入一個超參數(常數)的退火公式：

$\eta_{t}=\frac{\eta _{0}}{1+d\times t}$

$\eta _{0}$爲初始學習率，$d$爲衰減常數，通常爲$10^{-3}$

模擬退火基於一個梯度法優化的事實：

在優化過程中，Weight逐漸變大，因而需要逐漸減小學習率，保證更新平穩。

3.2 動量法

中期以及現在最普及的就是引入動量因子：

$\Delta x_{t}=\rho \Delta x_{t-1}-\eta \cdot g_{t}$

$\rho$爲動量因子，通常設爲0.9

在更新中引入0.9這樣的不平衡因子，使得：

★在下降初期，使用前一次的大比重下降方向，加速。

★在越過函數谷面時，異常的學習率，會使得兩次更新方向基本相反，在原地”震盪“

此時，動量因子使得更新幅度減小，協助越過函數谷面。

★在下降中後期，函數面局部最小值所在的吸引盆數量較多，一旦陷進吸引盆當中，

$Gradient \rightarrow 0$，但是前後兩次更新方向基本相同。

此時，動量因子使得更新幅度增大，協助躍出吸引盆。

3.3 AdaGrad

AdaGrad思路基本是借鑑L2 Regularizer，不過此時調節的不是$W$，而是$Gradient$:

$\Delta x_{t}=-\frac{\eta }{\sqrt{\sum_{\tau=1}^{t}(g_{\tau})^{2}}}\cdot g_{t}$

AdaGrad過程，是一個遞推過程，每次從$\tau=1$，推到$\tau=t$，把沿路的$Gradient$的平方根，作爲Regularizer。

分母作爲Regularizer項的工作機制如下：

★訓練前期，梯度較小，使得Regularizer項很大，放大梯度。[激勵階段]

★訓練後期，梯度較大，使得Regularizer項很小，縮小梯度。[懲罰階段]

另外，由於Regularizer是專門針對Gradient的，所以有利於解決Gradient Vanish/Expoloding問題。

所以在深度神經網絡中使用會非常不錯。

當然，AdaGrad本身有不少缺陷：

★初始化W影響初始化梯度，初始化W過大，會導致初始梯度被懲罰得很小。

此時可以人工加大$\eta$的值，但過大的$\eta$會使得Regularizer過於敏感，調節幅度很大。

★訓練到中後期，遞推路徑上累加的梯度平方和越打越多，迅速使得$Gradinet$被懲罰逼近0，提前結束訓練。

AdaDelta

AdaDelta基本思想是用一階的方法，近似模擬二階牛頓法。

4.1 矩陣對角線近似逆矩陣

1988年，[Becker&LeCun]提出一種用矩陣對角線元素來近似逆矩陣的方法：

$\Delta x_{t}=-\frac{1}{\left | diag(H_{t}) \right |+\mu }\cdot g_{t}$

$diag$指的是構造Hessian矩陣的對角矩陣，$\mu$是常數項，防止分母爲0。

2012年，[Schaul&S. Zhang&LeCun]借鑑了AdaGrad的做法，提出了更精確的近似：

$\Delta x_{t}=-\frac{1}{\left | diag(H_{t}) \right |}\frac{E[g_{t}-w:t]^{2}}{E[g_{t}^{2}-w:t]}\cdot g_{t}$

$E[g_{t}-w:t]$指的是從當前t開始的前w個梯度狀態的期望值。

$E[g_{t}^{2}-w:t]$指的是從當前t開始的前w個梯度狀態的平方的期望值。

同樣是基於Gradient的Regularizer，不過只取最近的w個狀態，這樣不會讓梯度被懲罰至0。

4.2 窗口和近似概率期望

計算$E[g_{t}-w:t]$，需要存儲前w個狀態，比較麻煩。

AdaDelta使用了類似動量因子的平均方法：

$E[g^{2}]_{t}=\rho E[g^{2}]_{t-1}+(1-\rho )g_{t}^{2}$

當$\rho=0.5$時，這個式子就變成了求梯度平方和的平均數。

如果再求根的話，就變成了RMS(均方根)：

$RMS[g]_{t}=\sqrt{E[g^{2}]_{t}+\epsilon }$

再把這個RMS作爲Gradient的Regularizer：

$\Delta x_{t}=-\frac{\eta}{RMS[g]_{t}}\cdot g_{t}$

其中，$\epsilon$是防止分母爆0的常數。

這樣，就有了一個改進版的AdaGrad。

該方法即Tieleman&Hinton的RMSProp，由於RMSProp和AdaDelta是同年出現的，

Matthew D. Zeiler並不知道這種改進的AdaGrad被祖師爺命名了。

RMSProp利用了二階信息做了Gradient優化，在BatchNorm之後，對其需求不是很大。

但是沒有根本實現自適應的學習率，依然需要線性搜索初始學習率，然後對其逐數量級下降。

另外，RMSProp的學習率數值與MomentumSGD差別甚大，需要重新線性搜索初始值。

注：$\epsilon$的建議取值爲1，出處是Inception V3，不要參考V3的初始學習率。

4.3 Hessian方法與正確的更新單元

Zeiler用了兩個反覆近似的式子來說明，一階方法到底在哪裏輸給了二階方法。

首先，考慮SGD和動量法：

$\Delta x \propto g\propto \frac{\partial f}{\partial x} \propto \frac{1}{x}$

$\Delta x$可以正比到梯度$g$問題，再正比到一階導數。而$log$一階導又可正比於$\frac{1}{x}$。

再考慮二階導Hessian矩陣法：

這裏爲了對比觀察，使用了[Becker&LeCun 1988]的近似方法，讓求逆矩陣近似於求對角陣的倒數：

$\Delta x \propto H^{-1}g\propto \frac{\frac{\partial f}{\partial x}}{\frac{\partial^{2}f}{\partial x^{2}}}\propto \frac{\frac{1}{x}}{\frac{1}{x}*\frac{1}{x}}\propto x$

$\Delta x$可以正比到Hessian逆矩陣$H^{-1}\cdot g$問題，再正比到二階導數。而$log$二階導又可正比於$x$。

可以看到，一階方法最終正比於$\frac{1}{x}$，即與參數逆相關：參數逐漸變大的時候，梯度反而成倍縮小。

而二階方法最終正比於$x$，即與參數正相關：參數逐漸變大的時候，梯度不受影響。

因此，Zeiler稱Hessian方法得到了Correct Units(正確的更新單元)。

4.4 由Hessian方法推導出一階近似Hessian方法

基於[Becker&LeCun 1988]的近似方法，有：

$\Delta x \approx \frac{\frac{\partial f}{\partial x}}{\frac{\partial^{2}f}{\partial x^{2}}}$

進而又有：

$\frac{\frac{\partial f}{\partial x}}{\frac{\partial^{2}f}{\partial x^{2}}}=\frac{1}{\frac{\partial^{2}f}{\partial x^{2}}}\cdot \frac{\partial f}{\partial x}=\frac{1}{\frac{\partial^{2}f}{\partial x^{2}}}\cdot g_{t}$

簡單收束變形一下, 然後用RMS來近似：

$\frac{1}{\frac{\partial^{2}f}{\partial x^{2}}}=\frac{\Delta x}{\frac{\partial f}{\partial x}}\approx -\frac{RMS[\Delta x]_{t-1}}{RMS[g]_{t}}$

最後，一階完整近似式：

$\Delta x= -\frac{RMS[\Delta x]_{t-1}}{RMS[g]_{t}}\cdot g_t$

值得注意的是，使用了$RMS[\Delta x]_{t-1}$而不是$RMS[\Delta x]_{t}$，因爲此時$\Delta x_{t}$還沒算出來。

4.5 算法流程

$\quad\quad\quad\qquad\qquad\qquad ALGORITHM:ADADELTA\\\\\\\\Require:DecayRate \,\rho \, ,Constant \,\,\epsilon \\Require:InitialParam \,\,x_{1} \\1: \quad Initialize\,\,accumulation \,\,variables \,\,E[g^{2}]_{0}=E[\Delta x^{2}]_{0=0} \\2: \quad For \,\,t=1:T \,\, do \,\, Loop \,\, all \,\,updates \\3: \quad \quad Compute \,\,Gradients:g_{t} \\4: \quad \quad Accumulate \,\, Gradient:E[g^{2}]_{t}=\rho E[g^{2}]_{t-1}+(1-\rho )g_{t}^{2} \\5: \quad \quad Compute \,\,Update:\Delta x= -\frac{RMS[\Delta x]_{t-1}}{RMS[g]_{t}}\cdot g_t \\6: \quad \quad Accumulate \,\, Updates:E[\Delta x^{2}]_{t}=\rho E[\Delta x^{2}]_{t-1}+(1-\rho )\Delta x^{2} \\7: \quad \quad Apply \,\,Update:x_{t+1}=x_{t}+\Delta x_{t} \\8: \quad End \,\,For$

4.6 Theano實現

論文中，給出的兩個超參數的合適實驗值。

$\rho=0.95 \quad\quad \epsilon=1e-6$

Theano的實現在LSTM的教學部分，個人精簡了一下：

def AdaDelta(tparams,grads):
    p=0.95;e=1e-6
# init
    delta_x2=[theano.shared(p.get_value() * floatX(0.)) for k, p in tparams.iteritems()]
    g2 = [theano.shared(p.get_value() * floatX(0.)) for k, p in tparams.iteritems()]
# first to update g2
    update_g2=[(g2, p * g2 + (1-p) * (g ** 2)) for g2, g in zip(g2, grads)]
    fn_update_1=theano.function(inputs=[],updates=update_g2)
#calc delta_x by RMS
    delta_x=[-T.sqrt(delta_x2_last + e) / T.sqrt(g2_now + e) * g for g, delta_x2_last, g2_now in zip(grads,delta_x2,g2)]
# then to update delta_x2 and param
    update_delta_x2=[(delta_x2, p * delta_x2 + (1-p) * (delta_x ** 2)) for delta_x2, delta_x in zip(delta_x2, delta_x)]
    update_param=[(param, param + delta) for param, delta in zip(tparams.values(), delta_x)]
    fn_update_2=theano.function(inputs=[],updates=update_delta_x2+update_param)
#return the update function of theano
return fn_update_1, fn_update_2

4.7 Dragon(Caffe)實現

默認代碼以我的Dragon框架爲準，對Caffe代碼進行了重寫。

//    hpp文件
template <typename Dtype>
class AdaDeltaSolver :public SGDSolver < Dtype > {
public:
    AdaDeltaSolver(const SolverParameter& param) :SGDSolver<Dtype>(param)    { }
    AdaDeltaSolver(const string& param_file) :SGDSolver<Dtype>(param_file)    { }
protected:
virtual void computeUpdateValue(int param_id, Dtype rate);
virtual void applyUpdate();
};


//    cpp文件
#include "gradient_solver.hpp"
template <typename Dtype>
void AdaDeltaSolver<Dtype>::computeUpdateValue(int param_id, Dtype rate){
    Blob<Dtype>* net_param = net->getLearnableParams()[param_id];
const Dtype lr_mult = net->getLrMults()[param_id];
    Dtype eps = param.delta();
    Dtype momntum = param.momentum();
// adadelta will ignore base_lr
    Dtype lr = lr_mult;
const int count = net_param->count();
switch (Dragon::get_mode()){
case Dragon::CPU:
//    history store for E[g^2]
//    update store for E[delta^2]
//    history=momentum*history + (1-momentum)*(diff^2)
//    1. compute diff^2 in temp
        dragon_powx<Dtype>(count, net_param->cpu_diff(), Dtype(2), temp[param_id]->mutable_cpu_data());
//    2. compute history
        dragon_cpu_axpby<Dtype>(count, Dtype(1) - momntum, temp[param_id]->cpu_data(),
                momntum, history[param_id]->mutable_cpu_data());
//    3. compute RMS[history] as denominator in temp
        dragon_set<Dtype>(count, eps, temp[param_id]->mutable_cpu_data());
        dragon_axpy<Dtype>(count, Dtype(1), history[param_id]->cpu_data(),temp[param_id]->mutable_cpu_data());
        dragon_powx<Dtype>(count, temp[param_id]->cpu_data(), Dtype(0.5), temp[param_id]->mutable_cpu_data());
//    4. compute diff/RMS[history] in diff
        dragon_div<Dtype>(count, net_param->cpu_diff(), temp[param_id]->cpu_data(), net_param->mutable_cpu_diff());
//    5. compute RMS[update] as numerator in temp
        dragon_set<Dtype>(count, eps, temp[param_id]->mutable_cpu_data());
        dragon_axpy<Dtype>(count, Dtype(1), update[param_id]->cpu_data(), temp[param_id]->mutable_cpu_data());
        dragon_powx<Dtype>(count, temp[param_id]->cpu_data(), Dtype(0.5), temp[param_id]->mutable_cpu_data());
//    6. compute diff*RMS[update] in diff
        dragon_mul<Dtype>(count, net_param->cpu_diff(), temp[param_id]->cpu_data(), net_param->mutable_cpu_diff());
//    7. compute final diff^2 in temp
        dragon_powx<Dtype>(count, net_param->cpu_diff(), Dtype(2), temp[param_id]->mutable_cpu_data());
//    8. compute update
        dragon_cpu_axpby<Dtype>(count, (1 - momntum), temp[param_id]->cpu_data(),
            momntum, update[param_id]->mutable_cpu_data());
//    9. apply learning rate
        dragon_scal<Dtype>(count, lr, net_param->mutable_cpu_diff());
break;
case Dragon::GPU:
#ifndef CPU_ONLY
        dragon_gpu_powx<Dtype>(count, net_param->gpu_diff(), Dtype(2), temp[param_id]->mutable_gpu_data());
//    2. compute history
        dragon_gpu_axpby<Dtype>(count, Dtype(1) - momntum, temp[param_id]->gpu_data(),
            momntum, history[param_id]->mutable_gpu_data());
//    3. compute RMS[history] as denominator in temp
        dragon_gpu_set<Dtype>(count, eps, temp[param_id]->mutable_gpu_data());
        dragon_gpu_axpy<Dtype>(count, Dtype(1), history[param_id]->gpu_data(), temp[param_id]->mutable_gpu_data());
        dragon_gpu_powx<Dtype>(count, temp[param_id]->gpu_data(), Dtype(0.5), temp[param_id]->mutable_gpu_data());
//    4. compute diff/RMS[history] in diff
        dragon_gpu_div<Dtype>(count, net_param->gpu_diff(), temp[param_id]->gpu_data(), net_param->mutable_gpu_diff());
//    5. compute RMS[update] as numerator in temp
        dragon_gpu_set<Dtype>(count, eps, temp[param_id]->mutable_gpu_data());
        dragon_gpu_axpy<Dtype>(count, Dtype(1), update[param_id]->gpu_data(), temp[param_id]->mutable_gpu_data());
        dragon_gpu_powx<Dtype>(count, temp[param_id]->gpu_data(), Dtype(0.5), temp[param_id]->mutable_gpu_data());
//    6. compute diff*RMS[update] in diff
        dragon_gpu_mul<Dtype>(count, net_param->gpu_diff(), temp[param_id]->gpu_data(), net_param->mutable_gpu_diff());
//    7. compute final diff^2 in temp
        dragon_gpu_powx<Dtype>(count, net_param->gpu_diff(), Dtype(2), temp[param_id]->mutable_gpu_data());
//    8. compute update
        dragon_gpu_axpby<Dtype>(count, Dtype(1) - momntum, temp[param_id]->gpu_data(),
            momntum, update[param_id]->mutable_gpu_data());
//    9. apply learning rate
        dragon_gpu_scal<Dtype>(count, lr, net_param->mutable_gpu_diff());
#endif
break;
default:LOG(FATAL) << "Unknown mode: " << Dragon::get_mode();
    }
}

template <typename Dtype>
void AdaDeltaSolver<Dtype>::applyUpdate(){
    CHECK(Dragon::get_root_solver());
    Dtype rate = getLearningRate();
//    AdaDelta do not need base lr
if (param.display() && iter%param.display() == 0)
        LOG(INFO) << "Iteration " << iter << ", lr = AdaDelta";
    clipGradients();
    vector<Blob<Dtype>*> net_params = net->getLearnableParams();
for (int i = 0; i < net_params.size(); i++){
        normalize(i);
        regularize(i);
        computeUpdateValue(i, rate);
        net_params[i]->update();
    }
}

INSTANTIATE_CLASS(AdaDeltaSolver);

View Code

AdaDelta的缺陷

局部最小值

從多個數據集情況來看，AdaDelta在訓練初期和中期，具有非常不錯的加速效果。

但是到訓練後期，進入局部最小值雷區之後，AdaDelta就會反覆在局部最小值附近抖動。

主要體現在驗證集錯誤率上，脫離不了局部最小值吸引盆。

這時候，切換成動量SGD，如果把學習率降低一個量級，就會發現驗證集正確率有2%~5%的提升，

這與常規使用動量SGD，是一樣的。

之後再切換成AdaDelta，發現正確率又退回去了。

再切換成動量SGD，發現正確率又回來了。

---------------------------------------------------------------------

注：使用Batch Norm之後，這樣從AdaDelta切到SGD會導致數值體系崩潰，原因未知。

---------------------------------------------------------------------

個人猜測，人工學習率的量級降低，給訓練造成一個巨大的抖動，從一個局部最小值，

抖動到了另一個局部最小值，而AdaDelta的二階近似計算，或者說所有二階方法，

則不會產生這麼大的抖動，所以很難從局部最小值中抖出來。

這給追求state of art的結果帶來災難，因爲只要你一直用AdaDelta，肯定是與state of art無緣的。

基本上state of art的結果，最後都是SGD垂死掙扎抖出來的。

這也是SGD爲什麼至今在state of art的論文中沒有廢除的原因，人家醜，但是實在。

精度

eps的數值不是固定的。

1e-6在Caffe Cifar10上就顯得過小了，1e-8比較適合。

這意味着不同數值比例體系，精度需要人工注意。

paper裏高精度反而沒低精度好，說明精度也有比較大抖動。

so，究竟什麼樣的精度是最好的呢？

————————————————————————————————————

2016.5.19 更新：

在FCNN-AlexNet裏，1e-8在epoch1之後就會產生數值問題。

原因是sqrt(1e-8)*grad很大，這時候1e-10是比較好的。

另外，DensePrediction一定要做normalize，否則也有可能讓AdaDelta的迭代步長計算出現數值問題。

該問題在FCNN-AlexNet進行到epoch5左右時候開始明顯化。

caffe默認給的1e-10實際上要比paper裏的1e-6要相對robust。

wangchaoqi1985

發佈了0 篇原創文章 · 獲贊 170 · 訪問量 155萬+

他的留言板關注