實用：使用caffe訓練模型時solver.prototxt中的參數設置解析

筆者之前發佈了關於解析caffe的層的博客，解析caffe常用層的博客正在不斷更新中。本篇博客是一個插播的博客，目的在徹底解決使用caffe訓練模型時的參數設置問題，爲什麼要發這篇博客呢？是因爲筆者最近在自定義網絡時，需要構造自己的solver.prototxt，由於之前使用別人的網絡時，很多設置參數都沒有變，舉個例子，下面是caffe官方例程中關於訓練LeNet的配置參數文件：

# The train/test net protocol buffer definition
net: "examples/mnist/lenet_train_test.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100
# Carry out testing every 500 training iterations.
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
display: 100
# The maximum number of iterations
max_iter: 10000
# snapshot intermediate results
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet"
# solver mode: CPU or GPU
solver_mode: GPU

大家可能會覺得非常眼熟，因爲我們在其中自己定製了一些參數，比如說最大迭代次數，基礎學習率，測試迭代次數等等，可是，對於其中的某些參數，可能新手使用的時候也沒有去嘗試着改變。那麼，接下來筆者就詳細地解析一下我們在構造solver.prototxt時怎樣去自己定製參數，首先，我們還是翻開caffe,proto，去看一看裏面關於solver參數的定義，依照慣例，先附上源碼和註釋：

message SolverParameter {
  //////////////////////////////////////////////////////////////////////////////
  // Specifying the train and test networks
  //
  // Exactly one train net must be specified using one of the following fields:
  //     train_net_param, train_net, net_param, net
  // One or more test nets may be specified using any of the following fields:
  //     test_net_param, test_net, net_param, net
  // If more than one test net field is specified (e.g., both net and
  // test_net are specified), they will be evaluated in the field order given
  // above: (1) test_net_param, (2) test_net, (3) net_param/net.
  // A test_iter must be specified for each test_net.
  // A test_level and/or a test_stage may also be specified for each test_net.
  //////////////////////////////////////////////////////////////////////////////

  // Proto filename for the train net, possibly combined with one or more
  // test nets.
  optional string net = 24;//定義網絡的prototxt文件
  // Inline train net param, possibly combined with one or more test nets.
  optional NetParameter net_param = 25;//網絡的參數

  optional string train_net = 1; // Proto filename for the train net.//定義訓練網絡的prototxt文件
  repeated string test_net = 2; // Proto filenames for the test nets.//定義測試網絡的prototxt文件
  optional NetParameter train_net_param = 21; // Inline train net params.//訓練網絡的參數
  repeated NetParameter test_net_param = 22; // Inline test net params.//測試網絡的參數

  // The states for the train/test nets. Must be unspecified or
  // specified once per net.
  //
  // By default, all states will have solver = true;
  // train_state will have phase = TRAIN,
  // and all test_state's will have phase = TEST.
  // Other defaults are set according to the NetState defaults.
  optional NetState train_state = 26;//指示訓練時網絡的模式
  repeated NetState test_state = 27;//指示測試時網絡的模式

  // The number of iterations for each test net.
  repeated int32 test_iter = 3;//在執行測試時需要迭代的次數，test_iter* 測試集batchsize=測試集總量

  // The number of iterations between two testing phases.
  optional int32 test_interval = 4 [default = 0];//指示迭代多少次進行一次測試
  optional bool test_compute_loss = 19 [default = false];//指示測試的時候要不要計算loss
  // If true, run an initial test pass before the first iteration,
  // ensuring memory availability and printing the starting value of the loss.
  optional bool test_initialization = 32 [default = true];//爲真的話，在進行模型訓練之前先用隨機參數測試模型精度，一般爲真
  optional float base_lr = 5; // The base learning rate//指示基礎學習率
  // the number of iterations between displaying info. If display = 0, no info
  // will be displayed.
  optional int32 display = 6;//指示迭代多少次進行顯示
  // Display the loss averaged over the last average_loss iterations
  optional int32 average_loss = 33 [default = 1];//指示進行多少次測試迭代顯示平均精度
  optional int32 max_iter = 7; // the maximum number of iterations//指示訓練最大迭代次數
  // accumulate gradients over `iter_size` x `batch_size` instances
  optional int32 iter_size = 36 [default = 1];//指示累計多少個訓練批次的梯度，默認爲1

  // The learning rate decay policy. The currently implemented learning rate
  // policies are as follows:
  //    - fixed: always return base_lr.
  //    - step: return base_lr * gamma ^ (floor(iter / step))
  //    - exp: return base_lr * gamma ^ iter
  //    - inv: return base_lr * (1 + gamma * iter) ^ (- power)
  //    - multistep: similar to step but it allows non uniform steps defined by
  //      stepvalue
  //    - poly: the effective learning rate follows a polynomial decay, to be
  //      zero by the max_iter. return base_lr (1 - iter/max_iter) ^ (power)
  //    - sigmoid: the effective learning rate follows a sigmod decay
  //      return base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))
  //
  // where base_lr, max_iter, gamma, step, stepvalue and power are defined
  // in the solver parameter protocol buffer, and iter is the current iteration.
  optional string lr_policy = 8;//學習率計算形式，可以使用之上的幾種.....................................................................(3)                
  optional float gamma = 9; // The parameter to compute the learning rate.參與學習率計算的參數，詳見筆者博客
  optional float power = 10; // The parameter to compute the learning rate.參與學習率計算的參數，詳見筆者博客
  optional float momentum = 11; // The momentum value.動量參數，動量描述了前一次梯度下降的影響因子
  optional float weight_decay = 12; // The weight decay. 權重衰減，和損失函數的正則化項有關
  // regularization types supported: L1 and L2
  // controlled by weight_decay
  optional string regularization_type = 29 [default = "L2"];//指示正則化項的形式
  // the stepsize for learning rate policy "step"
  optional int32 stepsize = 13;//指示step訓練方式的步長
  // the stepsize for learning rate policy "multistep"
  repeated int32 stepvalue = 34;//指示multistep訓練方式的步長

  // Set clip_gradients to >= 0 to clip parameter gradients to that L2 norm,
  // whenever their actual L2 norm is larger.
  optional float clip_gradients = 35 [default = -1];//若該參數大於零，把梯度限制在-clip_gradients到clip_gradients之間......................(2)
  optional int32 snapshot = 14 [default = 0]; // The snapshot interval//指示訓練多少次保存一次參數
  optional string snapshot_prefix = 15; // The prefix for the snapshot.//保存的模型參數文件的前綴
  // whether to snapshot diff in the results or not. Snapshotting diff will help
  // debugging but the final protocol buffer size will be much larger.
  optional bool snapshot_diff = 16 [default = false];//指示是否保存網絡梯度
  enum SnapshotFormat {//模型參數保存格式的枚舉
    HDF5 = 0;
    BINARYPROTO = 1;
  }
  optional SnapshotFormat snapshot_format = 37 [default = BINARYPROTO];//模型參數的保存格式
  // the mode solver will use: 0 for CPU and 1 for GPU. Use GPU in default.
  enum SolverMode {//訓練模式的枚舉，只能爲CPU或GPU
    CPU = 0;
    GPU = 1;
  }
  optional SolverMode solver_mode = 17 [default = GPU];//訓練模式
  // the device_id will that be used in GPU mode. Use device_id = 0 in default.
  optional int32 device_id = 18 [default = 0];//設備id，使用單GPU訓練時爲0，指GPU0
  // If non-negative, the seed with which the Solver will initialize the Caffe
  // random number generator -- useful for reproducible results. Otherwise,
  // (and by default) initialize using a seed derived from the system clock.
  optional int64 random_seed = 20 [default = -1];//如果random_seed大於零的話，可以產生相同的隨機數

  // type of the solver
  optional string type = 40 [default = "SGD"];//梯度下降形式，一般用SGD............................................................(1)

  // numerical stability for RMSProp, AdaGrad and AdaDelta and Adam
  optional float delta = 31 [default = 1e-8];//RMSProp, AdaGrad, AdaDelta和Adam梯度下降形式的delta參數
  // parameters for the Adam solver
  optional float momentum2 = 39 [default = 0.999];//Adam梯度下降形式的動量參數

  // RMSProp decay value
  // MeanSquare(t) = rms_decay*MeanSquare(t-1) + (1-rms_decay)*SquareGradient(t)
  optional float rms_decay = 38 [default = 0.99];//RMSProp梯度下降形式下的衰減率

  // If true, print information about the state of the net that may help with
  // debugging learning problems.
  optional bool debug_info = 23 [default = false];//指示是否打印網絡狀態的參數

  // If false, don't save a snapshot after training finishes.
  optional bool snapshot_after_train = 28 [default = true];//true表示在訓練結束後保存一次模型，false則反之

  // DEPRECATED: old solver enum types, use string instead
  enum SolverType {
    SGD = 0;
    NESTEROV = 1;
    ADAGRAD = 2;
    RMSPROP = 3;
    ADADELTA = 4;
    ADAM = 5;
  }
  // DEPRECATED: use type instead of solver_type
  optional SolverType solver_type = 30 [default = SGD];
}

其中各個參數已經解析的很詳細了，筆者認爲，在solver.prototxt中，首先定義了有關網絡的參數，再定義了有關訓練的參數，其中不乏一些我們常用的參數如weight_decay，學習率，學習率改變方式，訓練模式，權重更新模式等，請詳見上述代碼內容和註釋。在這之中，筆者想着重說明三點，上面筆者打標記的(1)，(2)和(3)點。第(1)點比較簡單，就是在選擇梯度下降模式的時候，筆者選擇的往往是SGD(Stochastic Gradient Descent)，即隨機梯度下降並能得到很好的結果，其他的參數下降方式幾乎沒有用到。

第二點是clip_gradients參數，這個參數將梯度限制在了一個範圍之中，給定了梯度的上限和下限，我們先來看看sgd_solver.cpp中有關這個參數作用的源碼：

template <typename Dtype>
void SGDSolver<Dtype>::ClipGradients() {
  const Dtype clip_gradients = this->param_.clip_gradients();//獲取clip_gradients參數
  if (clip_gradients < 0) { return; }//如果該參數小於零，則直接return
  const vector<Blob<Dtype>*>& net_params = this->net_->learnable_params();//獲取網絡的可學習參數
  Dtype sumsq_diff = 0;
  for (int i = 0; i < net_params.size(); ++i) {
    sumsq_diff += net_params[i]->sumsq_diff();//把所有梯度相加
  }
  const Dtype l2norm_diff = std::sqrt(sumsq_diff);//把梯度相加的結果開方
  if (l2norm_diff > clip_gradients) {//如果梯度開方的結果大於clip_gradients
    Dtype scale_factor = clip_gradients / l2norm_diff;//那麼就求到一個縮放因子
    LOG(INFO) << "Gradient clipping: scaling down gradients (L2 norm "
        << l2norm_diff << " > " << clip_gradients << ") "
        << "by scale factor " << scale_factor;
    for (int i = 0; i < net_params.size(); ++i) {
      net_params[i]->scale_diff(scale_factor);//對每個梯度進行縮放
    }
  }
}

從上文的源碼清晰可見，clip_gradients參數將梯度進行了範圍內的放縮，這個參數的作用是解決了梯度爆炸的問題，在第一次迭代時，有可能梯度會變得很大，而這個參數限制了梯度的範圍，clip_gradients這個參數多用於LSTM中。

第三點是lr_policy參數，這個參數定製了訓練過程中學習率的變化過程，基礎學習率由base_lr參數定製，而在訓練的過程中，基礎學習率往往會隨着訓練過程發生變化，caffe提供了幾種學習率變化模式：fixed，step，exp，inv，multistep，poly，sigmoid這七種，其中multistep和step異曲同工，因此我們排除掉multistep，解析一下剩餘六種學習率變化模式：

  //    - fixed: always return base_lr.//學習率不變
  //    - step: return base_lr * gamma ^ (floor(iter / step))//學習率隨一個迭代次數週期下降
  //    - exp: return base_lr * gamma ^ iter//讓迭代次數作爲指數，底數爲gamma變更學習率
  //    - inv: return base_lr * (1 + gamma * iter) ^ (- power)//讓gamma乘迭代次數加1作爲底數，-power作爲指數變更學習率
  //    - poly: the effective learning rate follows a polynomial decay, to be
  //      zero by the max_iter. return base_lr (1 - iter/max_iter) ^ (power)//學習率的多項式分佈衰減
  //    - sigmoid: the effective learning rate follows a sigmod decay//學習率的sigmoid衰減
  //      return base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))

不過光憑文字描述作用不大，筆者通過圖像來向大家闡述各種學習率變化策略，通過MATLAB繪圖可以形象的說明，筆者繪製出了五種學習率方式變化曲線，見下圖：

在各種學習率變化策略中，我們常用的有step，即學習率按迭代週期下降，這是一種很好的改變學習率的方案；其次，LeNet用到了inv方式，代表學習率在迭代之初較高，之後下降較快，exp方式讓學習率指數下降，poly方式學習率下降較均勻，sigmoid方式最初保持很低的學習率，在第一個迭代週期後學習率接近base_lr。

通過學習率變化方式的解析，solver的參數設計中的gamma和power被有效關聯了起來，原來這兩個參數都是和學習率變更有關的，我們在進行訓練參數設計時，完全可以考慮到學習率下降方式，並有效地變更參數。

到此爲止，caffe模型訓練中的參數設置解析告一段落了，筆者最大的感受就是，當很多訓練相關設置不明白時，第一要解析源碼，第二是可以帶着問題。規劃具體模型訓練一番，看看效果如何，並加以思考，經驗值就累加了。

歡迎閱讀筆者後續的博客，各位讀者朋友的支持與鼓勵是我最大的動力！

written by jiong

爲了夢想瘋狂這一次又怎樣

實用：使用caffe訓練模型時solver.prototxt中的參數設置解析

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

cs04 CSS Measurement Units

caffe源碼深入學習5：超級詳細的caffe卷積層代碼解析

caffe源碼深入學習8：caffe框架深度神經網絡反傳代碼解析（二）之pooling層源碼解析

重啓caffe源碼深入學習7：caffe框架深度神經網絡反傳代碼解析（一）之ReLU層源碼解析

PyTorch經驗分享：新手如何搭建PyTorch程序

caffe源碼深入學習4：支持魔改的layer：layer.hpp與layer.cpp

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結