實用:使用caffe訓練模型時solver.prototxt中的參數設置解析

   筆者之前發佈了關於解析caffe的層的博客,解析caffe常用層的博客正在不斷更新中。本篇博客是一個插播的博客,目的在徹底解決使用caffe訓練模型時的參數設置問題,爲什麼要發這篇博客呢?是因爲筆者最近在自定義網絡時,需要構造自己的solver.prototxt,由於之前使用別人的網絡時,很多設置參數都沒有變,舉個例子,下面是caffe官方例程中關於訓練LeNet的配置參數文件:

# The train/test net protocol buffer definition
net: "examples/mnist/lenet_train_test.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100
# Carry out testing every 500 training iterations.
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
display: 100
# The maximum number of iterations
max_iter: 10000
# snapshot intermediate results
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet"
# solver mode: CPU or GPU
solver_mode: GPU
   大家可能會覺得非常眼熟,因爲我們在其中自己定製了一些參數,比如說最大迭代次數,基礎學習率,測試迭代次數等等,可是,對於其中的某些參數,可能新手使用的時候也沒有去嘗試着改變。那麼,接下來筆者就詳細地解析一下我們在構造solver.prototxt時怎樣去自己定製參數,首先,我們還是翻開caffe,proto,去看一看裏面關於solver參數的定義,依照慣例,先附上源碼和註釋:

message SolverParameter {
  //////////////////////////////////////////////////////////////////////////////
  // Specifying the train and test networks
  //
  // Exactly one train net must be specified using one of the following fields:
  //     train_net_param, train_net, net_param, net
  // One or more test nets may be specified using any of the following fields:
  //     test_net_param, test_net, net_param, net
  // If more than one test net field is specified (e.g., both net and
  // test_net are specified), they will be evaluated in the field order given
  // above: (1) test_net_param, (2) test_net, (3) net_param/net.
  // A test_iter must be specified for each test_net.
  // A test_level and/or a test_stage may also be specified for each test_net.
  //////////////////////////////////////////////////////////////////////////////

  // Proto filename for the train net, possibly combined with one or more
  // test nets.
  optional string net = 24;//定義網絡的prototxt文件
  // Inline train net param, possibly combined with one or more test nets.
  optional NetParameter net_param = 25;//網絡的參數

  optional string train_net = 1; // Proto filename for the train net.//定義訓練網絡的prototxt文件
  repeated string test_net = 2; // Proto filenames for the test nets.//定義測試網絡的prototxt文件
  optional NetParameter train_net_param = 21; // Inline train net params.//訓練網絡的參數
  repeated NetParameter test_net_param = 22; // Inline test net params.//測試網絡的參數

  // The states for the train/test nets. Must be unspecified or
  // specified once per net.
  //
  // By default, all states will have solver = true;
  // train_state will have phase = TRAIN,
  // and all test_state's will have phase = TEST.
  // Other defaults are set according to the NetState defaults.
  optional NetState train_state = 26;//指示訓練時網絡的模式
  repeated NetState test_state = 27;//指示測試時網絡的模式

  // The number of iterations for each test net.
  repeated int32 test_iter = 3;//在執行測試時需要迭代的次數,test_iter* 測試集batchsize=測試集總量

  // The number of iterations between two testing phases.
  optional int32 test_interval = 4 [default = 0];//指示迭代多少次進行一次測試
  optional bool test_compute_loss = 19 [default = false];//指示測試的時候要不要計算loss
  // If true, run an initial test pass before the first iteration,
  // ensuring memory availability and printing the starting value of the loss.
  optional bool test_initialization = 32 [default = true];//爲真的話,在進行模型訓練之前先用隨機參數測試模型精度,一般爲真
  optional float base_lr = 5; // The base learning rate//指示基礎學習率
  // the number of iterations between displaying info. If display = 0, no info
  // will be displayed.
  optional int32 display = 6;//指示迭代多少次進行顯示
  // Display the loss averaged over the last average_loss iterations
  optional int32 average_loss = 33 [default = 1];//指示進行多少次測試迭代顯示平均精度
  optional int32 max_iter = 7; // the maximum number of iterations//指示訓練最大迭代次數
  // accumulate gradients over `iter_size` x `batch_size` instances
  optional int32 iter_size = 36 [default = 1];//指示累計多少個訓練批次的梯度,默認爲1

  // The learning rate decay policy. The currently implemented learning rate
  // policies are as follows:
  //    - fixed: always return base_lr.
  //    - step: return base_lr * gamma ^ (floor(iter / step))
  //    - exp: return base_lr * gamma ^ iter
  //    - inv: return base_lr * (1 + gamma * iter) ^ (- power)
  //    - multistep: similar to step but it allows non uniform steps defined by
  //      stepvalue
  //    - poly: the effective learning rate follows a polynomial decay, to be
  //      zero by the max_iter. return base_lr (1 - iter/max_iter) ^ (power)
  //    - sigmoid: the effective learning rate follows a sigmod decay
  //      return base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))
  //
  // where base_lr, max_iter, gamma, step, stepvalue and power are defined
  // in the solver parameter protocol buffer, and iter is the current iteration.
  optional string lr_policy = 8;//學習率計算形式,可以使用之上的幾種.....................................................................(3)                
  optional float gamma = 9; // The parameter to compute the learning rate.參與學習率計算的參數,詳見筆者博客
  optional float power = 10; // The parameter to compute the learning rate.參與學習率計算的參數,詳見筆者博客
  optional float momentum = 11; // The momentum value.動量參數,動量描述了前一次梯度下降的影響因子
  optional float weight_decay = 12; // The weight decay. 權重衰減,和損失函數的正則化項有關
  // regularization types supported: L1 and L2
  // controlled by weight_decay
  optional string regularization_type = 29 [default = "L2"];//指示正則化項的形式
  // the stepsize for learning rate policy "step"
  optional int32 stepsize = 13;//指示step訓練方式的步長
  // the stepsize for learning rate policy "multistep"
  repeated int32 stepvalue = 34;//指示multistep訓練方式的步長

  // Set clip_gradients to >= 0 to clip parameter gradients to that L2 norm,
  // whenever their actual L2 norm is larger.
  optional float clip_gradients = 35 [default = -1];//若該參數大於零,把梯度限制在-clip_gradients到clip_gradients之間......................(2)
  optional int32 snapshot = 14 [default = 0]; // The snapshot interval//指示訓練多少次保存一次參數
  optional string snapshot_prefix = 15; // The prefix for the snapshot.//保存的模型參數文件的前綴
  // whether to snapshot diff in the results or not. Snapshotting diff will help
  // debugging but the final protocol buffer size will be much larger.
  optional bool snapshot_diff = 16 [default = false];//指示是否保存網絡梯度
  enum SnapshotFormat {//模型參數保存格式的枚舉
    HDF5 = 0;
    BINARYPROTO = 1;
  }
  optional SnapshotFormat snapshot_format = 37 [default = BINARYPROTO];//模型參數的保存格式
  // the mode solver will use: 0 for CPU and 1 for GPU. Use GPU in default.
  enum SolverMode {//訓練模式的枚舉,只能爲CPU或GPU
    CPU = 0;
    GPU = 1;
  }
  optional SolverMode solver_mode = 17 [default = GPU];//訓練模式
  // the device_id will that be used in GPU mode. Use device_id = 0 in default.
  optional int32 device_id = 18 [default = 0];//設備id,使用單GPU訓練時爲0,指GPU0
  // If non-negative, the seed with which the Solver will initialize the Caffe
  // random number generator -- useful for reproducible results. Otherwise,
  // (and by default) initialize using a seed derived from the system clock.
  optional int64 random_seed = 20 [default = -1];//如果random_seed大於零的話,可以產生相同的隨機數

  // type of the solver
  optional string type = 40 [default = "SGD"];//梯度下降形式,一般用SGD............................................................(1)

  // numerical stability for RMSProp, AdaGrad and AdaDelta and Adam
  optional float delta = 31 [default = 1e-8];//RMSProp, AdaGrad, AdaDelta和Adam梯度下降形式的delta參數
  // parameters for the Adam solver
  optional float momentum2 = 39 [default = 0.999];//Adam梯度下降形式的動量參數

  // RMSProp decay value
  // MeanSquare(t) = rms_decay*MeanSquare(t-1) + (1-rms_decay)*SquareGradient(t)
  optional float rms_decay = 38 [default = 0.99];//RMSProp梯度下降形式下的衰減率

  // If true, print information about the state of the net that may help with
  // debugging learning problems.
  optional bool debug_info = 23 [default = false];//指示是否打印網絡狀態的參數

  // If false, don't save a snapshot after training finishes.
  optional bool snapshot_after_train = 28 [default = true];//true表示在訓練結束後保存一次模型,false則反之

  // DEPRECATED: old solver enum types, use string instead
  enum SolverType {
    SGD = 0;
    NESTEROV = 1;
    ADAGRAD = 2;
    RMSPROP = 3;
    ADADELTA = 4;
    ADAM = 5;
  }
  // DEPRECATED: use type instead of solver_type
  optional SolverType solver_type = 30 [default = SGD];
}

   其中各個參數已經解析的很詳細了,筆者認爲,在solver.prototxt中,首先定義了有關網絡的參數,再定義了有關訓練的參數,其中不乏一些我們常用的參數如weight_decay,學習率,學習率改變方式,訓練模式,權重更新模式等,請詳見上述代碼內容和註釋。在這之中,筆者想着重說明三點,上面筆者打標記的(1),(2)和(3)點。第(1)點比較簡單,就是在選擇梯度下降模式的時候,筆者選擇的往往是SGD(Stochastic Gradient Descent),即隨機梯度下降並能得到很好的結果,其他的參數下降方式幾乎沒有用到。

   第二點是clip_gradients參數,這個參數將梯度限制在了一個範圍之中,給定了梯度的上限和下限,我們先來看看sgd_solver.cpp中有關這個參數作用的源碼:

template <typename Dtype>
void SGDSolver<Dtype>::ClipGradients() {
  const Dtype clip_gradients = this->param_.clip_gradients();//獲取clip_gradients參數
  if (clip_gradients < 0) { return; }//如果該參數小於零,則直接return
  const vector<Blob<Dtype>*>& net_params = this->net_->learnable_params();//獲取網絡的可學習參數
  Dtype sumsq_diff = 0;
  for (int i = 0; i < net_params.size(); ++i) {
    sumsq_diff += net_params[i]->sumsq_diff();//把所有梯度相加
  }
  const Dtype l2norm_diff = std::sqrt(sumsq_diff);//把梯度相加的結果開方
  if (l2norm_diff > clip_gradients) {//如果梯度開方的結果大於clip_gradients
    Dtype scale_factor = clip_gradients / l2norm_diff;//那麼就求到一個縮放因子
    LOG(INFO) << "Gradient clipping: scaling down gradients (L2 norm "
        << l2norm_diff << " > " << clip_gradients << ") "
        << "by scale factor " << scale_factor;
    for (int i = 0; i < net_params.size(); ++i) {
      net_params[i]->scale_diff(scale_factor);//對每個梯度進行縮放
    }
  }
}

   從上文的源碼清晰可見,clip_gradients參數將梯度進行了範圍內的放縮,這個參數的作用是解決了梯度爆炸的問題,在第一次迭代時,有可能梯度會變得很大,而這個參數限制了梯度的範圍,clip_gradients這個參數多用於LSTM中。

   第三點是lr_policy參數,這個參數定製了訓練過程中學習率的變化過程,基礎學習率由base_lr參數定製,而在訓練的過程中,基礎學習率往往會隨着訓練過程發生變化,caffe提供了幾種學習率變化模式:fixed,step,exp,inv,multistep,poly,sigmoid這七種,其中multistep和step異曲同工,因此我們排除掉multistep,解析一下剩餘六種學習率變化模式:

  //    - fixed: always return base_lr.//學習率不變
  //    - step: return base_lr * gamma ^ (floor(iter / step))//學習率隨一個迭代次數週期下降
  //    - exp: return base_lr * gamma ^ iter//讓迭代次數作爲指數,底數爲gamma變更學習率
  //    - inv: return base_lr * (1 + gamma * iter) ^ (- power)//讓gamma乘迭代次數加1作爲底數,-power作爲指數變更學習率
  //    - poly: the effective learning rate follows a polynomial decay, to be
  //      zero by the max_iter. return base_lr (1 - iter/max_iter) ^ (power)//學習率的多項式分佈衰減
  //    - sigmoid: the effective learning rate follows a sigmod decay//學習率的sigmoid衰減
  //      return base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))

   不過光憑文字描述作用不大,筆者通過圖像來向大家闡述各種學習率變化策略,通過MATLAB繪圖可以形象的說明,筆者繪製出了五種學習率方式變化曲線,見下圖:


   在各種學習率變化策略中, 我們常用的有step,即學習率按迭代週期下降,這是一種很好的改變學習率的方案;其次,LeNet用到了inv方式,代表學習率在迭代之初較高,之後下降較快,exp方式讓學習率指數下降,poly方式學習率下降較均勻,sigmoid方式最初保持很低的學習率,在第一個迭代週期後學習率接近base_lr。

   通過學習率變化方式的解析,solver的參數設計中的gamma和power被有效關聯了起來,原來這兩個參數都是和學習率變更有關的,我們在進行訓練參數設計時,完全可以考慮到學習率下降方式,並有效地變更參數。

   到此爲止,caffe模型訓練中的參數設置解析告一段落了,筆者最大的感受就是,當很多訓練相關設置不明白時,第一要解析源碼,第二是可以帶着問題。規劃具體模型訓練一番,看看效果如何,並加以思考,經驗值就累加了。

   歡迎閱讀筆者後續的博客,各位讀者朋友的支持與鼓勵是我最大的動力!


written by jiong

爲了夢想瘋狂這一次又怎樣


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章