筆者之前發佈了關於解析caffe的層的博客,解析caffe常用層的博客正在不斷更新中。本篇博客是一個插播的博客,目的在徹底解決使用caffe訓練模型時的參數設置問題,爲什麼要發這篇博客呢?是因爲筆者最近在自定義網絡時,需要構造自己的solver.prototxt,由於之前使用別人的網絡時,很多設置參數都沒有變,舉個例子,下面是caffe官方例程中關於訓練LeNet的配置參數文件:
# The train/test net protocol buffer definition
net: "examples/mnist/lenet_train_test.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100
# Carry out testing every 500 training iterations.
test_interval: 500
# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# Display every 100 iterations
display: 100
# The maximum number of iterations
max_iter: 10000
# snapshot intermediate results
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet"
# solver mode: CPU or GPU
solver_mode: GPU
大家可能會覺得非常眼熟,因爲我們在其中自己定製了一些參數,比如說最大迭代次數,基礎學習率,測試迭代次數等等,可是,對於其中的某些參數,可能新手使用的時候也沒有去嘗試着改變。那麼,接下來筆者就詳細地解析一下我們在構造solver.prototxt時怎樣去自己定製參數,首先,我們還是翻開caffe,proto,去看一看裏面關於solver參數的定義,依照慣例,先附上源碼和註釋:
message SolverParameter {
//////////////////////////////////////////////////////////////////////////////
// Specifying the train and test networks
//
// Exactly one train net must be specified using one of the following fields:
// train_net_param, train_net, net_param, net
// One or more test nets may be specified using any of the following fields:
// test_net_param, test_net, net_param, net
// If more than one test net field is specified (e.g., both net and
// test_net are specified), they will be evaluated in the field order given
// above: (1) test_net_param, (2) test_net, (3) net_param/net.
// A test_iter must be specified for each test_net.
// A test_level and/or a test_stage may also be specified for each test_net.
//////////////////////////////////////////////////////////////////////////////
// Proto filename for the train net, possibly combined with one or more
// test nets.
optional string net = 24;//定義網絡的prototxt文件
// Inline train net param, possibly combined with one or more test nets.
optional NetParameter net_param = 25;//網絡的參數
optional string train_net = 1; // Proto filename for the train net.//定義訓練網絡的prototxt文件
repeated string test_net = 2; // Proto filenames for the test nets.//定義測試網絡的prototxt文件
optional NetParameter train_net_param = 21; // Inline train net params.//訓練網絡的參數
repeated NetParameter test_net_param = 22; // Inline test net params.//測試網絡的參數
// The states for the train/test nets. Must be unspecified or
// specified once per net.
//
// By default, all states will have solver = true;
// train_state will have phase = TRAIN,
// and all test_state's will have phase = TEST.
// Other defaults are set according to the NetState defaults.
optional NetState train_state = 26;//指示訓練時網絡的模式
repeated NetState test_state = 27;//指示測試時網絡的模式
// The number of iterations for each test net.
repeated int32 test_iter = 3;//在執行測試時需要迭代的次數,test_iter* 測試集batchsize=測試集總量
// The number of iterations between two testing phases.
optional int32 test_interval = 4 [default = 0];//指示迭代多少次進行一次測試
optional bool test_compute_loss = 19 [default = false];//指示測試的時候要不要計算loss
// If true, run an initial test pass before the first iteration,
// ensuring memory availability and printing the starting value of the loss.
optional bool test_initialization = 32 [default = true];//爲真的話,在進行模型訓練之前先用隨機參數測試模型精度,一般爲真
optional float base_lr = 5; // The base learning rate//指示基礎學習率
// the number of iterations between displaying info. If display = 0, no info
// will be displayed.
optional int32 display = 6;//指示迭代多少次進行顯示
// Display the loss averaged over the last average_loss iterations
optional int32 average_loss = 33 [default = 1];//指示進行多少次測試迭代顯示平均精度
optional int32 max_iter = 7; // the maximum number of iterations//指示訓練最大迭代次數
// accumulate gradients over `iter_size` x `batch_size` instances
optional int32 iter_size = 36 [default = 1];//指示累計多少個訓練批次的梯度,默認爲1
// The learning rate decay policy. The currently implemented learning rate
// policies are as follows:
// - fixed: always return base_lr.
// - step: return base_lr * gamma ^ (floor(iter / step))
// - exp: return base_lr * gamma ^ iter
// - inv: return base_lr * (1 + gamma * iter) ^ (- power)
// - multistep: similar to step but it allows non uniform steps defined by
// stepvalue
// - poly: the effective learning rate follows a polynomial decay, to be
// zero by the max_iter. return base_lr (1 - iter/max_iter) ^ (power)
// - sigmoid: the effective learning rate follows a sigmod decay
// return base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))
//
// where base_lr, max_iter, gamma, step, stepvalue and power are defined
// in the solver parameter protocol buffer, and iter is the current iteration.
optional string lr_policy = 8;//學習率計算形式,可以使用之上的幾種.....................................................................(3)
optional float gamma = 9; // The parameter to compute the learning rate.參與學習率計算的參數,詳見筆者博客
optional float power = 10; // The parameter to compute the learning rate.參與學習率計算的參數,詳見筆者博客
optional float momentum = 11; // The momentum value.動量參數,動量描述了前一次梯度下降的影響因子
optional float weight_decay = 12; // The weight decay. 權重衰減,和損失函數的正則化項有關
// regularization types supported: L1 and L2
// controlled by weight_decay
optional string regularization_type = 29 [default = "L2"];//指示正則化項的形式
// the stepsize for learning rate policy "step"
optional int32 stepsize = 13;//指示step訓練方式的步長
// the stepsize for learning rate policy "multistep"
repeated int32 stepvalue = 34;//指示multistep訓練方式的步長
// Set clip_gradients to >= 0 to clip parameter gradients to that L2 norm,
// whenever their actual L2 norm is larger.
optional float clip_gradients = 35 [default = -1];//若該參數大於零,把梯度限制在-clip_gradients到clip_gradients之間......................(2)
optional int32 snapshot = 14 [default = 0]; // The snapshot interval//指示訓練多少次保存一次參數
optional string snapshot_prefix = 15; // The prefix for the snapshot.//保存的模型參數文件的前綴
// whether to snapshot diff in the results or not. Snapshotting diff will help
// debugging but the final protocol buffer size will be much larger.
optional bool snapshot_diff = 16 [default = false];//指示是否保存網絡梯度
enum SnapshotFormat {//模型參數保存格式的枚舉
HDF5 = 0;
BINARYPROTO = 1;
}
optional SnapshotFormat snapshot_format = 37 [default = BINARYPROTO];//模型參數的保存格式
// the mode solver will use: 0 for CPU and 1 for GPU. Use GPU in default.
enum SolverMode {//訓練模式的枚舉,只能爲CPU或GPU
CPU = 0;
GPU = 1;
}
optional SolverMode solver_mode = 17 [default = GPU];//訓練模式
// the device_id will that be used in GPU mode. Use device_id = 0 in default.
optional int32 device_id = 18 [default = 0];//設備id,使用單GPU訓練時爲0,指GPU0
// If non-negative, the seed with which the Solver will initialize the Caffe
// random number generator -- useful for reproducible results. Otherwise,
// (and by default) initialize using a seed derived from the system clock.
optional int64 random_seed = 20 [default = -1];//如果random_seed大於零的話,可以產生相同的隨機數
// type of the solver
optional string type = 40 [default = "SGD"];//梯度下降形式,一般用SGD............................................................(1)
// numerical stability for RMSProp, AdaGrad and AdaDelta and Adam
optional float delta = 31 [default = 1e-8];//RMSProp, AdaGrad, AdaDelta和Adam梯度下降形式的delta參數
// parameters for the Adam solver
optional float momentum2 = 39 [default = 0.999];//Adam梯度下降形式的動量參數
// RMSProp decay value
// MeanSquare(t) = rms_decay*MeanSquare(t-1) + (1-rms_decay)*SquareGradient(t)
optional float rms_decay = 38 [default = 0.99];//RMSProp梯度下降形式下的衰減率
// If true, print information about the state of the net that may help with
// debugging learning problems.
optional bool debug_info = 23 [default = false];//指示是否打印網絡狀態的參數
// If false, don't save a snapshot after training finishes.
optional bool snapshot_after_train = 28 [default = true];//true表示在訓練結束後保存一次模型,false則反之
// DEPRECATED: old solver enum types, use string instead
enum SolverType {
SGD = 0;
NESTEROV = 1;
ADAGRAD = 2;
RMSPROP = 3;
ADADELTA = 4;
ADAM = 5;
}
// DEPRECATED: use type instead of solver_type
optional SolverType solver_type = 30 [default = SGD];
}
其中各個參數已經解析的很詳細了,筆者認爲,在solver.prototxt中,首先定義了有關網絡的參數,再定義了有關訓練的參數,其中不乏一些我們常用的參數如weight_decay,學習率,學習率改變方式,訓練模式,權重更新模式等,請詳見上述代碼內容和註釋。在這之中,筆者想着重說明三點,上面筆者打標記的(1),(2)和(3)點。第(1)點比較簡單,就是在選擇梯度下降模式的時候,筆者選擇的往往是SGD(Stochastic Gradient Descent),即隨機梯度下降並能得到很好的結果,其他的參數下降方式幾乎沒有用到。
第二點是clip_gradients參數,這個參數將梯度限制在了一個範圍之中,給定了梯度的上限和下限,我們先來看看sgd_solver.cpp中有關這個參數作用的源碼:
template <typename Dtype>
void SGDSolver<Dtype>::ClipGradients() {
const Dtype clip_gradients = this->param_.clip_gradients();//獲取clip_gradients參數
if (clip_gradients < 0) { return; }//如果該參數小於零,則直接return
const vector<Blob<Dtype>*>& net_params = this->net_->learnable_params();//獲取網絡的可學習參數
Dtype sumsq_diff = 0;
for (int i = 0; i < net_params.size(); ++i) {
sumsq_diff += net_params[i]->sumsq_diff();//把所有梯度相加
}
const Dtype l2norm_diff = std::sqrt(sumsq_diff);//把梯度相加的結果開方
if (l2norm_diff > clip_gradients) {//如果梯度開方的結果大於clip_gradients
Dtype scale_factor = clip_gradients / l2norm_diff;//那麼就求到一個縮放因子
LOG(INFO) << "Gradient clipping: scaling down gradients (L2 norm "
<< l2norm_diff << " > " << clip_gradients << ") "
<< "by scale factor " << scale_factor;
for (int i = 0; i < net_params.size(); ++i) {
net_params[i]->scale_diff(scale_factor);//對每個梯度進行縮放
}
}
}
從上文的源碼清晰可見,clip_gradients參數將梯度進行了範圍內的放縮,這個參數的作用是解決了梯度爆炸的問題,在第一次迭代時,有可能梯度會變得很大,而這個參數限制了梯度的範圍,clip_gradients這個參數多用於LSTM中。
第三點是lr_policy參數,這個參數定製了訓練過程中學習率的變化過程,基礎學習率由base_lr參數定製,而在訓練的過程中,基礎學習率往往會隨着訓練過程發生變化,caffe提供了幾種學習率變化模式:fixed,step,exp,inv,multistep,poly,sigmoid這七種,其中multistep和step異曲同工,因此我們排除掉multistep,解析一下剩餘六種學習率變化模式:
// - fixed: always return base_lr.//學習率不變
// - step: return base_lr * gamma ^ (floor(iter / step))//學習率隨一個迭代次數週期下降
// - exp: return base_lr * gamma ^ iter//讓迭代次數作爲指數,底數爲gamma變更學習率
// - inv: return base_lr * (1 + gamma * iter) ^ (- power)//讓gamma乘迭代次數加1作爲底數,-power作爲指數變更學習率
// - poly: the effective learning rate follows a polynomial decay, to be
// zero by the max_iter. return base_lr (1 - iter/max_iter) ^ (power)//學習率的多項式分佈衰減
// - sigmoid: the effective learning rate follows a sigmod decay//學習率的sigmoid衰減
// return base_lr ( 1/(1 + exp(-gamma * (iter - stepsize))))
不過光憑文字描述作用不大,筆者通過圖像來向大家闡述各種學習率變化策略,通過MATLAB繪圖可以形象的說明,筆者繪製出了五種學習率方式變化曲線,見下圖:
在各種學習率變化策略中, 我們常用的有step,即學習率按迭代週期下降,這是一種很好的改變學習率的方案;其次,LeNet用到了inv方式,代表學習率在迭代之初較高,之後下降較快,exp方式讓學習率指數下降,poly方式學習率下降較均勻,sigmoid方式最初保持很低的學習率,在第一個迭代週期後學習率接近base_lr。
通過學習率變化方式的解析,solver的參數設計中的gamma和power被有效關聯了起來,原來這兩個參數都是和學習率變更有關的,我們在進行訓練參數設計時,完全可以考慮到學習率下降方式,並有效地變更參數。
到此爲止,caffe模型訓練中的參數設置解析告一段落了,筆者最大的感受就是,當很多訓練相關設置不明白時,第一要解析源碼,第二是可以帶着問題。規劃具體模型訓練一番,看看效果如何,並加以思考,經驗值就累加了。
歡迎閱讀筆者後續的博客,各位讀者朋友的支持與鼓勵是我最大的動力!
written by jiong
爲了夢想瘋狂這一次又怎樣