Caffe的Solver參數設置

http://caffe.berkeleyvision.org/tutorial/solver.html 
solver是通過協調前向-反向傳播的參數更新來控制參數優化的。一個模型的學習是通過Solver來監督優化和參數更新,以及通過Net來產生loss和梯度完成的。 
Caffe提供的優化方法有:

  • Stochastic Gradient Descent (type: “SGD”),
  • AdaDelta (type: “AdaDelta”),
  • Adaptive Gradient (type: “AdaGrad”),
  • Adam (type: “Adam”),
  • Nesterov’s Accelerated Gradient (type: “Nesterov”),
  • RMSprop (type: “RMSProp”)

The solver

scaffolds the optimization bookkeeping and creates the training network for learning and test network(s) for evaluation. 
iteratively optimizes by calling forward / backward and updating parameters 
(periodically) evaluates the test networks 
snapshots the model and solver state throughout the optimization 
where each iteration

calls network forward to compute the output and loss 
calls network backward to compute the gradients 
incorporates the gradients into parameter updates according to the solver method 
updates the solver state according to learning rate, history, and method 
to take the weights all the way from initialization to learned model.

Like Caffe models, Caffe solvers run in CPU / GPU modes.

SGD

Stochastic gradient descent (type: “SGD”) updates the weights W by a linear combination of the negative gradient ∇L(W) and the previous weight update Vt. The learning rate α is the weight of the negative gradient. The momentum μ is the weight of the previous update.

Formally, we have the following formulas to compute the update value Vt+1 and the updated weights Wt+1 at iteration t+1, given the previous weight update Vt and current weights Wt:

Vt+1=μVt−α∇L(Wt) 
Wt+1=Wt+Vt+1 
The learning “hyperparameters” (α and μ) might require a bit of tuning for best results. If you’re not sure where to start, take a look at the “Rules of thumb” below, and for further information you might refer to Leon Bottou’s Stochastic Gradient Descent Tricks [1].

[1] L. Bottou. Stochastic Gradient Descent Tricks. Neural Networks: Tricks of the Trade: Springer, 2012.

總結solver文件個參數的意義

iteration: 數據進行一次前向-後向的訓練 
batchsize:每次迭代訓練圖片的數量 
epoch:1個epoch就是將所有的訓練圖像全部通過網絡訓練一次 
例如:假如有1280000張圖片,batchsize=256,則1個epoch需要1280000/256=5000次iteration 
它的max-iteration=450000,則共有450000/5000=90個epoch 
而lr什麼時候衰減與stepsize有關,減少多少與gamma有關,即:若stepsize=500, base_lr=0.01, gamma=0.1,則當迭代到第一個500次時,lr第一次衰減,衰減後的lr=lr*gamma=0.01*0.1=0.001,以後重複該過程,所以 
stepsize是lr的衰減步長,gamma是lr的衰減係數。 
在訓練過程中,每到一定的迭代次數都會測試,迭代次數是由test-interval決定的,如test_interval=1000,則訓練集每迭代1000次測試一遍網絡,而 
test_size, test_iter, 和test圖片的數量決定了怎樣test, test-size決定了test時每次迭代輸入圖片的數量,test_iter就是test所有的圖片的迭代次數,如:500張test圖片,test_iter=100,則test_size=5, 而solver文檔裏只需要根據test圖片總數量來設置test_iter,以及根據需要設置test_interval即可。

發佈了85 篇原創文章 · 獲贊 519 · 訪問量 183萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章