Caffe的Solver參數設置

http://caffe.berkeleyvision.org/tutorial/solver.html
solver是通過協調前向-反向傳播的參數更新來控制參數優化的。一個模型的學習是通過Solver來監督優化和參數更新，以及通過Net來產生loss和梯度完成的。
Caffe提供的優化方法有：

Stochastic Gradient Descent (type: “SGD”),
AdaDelta (type: “AdaDelta”),
Adaptive Gradient (type: “AdaGrad”),
Adam (type: “Adam”),
Nesterov’s Accelerated Gradient (type: “Nesterov”),
RMSprop (type: “RMSProp”)

The solver

scaffolds the optimization bookkeeping and creates the training network for learning and test network(s) for evaluation.
iteratively optimizes by calling forward / backward and updating parameters
(periodically) evaluates the test networks
snapshots the model and solver state throughout the optimization
where each iteration

calls network forward to compute the output and loss
calls network backward to compute the gradients
incorporates the gradients into parameter updates according to the solver method
updates the solver state according to learning rate, history, and method
to take the weights all the way from initialization to learned model.

Like Caffe models, Caffe solvers run in CPU / GPU modes.

SGD

Stochastic gradient descent (type: “SGD”) updates the weights W by a linear combination of the negative gradient ∇L(W) and the previous weight update Vt. The learning rate α is the weight of the negative gradient. The momentum μ is the weight of the previous update.

Formally, we have the following formulas to compute the update value Vt+1 and the updated weights Wt+1 at iteration t+1, given the previous weight update Vt and current weights Wt:

Vt+1=μVt−α∇L(Wt)
Wt+1=Wt+Vt+1
The learning “hyperparameters” (α and μ) might require a bit of tuning for best results. If you’re not sure where to start, take a look at the “Rules of thumb” below, and for further information you might refer to Leon Bottou’s Stochastic Gradient Descent Tricks [1].

[1] L. Bottou. Stochastic Gradient Descent Tricks. Neural Networks: Tricks of the Trade: Springer, 2012.

總結solver文件個參數的意義

iteration：數據進行一次前向-後向的訓練
batchsize：每次迭代訓練圖片的數量
epoch：1個epoch就是將所有的訓練圖像全部通過網絡訓練一次
例如：假如有1280000張圖片，batchsize=256，則1個epoch需要1280000/256=5000次iteration
它的max-iteration=450000，則共有450000/5000=90個epoch
而lr什麼時候衰減與stepsize有關，減少多少與gamma有關，即:若stepsize=500, base_lr=0.01, gamma=0.1,則當迭代到第一個500次時，lr第一次衰減，衰減後的lr=lr*gamma=0.01*0.1=0.001,以後重複該過程，所以
stepsize是lr的衰減步長，gamma是lr的衰減係數。
在訓練過程中，每到一定的迭代次數都會測試，迭代次數是由test-interval決定的，如test_interval=1000，則訓練集每迭代1000次測試一遍網絡，而
test_size, test_iter, 和test圖片的數量決定了怎樣test, test-size決定了test時每次迭代輸入圖片的數量，test_iter就是test所有的圖片的迭代次數，如：500張test圖片，test_iter=100，則test_size=5, 而solver文檔裏只需要根據test圖片總數量來設置test_iter，以及根據需要設置test_interval即可。

beihangzxm123

發佈了85 篇原創文章 · 獲贊 519 · 訪問量 183萬+

他的留言板關注

Caffe的Solver參數設置

SGD

總結solver文件個參數的意義

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

使用perf工具生成火焰圖

大齡程序員思考

響應式界面控件DevExtreme * 更強的數據分析和可視化功能

HttpSecurity 是如何組裝過濾器鏈的

數說海南——近6年海南各市縣人口簡單看

長序列中Transformers的高級注意力機制總結

WebStorm 創建 Vue 項目

相關濾波跟蹤（MOSSE）

Caffe:solver及其配置

Faster R-CNN教程

faster rcnn可視化（修改demo.py保存網絡中間結果）

理工學生就業那些事

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結