Caffe: 使用經驗總結











使用ResNet-Block或者Inception技術,最後的結果通過Bitwise Operation進行組合,而不是採用按channel Concatenate進行的。

尤其是BitWise multi進行組合的時候,往往會產生很大的數據懸殊,會導致梯度爆炸現象從而出現Loss 爲nan





  • 減小solver.prototxt中的base_lr,至少減小一個數量級。如果有多個loss layer,需要找出哪個損失層導致了梯度爆炸,並在train_val.prototxt中減小該層的loss_weight,而非是減小通用的base_lr
  • 設置clip gradient,用於限制過大的diff




措施:看看你是否能重現這個錯誤,在loss layer中加入一些輸出以進行調試。 示例:有一次我使用的loss歸一化了batch中label錯誤的次數。如果某個label從未在batch中出現過,loss就會變成NaN。在這種情況下,可以用足夠大的batch來儘量避免這個錯誤。





Caffe Debug info

當我們訓練過程面臨nan, loss不收斂的情況,可以打開solver.prototxt中的debuf_info:true進行查錯。

    I1109 ...]     [Forward] Layer data, top blob data data: 0.343971    
    I1109 ...]     [Forward] Layer conv1, top blob conv1 data: 0.0645037
    I1109 ...]     [Forward] Layer conv1, param blob 0 data: 0.00899114
    I1109 ...]     [Forward] Layer conv1, param blob 1 data: 0
    I1109 ...]     [Forward] Layer relu1, top blob conv1 data: 0.0337982
    I1109 ...]     [Forward] Layer conv2, top blob conv2 data: 0.0249297
    I1109 ...]     [Forward] Layer conv2, param blob 0 data: 0.00875855
    I1109 ...]     [Forward] Layer conv2, param blob 1 data: 0
    I1109 ...]     [Forward] Layer relu2, top blob conv2 data: 0.0128249
    I1109 ...]     [Forward] Layer fc1, top blob fc1 data: 0.00728743
    I1109 ...]     [Forward] Layer fc1, param blob 0 data: 0.00876866
    I1109 ...]     [Forward] Layer fc1, param blob 1 data: 0
    I1109 ...]     [Forward] Layer loss, top blob loss data: 2031.85
    I1109 ...]     [Backward] Layer loss, bottom blob fc1 diff: 0.124506
    I1109 ...]     [Backward] Layer fc1, bottom blob conv6 diff: 0.00107067
    I1109 ...]     [Backward] Layer fc1, param blob 0 diff: 0.483772
    I1109 ...]     [Backward] Layer fc1, param blob 1 diff: 4079.72
    I1109 ...]     [Backward] Layer conv2, bottom blob conv1 diff: 5.99449e-06
    I1109 ...]     [Backward] Layer conv2, param blob 0 diff: 0.00661093
    I1109 ...]     [Backward] Layer conv2, param blob 1 diff: 0.10995
    I1109 ...]     [Backward] Layer relu1, bottom blob conv1 diff: 2.87345e-06
    I1109 ...]     [Backward] Layer conv1, param blob 0 diff: 0.0220984
    I1109 ...]     [Backward] Layer conv1, param blob 1 diff: 0.0429201
    E1109 ...]     [Backward] All net params (data, diff): L1 norm = (2711.42, 7086.66); L2 norm = (6.11659, 4085.07) 

At first glance you can see this log p divided into two: [Forward] and [Backward]. Recall that neural network training is done via forward-backward propagation: A training example (batch) is fed to the net and a forward pass outputs the current prediction. Based on this prediction a loss is computed. The loss is then derived, and a gradient is estimated and propagated backward using the chain rule.

Caffe Blob data structure

Just a quick re-cap. Caffe uses Blob data structure to store data/weights/parameters etc. For this discussion it is important to note that Blob has two "parts": data and diff. The values of the Blob are stored in the data part. The diff part is used to store element-wise gradients for the backpropagation step.

Forward pass

You will see all the layers from bottom to top listed in this part of the log. For each layer you'll see:

    I1109 ...]     [Forward] Layer conv1, top blob conv1 data: 0.0645037
    I1109 ...]     [Forward] Layer conv1, param blob 0 data: 0.00899114
    I1109 ...]     [Forward] Layer conv1, param blob 1 data: 0

Layer "conv1" is a convolution layer that has 2 param blobs: the filters and the bias. Consequently, the log has three lines. The filter blob (param blob 0) has data

    I1109 ...]     [Forward] Layer conv1, param blob 0 data: 0.00899114

That is the current L2 norm of the convolution filter weights is 0.00899. The current bias (param blob 1):

    I1109 ...]     [Forward] Layer conv1, param blob 1 data: 0

meaning that currently the bias is set to 0.

Last but not least, "conv1" layer has an output, "top" named "conv1" (how original...). The L2 norm of the output is

    I1109 ...]     [Forward] Layer conv1, top blob conv1 data: 0.0645037

Note that all L2 values for the [Forward] pass are reported on the data part of the Blobs in question.

Loss and gradient

At the end of the [Forward] pass comes the loss layer:

    I1109 ...]     [Forward] Layer loss, top blob loss data: 2031.85
    I1109 ...]     [Backward] Layer loss, bottom blob fc1 diff: 0.124506

In this example the batch loss is 2031.85, the gradient of the loss w.r.t. fc1 is computed and passed to diff part of fc1 Blob. The L2 magnitude of the gradient is 0.1245.

Backward pass

All the rest of the layers are listed in this part top to bottom. You can see that the L2 magnitudes reported now are of the diff part of the Blobs (params and layers' inputs).


The last log line of this iteration:

    [Backward] All net params (data, diff): L1 norm = (2711.42, 7086.66); L2 norm = (6.11659, 4085.07)

reports the total L1 and L2 magnitudes of both data and gradients.

What should I look for?

  • If you have nans in your loss, see at what point your data or diff turns into nan: at which layer? at which iteration?
  • Look at the gradient magnitude, they should be reasonable. IF you are starting to see values with e+8 your data/gradients are starting to blow off. Decrease your learning rate!
  • See that the diffs are not zero. Zero diffs mean no gradients = no updates = no learning.


  1. caffe︱深度學習參數調優雜記+caffe訓練時的問題+dropout/batch Normalization
  2. Common causes of nans during training
  3. Caffe debug info 的使用
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.