Caffe: 使用經驗總結

引言

在深度學習框架caffe的使用過程中,加深了對神經網絡的理解,同時也將神經網絡知識從理論落到實處。希望日後多多拿代碼時間,將該總結繼續增廣~~

深度學習中常用的調節參數

學習率

步長的選擇:你走的距離長短,越短當然不會錯過,但是耗時間。步長的選擇比較麻煩。步長越小,越容易得到局部最優化(到了比較大的山谷,就出不去了),而大了會全局最優。一般來說,如ResNet前32k步,很大,0.1;到了後面,迭代次數增高,下降0.01,再多,然後再小一些。

caffe訓練時Loss變爲nan的原因

由小變大易出nan

原因:有小變大過程中,某個梯度值變得特別大,使得學習過程難以爲繼

例如:10x10x256的輸入,輸出如果是20x20x256或者10x10x512,如果是使用Inception-ResNet-v2或者直接進行卷積操作,很容易出現nan的情況。

具體解決方案:

使用ResNet-Block或者Inception技術,最後的結果通過Bitwise Operation進行組合,而不是採用按channel Concatenate進行的。

尤其是BitWise multi進行組合的時候,往往會產生很大的數據懸殊,會導致梯度爆炸現象從而出現Loss 爲nan

梯度爆炸

原因:梯度變得非常大,使得學習過程難以繼續

現象:觀察log,注意每一輪迭代後的loss。loss隨着每輪迭代越來越大,最終超過了浮點型表示的範圍,就變成了NaN。

措施

  • 減小solver.prototxt中的base_lr,至少減小一個數量級。如果有多個loss layer,需要找出哪個損失層導致了梯度爆炸,並在train_val.prototxt中減小該層的loss_weight,而非是減小通用的base_lr
  • 設置clip gradient,用於限制過大的diff

不當的損失函數

原因:有時候損失層中loss的計算可能導致NaN的出現。比如,給InfogainLoss層(信息熵損失)輸入沒有歸一化的值,使用帶有bug的自定義損失層等等。

現象:觀測訓練產生的log時一開始並不能看到異常,loss也在逐步的降低,但突然之間NaN就出現了。

措施:看看你是否能重現這個錯誤,在loss layer中加入一些輸出以進行調試。 示例:有一次我使用的loss歸一化了batch中label錯誤的次數。如果某個label從未在batch中出現過,loss就會變成NaN。在這種情況下,可以用足夠大的batch來儘量避免這個錯誤。

不當的輸入

原因:輸入中就含有NaN。

現象:每當學習的過程中碰到這個錯誤的輸入,就會變成NaN。觀察log的時候也許不能察覺任何異常,loss逐步的降低,但突然間就變成NaN了。

措施:重整你的數據集,確保訓練集和驗證集裏面沒有損壞的圖片。調試中你可以使用一個簡單的網絡來讀取輸入層,有一個缺省的loss,並過一遍所有輸入,如果其中有錯誤的輸入,這個缺省的層也會產生NaN。

Caffe Debug info

當我們訓練過程面臨nan, loss不收斂的情況,可以打開solver.prototxt中的debuf_info:true進行查錯。

    I1109 ...]     [Forward] Layer data, top blob data data: 0.343971    
    I1109 ...]     [Forward] Layer conv1, top blob conv1 data: 0.0645037
    I1109 ...]     [Forward] Layer conv1, param blob 0 data: 0.00899114
    I1109 ...]     [Forward] Layer conv1, param blob 1 data: 0
    I1109 ...]     [Forward] Layer relu1, top blob conv1 data: 0.0337982
    I1109 ...]     [Forward] Layer conv2, top blob conv2 data: 0.0249297
    I1109 ...]     [Forward] Layer conv2, param blob 0 data: 0.00875855
    I1109 ...]     [Forward] Layer conv2, param blob 1 data: 0
    I1109 ...]     [Forward] Layer relu2, top blob conv2 data: 0.0128249
    . 
    .
    .
    I1109 ...]     [Forward] Layer fc1, top blob fc1 data: 0.00728743
    I1109 ...]     [Forward] Layer fc1, param blob 0 data: 0.00876866
    I1109 ...]     [Forward] Layer fc1, param blob 1 data: 0
    I1109 ...]     [Forward] Layer loss, top blob loss data: 2031.85
    I1109 ...]     [Backward] Layer loss, bottom blob fc1 diff: 0.124506
    I1109 ...]     [Backward] Layer fc1, bottom blob conv6 diff: 0.00107067
    I1109 ...]     [Backward] Layer fc1, param blob 0 diff: 0.483772
    I1109 ...]     [Backward] Layer fc1, param blob 1 diff: 4079.72
    .
    .
    .
    I1109 ...]     [Backward] Layer conv2, bottom blob conv1 diff: 5.99449e-06
    I1109 ...]     [Backward] Layer conv2, param blob 0 diff: 0.00661093
    I1109 ...]     [Backward] Layer conv2, param blob 1 diff: 0.10995
    I1109 ...]     [Backward] Layer relu1, bottom blob conv1 diff: 2.87345e-06
    I1109 ...]     [Backward] Layer conv1, param blob 0 diff: 0.0220984
    I1109 ...]     [Backward] Layer conv1, param blob 1 diff: 0.0429201
    E1109 ...]     [Backward] All net params (data, diff): L1 norm = (2711.42, 7086.66); L2 norm = (6.11659, 4085.07) 

At first glance you can see this log p divided into two: [Forward] and [Backward]. Recall that neural network training is done via forward-backward propagation: A training example (batch) is fed to the net and a forward pass outputs the current prediction. Based on this prediction a loss is computed. The loss is then derived, and a gradient is estimated and propagated backward using the chain rule.

Caffe Blob data structure

Just a quick re-cap. Caffe uses Blob data structure to store data/weights/parameters etc. For this discussion it is important to note that Blob has two "parts": data and diff. The values of the Blob are stored in the data part. The diff part is used to store element-wise gradients for the backpropagation step.

Forward pass

You will see all the layers from bottom to top listed in this part of the log. For each layer you'll see:

    I1109 ...]     [Forward] Layer conv1, top blob conv1 data: 0.0645037
    I1109 ...]     [Forward] Layer conv1, param blob 0 data: 0.00899114
    I1109 ...]     [Forward] Layer conv1, param blob 1 data: 0

Layer "conv1" is a convolution layer that has 2 param blobs: the filters and the bias. Consequently, the log has three lines. The filter blob (param blob 0) has data

    I1109 ...]     [Forward] Layer conv1, param blob 0 data: 0.00899114

That is the current L2 norm of the convolution filter weights is 0.00899. The current bias (param blob 1):

    I1109 ...]     [Forward] Layer conv1, param blob 1 data: 0

meaning that currently the bias is set to 0.

Last but not least, "conv1" layer has an output, "top" named "conv1" (how original...). The L2 norm of the output is

    I1109 ...]     [Forward] Layer conv1, top blob conv1 data: 0.0645037

Note that all L2 values for the [Forward] pass are reported on the data part of the Blobs in question.

Loss and gradient

At the end of the [Forward] pass comes the loss layer:

    I1109 ...]     [Forward] Layer loss, top blob loss data: 2031.85
    I1109 ...]     [Backward] Layer loss, bottom blob fc1 diff: 0.124506

In this example the batch loss is 2031.85, the gradient of the loss w.r.t. fc1 is computed and passed to diff part of fc1 Blob. The L2 magnitude of the gradient is 0.1245.

Backward pass

All the rest of the layers are listed in this part top to bottom. You can see that the L2 magnitudes reported now are of the diff part of the Blobs (params and layers' inputs).

Finally

The last log line of this iteration:

    [Backward] All net params (data, diff): L1 norm = (2711.42, 7086.66); L2 norm = (6.11659, 4085.07)

reports the total L1 and L2 magnitudes of both data and gradients.

What should I look for?

  • If you have nans in your loss, see at what point your data or diff turns into nan: at which layer? at which iteration?
  • Look at the gradient magnitude, they should be reasonable. IF you are starting to see values with e+8 your data/gradients are starting to blow off. Decrease your learning rate!
  • See that the diffs are not zero. Zero diffs mean no gradients = no updates = no learning.

reference

  1. caffe︱深度學習參數調優雜記+caffe訓練時的問題+dropout/batch Normalization
  2. Common causes of nans during training
  3. Caffe debug info 的使用
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章