caffe相關問題（持續更新。。。。）

原創

Cbird-coder

2020-06-17 09:29

case1: syncedmem.cpp:56] Check failed: error == cudaSuccess (2 vs. 0) out of memory

這種情況下，可能有進程在使用顯存，而且使用很多，已經耗盡顯存了。無法再分配顯存。

使用：nvidia-smi查看顯存使用情況：

Mon Aug 21 17:22:35 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 960     Off  | 0000:01:00.0      On |                  N/A |
| 40%   60C    P2    37W / 120W |   1513MiB /  1993MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1438    G   /usr/lib/xorg/Xorg                             160MiB |
|    0      1500    C   /usr/bin/python                                 44MiB |
|    0      3268    G   compiz                                         108MiB |
|    0     25749    C   ../caffe/build/tools/caffe                    1195MiB |
+-----------------------------------------------------------------------------+

另一種情況是，train和test的prototxt裏面的batch_size設置過大，導致顯存一次不能載入那麼多數據。該小即可：

test:

data_param {
    source: "/home/gesture1/lmdb/gesture1_test_lmdb"
    batch_size: 1
    backend: LMDB
  }

train:

data_param {
    source: "/home/gesture1/lmdb/gesture1_trainval_lmdb"
    batch_size: 1
    backend: LMDB
  }

現在看看這個batch_size是啥子意思：

這個是網絡沒迭代一次處理的圖片數目，如果你有12800張照片，這個batch_size設置爲128，則訓練完所有的圖片，至少需要迭代100次。

#############################

在使用ssd的mobilenet時候

發現網絡輸出提示：

Couldn't find any detections

然後直接crash掉。

查看網絡結構，發現如下的設置：

code_type: CENTER_SIZE 
keep_top_k: 100 
confidence_threshold: 0.25

我目下使用自己的數據集進行訓練，只有四種物體。之前未crash掉是因爲物體種類有21種，檢測成其他物體且在0.25左右的可能性很高。爲了證明這個推斷，查看ssd算法原始deploy的輸出值設定：

code_type: CENTER_SIZE
    keep_top_k: 200
    confidence_threshold: 0.01

可以看出是否發生crash與此處的閾值confidence_threshold有關

故而修改此值爲0.001，再重新運行網絡，不會發生crash的情況了。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

caffe相關問題（持續更新。。。。）

SQL優化-20231016

pip查看安裝包

caffe相關問題（持續更新。。。。）

run in term

git克隆repo中的文件或者文件夾

TensorFlow – failed call to cuInit: CUDA_ERROR_UNKN

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結