模型訓練部署過程中的報錯處理

原創

陶瑞同学

2020-02-20 18:28

文章目錄

一、Allocation of X exceeds 10% of system memory 解決方式
二、wget 下載文件報錯：connection reset by peer
三、報錯：ERROR (theano.sandbox.cuda): Failed to compile cuda_ndarray.cu: libcublas.so.8.0: cannot open shared object file: No such file or directory
四、報錯：Attempting to fetch value instead of handling error Failed precondition: could not dlopen DSO: libcupti.so.10.0; dlerror: libcupti.so.10.0: cannot open shared object file: No such file or directory
五、報錯：tensorflow.python.framework.errors_impl.InvalidArgumentError: Saw a non-null label (index >= num_classes - 1) following a null label, batch: 14 num_classes: 4563 labels: 2819,2524,3491,3526,2672 [[{{node CTCLoss}}]] [[{{node gradients/CTCLoss_grad/mul}}]]
六、報錯：tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

一、Allocation of X exceeds 10% of system memory 解決方式

殺死所有正在運行的進程，以確保GPU具有足夠的內存。使用命令“nvidia-smi”查看正在運行的進程，並使用命令“kill -9 id”來終止它。
確保您的網絡不是很大，檢查是否有超大的完全連接層。
檢查是否使用了float64會使內存翻倍。
檢查是否使用了adam / RMSprop優化算法。這些優化算法將記錄歷史梯度，並將使內存使用量翻倍。

二、wget 下載文件報錯：connection reset by peer

關閉連接
Connection closed by peer 的一般理解是連接被目標機器（或其他訪問路線）故意關閉
如果下載功能是完好的，可能是因爲服務設置了下載量或者下載者的數量限制
解決辦法：僞裝主瀏覽器的代理用戶

wget --user-agent "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.22 (KHTML, like Gecko) Ubuntu Chromium/25.0.1364.160 Chrome/25.0.1364.160 Safari/537.22"

文件損壞
每次下載都是到同樣的大小發生這種中斷，可能是因爲文件開始部分是損壞的，在下載的時候需要等待幾秒鐘
解決辦法：設置合理的隨機等待時間

wget  --wait=15 --random-wait

三、報錯：ERROR (theano.sandbox.cuda): Failed to compile cuda_ndarray.cu: libcublas.so.8.0: cannot open shared object file: No such file or directory

網上查到的解決方案都是確認環境變量：

LD_LIBRARY_PATH: /usr/local/cuda/lib64/

可能是由於ml機器上cudnn的安裝目錄嵌套了軟連接，設置環境變量並沒有解決問題，這時候需要ldconfig，執行：

sudo ldconfig /user/local/cuda-8.0/lib64

四、報錯：Attempting to fetch value instead of handling error Failed precondition: could not dlopen DSO: libcupti.so.10.0; dlerror: libcupti.so.10.0: cannot open shared object file: No such file or directory

執行：export LD_LIBRARY_PATH="/usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/extras/CUPTI/lib64"
添加環境變量

五、報錯：tensorflow.python.framework.errors_impl.InvalidArgumentError: Saw a non-null label (index >= num_classes - 1) following a null label, batch: 14 num_classes: 4563 labels: 2819,2524,3491,3526,2672 [[{{node CTCLoss}}]] [[{{node gradients/CTCLoss_grad/mul}}]]

問題原因：theano後端的情況下索引從1開始，tensorflow爲後端的情況下索引從0開始
問題解決：將字典的最後一個字的索引改爲0

六、報錯：tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

百度的結果都是說cuda+cuDNN+TensorFlow的版本不匹配，當環境確認沒有問題或者沒有改動的情況下，
可能是因爲指定顯存使用的時候不能有小數點
bad casa：config.gpu_options.per_process_gpu_memory_fraction = 0.95
正確：config.gpu_options.per_process_gpu_memory_fraction = 0.9

陶瑞同學博客專家

發佈了93 篇原創文章 · 獲贊 225 · 訪問量 23萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

模型訓練部署過程中的報錯處理

文章目錄

一、Allocation of X exceeds 10% of system memory 解決方式

二、wget 下載文件報錯：connection reset by peer

三、報錯：ERROR (theano.sandbox.cuda): Failed to compile cuda_ndarray.cu: libcublas.so.8.0: cannot open shared object file: No such file or directory

四、報錯：Attempting to fetch value instead of handling error Failed precondition: could not dlopen DSO: libcupti.so.10.0; dlerror: libcupti.so.10.0: cannot open shared object file: No such file or directory

五、報錯：tensorflow.python.framework.errors_impl.InvalidArgumentError: Saw a non-null label (index >= num_classes - 1) following a null label, batch: 14 num_classes: 4563 labels: 2819,2524,3491,3526,2672 [[{{node CTCLoss}}]] [[{{node gradients/CTCLoss_grad/mul}}]]

六、報錯：tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

關於遊戲付費的一點想法

我通過CKA和CKS啦！

《最新出爐》系列入門篇-Python+Playwright自動化測試-42-強大的可視化追蹤利器Trace Viewer

大數據怎麼學？對大數據開發領域及崗位的詳細解讀，完整理解大數據開發領域技術體系

實時數倉和離線數倉

Spark SQL和 presto 訪問數據源的對比分析

使用memory_profiler工具對python工程做內存分析

n-gram語言模型的生成過程及原理

對python代碼進行加速處理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結