1.MxNet版本的LFFD需要安装CUDA10.1版本和CuDNN
若不满足会出现如下问题:
安装的CUDA版本太低或没有安装:
raceback (most recent call last):
File "configuration_10_320_20L_5scales_v2.py", line 17, in <module>
import mxnet
File "/usr/local/lib/python3.6/dist-packages/mxnet/__init__.py", line 24, in <module>
from .context import Context, current_context, cpu, gpu, cpu_pinned
File "/usr/local/lib/python3.6/dist-packages/mxnet/context.py", line 24, in <module>
from .base import classproperty, with_metaclass, _MXClassPropertyMetaClass
File "/usr/local/lib/python3.6/dist-packages/mxnet/base.py", line 213, in <module>
_LIB = _load_lib()
File "/usr/local/lib/python3.6/dist-packages/mxnet/base.py", line 204, in _load_lib
lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_LOCAL)
File "/usr/lib/python3.6/ctypes/__init__.py", line 348, in __init__
self._handle = _dlopen(self._name, mode)
OSError: libcudart.so.10.1: cannot open shared object file: No such file or directory
没有安装CuDNN:
terminate called after throwing an instance of 'dmlc::Error'
what(): [20:48:36] ../include/mshadow/./stream_gpu-inl.h:173: Check failed: err == CUDNN_STATUS_SUCCESS (4 vs. 0) : CUDNN_STATUS_INTERNAL_ERROR
Aborted (core dumped)
2.正确使用Python和正确安装MxNet版本
若已经正确安装CUDA和CUDNN,仍然出现:
terminate called after throwing an instance of 'dmlc::Error'
what(): [20:48:36] ../include/mshadow/./stream_gpu-inl.h:173: Check failed: err == CUDNN_STATUS_SUCCESS (4 vs. 0) : CUDNN_STATUS_INTERNAL_ERROR
Aborted (core dumped)
有两种可能:首先查看MxNet版本是否正确,再在configuration_10_560_25L_8scales_v1.py代码中将如下代码注释:
# add mxnet python path to path env if need
mxnet_python_path = '/home/heyonghao/libs/incubator-mxnet/python'
sys.path.append(mxnet_python_path)
我们只需要使用我们本地默认的Python就行。
3.正确安装OpenCV
如出现如下问题:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "_ctypes/callbacks.c", line 234, in 'calling callback function'
File "/root/work/mxnet/python/mxnet/operator.py", line 1052, in backward_entry
print('Error in CustomOp.backward: %s' % traceback.format_exc())
UnicodeEncodeError: 'ascii' codec can't encode characters in position 369-376: ordinal not in range(128)
说明OpenCV版本没有正确安装,删除旧版本之后安装如下版本:
pip install opencv-python==3.4.5.20
4.正确设置batch_size
遇到如下问题,很可能是batch_size设置的太大:
MXNetError: cudaMalloc retry failed: out of memory
可以设置batch_size=16