1. NCCL unhandled cuda error
問題:
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1565272271120/work/torch/lib/c10d/ProcessGroupNCCL.cpp:290, unhandled cuda error
Traceback (most recent call last):
…
subprocess.CalledProcessError: Command ‘[’/home/user3/anaconda3/envs/open-mmlab/bin/python’, ‘-u’, ‘./tools/test.py’, ‘–local_rank=2’, ‘configs/collin/dcn/faster_rcnn_dconv_c3-c5_r50_fpn_1x–hrrsd.py’, ‘work_dirs/faster_rcnn_dconv_c3-c5_r50_fpn_1x–hrrsd/epoch_12.pth’, ‘–launcher’, ‘pytorch’, ‘–out’, ‘work_dirs/faster_rcnn_dconv_c3-c5_r50_fpn_1x–hrrsd/results.pkl’, ‘–show’]’ returned non-zero exit status 1.
解決:
修改可視的GPU,且必須保證這些GPU上沒有任何其他程序運行。
export CUDA_VISIBLE_DEVICES=0,5,6