pytorch 多GPU 问题记录

共享内存问题:unable to open shared memory object </torch_> in read-write mode

使用NAS,网络太大,一块放不下,所以尝试用ddp玩一个多gpu训练。

(py36torch15) xx@cluster:~/wang/FasterCrowdCountingNAS/FBNetBranch$ python main.py 
/home//anaconda3/envs/py36torch15/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py:23: UserWarning: You requested multiple GPUs but did not specify a backend, e.g. Trainer(distributed_backend=dp) (or ddp, ddp2). Setting distributed_backend=ddp for you.
  warnings.warn(*args, **kwargs)
GPU available: True, used: True
No environment variable for node rank defined. Set as 0.
CUDA_VISIBLE_DEVICES: [0,1,2]
Traceback (most recent call last):
  File "main.py", line 29, in <module>
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 844, in fit
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 149, in start_processes
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/process.py", line 105, in start
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/context.py", line 284, in _Popen
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in __init__
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/reduction.py", line 60, in dump
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 333, in reduce_storage
RuntimeError: unable to open shared memory object </torch_12222_563474802> in read-write mode

刚开始还以为我哪点写错了……直接到spawn空间找,发现应该是open file限制问题。

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 514771
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 514771
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

1024空间实在太小,直接使用ulimit -SHn 51200问题解决。

多进程问题:The “freeze_support()” line can be omitted if the program is not going to be frozen to produce an executable.

  File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/spawn.py", line 143, in get_preparation_data
    _check_not_importing_main()
  File "/home//anaconda3/envs/py36torch15/lib/python3.6/multiprocessing/spawn.py", line 136, in _check_not_importing_main
    is not going to be frozen to produce an executable.''')
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

使用多进程发现又弹出这种问题,其实这个是python使用多进程设置的问题。
Python多进程的实现方式[。Unix系统下默认的实现方式是fork,而fork可以将进程复制一份,子进程可以执行与主程序不同的函数,此外,这种方式生成的进程继承了父进程的数据,所以数据可以方便的从父进程流动到子进程。而在Windows上不支持fork,而是要使用spawn。spawn其实也是将进程复制一份,但是进程会重新执行一遍主函数里面的代码,就像父进程一样,然后再去执行相应的函数。所以这就会导致一个问题就是如果我们不加任何判断的话,这个进程会不断的复制自身,形成新的进程。Python的设计者当然考虑到了这一点,所以如果你在spawn进程的初始阶段还尝试创建新进程的话,会报错退出。怎么区别主进程和父进程呢?一般会采用__name__属性来区分。
解决方法:

if __name__ == '__main__':
    p = mp.Process(target=v)
    p.start()
    p.join()
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章