解決分佈式訓練 報terminate called after throwing an instance of 'std::length_error'

在進行分佈式進行訓練,

INFO:tensorflow:Reduce to /replica:0/task:0/device:CPU:0 then broadcast to ('/replica:0/task:0/device:CPU:0',).
I0408 04:01:41.507015 140706188736256 cross_device_ops.py:427] Reduce to /replica:0/task:0/device:CPU:0 then broadcast to ('/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Create CheckpointSaverHook.
I0408 04:01:44.424420 140706188736256 basic_session_run_hooks.py:541] Create CheckpointSaverHook.
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::append
Fatal Python error: Aborted 

 饒了一大圈排查,通過減少gpu數量,可正常運行了

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章