遇到mpi worker exited on signal 9

運行一個 mpi-operator 的 demo(這個 demo 還是我提交的…),看到如下錯誤。

An MPI communication peer process has unexpectedly disconnected.  This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).

Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate.  For
example, there may be a core file that you can examine.  More
generally: such peer hangups are frequently caused by application bugs
or other external events.

  Local host: mpi-sleep-worker-0
  Local PID:  99
  Peer host:  mpi-sleep-worker-1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 58 on node mpi-sleep-worker-1 exited on signal 9 (Killed).
--------------------------------------------------------------------------

看了許久,發現是 Worker 配置的內存太少了(之前只有1Gi),如果要運行這個 demo,請把 Worker 的內存加到 2Gi。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章