運行一個 mpi-operator 的 demo(這個 demo 還是我提交的…),看到如下錯誤。
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: mpi-sleep-worker-0
Local PID: 99
Peer host: mpi-sleep-worker-1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 58 on node mpi-sleep-worker-1 exited on signal 9 (Killed).
--------------------------------------------------------------------------
看了許久,發現是 Worker 配置的內存太少了(之前只有1Gi),如果要運行這個 demo,請把 Worker 的內存加到 2Gi。