You are here

Intel MPI: early exit due to job process stopped.

1 post / 0 new
Intel MPI: early exit due to job process stopped.
#1

Hello, I am using rosetta docking_protocol compiled with Intel MPI (from Intel® oneAPI Toolkits 2020), and I have some problems / bug reports.

I requested 3 output poses with '-nstruct 3' and I ran 'mpirun -n 3 docking_protocol.mpi.linuxiccrelease ...' to start one master rank and two slave ranks. However, when one slave (rank 2) finished its jobs and master will send spin down signal:

protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Sending spin down signal to node 2


The slave process would exit with a non-zero exit code, while poses were outputted correctly and no error displayed in the log file.
Then, mpirun exited due to failure of this process, and killed all other processes including master (rank 0) and slave (rank 1) processes, leaving some jobs undone:

# Intel MPI error message:
---------------------------------------------------------------
[ ERROR ]: Error(s) were encountered when running jobs.
2 jobs failed;
Check the output further up for additional error messages.
---------------------------------------------------------------

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 1094200 RUNNING AT mu03
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 2 PID 1094202 RUNNING AT mu03
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

 

Therefore, I used '-disable-auto-cleanup' option in mpirun to make mpirun ignore the failure and prevent abort. However, after all job finished, the master and slave process stuck with 100% usage of one cpu core. They were waiting for MPI sync (MPI_barrier) but failed to connect to the exited process, therefore they are waiting forever.

In openmpi, exited process will not be tracked by MPI sync (MPI_barrier), so the processes will exit successfully.

Actually according to MPI specification, MPI processes should never exit asynchronizely so maybe it's a better choice to make slave processes wait until all jobs are done?

Another choice is to use MPI_Finalize to exit the mpi environment, so mpirun will not track the process and will not report failure nor waiting for the process to sync.

If there is already a way to make Intel MPI run without early stop / stucking by using some options of mpirun or rosetta, please remind me, thank you very much.

P.S. My Rosetta version is 3.13, and I've searched the forums for related issues, most of which focus on memory constraints, but I'm running on cluster with 128G memory so It seems not the reason.

Category: 
Post Situation: 
Mon, 2023-05-15 02:57
jackzzs