Has anyone ever had any problems with the master MPI process stalling as it's cleaning up / giving more work to a slave or a little afterward?
I am running more tests tomorrow, but I have had it stall (While still saying it at 100 % cpu) with its output garbled like this before it just stops communicating with the rest the slaves. Once it stops communicating, all the slaves continue to run, but nothing gets output and the job needs to be cancelled. Also, no error message is reported from MPI or Rosetta. It just silently stops:
protocols.jd2.MPIWorkPoolJobDistributor: (81) Slave Node 81: Finished job successfully! Sendiprotocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: Received message from ng output reque 81 with tag 30
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Nst to master.
ode: Receivedprotocols.jd2.MPIWorkPoolJobDistributor: (81) Slave job Node 81: Receivesd output confirmuccess message for jation from master.ob id 65 from node 81 blocking till output is d Writing oone
protocols.jd2.MPIWorkPoolJobDistributor: (81) Slave Node 81: Finished writing output in 0.3 seconds. Sendiprotocols.jd2.MPIWorkPoolJobDistributor: (0) Masterng messag Node: Received job oute to mastput finish messageer
for job id protocols.jd2.MPIWorkPoolJobDistributor: (81) Slave0 f Node 81: Requesting rom node new job id from mas81
protocols.jd2.MPIWorkPoolJobDistributor: (0) Mter
aster Node: Waiting for 56 slaves to finish jobs
protocols.jd2.MPIWorkPoolJobDistributor: (81) Slavprotocols.jd2.MPIWorkPoolJobDistributor: (0) Mastee Node 81: Receir Node: Received message from 81 ved job id 0 frwith tag 10
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master protocols.jd2.JobDistributor: (81) no more bNode: Sendinatches to procg spin down ess...
signal to node 81
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Node: protocols.jd2.JobDistributor: (81) 134 joWaiting bs considfor 5ered, 1 jo5 slaves to bs attempted finish jobs
in 191005 seconds
ode: Received jobprotocols.jd2.MPIWorkPoolJobDistributor: (21) Slave success messa Node 2ge for job id1: Received out 119 from nodput confirmation from master. Write 21 blocking ing output.
till output is done
protocols.jd2.MPIWorkPoolJobDistributor: (21) Slave Node 21: Finished writing output in 0.28 seconds. Sending message to master
protocols.jd2.MPIWorkPoolJobDistributor: (0) Masterprotocols.jd2.MPIWorkPoolJobDistributor: (21) Slav Node: Received job e Node 21: Requesting output finish mesnew job id from mastersage for job
id 133 from node 21
protocols.jd2.MPIWorkPoolJobDistributor: (0) Master Nprotocols.jd2.MPIWorkPoolJobDistributor: (21) Slave ode: WaitinNode 21: Received job ig for job reqd 133 from master
protocols.jd2.PDBJobInputter: (21) PDBJprotocols.jd2.MPIWorkPoolJobDistributor: (0) Master obInputter::pose_frNode: Received messagom_job
protocols.jd2.PDBJobInputter: (21) fe fr
We are running mpiexec (OpenRTE) 1.6.3
Thanks for any help!
Are you using -mpi_tracer_to_file (filestem)? That will preclude garbling by putting different nodes in different output files. I haven't seen the behavior but it makes me wonder if you somehow have two head nodes...do the tracers' node tags add up correctly?
I'm not using any extra MPI options, though the option you suggested may be handy (Is it generally recommended?).
I don't know how I would have two head nodes... It seems that looking through the log file, there is only one. I emailed you the log file if you can take a quick look. I have read that open MPI can stall on different occasions, and looking at the bugtracker for it, its a bit overwhelming to try and determine how or why its happening via some bug in the MPI libraries that just happens to dislike our cluster. Do you use open MPI or MPICH2 on your cluster?
That flag is pretty much strictly necessary for MPI debugging, and strongly encouraged with MPI if you are not using -mute all instead...because otherwise you get garbage data. We've never discussed defaulting it to true; we could, but we'd need to put a system in to make sure the new log files won't overwrite any existing files.
I've used Rosetta with both OpenMPI and MPICH2.
The log file you sent me does not seem to indicate an error... the last report from the head node is that it's waiting for a request.
I'll add that to my flags and try to debug it further. Perhaps a slave node failed and something went wrong with the master-slave communication. Yea, no error, just stall. It stayed like that for almost a day, meanwhile on the cluster page, all my processes were running at 100 %. Other large runs came out fine, the structures that were complete came out fine as well. Do you know which version of OpenMPI you have run it on?
I've seen issues like this on the killdevil cluster at UNC that I no longer have access too, but I generally just wrote it off as "hardware" and it recurred so rarely/irreproducibly that it was never worth worrying about further. (That's MPICH2). (Given your situation, unless it reproduces reliably, I'd ignore it too).
Contador in Brian's lab uses openMPI, apparently version 1.5. We don't use it as a "cluster" and I don't think anyone's seen this there.
Can you log into slave nodes directly to run "top" and see what they are actually doing? The clusters I have used have allowed you to directly ssh into slave nodes...they'll get angry if you use it to run jobs, but it's cool for debugging.
It happened twice in a row, out of 3 total runs. . The first may have been because people were oversubscribing nodes and our grid engine has no MPI communication, so the master could have just given up. I just started using MPI, so I'm not sure how common it will be for us, but I'll run a few more and see what happens. There is a top like webpage for ours that lists all the processes and their speed, memory, etc. I'll have our cluster admin update openMPI and see if it keeps happening. It didn't happen on the numerous test runs which were short, but still the same number of nodes/processors/structures.
If all else fails, I guess its back to batch jobs...
This is "bad practice", but if they're failing at the _end_ of the run, just let them fail. Unless you need precisely 1000 structures for some reason (like statistics-sensitive thermodynamic ensemble work), then 9998 is good enough, so let it produce what it will, then kill the job.
Will definitely consider this if the slave nodes keep going...
A follow up to the discussion from our cluster admin:
The mpd daemons on nodes 63 and 64 failed for some reason leaving the mpd ring broken. To list the state of the mpd ring use /apps/mpich2/bin/mpdtrace -l. I have restarted the ring and will start a search for more robust code.