You are here

running in MPI mode and multiple scores per output PDB file?

8 posts / 0 new
Last post
running in MPI mode and multiple scores per output PDB file?

Hi Forum 
I recently did a Rosetta fixbb run with MPI and found that the score file had a lot more lines of output than there were actual PDB files. Specifically, I've got 353 scores in but only 12 PDB files.  is it possible that the parallel processors are simply overwriting the PDBs?   Is there a flag I should be including to avoid this? 


Post Situation: 
Wed, 2019-10-30 09:42

353/12 is not a whole number, but otherwise with is 100% the symptom of "you didn't actually run in MPI".  This is what happens if you run non-MPI-compiled Rosetta (with or without mpiexec).  I assume you used -nstruct 12.

Does your rosetta binary have`mpi` in its name?  it should be rosetta-app-name.mpi.(system)(compiler)(mode)

Wed, 2019-10-30 11:47

Is it possible that, even though the binary has 'mpi' in the name, that perhaps it wasn't compiled correctly?   Is there a unit test or something for MPI-compiled Rosetta?  

Fri, 2019-11-01 10:16

No particularly useful tests I know of. Rocco's comment about the tracer tags with MPI rank below might be diagnostic.  Just the log files themselves should say something to; I haven't done a run in a while but proably the job distributor choice is announced and you'll see it in a log line near the top.


Mon, 2019-11-04 15:41

Yup, the binary does have mpi in the name: 

mpiexec $HOME/rosetta_src_2019.22.60749_bundle/main/source/bin/fixbb.mpi.linuxgccrelease -s filename.pdb -ex1 -ex2 -resfile resfile.txt -nstruct 15 -overwrite -linmem_ig 10


the numbers probably don't work out just right because I hit the walltime on the job and the machine killed the job before it was finished. 

Fri, 2019-11-01 10:17

(comment removed and resubmitted as direct reply to previous poster) 

Fri, 2019-11-01 12:27

I'm wondering if it might be an MPI version mismatch. That is, if you compile with OpenMPI libraries, say, but your mpiexec for MPICH2 version, say, then the MPICH2 launcher won't necessarily set things up properly for OpenMPI, and you might end up having each process think it's running serial, despite being under an MI launcher.

Double check your compilation settings and where your mpiexec is coming from (e.g. `which mpiexec`). Sometimes with clusters you get a mixed environment where mpiexec goes to MPICH2 (for example), but mpirun goes to OpenMPI (or vice versa, etc.).

The other thing to take a look at is the tracer output. If MPI is properly set up, there should be an annotation about the MPI process in parenthesis for each line. If that's missing, or if it's all '(0)',  (with no other numbers, despite launching multiple processes in MPI) then it could be that the MPI environment is not set up correctly for Rosetta to realize it's running under MPI, and may be running serially. There may be other information in the tracer about how thing are running under MPI as well.

Mon, 2019-11-04 11:56

Yes!!!  That seems to have been the problem!  The version  of Open MPI on the head node was different from that on the compute node.  All fixed now! 
Thank you all for your help !! 

Wed, 2019-11-06 09:51