running in MPI mode and multiple scores per output PDB file?

8 posts / 0 new

Top

Hi Forum
I recently did a Rosetta fixbb run with MPI and found that the score file had a lot more lines of output than there were actual PDB files. Specifically, I've got 353 scores in score.sc but only 12 PDB files. is it possible that the parallel processors are simply overwriting the PDBs? Is there a flag I should be including to avoid this?

Thanks!

Category:

Design

Scoring

Post Situation:

Solved

Wed, 2019-10-30 09:42

dantimatter

Top

353/12 is not a whole number, but otherwise with is 100% the symptom of "you didn't actually run in MPI". This is what happens if you run non-MPI-compiled Rosetta (with or without mpiexec). I assume you used -nstruct 12.

Does your rosetta binary have`mpi` in its name? it should be rosetta-app-name.mpi.(system)(compiler)(mode)

Wed, 2019-10-30 11:47

smlewis

(Reply to #3)

Top

Is it possible that, even though the binary has 'mpi' in the name, that perhaps it wasn't compiled correctly? Is there a unit test or something for MPI-compiled Rosetta?

Fri, 2019-11-01 10:16

dantimatter

(Reply to #4)

Top

No particularly useful tests I know of. Rocco's comment about the tracer tags with MPI rank below might be diagnostic. Just the log files themselves should say something to; I haven't done a run in a while but proably the job distributor choice is announced and you'll see it in a log line near the top.

Mon, 2019-11-04 15:41

smlewis

(Reply to #5)

Top

Yup, the binary does have mpi in the name:

mpiexec $HOME/rosetta_src_2019.22.60749_bundle/main/source/bin/fixbb.mpi.linuxgccrelease -s filename.pdb -ex1 -ex2 -resfile resfile.txt -nstruct 15 -overwrite -linmem_ig 10

the numbers probably don't work out just right because I hit the walltime on the job and the machine killed the job before it was finished.

Fri, 2019-11-01 10:17

dantimatter

Top

(comment removed and resubmitted as direct reply to previous poster)

Fri, 2019-11-01 12:27

dantimatter

Top

I'm wondering if it might be an MPI version mismatch. That is, if you compile with OpenMPI libraries, say, but your mpiexec for MPICH2 version, say, then the MPICH2 launcher won't necessarily set things up properly for OpenMPI, and you might end up having each process think it's running serial, despite being under an MI launcher.

Double check your compilation settings and where your mpiexec is coming from (e.g. `which mpiexec`). Sometimes with clusters you get a mixed environment where mpiexec goes to MPICH2 (for example), but mpirun goes to OpenMPI (or vice versa, etc.).

The other thing to take a look at is the tracer output. If MPI is properly set up, there should be an annotation about the MPI process in parenthesis for each line. If that's missing, or if it's all '(0)', (with no other numbers, despite launching multiple processes in MPI) then it could be that the MPI environment is not set up correctly for Rosetta to realize it's running under MPI, and may be running serially. There may be other information in the tracer about how thing are running under MPI as well.

Mon, 2019-11-04 11:56

rmoretti

(Reply to #8)

Top

Yes!!! That seems to have been the problem! The version of Open MPI on the head node was different from that on the compute node. All fixed now!
Thank you all for your help !!

Wed, 2019-11-06 09:51

dantimatter

Search form

You are here

running in MPI mode and multiple scores per output PDB file?