I'm running Rosetta single state design (though I imagine my question applies to MSD as well), which works great on a single core or my local machine. I'm using rosetta_scripts.mpi.linuxgccrelease as my package. It is compiled by my HPC administrator to function with the cluster.
I'm running this script on a single node, with 28 processors per node. I am generating 100 models for the time being, but when I view the design.fasc file after the design is completed, there are 28*100 = 2800 entries. However, there are only 100 PDBs produced as a result of the design. I believe that 2800 models were indeed designed, but the models are being overwritten by the last core to complete that particular numbered model (since they all have the same filename, i.e. pdbID_design_0001.pdb).
Why are my files getting overwritten by cores, and how do I make each core output a unique name for the design.fasc row entry and the .pdb output file?
Sounds like you may be missing `mpirun`? That would mean you are running the MPI executable but all 100 of them think they are the only one, and there is no communication. I think MPI rosetta appends the mpi rank to the tracer names in the log files, are they all 0? If you run a small test job with "-mpi_tracer_to_file proc" how many files do you get?
Thanks for the suggestion. I ran it with `mpiexec` which I thought would function equivalently to mpirun. I re-ran it wih `mpirun` just to be sure, still the same issue.
My command is the following:
PATH/mpich/3.1.4/bin/mpirun PATH/rosetta/rosetta_src_2018.09.60072_bundle/main/source/bin/rosetta_scripts.mpi.linuxgccrelease @design.options -parser:protocol design.xml -out:suffix _design -scorefile design.fasc
It works perfectly without `PATH/mpich/3.1.4/bin/mpirun` but uses just a single processor, of course.
Cores still are not talking to each other. I tried adding `-mpi_tracer_to_file proc` at the end, and `-mpi_tracer_to_file log` or `-mpi_tracer_to_file log.out`( after creating a log.out file), but I'm pretty clueless what I'm doing. Each time it resulted in an error getting thrown, and a `log.out_0` file was created after specifying -mpi_tracer_to_file proc, but the program quit due to errors. The error was "The OpenFabrics (openib) BTL failed to initialize while trying to allocate some locked memory..."
Do I need to add more than just `-mpi_tracer_to_file proc` at the end of my command, or am I doing something wrong?
-mpi_tracer_to_file is diagnostic, not a repair. The fact that only file 0 ever gets created is a solid confirmation you are running independent processes, not linked together by MPI.
Your mpirun command there makes no suggestion of how many processors should be used. I haven't run bare MPI in several years but I think the argument was -np #, where # is the number of processors? When you say you had 100 processes running - where was the 100?
Thank you for the continued guidance! I now tried to include -np 28 in the command after mpirun.
PATH/mpich/3.1.4/bin/mpirun -np 28 PATH/rosetta_src_2018.09.60072_bundle/main/source/bin/rosetta_scripts.mpi.linuxgccrelease @design.options -parser:protocol design.xml -out:suffix _design -scorefile design.fasc
Still the same issue.
The 100 processes is protocol-specific, so I am specifying to build 100 models in the single-state design design.options protocol. That is in the nstruct line. It is now modified to 10 for the time being.
I let this sit over the weekend to see if I'd have a brilliant idea but I didn't. I think your "MPI" code isn't compiled in MPI, or something similar. I don't currently have access to a machine on which to run a few MPI test jobs to try to replicate the behavior.
1) You could try recompiling it yourself? I think you said someone else compiled it?
2) post a sample log snippet with and without mpi_tracer_to_file. I just want to see where the processor ID number is on the tracer line and confirm it's always 0.
3) for playing with mpirun / mpiexec - try running the non-MPI executable within those, and try running the MPI executable without them. Try running with bad flags to rosetta or to mpirun. The goal is to force some more errors. My usual way out of this kind of hole is to run something that should fail and then track down why it fails-to-fail.