## You are here

4 posts / 0 new
#1

Hi all,

I'm running a protein-protein docking (both partners are ~240 res. monomers) and I get the following error:

---------------------------------------------------------------
[ ERROR ]: Error(s) were encountered when running jobs.
1 jobs failed;
Check the output further up for additional error messages.
---------------------------------------------------------------

I tried to run a test with tutorials inputs, either by using the "docking_protocol.mpi.linuxgccrelease" application for the tutorial in :

.../main/demos/tutorials/Protein-Protein-Docking/

or with the RosettaScripts interface, by using the inputs from the tutorial in :

If i set -nstruct 1 everything works fine, but if I try to generate more structures (e.g. -nstruct 100) I get the error.

It is worth noting that if I set nstruct to 5 or 10, the run generates all the requested structures and the score file seems to be fine.

If I increase nstruct to 100, the run always stops after generating ~80 structures.

Moreover, sometimes I get more than 1 job failed in the error.

I think that this isn't an MPI issue, as I get the same same error with the static release.

I have no idea where to start to face this issues, as the attached .log file seems to be ok, beyond the aforementioned error.

All help and suggestions are highly appreciated!
Thanks

Samuele

EDIT:

Category:
Post Situation:
Fri, 2022-03-18 09:36
sam_dc

The log doesn't seem to be attached.

It's not entirely unexpected to encounter issues where a low -nstruct run looks to be fine, but a high -nstruct run fails. Most frequently, it's because there's some low probablity of encountering an issue during the run (possibly due to the stochastic way Rosetta is sampling). If you just run things once, you're unlikely to hit the error, but if you run it hudreds of times, you're almost certain to encounter the issue before you finish all your models.

The fact that the run ended isn't too helpful in debugging why it failed. As the error message you quoted mentions, you often need to look at the earlier error messages to see what sort of issue was encountered which brought the whole run down.

Note that, if worse comes to worse, you can work around this. Almost all Rosetta protocols are restartable. That is, each output structure is more-or-less independent of each other, and you can simply restart the same command line which failed, and it will attempt to pick up where it left off. As long as you're using the default "pick a new random seed each time" behavior of Rosetta, it probably won't encounter the same stochastic crash immediately on a restart, though if it runs long enough, it might. -- But that's a work around. Often it is better to identify what issue Rosetta is running into, and making sure it doesn't stem from some mistake you've made in the input files.

Fri, 2022-03-18 09:47
rmoretti

I apologize for the log. I've edited the topic with a link to download it in a zipped format. I hope that this will help.

I looked for other errors in the log, but there I can't find it.

The input files from which I obtained this log are located in "/main/demos/tutorials/Protein-Protein-Docking/".

"...if you run it hudreds of times, you're almost certain to encounter the issue before you finish all your models."
I had the same intuition, but I tried dozens of time -- using different hardware and different Rosetta versions (3.12 and 3.13) -- and I still get the same error.

I find odd that none of the run seems to end without errors. Sometimes, if set nstruct to 100, all the requested are generated but I find the error in the log file.

The maximum number of structures I obtained from a single run is around 100 if I set nstruct to 1000. Consequently, the proposed work around seems to be unfeasible when it comes to generate 20000 structures.

Many thanks for the help.

Fri, 2022-03-18 15:15
sam_dc

It looks like the way Rosetta is handling job distribution is conflicting with the way your clustering/MPI system is handling job distribution. In particular, there's a block in the log:

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------


It's kind of hard to tell, as the tracer output is cut off, but it looks like the process returning with a non-zero exit code may be one of the jobs which completed successfully. This message isn't coming from Rosetta, so I'd look at the documentation for your MPI system and/or your cluster distribution system to figure out which setting is controlling that "per user-direction", and change it (at least temporarily) such that you don't abort the entire job system immediately when a subprocess exits.

This would be consistent with only seeing ~80 structures when you're running with 20 processes. Each process works on various outputs independently, and then one process finishes its structure, and there's no further structures to process. It exits, causing the rest of the processes (which are still finishing their last structure) to be killed, resulting in leaving ~19 structures still "in progress".

(Note that when you're doing MPI, the -nstruct is the _total_ number of structures you want, not the structure-per-process. With MPI, -nstruct 100 will only give you 100 structures total with a successful run, even if you have 20 processes. They'll be completed about 20x faster than if you just used one process, but you'll get the same number.)

The other thing I'd recommend is to use the -mpi_tracer_to_file <filename> option to redirect the tracer output to separate files for each MPI process. This will make interpreting things easier, as it will remove the interleaving of the multiple processes in the tracer logs.

There may still be an error, but hopefully with the "keep running" settings, along with the file-capture of the output it will be easier to see where that's occurring.

Tue, 2022-03-22 09:10
rmoretti