You are here

smoothly kill a mpirun process

3 posts / 0 new
Last post
smoothly kill a mpirun process
#1

Hi there,
Sorry about this, perhaps, off topic question. The problem is I´m running a MPIRUN loop modelling processes, which I've asked for 500k decoys. Well, i've changed my mind and decided to stop it before end. I'm afraid, however, that an abrupt interruption will damage my big silent file. So, is there any way to smoothly kill such a process?
Thanks in advance.

Post Situation: 
Tue, 2014-02-18 06:06
fred

Silent files are rather granular. Structures are written to them one-by-one, and there isn't really any sort of coordination between the different structures in the silent file. Therefore there should be no issue if you end a run prematurely. The structures which have been written to the file will be as valid as if you completed the run.

The only issue for corruption lies in if you "pull the plug" in the middle of writing a structure to the file. In that case you'd get part of a structure written at the very end of the file. If you tried to read this file back in, Rosetta would complain. That's easily corrected, though, by adding the flag "-silent_read_through_errors" to the commandline of the program reading the file back in. That flag causes Rosetta to discard the structures it can't read (i.e. the partial one at the end of the file), but will read all the ones it can (i.e. all the rest of the structures in the file).

That's the view from the Rosetta perspective - Rosetta IO was build to be somewhat robust to programs dieing/being killed. If there's anything special you need to do so that you don't mess up your queueing system is a question for the person who runs your cluster.

Tue, 2014-02-18 08:43
rmoretti

From my personal experience, dozens of my Rosetta jobs were killed either by me via top/kill, by crashing runs, reboots, etc. and not once a silent file was corrupted. Just to be safe, make a copy of your silent file and then kill the run, but I assume you should be fine. Good luck!

Wed, 2014-02-19 06:53
Ashafix