You are here

JobDistributor Bug

2 posts / 0 new
Last post
JobDistributor Bug
#1

I've been using the PyRosetta JobDistributor which should allow you to launch many instances of the same script in parallel, and each instance will work on a different decoy. Thus, for your N decoys each instance will work on approximately N/m of them, where m is the number of individual instances of the script you run. I have been running on our cluster, and noticed that after launching say a dozen instances of the script, after a number of minutes, some instances would start dying. After a few more minutes, I'd be down to only a few instances still running.

I tracked down the problem to the way the JobDistributor manages which instance works on which decoy. Basically, at the start of each new decoy, the instance scans through, starting at decoy 1, and looks to see if that decoy file is already present, or if there is file named _.pdb.in_progress. If there is, then it will move on to the next number of decoy. The first number that it gets to that has neither a .pdb nor a .pdb.in_progress file, is the decoy number it will work on. At that point, it opens a .pdb.in_progress file to let the other instances know that this decoy number is taken. Basically, this in_progress file is a semaphore - only one instance should be working on a given decoy. Once all of the processing is done on that instance, it writes out that decoy .pdb file and then removes the .pdb.in_progress file.

The problem is that there is an inherent race condition present. If two instances both start looking for the next decoy at about the same time, then they both may get to a number where there is neither a .pdb file nor a .pdb.in_progress file. Thus, both threads will decide they are going to work on that decoy number. They both then open a .pdb.in_progress file for write. This is perfectly allowed. Two threads can both have open the same file. Whoever wrote to it last will be what is in the file, but there is no error in having two instances of the file opened. That's the main problem, the file does not act like a true semaphore.

So they both do their calculations, both write out to the decoy pdb file (whichever thread does that last is the data that will be in the pdb file), and then they both attempt to remove the in_progress file. That's where the problem is. The first instance successfully removes the file. The second instance tries to remove the in_progress file, but its no longer there, and this raises an exception and the instance crashes.

One way to fix this is to just wrap the removal portion with a try: except: clause. If it fails in removing the file, it just moves on. This will prevent the threads from dying, but it won't be optimal because multiple threads will occasionally be working on the same decoy number, and thus all but one of those threads will be doing work that is never seen.

A better solution that I came up with (although still kind of hacky) is to use an in_progress directory instead of an in_progress file. When an instance wants to claim a decoy number, it creates a _.pdb.in_progress directory, not a file. The nice thing here is that if a 2nd thread tries to create a directory with the same name, it will raise an exception. In this way, only 1 instance can ever make that directory, preserving the semaphore structure. When a thread encounters an exception trying to make that directory, it just moves on to the next decoy number and tries again. After a decoy is produced, the in_progress directory is removed.

The changes in the __init__.py file to do this were quite straightforward, and i have attached my modified version. It seems to work fine, and I no longer see any instances die out before all of the decoys have been produced.

I'm not sure if anyone else has run into this issue (I was quite surprised that the race condition ever was met, but at least for me, it was met fairly regularly) but if they have, hopefully this is helpful. I have renamed the extension to .txt as it will not allow me to attach a .py.

-Brett

AttachmentSize
__init__.txt10.79 KB
Post Situation: 
Fri, 2011-03-11 07:00
brettth

For what it's worth, the groups that use the race-condition-sensitive in_progress semaphore are the groups whose cluster structure does not lend itself to the race condition being a problem. Where the race condition is a problem, a wide variety of other solutions exist in the C++ job distributors; I don't know how to use them in PyRosetta. (The standard solution is MPI communication of who does which job).

Using a directory is an interesting solution - good find!

Fri, 2011-03-11 08:12
smlewis