You are here

[SOLVED] swamped memory

14 posts / 0 new
Last post
[SOLVED] swamped memory
#1

Hi Rosetta users,
I have experienced some memory problems when loading a minirosetta threading application. I've asked for 10000 structures per run and launch several process in a cluster.
Rosetta consumes the memory slowly util starts using swap. At this point, the system cpu load gets higher and higher so that almost no more structures are built.
The amount of memory is 2GB per cpu. Is this Rosetta behavior normal? or do I miss some expert Rosetta keyword?

Post Situation: 
Fri, 2011-09-16 08:19
fred

It's trying to store all 10000 structures in memory at once. This is 'normal' behavior as far as I can tell. I've tried to get someone to make the documentation explain how to do clustering on large sets, but nobody has. I guess your best bet is to just cluster the best 1000 or so.

Fri, 2011-09-16 08:23
smlewis

Hi smlewis,
I might not have been clear. It happened in the structure prediction/modeling phase, not the clustering one. I have read some threads in this forum reporting problems when clustering 100,000 decoys or more. I'll try to ask a smaller number of structures and increase the number of process.
Thanks.

Fri, 2011-09-16 08:43
fred

No, I just wasn't paying enough attention! I've seen the clustering question repeatedly, so now I see it when it's not there.

How big is your structure? How many fragments are you using? How many runs does it report before it hits swap? 2 GB/thread ought to be sufficient for this. There are a few tricks for low-memory systems we can try to employ; in particular, overwrite in the "rosetta_database/chemical/residue_type_sets/fa_standard" folder, move the "slim" files over the regular files (save the normally named ones to a different name. This won't help if you're in centroid mode in your structure prediction, which I think you are...

Fri, 2011-09-16 09:01
smlewis

The sequence has ~300aa. I am using the entire frags3&9 files. Rosetta started swap at the model number 1903. I am using the fullatom mode. The slim files there are pretty small (4k).

Fri, 2011-09-16 11:01
fred

My slim files advice got garbled, I guess it's a bad day for me to be giving advice. The files "residue_types.txt.slim" and "patches.txt.slim" are meant to overwrite residue_types.txt and patches.txt. The slim versions of the files result in a Rosetta run that uses several hundred MB less memory for residue type sets; the cost is that many uncommon types are unavailable (and thus cause crashes). Anyway, if you're getting as far as run 1903, then it's not going to be fixed with the database fiddling (it would have failed much sooner if it could).

Unfortunately, I'm not aware of any known memory leaks in the code that would trigger under these circumstances. To confirm, this is a single-input circumstance, where you have ONE input sequence/structure, with many results requested? (As opposed to a huge list of things to do once each?) There is a known (command-line-fixable) issue in the latter case.

As workarounds, your best bets are to either start multiple jobs at nstruct 1500 in different directories, or start sequential jobs (depending on your job submission system) that have nstructs of 1500, 3000, 4500... all in one directory. This should zero out the memory between jobs and keep you from crashing.

Fri, 2011-09-16 12:08
smlewis

When crashed, it was a multliple-input circumstance. I mean, several pdbs against a multiple alignment with several sequences. Yesterday, however, I started several jobs running two pdbs against two sequences. Things are not going well (See http://ompldr.org/vYWRzbA). In few days it is going to start swap again.

Fri, 2011-09-16 14:17
fred

First off, I would recommend trying the "-jd2:delete_old_poses" flag. From the full options documentation: "Delete poses after they have been processed. For jobs that process a large number of structures, the memory consumed by old poses is wasteful. Default: false" Depending on your protocol it may not do anything, but it's worth a try.

The other obvious short-term fix is to not let it run for several days. I can't say for certain regarding whatever protocol you're using, but most Rosetta protocols are "embarrassingly parallel" at the output structure level. That is, one run of 10,000 output structures is theoretically equivalent to 10 runs of 1,000 structures, or 100 runs of 100 structures. Honestly, I doubt most Rosetta developers ever go out to 2000+ structures on a single run - usually we'll split such an experiment up and run them in parallel.

What's more, almost all protocols have output-structure level checkpointing. That is, if you interrupt a process and then restart it with the exact same command line, Rosetta will pick up where it left off (so if it just finished structure 1902 when you pulled the plug, it will continue running at structure 1903). I can't say for certain, but I'd assume that when you restart the program, you'll reset that memory usage (so usage at structure 1912 on the restart would be similar to that of structure 10 on the first run).

If possible, break your 10,000 output structure runs into multiple different runs. If running on a single processor, you can serialize them with something like a bash script. Alternatively you can have a memory watchdog that kills and restarts the single Rosetta run if it uses too much memory.

That said, Rosetta really *shouldn't* slowly increase in memory usage like that. If the -jd2:delete_old_poses flag doesn't work, it might be worth thinking about putting together a test case to send to the developers of whatever protocol you're using to see if there's some memory leak we're missing.

Sat, 2011-09-17 11:13
rmoretti

delete_old_poses (which I wanted to call dont_leak_memory) is meant to work in cases where you have a large input list, like passing 30,000 different PDBs as input to the -l flag (usually with an nstruct of 1). In other words, a case where you are sifting through many inputs.

If ThreadingInputter is creating new Job objects as it goes, AND those have different starting Pose objects, THEN delete_old_poses is likely to help here. I don't think ThreadingInputter should have that behavior...

Let us know what the flag does!

Sun, 2011-09-18 09:56
smlewis

Hi there, thanks for helping.
I've inserted the recommended option and have sent 100 runs asking the same number of structures as before (3000). At first moment, things remains the same, but I need a couple of days to compare the memory consumption with the previous runs. I'll let you know the results.
Just by chance, could you please possibly point me out to the Rosetta's full option documentation?

Mon, 2011-09-19 13:11
fred

The option -help will list all options available, but has only a limited idea of what options work with what executables; this usually has a blurb on what the option does. The manual (link at the top of this page) has pages on most of the executables, which list some but not all of the options that work with each executable and what they do. No master list of all options/executables exists, much less one that's reasonably accurate and up to date.

Mon, 2011-09-19 13:14
smlewis

Yep, I have been using -help option, however, you never know if the options are interchangeable from program to program. Minirosetta, for instance, don't have the option
"delete_old_poses" and telling truth I didn't even know about such option. Anyway, let's see if it works.

Mon, 2011-09-19 13:33
fred

The page I was referring to is at http://www.rosettacommons.org/manuals/archive/rosetta3.3_user_guide/opti... - there theoretically should be a link ("Option list with description/range/default values") on the main page of the documentation, but unfortunately that doesn't look like it's correctly linked.

The "never know if the options are interchangeable" is certainly an issue - most of the options listed on the page are ignored by most protocols (which varies based on the protocol) with little to no indication which is which. That's one of the reasons it's listed under Advanced Topics/Developer Guide - it's not much help unless you already know about the option, or are looking for possible starting points which you would have to confirm by looking through the code (or somewhere in the documentation if you get lucky).

Mon, 2011-09-19 18:16
rmoretti

Just to follow-up, the "-jd2:delete_old_poses" didn't help much. Minirosetta continues swamping the memory (http://ompldr.org/vYWdseA). I have decided to increase the number of parallel runs (150) with smaller number of structures (1000). Thanks of helping.

Wed, 2011-09-21 09:57
fred