I am preparing the pdbs from pdb.org for scaffold. About more than 6000 pdbs were downloaded and have been preparing. However, the memory leak was encountered. I have try to use -linmem_ig 10 and -jd2:delete_old_poses to resolve the problem. But the memory leak is still there.
Could you help me to solve this problem?
Thanks a lot.
The workstation we are using has 48 cores and 128GB memory, the rosetta version is 2019.1460699. I have also tried to run in another verision of rosetta (2018.48.60516), the memory leak also can be found.
mpirun -np 45 rosetta_scripts.mpi.linuxgccrelease -s scaffold/*.pdb.atoms @prepare.option > ppk_min.log &
<Rmsd name="rmsd" threshold="1.5" superimpose="1"/>
<Prepack name="ppk" jump_number="0"/>
<MinMover name="sc_bb_min" bb="1" chi="1"/>
I'm glad to see you found delete_old_poses. I don't think linmem_ig is relevant here (in fact I think you should remove it - it makes performance worse when not designing).
No cause is obvious to me. You've only got about 4 components in the system (2 movers, 1 filter, and JD2). I would try commenting them out one at a time to see which is the culprit, but that might be too time expensive. You can also just break up your list so that Rosetta is restarting repeatedly - I know this is a pain in terms of cluster time but it will certainly solve a memory leak.
How many poses are you getting through before you start to see large memory use?
You are using MPI but I don't see any output control flags. Maybe the "memory" is some companion process multiplexing all your stdouts - I've seen that on large systems (it is a sysadmin problem not a rosetta problem, but they are lazy and want use to fix it.). Try -mpi_tracer_to_file $filestem or try -mute all?
I'm not sure multiple_processes_writing_to_one_directory is valid in an MPI context (I think it is just ignored). This is implicitly true for MPI anyway.
Thank you for your valuable suggestions. I will try it. Unfortunately, our workstation has core dumped during runing the protocol... We are now recovering the system and will try it.
Thank you again for taking your time.
Thank you for your suggestions. I have recovered our workstation and recompiled the rosetta 2019.14. In this time, I have expanded the swap section up to 131GB. The memory of this workstation is 128GB. So, I have rerun the protocol we have posted before. However, the application is still core dumped. Now, when I run this script, the processed will be immediatedly killed and error info showed that " + 24111 segmentation fault (core dumped) mpirun -np 45 rosetta_scripts.mpi.linuxgccrelease -s ../scaffold/3*.pdb.atoms"
Could anyone help us to rerun the rosetta_script.mpi again?
as you suggested, we will rerun this protocol:
segfault / core dump has nothing to do with the memory leak you observed earlier. The memory leak meant Rosetta was consuming all the available memory. (I would not encourage you to let Rosetta use swap space: it will be glacially slow and you are better off just splitting the job into smaller parts if the memory leak cannot be resolved).
Segfaults mean that the code has tried to access memory outside the bounds of the program - more often than not it is a vector overrun. There are a handful of PDBs out there that can cause Rosetta to segfault. The first thing to try when you get segfaults is: 1) run in debug mode (after compiling in debug mode), 2) if at all possible, do not capture the output with "> log" (because that cuts output into 32 KB blocks and loses text when the crash occurs) but instead just let stdout dump to terminal and copy/paste it from there. Try to reproduce your error WITHOUT MPI; if you can reproduce it, paste the last 40-50 lines of log file here. If you are getting segfaults in MPI mode ONLY (and not without MPI), then the problem is either 1) the MPI environment is misconfigured, or rosetta is compiled with the wrong MPI library relative to your runtime mpi, or something like that (something that is best resolved by compiling again, I guess), or 2) there is an actual MPI-specific segfault you are observing, in which case the best course of action is just abandon MPI entirely, as it is too hard to debug.
In the extreme case you can always set up the "dumbest possible" shell script job distributor (see the JD0 files in the tools directory of your rosetta distribution to get started there, or ask in a new thread).
Thank you for your immedialtely reply. I will recomplie rosetta again. In this time, the mpich-3.3 was compiled by gcc 4.8.5, but the rosetta2019.14 was compiled by gcc 8.2.1.
In my mind, the rosetta_script.mpi can be run the protocol. However, after the rosetta_script.mpi has been killed due to memory leak, rosetta_script.mpi could not be run again. The rosetta_script.mpi will be quickly terminated followed the info below: segfault / core dumped.
As you suggested, rosetta_script.default could still run this protocol.
Actually, we can compile mpich-3.3 by gcc 8.2.1 and try to run the protocol with splitting the job into smaller parts.
Would you mind to give us some comments and suggestions?
Thank you for your kindness.
I have resolved the memory leak through parallel the job. I have recompiled rosetta, but the memory leak is inevitable by using rosetta_script.mpi.
Thank you for your help.