I have been running abinitio on a server using mpi to utilize 13 cores. There have been a few times over the last month when we had to restart abinitio. My question is will abinitio pick up the fold count where it left off, or will it start the count from one again?
Abinitio didn't overwrite the silent out files, it just kept writing to the end of the files. But it seems to be taking longer then I expected, maybe I'm just overestimating how long this should take. I also looked at one of the silent out files and after about 6700 folds the number went back down to 14. Does this mean that the folding started over after we had to restart? This is the command I used to run abinitio.
mpirun -np 14 AbinitioRelax.linuxgccrelease -database /opt/rosetta-3.2.1/rosetta_database/ -in:file:fasta t000_.fasta -in:file:frag3 aat000_03_05.200_v1_3 -in:file:frag9 aat000_09_05.200_v1_3 -abinitio:relax -relax:fast -abinitio::increase_cycles 10 -abinitio::rg_reweight 0.5 -abinitio::rsd_wt_helix 0.5 -abinitio::rsd_wt_loop 0.5 -use_filters true -psipred_ss2 t000_.psipred_ss2 -out:file:silent t000_silent.out -nstruct 20000
I just want a total of 20000 folds, is that what this command will give, or will it do 20000 folds 13 times?
Thanks for any help with this problem.
Abinitio is non-intuitively and non-obviously not compatible with MPI in 3.2.1. I suspect your problem is more related to that. If you look at your result file you will probably find that it does structure 1 13 times, then structure 2 13 times, etc.
It should know where it left off and restart there, but I can't say for sure that it does.
If you'd like the abinitio MPI patch, let me know, and I'll email it to the address you registered your forum account to. You can also take a look at these threads on the issue:
Damn, well that sucks. Thanks for the reply and yes I would like to get that patch from you. And I will look to the other posts that you listed for more info.
smlewis, thanks for e-mailing me that patch. I have another question. It seems like there were a lot of duplicate structures in the silent out files. I ran extract_pdbs on the silent out files and it listed "SilentStruct with S_00000715 already exists!" for a bunch of different structures, with a total of 24200 structures in one of the files; there were 13 silent out files from the mpi run, all with about the same filesize and thus probably roughly the same number of structures. My question is, are these structures a unique fold, or are the structures with the same number the same fold (does this question make sense)? I guess what I'm asking is, are all the duplicate structures just going to give the same data/results, or does the monte carlo procedure produce different folds every time it runs? Let me know if you need me to explain what I'm asking any further. Thanks for your help.
Whether you have different structures with the same name, or the same structure with the same name, depends on whether the different runs started with different random number seeds. If you didn't manually assign seeds, then what seed it gets is dependent on the system you run it on; I think it's from /dev/urandom on linux systems.
I would try just try extracting the first one from each set. If those are different, then probably they're all different. If they're all the same, you can throw a lot of it out as duplicate. If you get something like 5 copies of A, 3 copies of B, 4 copies of C, and one of D, then you have 4 results, and you can throw out the duplicate trajectory sets.
You shouldn't get behaviors like one output appearing from processor 1 at structure 0001, but then reappearing from processor 10 at structure 0245. If they're going to be the same the numbers should be the same. (This might not hold true if you had multiple job restarts in the same output file, which may cause it to number "first results" as one plus the end of the last one.)
Okay, thanks for the informative reply. It seems like there was some duplication, at least in file name. After I extracted them, I deleted all the structures that were succeeded by a number, indicating a file with the same name had already be created.
am also facing the same problem while restarting (run stopped place "S_00002017.pdb" , expected continue "S_00002017.pdb" or "S_00002018.pdb" but "S_00000001.pdb") with the same script it start again from first
(kindly note:am using single node consist of 8 processor, Rosetta version: "rosetta_bin_linux_2016.17.58663" )
expecting your support and suggestion
script used to restart
mpirun -np 8
-database /PATH/rosetta_bin_linux_2016.17.58663_bundle/main/database/ \
-in:file:fasta 74re.fasta \
-in:file:frag9 aat000_09_05.200_v1_3 \
-in:file:frag3 aat000_03_05.200_v1_3 \
-psipred_ss2 t000_.psipred_ss2 \
-use_filters true \
-abinitio::increase_cycles 10 \
-abinitio::rg_reweight 0.5 \
-abinitio::rsd_wt_helix 0.5 \
-abinitio::rsd_wt_loop 0.5 \
-nstruct 30000 \
-out:file:silent silent.out \
It might be because you're using both -out:pdb and -out:file:silent.
Nobody has ever fixed this particular version of abrelax to use the better job distributor, and I've forgotten anything I knew about the old one from 8 years ago.
I think the combine_silent tool can just renumber the structures for you (to unique numbers) after the fact.
Using Rosetta 3.6
As the previous user, I am also getting duplicates in the default.out except that I did not use either -out:pdb or -out:file:silent. It looks like I am getting as many duplicates as the -np parameter (ie mpirun -np 10 yields 10 models with the same name in the default.out for each nstruct )
Sending this job to our SLURM:
resulted in a default.out where every other model is having the same name as the previous one, although all the models were distinct (different scores), e.g.:
SCORE: score fa_atr fa_rep fa_sol fa_intra_rep fa_elec pro_close hbond_sr_bb hbond_lr_bb hbond_bb_sc hbond_sc dslf_fa13 atom_pair_constraint coordinate_constraint angle_constraint dihedral_constraint rama omega fa_dun p_aa_pp yhh_planarity ref Filter_Stage2_aBefore Filter_Stage2_bQuarter Filter_Stage2_cHalf Filter_Stage2_dEnd co clashes_total clashes_bb time description
SCORE: -386.981 -1063.484 133.980 590.879 2.937 -83.298 1.503 -67.975 -16.145 -14.916 -15.160 0.000 -85.736 0.000 0.000 0.000 -21.011 17.257 280.673 -41.167 0.122 -5.440 0.000 0.000 0.000 0.000 24.393 0.000 0.000 1192.000 S_00000001
SCORE: -457.134 -1139.665 151.095 638.124 2.957 -81.250 0.895 -63.663 -24.682 -23.702 -14.814 0.000 -151.746 0.000 0.000 0.000 -18.974 26.806 286.585 -39.979 0.319 -5.440 0.000 0.000 0.000 0.000 31.262 0.000 0.000 2150.000 S_00000001
SCORE: -431.520 -1114.407 146.608 623.038 2.875 -75.896 1.017 -71.586 -17.001 -19.279 -13.235 0.000 -147.458 0.000 0.000 0.000 -15.957 30.079 282.052 -36.953 0.024 -5.440 0.000 0.000 0.000 0.000 39.731 0.000 0.000 2102.000 S_00000002
SCORE: -456.403 -1142.623 146.135 646.783 3.101 -92.779 1.241 -62.741 -24.531 -16.490 -18.518 0.000 -151.755 0.000 0.000 0.000 -15.470 23.233 305.379 -51.934 0.005 -5.440 0.000 0.000 0.000 0.000 41.281 0.000 0.000 2112.000 S_00000002
SCORE: -441.175 -1110.798 136.214 621.876 3.043 -77.155 1.372 -60.945 -16.107 -24.633 -16.463 0.000 -151.997 0.000 0.000 0.000 -14.308 24.913 290.717 -41.872 0.409 -5.440 0.000 0.000 0.000 0.000 29.163 0.000 0.000 2162.000 S_00000003
SCORE: -441.159 -1157.146 164.662 634.912 3.143 -97.415 1.366 -66.306 -17.415 -17.080 -22.135 0.000 -143.394 0.000 0.000 0.000 -11.400 28.260 301.588 -37.477 0.119 -5.440 0.000 0.000 0.000 0.000 31.182 0.000 0.000 2267.000 S_00000003
You're using the non-MPI compile of AbinitioRelax (the "default" in AbinitioRelax.default.linuxgccrelease). This means that when you launch 10 versions of the program with mpirun -np 10 command, they're all independent and don't talk to each other. This means that they're each going to make a S_00000001. (But if they get different random seeds, then each of the S_00000001s will be different, with different energies.
If you want to run Rosetta programs with MPI, you should be using the MPI-compiled version of the program (AbinitioRelax.mpi.linuxgccrelease). To get this, you need to compile with "extras=mpi" on the scons commandline, and often have to tweak the compile settings. (This is quite a bit cluster dependent). There are a fair number of threads on this forum about troubleshooting MPI compiles.
If your silent file is similarly problematic - combine_silent can renumber them.
If you have PDBs instead - well, the file system ate every other output, there's nothing to do about it now but not run it the same way again.