When I am using the design protocols (mostly fixbb design), I'm always wondering how much of the sequence space was effectively searched by the Monte Carlo Simulated Annealing method. For example, in a fixbb design at 13 positions, the output log file said there were totaly 15459 rotamers, and I generated 1000 decoys. So there should be 1000 sampling trajectories of random substitutions from the 15459 rotamers, right? In this case, I noticed the resulting sequences were pretty converged. My questions are:
1) How many substitutions were tried? I found one paper (Kuhlman and Baker 2000) giving an estimated number, about 1 million per Monte Carlo run; and another paper (Leaver-Fay et al 2008) mentioning 200*(number of residues)*(rotamers per residue). Is there any reason why "200" is chosen? Is there any other specific number? If there are some in the Rosetta code, where should I look for?
2) Theoretically, the result should be the same no matter what starting sequence I choose. In another word, starting sequence won't bias the sampling trajectory. Is this always true in the real practices?
3) I guess it is still possible to happen that the trajectories starting from the same sequence were trapped in one local minimum. In this case, the result still seems converged. How to distinguish that from a real global minimum solution? And how to avoid that? Should I randomize the starting sequences of each trajectory?
4) Are there any algorithms or parameters in the Roseta design protocols that explicitly ensure the Monte Carlo Simulated Annealing algorithm effectively samples enough area in the sequence space?
1) The code you should look at here is src/core/pack/annealer/SimAnnealerBase.cc. The number 200*(number of residues)*(rotamers per residue) is sort-of-visible here, as (factors) x (number of total rotamers). There's no exact factor of 200 but perhaps the default case falls out that way (there are lots of little overrides and scalers that are scattered about, like how the get_inneriterations function returns a number different from what setup_iterations sets up). I guess this was chosen empirically - if it's not in the Leaver-Fay paper it's still locked in Andrew's head.
2-3) I've never seen Rosetta get trapped in the fashion you describe. If it concerns you, run your protein through with a resfile:
1 A PIKAA A #your first residue, change to correct number and chain ID
2 A PIKAA A #your second residue
to force the entire thing to alanine, then redesign on that. If you get the same solutions, it wasn't trapped. A 13mer design should converge a reasonable fraction of the time from my experience.
4) Not that I'm aware of. How would you even define or measure this from within a trajectory?
Thanks for the quick reply!
I've tried starting with all alanine. The resulting sequences were a bit different with the original design results. However my original design starting with the native sequence used Rosetta 2.3.0, whereas the all alanine trial was done much later after that and used Rosetta 3.2. I guess that might matter, since some default parameters and flags may vary. I will have some more trials.
You mentioned that the 13mer design should converge. In this sense, I guess millions of substituions trials should guarantee to find the global optimal solution in the entire sequence space of the 13mer (20**13), at least empirically? In fact, maybe millions is much more than needed? Have the Rosetta team tried to play with the factor number for different design length? I think it will benefit the calculation speed if millions of substitions are time consuming.
I wouldn't be surprised if results differed between 2.3 and 3.2. I would be surprised if they didn't. There are some minor scorefunction changes floating around; I think the distances for some of the shorter-range terms (fa_rep) were increased.
I believe Andrew put serious effort into optimizing the packer parameters. It is, nevertheless, an impossible problem. The optimal number of rotamer changes is going to vary from system to system and problem to problem, and it's all Monte Carlo anyway so there's no way to ever know if all parameters are optimal due to randomness.