I'm been fairly successful at creating a large list of decoys for several of my project (de novo, loop modeling, etc). My runs typically produce 30,000 to 50,000 decoy structures, which I am under the impression is a decent number.
What is the best method to determining the "most accurate" structure from these decoys? Do I cluster first, take the lowest energy (c.0.0.pdb). Or do I look at the top scoring structures from each cluster? (c.0.0.pdb, c.1.0.pdb, etc). Or do I score the entire set and take the lowest scoring structure to be the most accurate, without clustering?
Any advice would be great. I've tried a few of these options, and it just gets confusing when several decoy structures have similar scores, but very different structures.
It depends on what information you mean to extract from the models.
I generally sort by score and manually examine them starting from the top until I'm sick of it, I have enough good designs to test, or I have a feel for what the top fraction of the ensemble is saying. For the projects I've worked on clustering is inappropriate.
The purpose of clustering is to tell you which individual models carry the most information, so you can look at the ten members representing clusters instead of trying to look at 10,000 PDBs. If you are doing something denovo, then cluster, look at the best energies from each cluster, and decide which model you think is right based on Rosetta's score, your biophysical intuition, and any experimental knowledge you may have.
"confusing when several decoy structures have similar scores, but very different structures" If these seem like good score and good structures, it means Rosetta doesn't know what the right answer is; ideally you'd use some other data to bias the modeling with constraints or filters. If these are all poor structures, then either you're undersampling (unlikely from the rest of your post) or you're asking something too hard for Rosetta to do.
If you're more specific about what data you need from your ensembles, we (assuming Rocco is listening) can try to craft a cleaner answer.
Thanks Steven! My application (that I'm working on right now) is a loop modeling problem. I have a loop of 10 amino acids, and I am creating 30,000 decoys.
I do have biophysical data representing antibody binding to this loop, but no real information past that. So I am hoping to correlate the structure of my newly rosetta-generated loop with the experimental binding data.
For this, I guess clustering might now be necessary then huh? I should just look at the top scoring structures (until I'm sick of it too) and see if anything makes physical sense?
Clustering sounds appropriate to me, in that it will allow you to identify the major conformational wells. You should look at at least the top 20 or 30 structures directly as well (unless they're all one cluster) to get a feel for what's out there in your data set.
Rosetta generally produces models that obey some sort of normal-ish distribution in energy. You only ever really care about the models that are 2 or 3 SD above the mean (the best ones). Looking at the cluster bests and at the top fraction overall is a good way to spend your time on those, which have the best energies and are thus most justifiable.
Flexible loops may change conformation when an antibody binds them, so don't read too much into the unbound loop conformation - it's good to know about, but Rosetta knows next to nothing about entropy, and won't predict a bound state in the absence of the partner.
I have an unreleased module for loop analysis if you want to use it - it's called LoopAnalyzer; it's in 3.4 but it doesn't have an application interface. I don't know if it has a parser interface (I never wrote one), but I can attach the app interface if you want. It's vaguely described in the AnchoredDesign paper; mostly it's good for prying out all the loop backbone data (phi/psi/omega) and looking for bad omega lengths.
Evaluating Rosetta results is something more of an art than a science. While one would hope that the Rosetta energy would capture reality (so that the low energy structures would be the ones actually observed in solution), that's the ideal, rather than the reality (although the Rosetta energy function does about as well as any other computational method in that respect). In practice, evaluating results is combining the calculated energy of the structure with other factors that Rosetta might not quite capture.
What are those? - it's often hard to say (which is part of the reason they aren't already captured by the Rosetta energy). Often a good practice is to critically examine the top results, looking for places to say "Well, that can't be quite right, because of X." You can then go back, make a filter or re-ranking which includes considerations of X, and come up with a new set of top structures. Lather, rinse, repeat until you've gotten one or more structures that look physically reasonable.
Biophysical data is really good for filtering out bad structures, if you can come up with a way to tie it to structural effects. For example, if you know a particular residue is critical for antibody binding, you can discard those structures where that residue is being modeled far from the known antibody binding site. Likewise if the critical residue is modeled as buried, although you have to consider that the effect may be an indirect one (e.g. because of backbone remodeling or subtle structural shifts).
By the way, if you've clustered with reasonably tight cluster radii, you probably don't need to look too much at anything but the lowest energy structure in a cluster. Because Rosetta is stochastic, and the energy landscape rough, you can end up with *very* similar structures with greatly different energies. In those sorts of cases, you'd want to trust the one with lower energy, as that's the one that's more likely to be representative of physical reality, with the higher energy one likely being caught in a local minimum that a physical protein would have no problem escaping. As Steven indicates, clustering is a good way to filter down a large number of potentially structurally similar structures into a smaller set that easier to deal with manually.