I used Calibur to cluster my structures produced by Rosetta. And I got RMSD ~13 Anstrong within each cluster, to me it is very high.
And I'm wondering what's the common range of the RMSD within each cluster, that is acceptable ? (either clustered by Rosetta itself or Calibur)
And what's the algorithm to calculate the distance in the Cluster application of Rosetta?
Does it compare the distance for each residue one by one, like residue 1 alpha carbon in structure 1 with residue 1 alpha carbon in structure 2, and then residue 2 's alpha carbon in these two structures, until every residue in the two structures are compared correspondingly?
Also, how to pick up a structure that best present each cluster?
The overall goal for me is to pick up a best structure among all simulated structures by Rosetta, that is closest to the native structure.
I appreciate your answers in advance!
Somewhere around 2-3 Angstrom region is typically where people place the cutoff for "close to native" structures, when comparing to natives. So a cluster radius of near 2-3 rmsd should give you a cluster center which would be "close enough", assuming that your native structure is somewhere within that cluster. It depends a little on your application, though. You may want to expand or contract that radius depending on how stringent your acceptance criteria are, or how convergent/divergent the protocol results are. For reference, the default cluster radius for the Rosetta cluster application is 3.0 Ang.
Typically the distance metric in structure clustering is Calpha root mean squared deviation (rmsd). What that means is that for each residue, the distance between the Calphas of the test structure and the reference structure are determined. The square of each of those Calpha-Calpha distances are averaged across all residues, and the square root is taken to yield a value in units of Angstroms. (Typically the value for the best superposition - the rotation and translation that minimizes the rmsd - is the one reported.) Other methods are possible, for example averaging the squared distance across all backbone atoms or all backbone+sidechain atoms, instead of just the Calphas. For CASP-type structure prediction, a metric called GDT ("global distance test") is typically used instead of rmsd, as it's less sensitive to misplaced loops and the like. Rosetta clustering has options to use GDT instead of rmsd, though I don't know if Calibur can.
The normal procedure to pick the best structure from a run is to do clustering, look at the largest cluster (the one with the most member structures) and then pick the structure in the cluster which has the best energy.
Thanks for your reply!
I want to figure out why my samples have such large RMSD within each clusters.
From the tutorial, I read it may need to generate ~10,000 models in Rosetta and then cluster them.
My questions is, for a common case, how big is the largest cluster that has RMSD within 2~3 angstrom in the ~10,000 models? it's thousands, hundreds, or less?
For my case, I only generate ~300 models to do the cluster in Calibur. Will this be a problem?
My guess is that it may be too less so that it will only have representative conformation. Say if 10,000's largest cluster usually contains 100 structures within 2~3 angstrom. Then my 300 will not be able to cluster correctly to find only 3 structures that are within 3 angstrom. (My largest cluster has ~220 structures out of 300, and ~25 for the second largest cluster.)
How many clusters and how many models you get per cluster is going to highly depend on how convergent your particular problem is. In the extreme (and extremely unlikely) case, Rosetta might be able to find the lowest energy structure most every run, and so would get you a single big cluster with most of the structures in it. On the other end you may have a particularly difficult problem, so Rosetta might not converge at all, so you'll get a large number of clusters with very few structures in each. (And no one cluster which stands out regarding the number of structures.) Best case scenario is typically somewhere in between, where you get a small but not tiny fraction of the structure in a single cluster. 1% sounds about in the ballpark, but again, it will be highly variable based on your particular problem.
If you're getting 220/300 structures in a single cluster, that explains why you're getting a high rmsd radius for the cluster. It's very unlikely that your problem is that convergent. Instead, you have the clustering setting such that you're smashing widely different structures into a single cluster, leading to the large radius. To get a better idea of how things are going, adjust your clustering parameters such that you get cluster radii in the range which you want them. My guess is that you won't be able to see a signal at 300 structures - we don't recommend thousands of structures just for the heck of it. In most situations you really do need that much sampling to get interpretable results.