I'm using Rosetta ~10,000 PDB output and using cluster software Calibur to cluster it.
And question occurs to me, is that, are the alpha carbon RMSD for each clusters, were calculated after all the PDBs are superimposed? or were RMSD calculate before they are superimposed, to represent the most popular states in space? (either answer about Calibur or Rosetta's own cluster application will be very helpful!)
Because my RMSD for each cluster is very large ~10 Angstrom, which should ideally < 2 angstrom to get a close to native structure, right?
Thanks for your help!
Rosetta can calculate the RMSD either with or without superposition. The typical Rosetta clustering application computes the value with superposition by default, as the ab initio structural modeling protocols in Rosetta has unrestrained rotational and translational freedom as a consequence of the sampling protocol. What your clustering run with Calibur does, though, is dependent on your settings. Clustering with or without superposition are both valid approaches, depending on what question you're trying to ask. I imagine that Calibur has an option to control which one is used, though I'd suspect (but do not know) that with superposition would be the default.
When you talk about RMSD for clusters, are you talking about within cluster RMSD, or the between cluster RMSD? If it's the between cluster RMSD, for something like ab initio output you'd expect to have different clusters with >10 Angstrom RMSD from each other.
If it's within cluster RMSD, though, then that's a rather large cluster radius. One possibility is that you're restricting the number of clusters too much. Certain clustering algorithms (like K-means) have a pre-specified number of clusters, so if that number is poorly chosen, you can get clusters with too large a radius. Increasing the number of clusters allows for clusters with smaller radii. Another option is that the RMSD that you're calculating is not the metric that you're clustering over. Loopy regions can dominate the RMSD metric, but are often the least interesting part of the structure for clustering purposes. This is why CASP uses the alternate GDT emtrics, which are less sensitive to the variable loop regions. Tight clustering by GDT (or even by an RMSD over some "core" region) might result in clusters with large all-C-alpha RMSDs, as that number can be dominated by "uninteresting" loop regions. Another possibility is that those large-radius clusters are poorly populated. If you have structures which aren't really close to anything else, depending on your clustering algorithm, you may not want to make clusters with only a handful of structures. So instead you lump them in with whatever's closest. So you either get clusters where the majority are all within a few Angstroms, and there are a few outliers, or you get a "miscellaneous" cluster, where you get scatter-shot structures from all over the place. The Rosetta clustering application has the ability to enforce a cluster radius cutoff, so these outliers would be forced into their own cluster, but I'm not familiar enough with Calibur to know if it has a similar option.
I'm not quite sure what you mean by your last statement. If you have a structure that's < 2 Angstroms C-alpha RMSD from the native, then that's a good model. And if you get a cluster of structures which are all within 2 Angstroms of each other, and within 2 Angstroms of the native, that's a good native-like cluster. However, just because you get a well populated cluster that has a radius of less than 2 Angstroms doesn't mean that that's a native-like cluster - it could be some false minima. Likewise, just because a cluster has a large radius doesn't mean it won't contain the most native-like structures. The guideline is that the large, low-energy cluster from the results of a Rosetta run is most likely to be native like, but that's not necessarily a guarantee.
I realize this is old, but in case anyone else finds it, Calibur does indeed superimpose before finding the distance to cluster by.