You are here

analysis of clustering

4 posts / 0 new
Last post
analysis of clustering
#1

Dear All,

I'm a new user of Rosetta software and I believe I have the usual questions to ask, but as long as I've searched (manuals and web) I couldn't find satisfying answers.
I have generated 30000 structures with the abInitio protocol in parallel, 100 structures each run taking care of the seed number. Afterwards I've combined the 300 silent outputs with the combine protocol.
then I've clustered all these decoys with the cluster protocol. The results after the clustering were 150 pdbs called c.*.0.pdb which I understood as the best scored structure from each cluster.
So far so good, but at this point I'd like to have some more information, if that's possible, about the clusters like the other members of the clusters, in order to get information about the cluster size and to compare the structures of the same cluster and cross-cluster and the score.

How can I extract those information?

Thank you for your attention.
Loris

nb: I didn't feel like attaching any file because I think is more a methodological issue, but I'll be more than happy to provide the files you think will help.

Post Situation: 
Thu, 2012-04-19 02:52
loris.moretti

In the tracer output (what's printed to stdout (console), assuming you haven't muted it) the clustering application should print out a summary of the clustering run, including the size for each cluster.

If you don't set the -cluster:export_only_low option, you should actually get pdbs for all of the structures (all 30000 of them), each named in the form c.*.*.pdb, where the first number is the number of the cluster, and the second is the order of the structure within the cluster. Note that because of the number of structures, you may want to use the "-out:file:silent [outputfilename]" flag rather than having them be dumped as PDBs. You can then get the tags for those particular structures you want, and extract them from the silent file.

Thu, 2012-04-19 09:07
rmoretti

Thank you for answering...
In the tracer output, which I collected and is here in attachment (cluster_log.txt) but only the last portion (too big), I cannot see where is all this information. In the very last portion there is, i guess, just the list of the clusters but with only one member (the first=0) and I guess it's connected to the fact that I got only those pdb output files! But I have no idea why, the flags I used for the clustering protocol (cluster.linuxgccrelease @flags >&cluster.log) are here in attachment too (flags.txt) and there are no option to reduce the output as it happened.

Do you have any idea? Any help would be appreciated!
loris....

Fri, 2012-04-20 00:07
loris.moretti

Working up from the bottom of the log file, you have the timing, then the final cluster assignments (numeric id, tag name, cluster number, order in cluster) then the final clustering summary (A line for each cluster, with the size of the cluster, the group size, and a tag representing the pdb naming. Beneath this is a line for each cluster member, showing the tag and the Rosetta energy), then some post processing info, and a pre-post-processing summary.

Above this is the actual clustering - this is actually performed in two stages, with the first 400 structures being used in all-against-all comparisons to seed the clusters, and then the remaining structures being added to the existing clusters, or to a new cluster if they're different enough from existing ones. Every 150 structures, the number and size of the clusters is limited (this is the multiple summaries that you see from the list one is printed before and after each trimming). -1 means no limit, so you're not limiting by number of structures or the total number of structures, but you are limiting by total number of clusters (Because -cluster:limit_clusters is the default value of 100.) - this limit is somewhat important, because keeping all the structures would eat up a lot of memory.

So from your log, it looks like that when you're clustering, you are assigning each structure to a new cluster. You then limit that to the top 100 clusters, so you're left with just 100 clusters, with a single structure in each. My guess is that the radius of 3 is too narrow for your set of structures, so you may want to try increasing it. (What's the general rmsd between the various clusters your previous runs have done? Do you agree with that grouping/separation?)

By the way, you should probably be aware that the Rosetta clustering code has been around in its current form for a while, so other clustering applications have made much more progress, and aren't saddled with Rosetta cluster's idiosyncrasies. A number of people (but certainly not all) have switched over to Calibur for clustering (http://www.ncbi.nlm.nih.gov/pubmed/20070892).

Fri, 2012-04-20 12:07
rmoretti