Can I ask if "clustering.py" can be used for PDBs contain two chains? It works fine if there is only one chain in the PDB. However, when it comes to PDBs of two chains, it seems that no PDB structures have been processed. The following are the inputs and outputs for processing PDBs with two chains. Thank you very much.
python /home/lanselibai/Cheng/rosetta_2014.30.57114_bundle/tools/protein_tools/scripts/clustering.py --pdb_list /home/lanselibai/Cheng/cluster/list_of_pdbs.txt --rosetta /home/lanselibai/Cheng/rosetta_2014.30.57114_bundle/main/source/bin/cluster.linuxgccrelease --database /home/lanselibai/Cheng/rosetta_2014.30.57114_bundle/main/database --options /home/lanselibai/Cheng/cluster/cluster.options /home/lanselibai/Cheng/cluster/cluster_summary.txt /home/lanselibai/Cheng/cluster/cluster_histogram.txt >& /home/lanselibai/Cheng/cluster/cluster.log &
1) options file:
Parsing pdb file scores
Parsing cluster output
bin 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75 4 4.25 4.5 4.75 5 5.25 5.5 5.75 6 6.25 6.5 6.75 7 7.25 7.5 7.75 8 8.25 8.5 8.75 9 9.25 9.5 9.75 10 10.25 10.5 10.75 11 11.25 11.5 11.75 12 12.25 12.5 12.75 13 13.25 13.5 13.75 14 14.25 14.5 14.75 15 15.25 15.5 15.75 16 16.25 16.5 16.75 17 17.25 17.5 17.75 18 18.25 18.5 18.75 19 19.25 19.5 19.75
count 4950 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tag file_name score size
Could you pull out a few example structures (five or so) which you would expect to form at least two clusters and post them here? It'll be easier to debug with a (non-working) example.
Thank you. I just tested 10 structures. Could you please download the input/output files at
As you can see from the output files:
1) "Structures: 0" is in the "cluster.log", which seems to suggest no structure has been processed.
2) There is nothing except "tag file_name score size" in the "cluster_summary.txt". It is fine for me if all of the structures are of the same cluster. But there should still be something in the second line like the below to indicate lowest score structure, cluster position and structure position, score, number of structures in the cluster:
"/home/lanselibai/Cheng/20141112_cluster/input/4KMT_clean_0003.pdb c.0.3.pdb 7273.66 10"
Therefore, I am afraid those 10 structure with two chains have not been properly processed by the cluster. How do you think of it?
Can I ask can you download those files from the link? Thank you very much.
All of your structures have near-identical backbones. The clustering application makes clusters based on Calpha rmsd, and if all of the structures are near-identical, it falls into an edge case and it looks like there's no clusters output. (Because it's somewhat pointless to have a cluster with only a single cluster - it's not really a case where the clustering script/application has been tested.)
What you could do is add in a known-different structure to the mix. Once you do that, you'll get a clear clustering result, where the single outlier is in its own cluster, and all the other structures are in a single cluster.
But that's still going to get you a Calpha rmsd. Unfortunately there doesn't look to be an easy way to change the metric to an all-atom rmsd, which may be more what you're looking for if you have results that all have the same backbone.
Hi R Moretti,
I tried to cluster a mix of 10 structures. Four of them are only of heavy chain (HC) and six of them are of both HC and LC.
Just as you said, it worked out a proper cluster summary:
tag file_name score size
/home/lanselibai/Cheng/20141112_cluster/input/4KMT_clean_0007.pdb c.0.6.pdb -543.649 10
So in that case, I will regard the input structures are nearly identical if the clustering.py can be done successfully but without "cluster.out" and record in "cluster_summary".
I am happy with Calpha rmsd.
On the clustering.py website, it is said:
Note that the cluster application cannot sometimes handle very large sets of structures. It is recommended to use a program that has been optimized for such a purpose such as Calibur instead.
I used Calibur and Clustering.py for a same set of structures but got different cluster outputs. But the Calibur outputs are not as user-friendly as clustering.py (e.g. cluster_summary). Can I ask to what extent can we trust clustering.py? What is the maximum inputs for clustering.py?
Thank you very much.
I wouldn't necessarily trust the results of clustering structures of differing composition/lengths. (e.g. some of only heavy chains and some of HC+LC). In order to compute the rmsd, you need to be able to match up residues in the structures, and if you have differing lengths, the rmsd between the two structures might not be accurate. Either cluster your HC and HC+LC structures separately, or extract just the HC from the HC+LC structures so that all the structures are of the same length.
I wouldn't expect the clusters to be the same, as they use slightly different algorithms. For efficiency, Rosetta clustering only does an all-against-all matrix for a limited number of structures (400, I believe), finds seed clusters from those structures, and then extends the clusters based on the new structures. (Adding new clusters as needed.) Calibur, as I understand it, has some tricks such that it can efficiently work with a larger number of cluster pairings, so it doesn't have to do the seed-and-grow process. (Note that due to algorithm differences, you would expect different clusters even with under 400 structures.)
Rosetta's clustering.py will work with a large number of structures. (It doesn't have a hard maximum). It's just that the type of clusters you'll get will be different than if you theoretically did a "full" all-against-all clustering. The way Calibur works allows it to get closer to the theoretical ideal. If the first 400 structures are representative, though, Rosetta clustering should be "close enough" for many applications, especially if you just want the biggest clusters.
Hi R Moretti,
Thank you very much for your help.
I will only include structures with same compositions.
I get it that the outputs from the two methods are supposed to be different. I will use Rosetta's clustering.py.