# "score vs RMSD plots" & "cluster models"

13 posts / 0 new
"score vs RMSD plots" & "cluster models"
#1

Dear friends,
After obtaining multiple (e.g.100) models from relax.linuxgccrelease, I think I need to select the most likely one for further design work. By looking at

2) Is this Model Good.pdf

in Meiler Lab's tutorials, I think I need to do "cluster model" and "score vs RMSD plots" to determine the best model.

I can find cluster protocol in
https://www.rosettacommons.org/docs/latest/cluster.html

Can I ask where is the protocol of "score vs RMSD plots"? Thank you very much.

Yours sincerely
Cheng

Post Situation:
Thu, 2014-09-25 06:51
lanselibai

There should be examples on how to do the plotting in the Meiler Lab tutorials, but there really isn't a defined protocol as such.

Basically, all you need to do is get the RMSD values for the poses by however you want to calculate them (CaRmsd, interface rmsd, ligand rmsd, etc.), and also get the relevant score for each model (total score, interface score, ligand score, etc.). Depending on what you want with respect to rmsd or score depends on how you calculate it. (Though many protocols will be able to provide both a relevant score and a relevant rmsd in the scorefile.)

Once that's done, you simply make a scatter plot with the rmsd on the x-axis and the score on the y-axis, with each point being a different model. You can use whatever plotting program you want - R, SigmaPlot, Gnuplot, Matplotlib, even Excel, if that's what you want.

That's basically it. you just plot the values and look to see if the structure you used as an rmsd reference shows good behavior with respect to the scores you calculate. (Generally, you want low energy structures to be close to your native-like reference structure, and structures which are far away from the reference to be higher scoring.)

Fri, 2014-09-26 15:51
rmoretti

Hi R Moretti,
Thank you very much for your help. I think I need to calculate CaRMS as I need to study the entire structure.

1) Which model should I use as the native reference to calculate RMSD? Should I use the original model before relaxation, or the model with lowest score after relaxation? (I think it should be the lowest score one)

2) How to calculate CaRMS? I tried "-in:file:native" by using "score_jd2.linuxgccrelease":

However, in the output "score.sc" file, I could not file an item that corresponds to CaRMS:

SCORE: total_score dslf_fa13 fa_atr fa_dun fa_elec fa_intra_rep fa_rep fa_sol hbond_bb_sc hbond_lr_bb hbond_sc hbond_sr_bb linear_chainbreak omega overlap_chainbreak p_aa_pp pro_close rama ref description

3) Do you think cluster is needed if I just want to select the most likely model for my subsequent design work? As I understand, cluster is to group models based on energy. But I do not think I really need that as long as the RMSD vs score plot makes sense.

I am going to do the following, can I ask are they reasonable?

Steps for "score vs RMSD plots"
1) Get all the PDB files from "relax.linuxgccrelease -nstruct 100", combine and convert it into a silent file.
2) Use score_jd2.linuxgccrelease to score all the 100 models
3) Choose the model with lowest score as the reference model for "RMSD vs score plot"
4) calculate the RMSD score for all the 100 models
5) make a scatter plot with the rmsd on the x-axis and the score on the y-axis
6) Evaluate the goodness of the "RMSD vs score plot". It is good if low score models are close to the reference structure and high score models are far away from the reference (like attachment).
7) If "RMSD vs score plot" is good, then the reference model (i.e. with lowest score) would be chosen for further design work.

Thank you very much.

Yours sincerely
Cheng

File attachments:
Mon, 2014-09-29 03:21
lanselibai

1) If you have a true native (know experimental structure), use that. (If you're using it just for rmsd, I wouldn't bother with relaxing, as you want to reference off of the experimental coordinates.)

If you don't have a true native, you will want to use what you think is the best representation of the native. Normally this is the lowest energy model, but often it's something like the lowest energy structure in the largest cluster. Keep in mind you can always recalculate the rmsd metrics to a new structure, so you can make multiple score-vs-rmsd plots, each to a different potential native-like structure.

2) I think that the rmsd calculation is one of the remaining differences in usage between score and score_jd2 - I think that the regular score application (e.g. score.linuxgccrelease) does rmsd calculation. Keep in mind this is the all-against-all CaRMSD calculation. If you're doing something like docking as opposed to folding, you may want to use something else that calculates single chain rmsd. (Usually there's a mode in the docking protocols that allows you to recalculate the rmsd for a different native without going through the whole docking process.)

3) Clustering and score-vs-rmsd serve a similar function - they group structures by similarity. The difference is that the score-vs-rmsd is based on similarity to a single structure, whereas clustering does a more extensive many-many comparison. Score-vs-rmsd is good for a quality check if you already know what your main structure is, whereas cluster is good for finding a potential reference structure, and for finding additional alternative minima. If you just want one representative structure and you have a selected structure which has a good score-vs-rmsd already, then clustering doesn't gain you much.

The steps that you've posted sound like a very reasonable way to proceed. (You don't have to convert to a silent file just for score_jd2, though - it can read PDBs with the -s option. Also, if you want a silent file, you can have relax produce one directly with something like "-out:file:silent silent_file_name.out -out:file:silent_struct_type binary", rather than doing it in two steps.)

Mon, 2014-09-29 13:05
rmoretti

Hi R Moretti,
Thanks a lot for your help!

1) I do not have an experimental structure. So I will firstly choose the lowest energy model. I will also do cluster to select the relatively low energy model among the largest cluster.

2) So I will use score.linuxgccrelease for rmsd. As I need to study the entire structure, I really need this "all-against-all CaRMSD calculation".

3) Your input on clustering vs score-vs-rmsd is quite helpful. As I do not know my own structure, I need to find a potential representative model based on an existing PDB with similar amino acid sequence. So I may use clustering to verify the model I choose is within the largest cluster.

Thank you for your agreement on my RMSD steps.

Yours sincerely
Cheng

Tue, 2014-09-30 03:29
lanselibai

Hi friends,
A) got 100 PDB models after relax
B) use score_jd2.linuxgccrelease to combine 100 PDB into a silent file "silent_100.out"
C) rank the 100 models in "silent_100.out" to identify the model with lowest total score.

However, when I was trying to use "score.linuxgccrelease" to get the RMSD score, I have the following problems

1) If the silent file "silent_100.out" is used as input:

~/Cheng/rosetta_2014.30.57114_bundle/main/source/bin/score.linuxgccrelease -database ~/Cheng/rosetta_2014.30.57114_bundle/main/database -in:file:silent /mnt/hgfs/Mutagenesis_Rosetta/4KMT/2.0_Relax/20141007/silent_100.out -in:file:native /mnt/hgfs/Mutagenesis_Rosetta/4KMT/2.0_Relax/20141007/4KMT_clean_0064.pdb -out:file:silent /mnt/hgfs/Mutagenesis_Rosetta/4KMT/2.0_Relax/20141007/score_with_RMSD_100.out

The error message is:

core.io.silent: ERROR: trying to index off the end of the secstruct array, idx is 433 secstruct_.size() is 432
caught exception failure to read decoy 4KMT_clean_0002_0001 from silent-file /mnt/hgfs/Mutagenesis_Rosetta/4KMT/2.0_Relax/20141007/silent_100.out

So I assumed the problem is due to the file "silent_100.out".

So I did a "dos2unix" to the "silent_100.out" and was told:

dos2unix: Binary symbol 0x00 found at line 883
dos2unix: Skipping binary file ./silent_100.out

As I checked the "silent_100.out", I found so many "NUL"s at line 883.

So what I can do if I want to use "silent_100.out" by "score.linuxgccrelease" to get the RMSD score?

2) To avoid using "silent_100.out", I have tried to use PDBs as input:

~/Cheng/rosetta_2014.30.57114_bundle/main/source/bin/score.linuxgccrelease -database ~/Cheng/rosetta_2014.30.57114_bundle/main/database -in:file:l /mnt/hgfs/Mutagenesis_Rosetta/4KMT/2.0_Relax/20141007/list_of_pdbs.txt -in:file:native /mnt/hgfs/Mutagenesis_Rosetta/4KMT/2.0_Relax/20141007/4KMT_clean_0064.pdb -out:file:silent /mnt/hgfs/Mutagenesis_Rosetta/4KMT/2.0_Relax/20141007/score_with_RMSD_100.out

The run is successful.

Does "rms" or "irms" correspond to CaRMSD?

Can I also ask how to properly extract all the CaRMSD? I tried:

grep SCORE /mnt/hgfs/Mutagenesis_Rosetta/4KMT/2.0_Relax/20141007/score_with_RMSD_100.out | sort -n -k28 > /mnt/hgfs/Mutagenesis_Rosetta/4KMT/2.0_Relax/20141007/RMSD_score_rank_100.txt

But there are so many "NUL"s and the indent is not good and cannot be correctly recognised by "Text Import Wizard" of Excel.

Thank you very much.

Yours sincerely
Cheng

Tue, 2014-10-07 10:06
lanselibai

I'm not sure why you're getting NULs in your silent file output. What's the content around line 883 where you're getting the NULs?

You actually don't need to use a silent file output with the score application, if you don't need the structures combined in a silent file. By default it will output the SCORE lines (without any of the other lines) to the file "default.sc", but this name can be changed with the commandline option "-out:file:scorefile". Try that and see if you get a cleaner tabular output.

The "rms" value is the Calpha-RMSD of the "final" structure to the structure given with -in:file:native. "irms" is the Calpha-RMSD of the "final" structure to the input structure. "srms" is the Calpha-RMSD of the input structure to the native structure. Since you're just scoring, the "irms" value should be zero, as the "final" structure is the same as the input structure. (This scheme is set up for other protocols, which change the pose and result in final structures being different from the initial structure.)

Tue, 2014-10-07 11:47
rmoretti

Hi R Moretti,
Thank you.

1) I tried to test a list file (attached) only containing 5 PDBs:

~/Cheng/rosetta_2014.30.57114_bundle/main/source/bin/score_jd2.linuxgccrelease -database ~/Cheng/rosetta_2014.30.57114_bundle/main/database -in:file:l /mnt/hgfs/Mutagenesis_Rosetta/4KMT/2.0_Relax/20141007/list_of_pdbs_5.txt -out:file:scorefile /mnt/hgfs/Mutagenesis_Rosetta/4KMT/2.0_Relax/20141007/total_score_5.sc > 5_log.txt

However, I still got numerous "NUL"s in "total_score_5.sc"(attached).

In addition, as you can see from the attachments (i.e. "total_score_5.sc" & "5_log.txt"), only 4 PDB files were conducted instead of 5.

1.1) I just used another PC with 2 G RAM to test. All the "NUL" has gone! :)

May be 1 G RAM is not so good.

However, the same as before, only the first 4 PDB are processed. Of course, I can simply amend one more line to that.

2) Yes, I am just scoring. So I will take "rms".

Thank you very much.

Yours sincerely
Cheng

File attachments:
Tue, 2014-10-07 15:37
lanselibai

Finally, I know the answer for the "NUL" now!
When you put more than 64 PDB files in the folder, the "NUL" ghost will come out!
I think it may be relate to some binary issue.

I know more in detail:
The number of PDB files can be put in a folder so as to avoid "NUL"s probably depends on the RAM allowcated.

4G RAM: maximum of 64 PDB files
1G RAM: maximum of 1 PDB files ！

Thu, 2014-10-09 03:11
lanselibai

After one day trial, now I become confused for the reason of "NUL" in the score file (.sc file). Here are a few ways to avoid the "NUL" based on priority:

(I know it seems to be superstitious, but it actually like that.)

1) Re-open the VMware Player
Every time I re-open it, the maximum PDBs I can process in one time varies.
2) Use less PDB files in the list_of_pdbs.txt
It is hard to say how many PDBs can be successfully processed without "nul". Sometimes 5, sometimes 10. I once accessed 64 in one go.
3) Delete the score file before running the next.
4) Save the list file after changing it.

PS: always add another line after the last line as the last line of PDB directory will not be considered.

Thu, 2014-10-09 14:09
lanselibai

Dear Fridends,
The attachment is my plotting of total score VS rms. All the rms values are so small (i.e. 0 to 0.003) and no ideal scattering format can be seen. Do I need to use a more free algorithm to relax the structure, or I can simply take the structure of lowest score for my further study? Thank you.

Yours sincerely
Cheng

The following is my method to obtain the plotting.

1) 100 models were previously obtained by using "relax.linuxgccrelease" based on paper "A Pareto-optimal refinement method for protein design scaffolds"
https://www.rosettacommons.org/node/3808

2) Then the model of lowest score was chosen by using "score_jd2.linuxgccrelease".

3) Afterwards, the lowest score model was selected as the native reference to obtain the rms of 100 models by using "score.linuxgccrelease".

File attachments:
Sat, 2014-10-11 09:07
lanselibai

The point of the constrained relax protocol is to minimize movement of the protein. 0 to 0.003 seems a little small, but I wouldn't necessarily be overly concerned about it. (Numbers in the paper are for all atom rmsd, rather than Calpha rmsd, and a large chunk of the rmsd seen is due to his, asn and gln sidechain flipping.)

Assuming that the energy of your models are in the 1.5 to 2 REU per residue range (i.e. you're relaxing a 300-400 residue protein) I would say you're probably safe in taking the lowest energy structure - after all, all the structures are rather similar to each other.

Tue, 2014-10-14 10:20
rmoretti

Hi R Moretti,
Thank you. I will take the lowest score model and proceed.

Yours sincerely
Cheng

Wed, 2014-10-15 01:54
lanselibai