I Need Help on Validation for a Ligand Docking Experiment.

2 posts / 0 new

Top

Hello all,

I am looking to validate binding data obtained in Rosetta for small molecule inhibitors for various SARS CoV-2 proteins (Main Protease, Spike, Non-structural proteins, etc). I was recently introduced to BindingDB, a database where experimentally determined binding parameters (e.g., Kd (dissociation constant), EC-50, IC-50, and Ki) are reported for a great number of protein-ligand pairs. To validate the docking results, I intend to collect PDB files for a small number of protein targets listed on BindingDB, collect .sdf files for many small molecule binders of those proteins, record the experimentally determined Kd (dissociation constant) value as listed on BindingDB (I haven't used any other parameter yet, since I believe the others are less likely to correlate to binding affinity, but I could be wrong), and dock those protein-ligand pairs in Rosetta. What I hope to see from this validation is that the order of the lowest REU values obtained for the dockings of those protein-ligand pairs corresponds to the order of the experimentally obtained Kd values for those same pairs (i.e., the protein-ligand pairs that have the smallest reported Kd values in BindingDB also have the lowest REU values when run through Rosetta since a lower Kd value indicates a stronger binding affinity). Similarly, I hope to see that the protein-ligand pairs that had the highest Kd values have the highest REU values, since a higher Kd value indicates a higher binding affinity. Overall, I hope to see that the order of Kd values reported for the protein-ligand pairs in Rosetta corresponds to the order of REU binding affinity values to a statistically significant degree when I run those pairs through Rosetta. That will give real-world value to the computational data. Is this a valid approach? How could we optimize this validation process and make it easier? Has anyone else here done anything similar with success? What resources have you used? How can I go about assessing the statistical significance of this?

When I collected the ligand-protein pairs from BindingDB, I looked for pairs investigated in the exact same assay by the exact same research group, since I know that Kd values differ when the binding assays are changed. Is this a good enough control? Additionally, I want to validate the binding postions by comparing the docked structures in Rosetta to experimentally obtained crystal structures of the same protein-ligand pair. How should I conduct the file preparation for all of this to ensure optimal outcomes? Should I add all of the hydrogens in the .sdf files of the small-molecules? Should I delete cofactors and ligands from the protein I'm conducting the docking on? Thus far, I've been manually cleaning them in Pymol, removing all waters cofactors and ligands; I've seen that the clean_pdb.py script in Rosetta accomplishes the same thing. Do cofactors and hydrogens even affect anything in Rosetta? Should I clean the crystal structures of the protein-ligand complexes that I'm using for positional validation by removing cofactors, waters, glycerol molecules, etc? How should I deal with the crystal structures?

Lastly, we have been using the CASTp protein topographical analysis web server to locate potential binding pockets on the proteins in order to specify coordinates for Rosetta. Is this a good approach? Are there other methods that work well for this purpose?

Thank you in advance for your suggestions!

Category:

Small Molecules

Post Situation:

Unsolved

Sat, 2021-03-20 13:27

tbelec

Top

There are a lot of questions and discussion points there. I would like to touch upon a few and hopefully other will do so too.

Should I add all of the hydrogens in the .sdf files of the small-molecules

You are talking here of implicit hydrogens, right? Say the SMILES of ethanol is 'CCO' with implicit hydrogens and '[CH3][CH2][OH]' or '[H]OC([H])([H])C([H])([H])[H]' with explicit hydrogens. Depends how you generate the topology/parameter files (params files) for Rosetta. The Python 2 script molfile_to_params.py will want all protons. I wrote an RDKit to params Python 3 module (with a rubbish unofficial webserver), which accepts SMILES etc. so is okay with implicit hydrogens.

One thing that is important to make sure whatever way it is done that the ligands are protonated correctly for pH 7. 'CC(=O)O' is acetic acid, which even if it's the "default" it is unlikely to be an acid at neutral pH (it will be 'CC(=O)[O-]' and not 'CC(=O)[OH]'). Acetic acid will not hydrogen bond with a proton donor and one ends up with spurious repulsions. Due to Lipinski's rule there really should not be too many dissociatable protons, but one should keep an eye out —for natural ligands I never trust the PDB_components DB and always provide my own.
In the protease the catalytic cysteine is deprotonated and the histidine is a HID tautomer —there's a neutron diffraction paper IIRC. I use Pyrosetta only, but I am guessing the MutateResidue mover behaves the same in scripts and can be tricked with the nasty codes CIZ and HIS_D.

Should I delete cofactors and ligands from the protein I'm conducting the docking on?

Best docking results are seen on a template derived from a substrate bound structure (with the substrate removed)
Cofactors can be left for most application, but have to be reflected in the foldtree chain shorthhand thing some application use ('AB_X') and many like ligand_dock requires the last residue to be the ligand. However, some inhibitors do mimic cofactors —the remdesivir metabolite is an ATP mimic for the macrodomain. So depends what the cofactor is. Also, ions are best treated with constraints —there are a few movers to do this automatically.
Whereas docking scores are improved with water (e.g, waterdock mod of Autodock), the Rosetta Hydrate mover (cf. Spades paper) is a bit tempramental and not used much. It can add waters only to amino acid residues, but can remove waters that are too unhappy from anywhere.
Beware of commandd like arguments for waters (-ignore_waters false) and unrecognised cofactors (-ignore_unrecognized_res true).

Thus far, I've been manually cleaning them in Pymol

PyMol removes LINK lines and the metadata. LINK is used for covalent ligands, but if you are removing them that is fine. pymol.cmd.align is easier to use than the Rosetta alignment movers, which require equal

Additionally, I want to validate the binding postions by comparing the docked structures in Rosetta to experimentally obtained crystal structures of the same protein-ligand pair

Last year for an open fragment-based lead discovery project (Covid mootshot) against the SARS-CoV-2 main protease, where users submitted suggested followups, fidelity of docking to the actual inspiration hits was a major headache. I ended up developing a fragment-merging and followup placing module (Fragmenstein) to initially to address this. Given two conformers, there are many ways to score the similarity between them, ranging from simple RMSD, to hotspots and shapes-and-colour based approaches —and no two compchemist will be okay with the same one. And even with the simple RMSD there's the problem of equally valid mappings (two benzene rings have 12 unique ways to be mapped).

Mon, 2021-03-22 08:29

matteoferla

Search form

You are here

I Need Help on Validation for a Ligand Docking Experiment.