I am looking to validate binding data obtained in Rosetta for small molecule inhibitors for various SARS CoV-2 proteins (Main Protease, Spike, Non-structural proteins, etc). I was recently introduced to BindingDB, a database where experimentally determined binding parameters (e.g., Kd (dissociation constant), EC-50, IC-50, and Ki) are reported for a great number of protein-ligand pairs. To validate the docking results, I intend to collect PDB files for a small number of protein targets listed on BindingDB, collect .sdf files for many small molecule binders of those proteins, record the experimentally determined Kd (dissociation constant) value as listed on BindingDB (I haven't used any other parameter yet, since I believe the others are less likely to correlate to binding affinity, but I could be wrong), and dock those protein-ligand pairs in Rosetta. What I hope to see from this validation is that the order of the lowest REU values obtained for the dockings of those protein-ligand pairs corresponds to the order of the experimentally obtained Kd values for those same pairs (i.e., the protein-ligand pairs that have the smallest reported Kd values in BindingDB also have the lowest REU values when run through Rosetta since a lower Kd value indicates a stronger binding affinity). Similarly, I hope to see that the protein-ligand pairs that had the highest Kd values have the highest REU values, since a higher Kd value indicates a higher binding affinity. Overall, I hope to see that the order of Kd values reported for the protein-ligand pairs in Rosetta corresponds to the order of REU binding affinity values to a statistically significant degree when I run those pairs through Rosetta. That will give real-world value to the computational data. Is this a valid approach? How could we optimize this validation process and make it easier? Has anyone else here done anything similar with success? What resources have you used? How can I go about assessing the statistical significance of this?
When I collected the ligand-protein pairs from BindingDB, I looked for pairs investigated in the exact same assay by the exact same research group, since I know that Kd values differ when the binding assays are changed. Is this a good enough control? Additionally, I want to validate the binding postions by comparing the docked structures in Rosetta to experimentally obtained crystal structures of the same protein-ligand pair. How should I conduct the file preparation for all of this to ensure optimal outcomes? Should I add all of the hydrogens in the .sdf files of the small-molecules? Should I delete cofactors and ligands from the protein I'm conducting the docking on? Thus far, I've been manually cleaning them in Pymol, removing all waters cofactors and ligands; I've seen that the clean_pdb.py script in Rosetta accomplishes the same thing. Do cofactors and hydrogens even affect anything in Rosetta? Should I clean the crystal structures of the protein-ligand complexes that I'm using for positional validation by removing cofactors, waters, glycerol molecules, etc? How should I deal with the crystal structures?
Lastly, we have been using the CASTp protein topographical analysis web server to locate potential binding pockets on the proteins in order to specify coordinates for Rosetta. Is this a good approach? Are there other methods that work well for this purpose?
Thank you in advance for your suggestions!