I have a problem with cleaning my protein before ligand docking. I use the command clean_pdb.py 1i09 A to download and clean GSK-3, but the final structure has some missing residues, even more than the initial structure in the protein data bank. There is no gap in the AA sequence when open the structure as a text file, but using PyMol, there are dotted lines instead of solid lines in some areas of the structure. I have uploaded the initial and the cleaned PDB files for your consideration.
Also, how should I determine if I need to relax the protein before docking or not?
I have one more question regarding the double bonds in ligand PDB files. When using mole_to_params.py, all the double bonds in the SDF file turn into single bonds in the PDB file. I used Open Babel to keep the double bonds, but when I place the ligand into the binding site, they disappear again. I see that in the material given in the Rosetta virtual workshop (2020) all the double bonds are present in PDB files. I was wondering if that makes a problem in the docking process.
|The protein structure obtained from protein data bank||489.64 KB|
|The protein structure after using clean_pdb.py 1i09 A||206.22 KB|
So the issue you're having with missing residues/gaps is that there are certain residues in your input PDB which have backbone atoms with zero occupancy. This is likely because the electron density in those regions is too weak to see, but the crystallographer could be pretty certain of the coordinates based on the surrounding atoms. The issue here is that, by default, Rosetta will ignore residues with missing & zero occupancy backbone atoms. As such, the clean_pdb script will throw those atoms out. (That's actually one of the things it's explicitly "cleaning".)
One fix might be to open the input file in a text editor and manually change the "0.00" in the occupancy columns of ASP 31 and GLY 34 to be "1.00". Another option may be to provide the `--keepzeroocc` flag to the clean_pdb.py script, and then pass `-ignore_zero_occupancy false` to the Rosetta application command line.
A third option is to skip the clean_pdb step altogether. Rosetta is much more robust to reading in PDBs than it was in the past, so cleaning is often not necessary. You may want to edit the PDB to remove extra chains and/or some of the crystallization buffers (e.g. the phosphates), but Rosetta should be able to read in the PDB directly without cleaning. (Though you will probably still want to pass `-ignore_zero_occupancy false` in your case.)
Regarding prep, doing a pre-relax probably won't hurt. But I'd recommend that you add the flags `-constrain_relax_to_start_coords` and `-relax:ramp_constraints false` to your relax command line to keep the backbone from moving too much. (You could also constrain the sidechains, but as ligand docking allows the sidechains to move, it probably isn't necessary.)
Regarding the double bonds, the PDB format strictly speaking doesn't actually have any bond order information. There's some conventions which can be used to represent it, but that depends on the program reading it to recognize those non-standard addtions. Rosetta does not. Instead, the presence/abscense of double bonds is driven entirely by the designation in the params files. If you have the proper single/double annotations in the Mol file you pass to molefile_to_params.py, then the params file will have the correct bonding, and Rosetta will properly recognize the bonds -- even if there's no such annotation in the PDB file.
Thanks for your quick response,
I changed the "0.00" in the occupancy columns of ASP 31 and GLY 34 to be "1.00" and I no longer get new missing residues, but I still have the missing residues (dotted lines) of the input file when I open it in PyMol. What if I change all the 0.00 occupancies to be 1.00? Is this the same as providing the `--keepzeroocc` flag to the clean_pdb.py script, and then passing `-ignore_zero_occupancy false`?
The binding site is the area around Cys199 and it seems to be far from the missing residues. Is it a good idea to use other software such as Modeller to fill in the missing residues?
If I skip the clean_pdb step altogether, the protein sequence will start from number 25, do I need to renumber it so that it starts from number 1?
Taking a closer look, it looks like your original input file has those gaps, and a fair number of missing residue (e.g. there's missing residues between 285 and 301). Pymol doesn't draw those as dotted lines, as there's too many missing residues between. However, when clean_pdb renumbers things, PyMol then draws the dotted lines, as it doesn't see any missing residues in the gap (as it goes directly from 252 to 253)
Ligand docking is pretty local. What's happening with the protein 10+ Ang away from the binding site doesn't have much of an effect on the results. I wouldn't be too concerned about missing residues far from the binding site. While you certainly could use a program like Modeller (or RosettaCM) to try to fill in the gaps, if they're not close to the binding site it's probably not worth it.
Ligand docking should also be able to work with protein sequences which don't start from 1 or which have gaps in it. These days Rosetta is pretty robust to working with oddly PDB numbered structures. The biggest issue is normally from a user perspective. Depending on feature, various inputs and outputs either give things in PDB numbering (with a chain letter) or "Pose numbering" (without a chain letter), which is start-at-one-no-gaps. One benefit of clean_pdb is that it makes PDB numbering and Pose numbering match up. But it's often not needed, so long as you as a user remember to do the conversion in any situation where you may need to. (But which shouldn't be many for ligand docking.)