I'm trying to use ab initio modelling to predict the structure of a ~140 AA natively disordered domain of a protein which I plan to use later to stitch onto the rest of the protein. I produced 100,000 model structures following this tutorial (https://www.rosettacommons.org/docs/latest/application_documentation/structure_prediction/abinitio-relax) except without the -kill-hairpins option since it’s intrinsically disordered. I then used caliber to cluster the structures based on score as follows with the thresholds listed:
All – 14.01284
Top 10,000 – 13.79987
Top 5,000 – 13.75388
Top 1,000 – 13.65071
Top 500 – 13.63761
These almost always gave two top clusters with margin < 10%. I’m new to this and was worried that this might indicate that the modelling isn’t working since the thresholds are still high after this many models. Additionally, clustering with a threshold of 4.0 gives clusters of size 1.
Is this high of threshold expected or is there something that I should change to potentially give better models? Also, does the potential for multiple states, e.g. 2 state model, mess with clustering?
Thanks in advance!
The phrase *intrinsically disordered* sticks out to me.
First off, Rosetta ab initio isn't really designed for modeling intrinsically disordered proteins. There's a bunch of assumptions in the settings which are tuned for folding globular proteins, and aren't necessarily tuned to disordered proteins.
Secondly, I wouldn't necessarily expect a disordered protein to converge onto clusters without a *large* amount of sampling. The conformational state accessable to a disordered protein is much, much greater than that of a globular protein. Rosetta abinitio relies on the energy funnel seen in well-folded proteins to help guide the indepenent Monte Carlo runs to converge to the low energy native-like state. Absent that energy funnel (which will be lacking in disordered proteins) the sampling will be spread out across a large number of states.
I don't have a good sense of the state space for a disordered protein, but it may be that 100,000 structures are just too sparse to get reasonable convergence in any one location.
Instead, I might suggest omitting the clustering step from your run (as there's no guarantee that an intrinsically disordered protein will necessarily find clusters), and instead treat all the low energy structures as an representative ensemble of possible states. You can then run statistics over the set (e.g. looking at the distribution of particular atom-atom distances). Statistics-over-large-ensembles are a better way of thinking about intrinsically disordered proteins, rather than attempting to get one/a few "representative" structures.
But again, be wary of the results of this. As mentioned, Rosetta ab initio isn't necessarily benchmarked for intrinsically disordered proteins, so it might be that it doesn't work quite so well as you would hope. I'd recommend doing some "positive control" experiments, calculting values which you already know the answer to, to see if the computational results match the experimental ones.