Q1) For a <100 residues protein, is it "enough" to generate 20,000 structures before comparing/clustering?
Q2) (I apologize if the question is beyond this forum) Is(Are) there nay formal publication(s) that mention directly/indirectly the "minimum/typical" number of conformations needed?
Thank you in advance,
I presume in this that you're referring to doing ab inito folding -- different protocols will require a different amount of sampling.
For that matter, different proteins will require a different amount of sampling, even for the same protocol. Generally, the larger the protein the more sampling is needed to get equivalently good results.
To get a sense of the number of samples needed, I'd recommend taking a look at the CASP papers which have come out of David Baker's lab. That should give you a good sense of how many structures were produced for production runs on a diverse set of proteins of varying sizes. The one caveat there is that the Baker Lab runs Rosetta@Home, so they have access to much more computational power than most other labs are likely to have. So they're likely to go for "thorough" rather than just "sufficient" when it comes to number of structures.
An approach that might work for you, especially if you have computational limitations, is to take a progressive approach. Run X,000 structures. Score/cluster/examine/post process, and get your results. Then run an additional Y,000 structures. Combine all the (X+Y),000 structures and repeat the analysis. Do you get the same results? Are things marginally better? Do you sample new structures? If so, you can continue, progressively adding more and more structures to your sampling set until adding additional structures no longer improves/changes the result enough to justify spending additional computational time.
Thank you so much, a truly informative answer!