You are here

match_dspos1 and other matcher options

6 posts / 0 new
Last post
match_dspos1 and other matcher options
#1

I am just wondering about the options for matching. Particularly, the manual says:

"There are two flavors of matches that should be mentioned [...] The second definition is that a match is defined as the unique combination of the upstream portion of N hits, and the unique location of the downstream partner as defined by a particular geometric constraint. In this case, multiple groupings of hits that are enumerated can map to the same match. This second definition, in the code referred to as a "match_dspos1" since the down-stream (ds) position is defined by a single geometric constraint, reduces the number of matches that need to be considered for output considerably; this makes matching faster and avoids the problem of producing thousands of nearly-identical output files which frustrate post-processing."

But the manual doesn't seem to clearly indicate which options enable or disable match_dspos1.

Also, a question to veterans of matcher and enzdes - from your experience what are the best options to enable to speed up matching and minimize redundant or very similar results, while maintaining as many unique matches as possible? How many matches is considered reasonable for subsequent analysis with enzdes? What is the best grouper to use for subsequent enzdes?

Thank you for your advice!

Post Situation: 
Sat, 2012-10-20 09:22
petrikigor

I get the vague impression that it's on by default. I've sent this along to some of the enzyme folks.

Sat, 2012-10-20 09:41
smlewis

To my understanding, that paragraph is simply describing the internal workings of the match outputter machinery, to help you understand the various options for outputting matches. The naive output is to simply output one file for each combination found by the matcher. The problem with that is if you've done fine sampling, you could have two matches which differ only by a slight rotamer shift. If you're going to run enzyme_design on your output matches, this is superflous, as you'll sample both of those during design anyway. But you wouldn't want to scale down the fineness of the rotamer sampling, as you might miss matches which aren't necessarily as "wide" as the oversampled one. The solution is to tweak the match outputter to combine those matches which are more-or-less equivalent from a redesign perspective.

The two techniques are controlled by the -match:output_format commandline option. Using the value of "PDB" does the first, exhaustive output, and the value "CloudPDB" does the second. By default in Rosetta 3.3 and 3.4 (and possibly for earlier ones) this is CloudPDB. From what I can tell, pretty much everyone in the Baker lab uses the CloudPDB output technique. The "PDB" output just creates too many files, for little to negative gain. (Not only do you need to store the files on disk, but you increase your design run, as you need to process each separate file.) Note that the CloudPDB output comes at a cost - it's a slightly specialized format, so you need support to load it. The enzyme_design application should be able to read it natively, but if you're using RosettaScripts, you need to use the commandline option -parser_read_cloud_pdb to get it to work correctly.

Regarding how many matches you need, that entirely depends on your system. It's not just number of matches, it getting matches that are designable (and no, there's no a priori way of telling which matches are designable). It also depends on how degenerate your constraints are - if you have a lot of atom_type constraints where you're ambiguously matching E/D, Q/N, S/T/Y, F/W/Y, etc. you'll probably want more matches to cover the possibilities than if you're matching a more precisely defined constraint (enzyme design won't really swap the identity of constrained amino acids).

Mon, 2012-10-22 10:48
rmoretti

Thanks for your advice!

I am getting alot many match groups (>400), but my constraints are fairly degenerate. I'm trying to test enzyme design on some of these now.

My next issue comes in with the enzdes application (Rosetta 3.4):

There seems to be some incompatibility with enzyme design when using variable blocks in match
If I use the same cst file that I used for match (w/ variable blocks), the following error occurs:
protocols.toolbox.match_enzdes_util.EnzConstraintIO: read enzyme constraints from ../HASs_MC_NQbb_NQbb_KR_1_match.cst ... done, 4 cst blocks were read.
protocols.toolbox.match_enzdes_util.EnzConstraintIO: Generating constraints for pose...
protocols.toolbox.match_enzdes_util.EnzConstraintParameters: for block 1, 1 newly generated constraints were added
protocols.toolbox.match_enzdes_util.EnzConstraintIO: checking cst data consistency for block 1... done
protocols.toolbox.match_enzdes_util.EnzConstraintIO: Cst Block 1done...
Error: residue GLN2found in pdb header is not allowed by data in cstfile.
ERROR:: Exit from: src/protocols/toolbox/match_enzdes_util/EnzCstTemplateRes.cc line: 272
protocols.toolbox.match_enzdes_util.EnzConstraintIO: checking cst data consistency for block 2...

I also tried editing the cst file so that it only contained the blocks that were used to make a particular match, and feeding that into enzdes for that match. I get the following error in that case:
protocols.toolbox.match_enzdes_util.EnzConstraintIO: read enzyme constraints from ../HASs_M_Q_N_K.cst ... done, 4 cst blocks were read.
ERROR: The external geometry ID specified for cst block 1 is larger than the number of sub-blocks for that block in the cst file.
ERROR:: Exit from: src/protocols/toolbox/match_enzdes_util/EnzConstraintIO.cc line: 238

What is the preferred protocol for enzyme design after variable cst blocks were used in match?

Wed, 2012-10-24 12:40
petrikigor

Well I think I figured that out - here is the header of one of my cloud PDB files:
MODEL 1
REMARK 666 MATCH TEMPLATE X HAS 0 MATCH MOTIF A MET 40 1 2
REMARK 666 MATCH TEMPLATE X HAS 0 MATCH MOTIF A ASN 122 2 2
REMARK 666 MATCH TEMPLATE X HAS 0 MATCH MOTIF A ASN 51 3 2
REMARK 666 MATCH TEMPLATE X HAS 0 MATCH MOTIF A LYS 131 4 2

As you can see, for each of the blocks it recorded that the sub-block that was used was 2. The problem is that it is NOT 2 for all of them. ASN and LYS are in sub block 1 of their respective variable blocks. I.e. the PDB was telling enzdes to look at the wrong sub-blocks.

So now the question is why doesn't match put the correct sub-block number and how to fix it... I don't even know where to start. Where is the code that writes the header remarks for CloudPDBs?

Wed, 2012-10-24 14:08
petrikigor

The code that handles CloudPDB output is protocols::match::output::CloudPDBWriter::write_match_groups() function, in rosetta_source/src/protocols/match/output/PDBWriter.cc; but the header outputting is in protocols::match::output::PDBWriter::assemble_remark_lines() (in the same file) - though be warned that where the block number gets assigned is far upstream of those. My guess is that the issue is likely to be somewhere around where it's first stored, which is handled by the protocols::match::Matcher class (see rosetta_source/src/protocols/match/Matcher.cc).

Wed, 2012-10-24 20:05
rmoretti