You are here

Difference between checkpoint file and PSSM for making fragments

2 posts / 0 new
Last post
Difference between checkpoint file and PSSM for making fragments
#1

PSSM(position-specific scoring matrix) is a commonly used representation of motifs in biological sequences, and psiblast to generate a PSSM. a portion of a PSSM shows as follow:
A R N D C Q E G H I L K M F P S T W Y V
1 G 0 -2 0 -1 -2 -2 -2 5 -2 -4 -3 -1 -3 -3 -2 0 -2 -2 -3 -3
2 S 0 -1 4 1 -2 0 0 0 0 -3 -3 0 -2 -3 -1 3 1 -3 -2 -2
3 G 1 -2 0 -1 -2 -1 -1 5 -2 -3 -3 -1 -2 -3 -2 2 0 -3 -3 -3
4 M 2 -2 -2 -3 -1 -1 -2 -2 -2 1 2 -2 3 0 -2 -1 -1 -2 -1 0

PSSM often was normalized by sigmoid function scaled to a range 0-1. Once I thought PSSM is similar to Checkpoint file, But I find they were differnt when I saw the make_fragments.pl.

In make_fragments.pl, there was a step to parse a PSI-BLAST binary checkpoint file, and it generated a protein.checkpoint file (N x 20 array containing checkpoint weight values, where N is the size of the protein) as follows:

G 0.0783 0.0108 0.0337 0.0256 0.0162 0.5101 0.0135 0.0189 0.0337 0.0283 0.0094 0.0391 0.0189 0.0189 0.0229 0.0513 0.0297 0.0243 0.0054 0.0108
S 0.1099 0.0175 0.0489 0.0524 0.0209 0.0663 0.0192 0.0297 0.0541 0.0419 0.0157 0.0541 0.0297 0.0332 0.0401 0.2199 0.0820 0.0419 0.0052 0.0175
G 0.0783 0.0108 0.0337 0.0256 0.0162 0.5101 0.0135 0.0189 0.0337 0.0283 0.0094 0.0391 0.0189 0.0189 0.0229 0.0513 0.0297 0.0243 0.0054 0.0108
M 0.0522 0.0161 0.0201 0.0281 0.0482 0.0281 0.0161 0.1004 0.0361 0.1968 0.1606 0.0201 0.0161 0.0281 0.0321 0.0361 0.0402 0.0924 0.0080 0.0241
K 0.0570 0.0086 0.0415 0.0708 0.0155 0.0432 0.0207 0.0276 0.2781 0.0432 0.0155 0.0415 0.0276 0.0535 0.1071 0.0535 0.0397 0.0328 0.0052 0.0173
E 0.0552 0.0074 0.0902 0.2965 0.0166 0.0350 0.0258 0.0221 0.0755 0.0368 0.0129 0.0405 0.0258 0.0645 0.0497 0.0552 0.0368 0.0313 0.0055 0.0166
F 0.0448 0.0116 0.0155 0.0229 0.0984 0.0218 0.0096 0.0794 0.0236 0.2947 0.0403 0.0143 0.0184 0.0191 0.0234 0.0307 0.0365 0.0773 0.0097 0.0356

The sum of each row is approximately 1, and parsing checkpoint also use BLOSUM as PSSM. However the results of PSSM and checkpoint are so different. In this field, what are their functions, respectively, and why do they have difference?

Post Situation: 
Fri, 2014-03-21 22:00
qlj

They have similar information, but the way they're represented is different.

For example, do you store probabilities, raw counts, or perhaps log odds? Do you do any scaling? Do you add pseudocounts? Do you do any rounding of the values?

My understanding of most PSSMs is that they're normally represented as log-odds values, scaled by some factor, and then rounded to the nearest integer. This may be the ideal format for some purposes, but may not be ideal in others.

For the parsed PSI-BLAST checkpoint file, the numbers you get are the "weight" values from PSI-BLAST, with missing observations "filled in" with generalized frequency values. My understanding is that the numbers are analogous to the frequency values of each amino acid - while this can be further manipulated into the standard PSSM log-odds values, for the use make_fragments.pl puts it to, the frequency values are slightly more useful.

For details, the PSI-BLAST paper ( http://nar.oxfordjournals.org/content/25/17/3389.long ) is probably a good place to start. For details on PSSMs, usually they're constructed to be analogous to the BLOSUM matricies (http://en.wikipedia.org/wiki/BLOSUM and http://www.pnas.org/content/89/22/10915), but on a per-position basis, versus on an amino acid identity basis.

Sat, 2014-03-22 11:40
rmoretti