Difference between revisions of "Mutation Data Matrices"
(21 intermediate revisions by one other user not shown) | |||
Line 1: | Line 1: | ||
+ | <div style="padding: 5px; background: #A6AFD0; border:solid 1px #AAAAAA; font-size:200%;font-weight:bold;"> | ||
+ | Mutation Data Matrices | ||
+ | </div> | ||
+ | |||
+ | | ||
+ | | ||
+ | |||
+ | |||
+ | |||
;Mutation Data Matrices are computational tools to quantify how well an observed pair-exchange conforms to a particular model of exchanges, usually a model of evolution. | ;Mutation Data Matrices are computational tools to quantify how well an observed pair-exchange conforms to a particular model of exchanges, usually a model of evolution. | ||
One really has to understand the above sentence, if one wants to '''understand''' the whole business of sequence aligning and similarity scores in bioinformatics. We have algorithms that compute ''optimal sequence alignments'' - either global (Needleman-Wunsch), or local (Smith-Waterman). But what does ''optimal'' really mean? Optimal means that the alignment gives us the highest score that is possible, '''for a given mutation matrix'''. Actually: for a given matrix plus a given set of parameters for insertions and deletions, but since the judicious choice of these parameters depends on the matrix as well, for this discussion here we will assume it is only the matrix. | One really has to understand the above sentence, if one wants to '''understand''' the whole business of sequence aligning and similarity scores in bioinformatics. We have algorithms that compute ''optimal sequence alignments'' - either global (Needleman-Wunsch), or local (Smith-Waterman). But what does ''optimal'' really mean? Optimal means that the alignment gives us the highest score that is possible, '''for a given mutation matrix'''. Actually: for a given matrix plus a given set of parameters for insertions and deletions, but since the judicious choice of these parameters depends on the matrix as well, for this discussion here we will assume it is only the matrix. | ||
− | So since everything depends on the matrix, what '''is''' such a matrix? It is an ordered set of numbers that associates every possible pair of amino acids with a score. The crux of the matter is how one comes up with such a score. '''Ultimately, all such scores measure the probability that two amino acids are related by evolution.''' | + | So since everything depends on the matrix, what '''is''' such a matrix? It is an ordered set of numbers that associates every possible pair of amino acids with a score. The crux of the matter is how one comes up with such a score. '''Ultimately, all such scores measure the probability that two amino acids are related by evolution.''' And in order to that, a quantitative, computable model of the evolutionary process has to be defined in order to construct such a matrix. Since we can devise several alternative computational models of how to do such a measurement, several different matrices are in common use. Which one to use depends entirely on which model of evolution is the most appropriate for the question at hand. Here is a selction of models: |
− | *We can argue that homologous sequences should not have mutated at all. Thus we should assign a value of 1 for identical amino acids, a value of 0 for mutations. This is the '''identity matrix'''. | + | *We can argue that homologous sequences should not have mutated at all. Thus we should assign a value of 1 for identical amino acids, a value of 0 for mutations. This is the '''identity matrix'''. It is appropriate only for sequences with a very high degree of identity. |
*We can argue that evolution proceeds by point mutations in the coding sequence. Thus the number of nucleotide changes required to change one codon into another should be small. We score amino acids with similar codons highly, those with dissimilar codons poorly. This is the '''genetic code matrix'''. | *We can argue that evolution proceeds by point mutations in the coding sequence. Thus the number of nucleotide changes required to change one codon into another should be small. We score amino acids with similar codons highly, those with dissimilar codons poorly. This is the '''genetic code matrix'''. | ||
− | *We can argue that selection after mutation favours similar amino acids. Thus if two amino acids are related, they have a higher probability of being similar than being dissimilar. Similarity can be defined from first principles with some quantitative measure of size, charge, hydrophopicity etc. This lead us to a '''biophysical matrix'''. | + | *We can argue that selection after mutation favours similar amino acids. Thus if two amino acids are related by mutation and selection, they have a higher probability of being biophysically similar than being dissimilar. Similarity can be defined from first principles with some quantitative measure of size, charge, hydrophopicity etc. This lead us to a '''biophysical matrix'''. |
− | However all these arguments from first principles have associated problems. An '''empirical approach''' | + | However all these arguments from first principles have associated problems. The identity matrix does not capture the fact that related sequences have a higher likelihood of conservaive substitutions. The genetic code matrix does not capture similar properties well for some pairs: e.g. glycine (GGG) → tryptophan (TGG). The biophysical matrix cannot take context into account, but it is context that determines of whether e.g. tyrosine is consevred because it is similar to phenylalanine (hydrophobic) or because it is an H-bonf acceptor (polar). An alternative is the '''empirical approach''': observe what nature allows as replacements in amino acid pairs from sequences that we '''know''' to be homologous. |
− | This approach postulates that once the evolutionary relationship of two sequences is established, the residues that did exchange can be called '''similar'''. This is the principle behind the mutation matrices compiled by Margaret O. Dayhoff and colleagues at the National Biomedical Research Foundation in the early 70s (Dayhoff, M.O. et al. (1978) Atlas of Protein Sequence and Structure. Vol. 5, Suppl. 3 National Biomedical Research Foundation, Washington D.C. U.S.A). They developed a precise and rigorous approach to implement a model of evolutionary change in their mutation data matrix.Their model allows to quantify the odds that a given alignment of two protein sequences would be observed by chance or would | + | This approach postulates that once the evolutionary relationship of two sequences is established, the residues that did exchange can be called '''similar'''. This is the principle behind the mutation matrices compiled by Margaret O. Dayhoff and colleagues at the National Biomedical Research Foundation in the early 70s (Dayhoff, M.O. et al. (1978) Atlas of Protein Sequence and Structure. Vol. 5, Suppl. 3 National Biomedical Research Foundation, Washington D.C. U.S.A). They developed a precise and rigorous approach to implement a model of evolutionary change in their mutation data matrix.Their model allows to quantify the odds that a given alignment of two protein sequences would either be observed by chance or would reflect divergent evolution from a common ancestral sequence. |
− | We have since obtained | + | We have since obtained a vastly larger number of sequences, and we can compile directly some of the numbers that Dayhoff had to infer from the limited dataset she had available. But the principle of the process is necessarily the same for any related construction of a computational tool that quantifies probabilities. It is a paradigm for the construction of any mutation data matrix. |
<br> | <br> | ||
Line 21: | Line 30: | ||
− | How is the Dayhoff mutation data matrix constructed ? Since we are looking for a matrix that will | + | How is the Dayhoff mutation data matrix constructed ? Since we are looking for a matrix that will quantify the probabiltiy of an evolutionary relationship, we must first define a quantitative model of evolution. The model used by Dayhoff states that proteins evolve through a succesion of independent point mutations that are accepted in a population and subsequently can be observed in the sequence pool. Formally, the fact that the mutations are treated independently of their neigborhood and of their history makes this a Markov model. We can define an evolutionary distance between two sequences as the number of point mutations that was necessary to evolve one sequence into the other. (More precisely, the distance is the number of accepted mutations.) Two aspects of this process cause the '''evolutionary distance''' to be unequal in general to the number of '''observed differences''' between the sequences: |
− | *First, there is a chance that a certain residue may have mutated, than reverted, hiding the effect of the mutation. This phenomenon is important in the evaluation of biological clocks and the question of how many mutational events may become fixed per time unit | + | *First, there is a chance that a certain residue may have mutated, than reverted, hiding the effect of the mutation. This phenomenon is especially important in the evaluation of biological clocks and the question of how many mutational events may become fixed per time unit. |
− | *Second, specific residues may have mutated more than once, thus the number of point mutations is likely to be larger than the number of differences between the two sequences | + | *Second, specific residues may have mutated more than once, thus the number of point mutations is likely to be larger than the number of differences between the two sequences. |
+ | Both factors combine so that sequence identity drops progressively more slowly as the number of accepted mutations increases. Even when very many mutations have been accepted, sequence identity will remain around 5%. | ||
<div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;"> | <div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;"> | ||
Line 39: | Line 49: | ||
<br> | <br> | ||
− | In order to identify accepted point mutations, a complete phylogenetic tree including all ancestral sequences has to be constructed. (There are standard procedures for this.) To avoid a large degree of ambiguities in this step, Dayhoff and colleagues restricted their analysis to sequence families with more than 85% identity. Since the evolutionary distance between these highly | + | In order to identify accepted point mutations, a complete phylogenetic tree including all ancestral sequences has to be constructed. (There are standard procedures for this.) To avoid a large degree of ambiguities in this step, Dayhoff and colleagues restricted their analysis to sequence families with more than 85% identity. Since the evolutionary distance between these highly similar proteins is small, the construction of the phylogenetic tree can be achieved without too many complicating assumptions. For each of the observed and inferred sequences, the amino acid pair exchanges are tabulated into a 20x20 matrix. It is assumed, that the likelihood of an amino-acid ''i'' being replaced by an amino acid ''j'' is the same as ''j'' replacing ''i''. Hence the matrix is symmetric. (This assumption is probably largely true, it is equivalent to postulating that the amino acid distributions are in equilibrium regarding the evolutionary process.) Note that this process is different from comparing observed sequences directly with each other; a large number of inferred sequences are added to the data. |
− | ''A<sub>ij</sub>'' is the number of accepted mutations observed | + | ''A<sub>ij</sub>'' is the number of accepted mutations observed in which amino acid ''i'' replaces amino acid ''j''. |
<div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;"> | <div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;"> | ||
+ | |||
==Second step: Frequencies of occurrence== | ==Second step: Frequencies of occurrence== | ||
</div> | </div> | ||
<br> | <br> | ||
− | + | Since the properties of amino acids differ and since they occur with different frequencies in proteins, all statements we can make about the average properties of sequences will depend on the '''frequencies of occurence''' of the individual amino acids. These frequencies of occurence are approximated by the frequencies of observation. They are the number of occurences of a given amino acid divided by the total number of amino acids that have been observed. Writing ''aa<sub>i</sub>'' for the ''i''-th amino acid | |
<math>f_i=\frac{\mbox{observations of }aa_i}{\mbox{all observations}}</math> | <math>f_i=\frac{\mbox{observations of }aa_i}{\mbox{all observations}}</math> | ||
Line 55: | Line 66: | ||
<math>\,\sum_{i=aa} f_i=1</math> | <math>\,\sum_{i=aa} f_i=1</math> | ||
− | The appropriate frequencies | + | The appropriate frequencies do depend on the organism under consideration (yes, scoring matrices are improved if they take organism-specific amino acid frequencies into account) and they may depend on the type of amino acid, e.g. amino acid frequencies for membrane proteins are different from those of soluble proteins. However from a practical point of view the theoretical increase in accuracy does not appear sufficient to justify the signifcant additional computational overhead that the work with specialised scoring matrices would require. |
=====Amino acid frequencies===== | =====Amino acid frequencies===== | ||
Line 92: | Line 103: | ||
</div> | </div> | ||
<br> | <br> | ||
+ | |||
+ | To obtain a complete picture of the mutational process, the amino-acids that do not mutate must be taken into account as well. We need to know: what is the chance, on average, that a given amino acid will mutate at all. This is the relative mutability of the amino acid. It is obtained by multiplying the number of observed changes by the amino acid's frequency of occurence. | ||
+ | |||
+ | <math>\,m_i=f_i(\mbox{number of times }aa_i\mbox{ is observed to change})</math> | ||
+ | |||
+ | where <math>m_{Ala}</math> is arbitrarily set to 100%. | ||
+ | |||
+ | =====Relative mutabilities===== | ||
+ | <br> | ||
+ | <table style="border:1px solid #000000;" cellpadding="5" cellspacing="0"> | ||
+ | <tr style="background:#BDC3DC;"><td style="border-bottom:1px solid #000000;">'''AA'''</td><td style="border-bottom:1px solid #000000;">'''1978<sup> a</sup>'''</td><td style="border-bottom:1px solid #000000;">'''1991<sup> b</sup>'''</td></tr> | ||
+ | |||
+ | <tr><td>S</td><td><tt> 120 </tt></td><td><tt> 117 </tt></td></tr> | ||
+ | <tr style="background: #E9EBF3;"><td>T</td><td><tt> 97 </tt></td><td><tt> 107 </tt></td></tr> | ||
+ | <tr><td>N</td><td><tt> 134 </tt></td><td><tt> 104 </tt></td></tr> | ||
+ | <tr style="background: #E9EBF3;"><td>I</td><td><tt> 96 </tt></td><td><tt> 103 </tt></td></tr> | ||
+ | <tr style="background: #BDC3DC;"><td>A</td><td><tt> 100 </tt></td><td><tt> 100 </tt></td></tr> | ||
+ | <tr style="background: #E9EBF3;"><td>V</td><td><tt> 74 </tt></td><td><tt> 98 </tt></td></tr> | ||
+ | <tr><td>M</td><td><tt> 94 </tt></td><td><tt> 93 </tt></td></tr> | ||
+ | <tr style="background: #E9EBF3;"><td>H</td><td><tt> 66 </tt></td><td><tt> 91 </tt></td></tr> | ||
+ | <tr><td>D</td><td><tt> 106 </tt></td><td><tt> 86 </tt></td></tr> | ||
+ | <tr style="background: #E9EBF3;"><td>Q</td><td><tt> 93 </tt></td><td><tt> 84 </tt></td></tr> | ||
+ | <tr><td>R</td><td><tt> 65 </tt></td><td><tt> 83 </tt></td></tr> | ||
+ | <tr style="background: #E9EBF3;"><td>E</td><td><tt> 102 </tt></td><td><tt> 77 </tt></td></tr> | ||
+ | <tr><td>K</td><td><tt> 56 </tt></td><td><tt> 72 </tt></td></tr> | ||
+ | <tr style="background: #E9EBF3;"><td>P</td><td><tt> 56 </tt></td><td><tt> 58 </tt></td></tr> | ||
+ | <tr><td>L</td><td><tt> 40 </tt></td><td><tt> 54 </tt></td></tr> | ||
+ | <tr style="background: #E9EBF3;"><td>F</td><td><tt> 41 </tt></td><td><tt> 51 </tt></td></tr> | ||
+ | <tr><td>G</td><td><tt> 49 </tt></td><td><tt> 50 </tt></td></tr> | ||
+ | <tr style="background: #E9EBF3;"><td>Y</td><td><tt> 41 </tt></td><td><tt> 50 </tt></td></tr> | ||
+ | <tr><td>C</td><td><tt> 20 </tt></td><td><tt> 44 </tt></td></tr> | ||
+ | <tr style="background: #E9EBF3;"><td>W</td><td><tt> 18 </tt></td><td><tt> 25 </tt></td></tr> | ||
+ | |||
+ | </table> | ||
+ | |||
+ | <small>All values are taken relative to alanine, which is arbitrarily set at 100. As above: '''a)''' Frequencies taken from Dayhoff (1978); '''b)''' Frequencies taken from the 1991 recompilation of the mutation matrices by Jones ''et al.'' ([http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=pubmed&cmd=Retrieve&dopt=AbstractPlus&list_uids=1633570&query_hl=1&itool=pubmed_docsum Jones, D.T. Taylor, W.R. & Thornton, J.M. (1991) CABIOS 8:275-282]). The difference for some amino acids are quite significant, especially for those amino acids, for which hardly any exchanges have been observed in 1978 (C and W). Serine and threonine are the most mutable amino acids, cysteine and tryptophan are the least likely to be succesfully replaced in evolution.</small> | ||
+ | |||
+ | |||
+ | |||
<div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;"> | <div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;"> | ||
+ | |||
==Fourth step: Mutation probability matrix== | ==Fourth step: Mutation probability matrix== | ||
</div> | </div> | ||
<br> | <br> | ||
+ | With the numbers we have calculated and compiled above we can quantify the probability that an amino acid ''i'' will replace an amino acid ''j'' in an alignment of two '''homologous''' sequences: it is the mutability of amino acid ''j'', multiplied by the ''<u>relative</u> pair exchange frequency'' for ''i'' ← ''j'' (i.e. the pair exchange frequency for ''i←j'' divided by the sum of all pair exchange frequencies for amino acid ''j''). | ||
+ | |||
+ | <math>M_{ij}=m_j\frac{A_{ij}}{\sum_{i=aa}{A_{ij}}}</math> | ||
+ | |||
<div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;"> | <div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;"> | ||
+ | |||
==Fifth step: Evolutionary distance scale== | ==Fifth step: Evolutionary distance scale== | ||
</div> | </div> | ||
+ | |||
+ | Since the diagonal elements of the matrix <math>M_{ii}</math> represent the probability for an amino acid to remain conserved, if we scale all cells of our matrix by a constant factor <math>\lambda</math> we can scale the matrix to reflect a specific overall probability of change. We may chose <math>\lambda</math> so that the expected number of changes is 1 %, this matrix then represents the '''exchange probabilities for the evolutionary distance of 1 PAM'''. An average protein will have the composition: | ||
+ | |||
+ | <math>n_i=f_iN</math> | ||
+ | |||
+ | |||
+ | with <math>N</math> being the length of the protein and <math>n_i</math> the number of amino acids of a certain type. If we take <math>N</math> to be 100, <math>n_i</math> will equal the percentage of aminoa acids. The percentage of amino acids of type <math>i</math> that will change in the evolutionary interval represented by a given matrix is | ||
+ | |||
+ | <math>n_i(1-M_{ii})</math> ; | ||
+ | |||
+ | |||
+ | and the number of total changes - '''PAM''' - represented by the matrix is the sum over all individual changes. We can use this sum to rescale the <math>M_{ii}</math> so that the percentage of change is 1, then the matrix represents one PAM, i.e. <math>\lambda</math> can be calculated as: | ||
+ | |||
+ | <math>\frac{1}{\lambda}=\sum_i n_i(1-M_{ii})</math> | ||
+ | |||
+ | |||
+ | with the scaling factor <math>\lambda</math>, the mutation probability elements become | ||
+ | |||
+ | |||
+ | <math>M_{ij}=\lambda m_j\frac{A_{ij}}{\sum_{i=aa}{A_{ij}}}</math> | ||
+ | |||
+ | |||
+ | and the diagonal elements become | ||
+ | |||
+ | |||
+ | <math>M_{ii}=1-\lambda m_i</math> | ||
+ | |||
+ | |||
+ | It is apparent that the number of amino acids that are expected to change according to the matrix depends only on the factor <math>\lambda</math>. But this is correct only for the evolutionary distances from which the data were compiled. In the case of the Dayhoff matrices, where sequences with less than 15 % differences were used, scaling for 1 to 5 PAM should be permissible, from the original data. For higher evolutionary distances, extrapolation can not be simply done by adjusting the <math>M_{ii}</math> with a scaling factor <math>\lambda</math>, since this would neglect overlapping mutations. | ||
+ | |||
+ | In the framework of Dayhoff's model, '''a mutation probability matrix for any distance can be obtained by multiplying the 1 PAM matrix with itself the required number of times'''. Thus the 3 PAM matrix is simply the cube of the 1 PAM matrix. Obviously, the accuracy of this process of extrapolation will depend on the accuracy of the 1 PAM matrix - but up to that accuracy, it is rigorously correct. | ||
+ | |||
+ | Two facts should be pointed out: | ||
+ | |||
+ | The matrix at time PAM 0 is simply the unitary matrix. PAM 0 means that identities are certain and exchange is impossible: | ||
+ | |||
+ | <math>M^0= | ||
+ | \begin{bmatrix} | ||
+ | 1 & 0 & \cdots & 0 \\ | ||
+ | 0 & 1 & \cdots & 0 \\ | ||
+ | \vdots & \vdots & \ddots & \vdots \\ | ||
+ | 0 & 0 & \cdots & 1 | ||
+ | \end{bmatrix} | ||
+ | </math> | ||
+ | |||
+ | The Matrix at infinite evolutionary disatance is: | ||
+ | |||
+ | <math>M^\infty= | ||
+ | \begin{bmatrix} | ||
+ | f_1 & f_1 & \cdots & f_1 \\ | ||
+ | f_2 & f_2 & \cdots & f_2 \\ | ||
+ | \vdots & \vdots & \ddots & \vdots \\ | ||
+ | f_20 & f_20 & \cdots & f_20 | ||
+ | \end{bmatrix} | ||
+ | </math> | ||
+ | |||
+ | |||
+ | |||
+ | Thus at large distances, the amino acids are simply expected to be replaced according to the the average composition - the protein mutates beyond recognizabiltiy. | ||
+ | |||
+ | |||
+ | ;Scaling to evolutionary distance: | ||
+ | |||
+ | As noted above, the diagonal matrix elements give an indication of the evolutionary interval.It is clear, that the evolutionary interval is only equivalent to the % difference between two sequences at very low PAM distances. The correspondence is given below: | ||
+ | |||
<br> | <br> | ||
+ | <table style="border:1px solid #000000;" cellpadding="5" cellspacing="0"> | ||
+ | <tr style="background:#BDC3DC;"><td style="border-bottom:1px solid #000000;">'''%difference'''</td><td style="border-bottom:1px solid #000000;">'''PAM'''</td></tr> | ||
+ | |||
+ | <tr><td>1</td><td>1</td></tr> | ||
+ | <tr style="background: #E9EBF3;"><td>5</td><td>5</td></tr> | ||
+ | <tr><td>10</td><td>11</td></tr> | ||
+ | <tr style="background: #E9EBF3;"><td>15</td><td>17</td></tr> | ||
+ | <tr><td>20</td><td>23</td></tr> | ||
+ | <tr style="background: #E9EBF3;"><td>25</td><td>30</td></tr> | ||
+ | <tr><td>30</td><td>38</td></tr> | ||
+ | <tr style="background: #E9EBF3;"><td>35</td><td>47</td></tr> | ||
+ | <tr><td>40</td><td>56</td></tr> | ||
+ | <tr style="background: #E9EBF3;"><td>45</td><td>67</td></tr> | ||
+ | <tr><td>50</td><td>80</td></tr> | ||
+ | <tr style="background: #E9EBF3;"><td>55</td><td>94</td></tr> | ||
+ | <tr><td>60</td><td>112</td></tr> | ||
+ | <tr style="background: #E9EBF3;"><td>65</td><td>133</td></tr> | ||
+ | <tr><td>70</td><td>159</td></tr> | ||
+ | <tr style="background: #E9EBF3;"><td>75</td><td>195</td></tr> | ||
+ | <tr><td>80</td><td>246</td></tr> | ||
+ | <tr style="background: #E9EBF3;"><td>85</td><td>328</td></tr> | ||
+ | |||
+ | </table> | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | Note that the PAM250 matrix corresponds to approxiamtely 20% identities. the 50% identities level is approximately PAM100 | ||
+ | |||
+ | <!-- insert PAM/% image here --> | ||
+ | |||
+ | The asymptote of this function is the % difference score of two randomly aligned sequences. The probability for the chance occurence of an identical amino acid pair is the square of the probability of occurrence for that amino acid. Thus the expected percent difference for random alignments is: | ||
+ | |||
+ | <math> \mathrm{diff}_{random}=100\left(1-\sum_{i=1}^{20} f_i^2\right)</math>. | ||
+ | |||
+ | For the Jones and Taylor tabulation of frequencies this difference expectation value is 94.2%. This is close to the value of 95% that you would expect if all amino acids occured with equal frequencies. This means: even for randomly aligned sequences, you would expect to get around 5% identities. | ||
<div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;"> | <div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;"> | ||
+ | |||
==Sixth step: Relatedness odds== | ==Sixth step: Relatedness odds== | ||
</div> | </div> | ||
+ | |||
+ | The word '''"odds"''' describes the ratio of two probabilities. In particular we frequently speak about the odds of one event happening versus that event not happening. | ||
+ | |||
+ | <math>\mathrm{odds}\left(\mathrm{event}\right) = \frac{P\left({\mathrm{event}}\right)}{1-P\left({\mathrm{event}}\right)}</math> | ||
+ | |||
+ | In that case odds and probabilities are related, but not identical. In particular the product of two odds will be different from the product of their corresponding probabilities. ''However'' another use of the word ''odds'' is to express the ratio of two probabilities and that's the way Dayhoff has used it in the term ''relatedness odds''. | ||
+ | |||
+ | The mutation probability matrix gives the probability <math>M_{ij}</math>, that an amino acid <math>i</math> will replace an amino acid of type <math>j</math> in a given evolutionary interval, in two '''homologous''' sequences, given the evolutionary model we have developed above. By comparison, the probability that that same "replacement" is observed by random chance, is simply given by the frequency of occurence of amino acid i. | ||
+ | |||
+ | <math>p_i^{\mathrm{random}}=f_i</math> | ||
+ | |||
+ | |||
+ | Accordingly the '''relatedness odds''', i.e. relative odds that a given event is due to evolution, rather than due to chance are | ||
+ | |||
+ | |||
+ | <math>R_{ij}=\frac{M_{ij}}{p_i^{\mathrm{random}}}</math> | ||
+ | |||
+ | |||
+ | <div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;"> | ||
+ | |||
+ | ==Last step: the log-odds matrix== | ||
+ | </div> | ||
+ | |||
+ | <br> | ||
+ | <br> | ||
+ | <div style="padding: 5px; background: #FF4560; border:solid 2px #000000;"> | ||
+ | '''Update Warning!''' | ||
+ | This page is under revision. Material below this tag is missing formulas and has not been updated.</div> | ||
+ | <br> | ||
<br> | <br> | ||
+ | Finally we have a tool to quantify the probability that two sequences are homologous, i.e. related by evolution. The relatedness odds for one aligned pair are given by the corresponding matrix element. The relatedness odds for a second pair are multiplied with those of the first pair to give the joint probability. This is done for all aligned pairs of the whole sequence. Since multiplication is a computationally expensive process, it is preferrable to add the logarithms of the matrix elements. This matrix, the log odds matrix, is the foundation of quantitative sequence comparisons under an evolutionary model. Since the Dayhoff matrix was taken as the log to base 10, a value of +1 would mean that the corresponding pair has been observed 10 times more frequently than expected by chance. A value of -0.2 would mean that the observed pair was observed 1.6 times less frequently than chance would predict. The most commonly used matrix is the matrix from the 1978 edition of the Dayhoff atlas, at PAM 250: this is also frequently referred to as the MDM78 PAM250 matrix. | ||
<div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;"> | <div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;"> | ||
− | == | + | |
+ | ==The MDM78 PAM250 matrix== | ||
</div> | </div> | ||
<br> | <br> | ||
+ | <pre> | ||
+ | Alternate symbol comparison table for the comparison of peptide | ||
+ | sequences. | ||
+ | |||
+ | Dayhoff table (Dayhoff, M. O., Schwartz, R. M., and Orcutt, B. C. [1979] | ||
+ | in Atlas of Protein Sequence and Structure, Dayhoff, M. O. Ed, pp. | ||
+ | 345-352 (Figure 84), National Biomedical Research Foundation, Washington | ||
+ | D.C.) rescaled by dividing each value by 5.0 | ||
+ | |||
+ | |||
+ | May 24, 1990 16:47 | ||
+ | |||
+ | A B C D E F G H I K L M N P Q R S T V W Y Z | ||
+ | 0.4 0.0 -0.4 0.0 0.0 -0.8 0.2 -0.2 -0.2 -0.2 -0.4 -0.2 0.0 0.2 0.0 -0.4 0.2 0.2 0.0 -1.2 -0.6 0.0 A | ||
+ | 0.5 -0.9 0.6 0.4 -1.0 0.1 0.3 -0.4 0.1 -0.7 -0.5 0.4 -0.2 0.3 -0.1 0.1 0.0 -0.4 -1.1 -0.6 0.4 B | ||
+ | 2.4 -1.0 -1.0 -0.8 -0.6 -0.6 -0.4 -1.0 -1.2 -1.0 -0.8 -0.6 -1.0 -0.8 0.0 -0.4 -0.4 -1.6 0.0 -1.0 C | ||
+ | 0.8 0.6 -1.2 0.2 0.2 -0.4 0.0 -0.8 -0.6 0.4 -0.2 0.4 -0.2 0.0 0.0 -0.4 -1.4 -0.8 0.5 D | ||
+ | 0.8 -1.0 0.0 0.2 -0.4 0.0 -0.6 -0.4 0.2 -0.2 0.4 -0.2 0.0 0.0 -0.4 -1.4 -0.8 0.6 E | ||
+ | 1.8 -1.0 -0.4 0.2 -1.0 0.4 0.0 -0.8 -1.0 -1.0 -0.8 -0.6 -0.6 -0.2 0.0 1.4 -1.0 F | ||
+ | 1.0 -0.4 -0.6 -0.4 -0.8 -0.6 0.0 -0.2 -0.2 -0.6 0.2 0.0 -0.2 -1.4 -1.0 -0.1 G | ||
+ | 1.2 -0.4 0.0 -0.4 -0.4 0.4 0.0 0.6 0.4 -0.2 -0.2 -0.4 -0.6 0.0 -0.4 H | ||
+ | 1.0 -0.4 0.4 0.4 -0.4 -0.4 -0.4 -0.4 -0.2 0.0 0.8 -1.0 -0.2 -0.4 I | ||
+ | 1.0 -0.6 0.0 0.2 -0.2 0.2 0.6 0.0 0.0 -0.4 -0.6 -0.8 0.1 K | ||
+ | 1.2 0.8 -0.6 -0.6 -0.4 -0.6 -0.6 -0.4 0.4 -0.4 -0.2 -0.5 L | ||
+ | 1.2 -0.4 -0.4 -0.2 0.0 -0.4 -0.2 0.4 -0.8 -0.4 -0.3 M | ||
+ | 0.4 -0.2 0.2 0.0 0.2 0.0 -0.4 -0.8 -0.4 0.2 N | ||
+ | 1.2 0.0 0.0 0.2 0.0 -0.2 -1.2 -1.0 -0.1 P | ||
+ | 0.8 0.2 -0.2 -0.2 -0.4 -1.0 -0.8 0.6 Q | ||
+ | 1.2 0.0 -0.2 -0.4 0.4 -0.8 0.6 R | ||
+ | 0.4 0.2 -0.2 -0.4 -0.6 -0.1 S | ||
+ | 0.6 0.0 -1.0 -0.6 -0.1 T | ||
+ | 0.8 -1.2 -0.4 -0.4 V | ||
+ | 3.4 0.0 -1.2 W | ||
+ | 2.0 -0.8 Y | ||
+ | 0.6 Z | ||
+ | </pre> | ||
<div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;"> | <div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;"> | ||
Line 122: | Line 348: | ||
</div> | </div> | ||
<br> | <br> | ||
+ | The matrix in most widespread use for many years was entirely derived from the Dayhoff MDM78 (Gribskov, M. & Burgess, R.R. (1986) NAR 14:6745-6763). It has been renormalized to give a score of 1.5 for identical residues, and the scores for non-identical residues have been adjusted to give a mean of -0.17 and a standard deviation of 0.364. The negative expectation value is necessary for local alignment routines, since a positive expectation value would allow to increase the score simply by extending random alignments. Note that these normalizations constitute a significant departure from the evolutionary model as discussed above. Specifically, the interpretation in terms of a specific evolutionary distance is now no longer possible. | ||
+ | |||
+ | ;The GCG matrix: | ||
+ | <pre> | ||
+ | Default scoring matrix used by COMPARE for the comparison of | ||
+ | protein sequences. | ||
+ | Dayhoff table (Schwartz, R. M. and Dayhoff, M. O. [1979] in Atlas of | ||
+ | Protein Sequence and Structure, Dayhoff, M. O. Ed, pp. 353-358, National | ||
+ | Biomedical Research Foundation, Washington D.C.) rescaled by dividing | ||
+ | each value by the sum of its row and column, and normalizing to a mean | ||
+ | of 0 and standard deviation of 1.0. The value for FY (Phe-Tyr) = RW = | ||
+ | 1.425. Perfect matches are set to 1.5 and no matches on any row are | ||
+ | better than perfect matches. | ||
+ | |||
+ | Table used by Gribskov and Burgess NAR 14(16) 6745-6763 | ||
+ | |||
+ | |||
+ | December 29, 1986 12:46 | ||
+ | |||
+ | A B C D E F G H I K L M N P Q R S T V W Y Z .. | ||
+ | 1.5 0.2 0.3 0.3 0.3 -0.5 0.7 -0.1 0.0 0.0 -0.1 0.0 0.2 0.5 0.2 -0.3 0.4 0.4 0.2 -0.8 -0.3 0.2 A | ||
+ | 1.1 -0.4 1.1 0.7 -0.7 0.6 0.4 -0.2 0.4 -0.5 -0.3 1.1 0.1 0.5 0.1 0.3 0.2 -0.2 -0.7 -0.3 0.6 B | ||
+ | 1.5 -0.5 -0.6 -0.1 0.2 -0.1 0.2 -0.6 -0.8 -0.6 -0.3 0.1 -0.6 -0.3 0.7 0.2 0.2 -1.2 1.0 -0.6 C | ||
+ | 1.5 1.0 -1.0 0.7 0.4 -0.2 0.3 -0.5 -0.4 0.7 0.1 0.7 0.0 0.2 0.2 -0.2 -1.1 -0.5 0.9 D | ||
+ | 1.5 -0.7 0.5 0.4 -0.2 0.3 -0.3 -0.2 0.5 0.1 0.7 0.0 0.2 0.2 -0.2 -1.1 -0.5 1.1 E | ||
+ | 1.5 -0.6 -0.1 0.7 -0.7 1.2 0.5 -0.5 -0.7 -0.8 -0.5 -0.3 -0.3 0.2 1.3 1.4 -0.7 F | ||
+ | 1.5 -0.2 -0.3 -0.1 -0.5 -0.3 0.4 0.3 0.2 -0.3 0.6 0.4 0.2 -1.0 -0.7 0.3 G | ||
+ | 1.5 -0.3 0.1 -0.2 -0.3 0.5 0.2 0.7 0.5 -0.2 -0.1 -0.3 -0.1 0.3 0.5 H | ||
+ | 1.5 -0.2 0.8 0.6 -0.3 -0.2 -0.3 -0.3 -0.1 0.2 1.1 -0.5 0.1 -0.2 I | ||
+ | 1.5 -0.3 0.2 0.4 0.1 0.4 0.8 0.2 0.2 -0.2 0.1 -0.6 0.4 K | ||
+ | 1.5 1.3 -0.4 -0.3 -0.1 -0.4 -0.4 -0.1 0.8 0.5 0.3 -0.2 L | ||
+ | 1.5 -0.3 -0.2 0.0 0.2 -0.3 0.0 0.6 -0.3 -0.1 -0.1 M | ||
+ | 1.5 0.0 0.4 0.1 0.3 0.2 -0.3 -0.3 -0.1 0.4 N | ||
+ | 1.5 0.3 0.3 0.4 0.3 0.1 -0.8 -0.8 0.2 P | ||
+ | 1.5 0.4 -0.1 -0.1 -0.2 -0.5 -0.6 1.1 Q | ||
+ | 1.5 0.1 -0.1 -0.3 1.4 -0.6 0.2 R | ||
+ | 1.5 0.3 -0.1 0.3 -0.4 0.0 S | ||
+ | 1.5 0.2 -0.6 -0.3 0.1 T | ||
+ | 1.5 -0.8 -0.1 -0.2 V | ||
+ | 1.5 1.1 -0.8 W | ||
+ | 1.5 -0.6 Y | ||
+ | 1.1 Z | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | </pre> | ||
<div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;"> | <div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;"> | ||
Line 128: | Line 401: | ||
</div> | </div> | ||
<br> | <br> | ||
+ | |||
+ | |||
+ | ---- | ||
+ | <small>(Derived and updated from material I originally posted at the University of Munich Gene Center in 1999. <!-- http://www.lmb.uni-muenchen.de/groups/bioinformatics/04/ch_04_3.html) --></small> |
Latest revision as of 16:00, 9 September 2008
Mutation Data Matrices
- Mutation Data Matrices are computational tools to quantify how well an observed pair-exchange conforms to a particular model of exchanges, usually a model of evolution.
One really has to understand the above sentence, if one wants to understand the whole business of sequence aligning and similarity scores in bioinformatics. We have algorithms that compute optimal sequence alignments - either global (Needleman-Wunsch), or local (Smith-Waterman). But what does optimal really mean? Optimal means that the alignment gives us the highest score that is possible, for a given mutation matrix. Actually: for a given matrix plus a given set of parameters for insertions and deletions, but since the judicious choice of these parameters depends on the matrix as well, for this discussion here we will assume it is only the matrix.
So since everything depends on the matrix, what is such a matrix? It is an ordered set of numbers that associates every possible pair of amino acids with a score. The crux of the matter is how one comes up with such a score. Ultimately, all such scores measure the probability that two amino acids are related by evolution. And in order to that, a quantitative, computable model of the evolutionary process has to be defined in order to construct such a matrix. Since we can devise several alternative computational models of how to do such a measurement, several different matrices are in common use. Which one to use depends entirely on which model of evolution is the most appropriate for the question at hand. Here is a selction of models:
- We can argue that homologous sequences should not have mutated at all. Thus we should assign a value of 1 for identical amino acids, a value of 0 for mutations. This is the identity matrix. It is appropriate only for sequences with a very high degree of identity.
- We can argue that evolution proceeds by point mutations in the coding sequence. Thus the number of nucleotide changes required to change one codon into another should be small. We score amino acids with similar codons highly, those with dissimilar codons poorly. This is the genetic code matrix.
- We can argue that selection after mutation favours similar amino acids. Thus if two amino acids are related by mutation and selection, they have a higher probability of being biophysically similar than being dissimilar. Similarity can be defined from first principles with some quantitative measure of size, charge, hydrophopicity etc. This lead us to a biophysical matrix.
However all these arguments from first principles have associated problems. The identity matrix does not capture the fact that related sequences have a higher likelihood of conservaive substitutions. The genetic code matrix does not capture similar properties well for some pairs: e.g. glycine (GGG) → tryptophan (TGG). The biophysical matrix cannot take context into account, but it is context that determines of whether e.g. tyrosine is consevred because it is similar to phenylalanine (hydrophobic) or because it is an H-bonf acceptor (polar). An alternative is the empirical approach: observe what nature allows as replacements in amino acid pairs from sequences that we know to be homologous.
This approach postulates that once the evolutionary relationship of two sequences is established, the residues that did exchange can be called similar. This is the principle behind the mutation matrices compiled by Margaret O. Dayhoff and colleagues at the National Biomedical Research Foundation in the early 70s (Dayhoff, M.O. et al. (1978) Atlas of Protein Sequence and Structure. Vol. 5, Suppl. 3 National Biomedical Research Foundation, Washington D.C. U.S.A). They developed a precise and rigorous approach to implement a model of evolutionary change in their mutation data matrix.Their model allows to quantify the odds that a given alignment of two protein sequences would either be observed by chance or would reflect divergent evolution from a common ancestral sequence.
We have since obtained a vastly larger number of sequences, and we can compile directly some of the numbers that Dayhoff had to infer from the limited dataset she had available. But the principle of the process is necessarily the same for any related construction of a computational tool that quantifies probabilities. It is a paradigm for the construction of any mutation data matrix.
Contents
- 1 First step: Pair Exchange Frequencies
- 2 Second step: Frequencies of occurrence
- 3 Third step: Relative mutabilities
- 4 Fourth step: Mutation probability matrix
- 5 Fifth step: Evolutionary distance scale
- 6 Sixth step: Relatedness odds
- 7 Last step: the log-odds matrix
- 8 The MDM78 PAM250 matrix
- 9 A variant of the MDM78 PAM250 matrix used in many sequence alignment applications
- 10 The BLOSUM matrices
How is the Dayhoff mutation data matrix constructed ? Since we are looking for a matrix that will quantify the probabiltiy of an evolutionary relationship, we must first define a quantitative model of evolution. The model used by Dayhoff states that proteins evolve through a succesion of independent point mutations that are accepted in a population and subsequently can be observed in the sequence pool. Formally, the fact that the mutations are treated independently of their neigborhood and of their history makes this a Markov model. We can define an evolutionary distance between two sequences as the number of point mutations that was necessary to evolve one sequence into the other. (More precisely, the distance is the number of accepted mutations.) Two aspects of this process cause the evolutionary distance to be unequal in general to the number of observed differences between the sequences:
- First, there is a chance that a certain residue may have mutated, than reverted, hiding the effect of the mutation. This phenomenon is especially important in the evaluation of biological clocks and the question of how many mutational events may become fixed per time unit.
- Second, specific residues may have mutated more than once, thus the number of point mutations is likely to be larger than the number of differences between the two sequences.
Both factors combine so that sequence identity drops progressively more slowly as the number of accepted mutations increases. Even when very many mutations have been accepted, sequence identity will remain around 5%.
First step: Pair Exchange Frequencies
M.O. Dayhoff and colleagues introduced the term "accepted point mutation" for a mutation that is stably fixed in the gene pool in the course of evolution. Thus a measure of evolutionary distance between two sequences can be defined:
- PAM
A PAM (Percent Accepted Mutation) is one accepted point mutation on the path between two sequences, per 100 residues.
In order to identify accepted point mutations, a complete phylogenetic tree including all ancestral sequences has to be constructed. (There are standard procedures for this.) To avoid a large degree of ambiguities in this step, Dayhoff and colleagues restricted their analysis to sequence families with more than 85% identity. Since the evolutionary distance between these highly similar proteins is small, the construction of the phylogenetic tree can be achieved without too many complicating assumptions. For each of the observed and inferred sequences, the amino acid pair exchanges are tabulated into a 20x20 matrix. It is assumed, that the likelihood of an amino-acid i being replaced by an amino acid j is the same as j replacing i. Hence the matrix is symmetric. (This assumption is probably largely true, it is equivalent to postulating that the amino acid distributions are in equilibrium regarding the evolutionary process.) Note that this process is different from comparing observed sequences directly with each other; a large number of inferred sequences are added to the data.
Aij is the number of accepted mutations observed in which amino acid i replaces amino acid j.
Second step: Frequencies of occurrence
Since the properties of amino acids differ and since they occur with different frequencies in proteins, all statements we can make about the average properties of sequences will depend on the frequencies of occurence of the individual amino acids. These frequencies of occurence are approximated by the frequencies of observation. They are the number of occurences of a given amino acid divided by the total number of amino acids that have been observed. Writing aai for the i-th amino acid
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle f_i=\frac{\mbox{observations of }aa_i}{\mbox{all observations}}}
writing i=aa to denote the index running over all 20 amino acids, obviously
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \,\sum_{i=aa} f_i=1}
The appropriate frequencies do depend on the organism under consideration (yes, scoring matrices are improved if they take organism-specific amino acid frequencies into account) and they may depend on the type of amino acid, e.g. amino acid frequencies for membrane proteins are different from those of soluble proteins. However from a practical point of view the theoretical increase in accuracy does not appear sufficient to justify the signifcant additional computational overhead that the work with specialised scoring matrices would require.
Amino acid frequencies
AA | 1978 a | 1991 b |
L | 0.085 | 0.091 |
A | 0.087 | 0.077 |
G | 0.089 | 0.074 |
S | 0.070 | 0.069 |
V | 0.065 | 0.066 |
E | 0.050 | 0.062 |
T | 0.058 | 0.059 |
K | 0.081 | 0.059 |
I | 0.037 | 0.053 |
D | 0.047 | 0.052 |
R | 0.041 | 0.051 |
P | 0.051 | 0.051 |
N | 0.040 | 0.043 |
Q | 0.038 | 0.041 |
F | 0.040 | 0.040 |
Y | 0.030 | 0.032 |
M | 0.015 | 0.024 |
H | 0.034 | 0.023 |
C | 0.033 | 0.020 |
W | 0.010 | 0.014 |
a) Frequencies taken from Dayhoff (1978); b) Frequencies taken from the 1991 recompilation of the mutation matrices by Jones et al. (Jones, D.T. Taylor, W.R. & Thornton, J.M. (1991) CABIOS 8:275-282) representing a database of observations that is approximately 40 times larger than that available to Dayhoff. Reassuringly, the changes are small. Note the higher abundance of hydrophobic residues, in particular Leu, this may reflect the addition of many membrane proteins to the sequence databases between 1978 and 1991.
Third step: Relative mutabilities
To obtain a complete picture of the mutational process, the amino-acids that do not mutate must be taken into account as well. We need to know: what is the chance, on average, that a given amino acid will mutate at all. This is the relative mutability of the amino acid. It is obtained by multiplying the number of observed changes by the amino acid's frequency of occurence.
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \,m_i=f_i(\mbox{number of times }aa_i\mbox{ is observed to change})}
where Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle m_{Ala}} is arbitrarily set to 100%.
Relative mutabilities
AA | 1978 a | 1991 b |
S | 120 | 117 |
T | 97 | 107 |
N | 134 | 104 |
I | 96 | 103 |
A | 100 | 100 |
V | 74 | 98 |
M | 94 | 93 |
H | 66 | 91 |
D | 106 | 86 |
Q | 93 | 84 |
R | 65 | 83 |
E | 102 | 77 |
K | 56 | 72 |
P | 56 | 58 |
L | 40 | 54 |
F | 41 | 51 |
G | 49 | 50 |
Y | 41 | 50 |
C | 20 | 44 |
W | 18 | 25 |
All values are taken relative to alanine, which is arbitrarily set at 100. As above: a) Frequencies taken from Dayhoff (1978); b) Frequencies taken from the 1991 recompilation of the mutation matrices by Jones et al. (Jones, D.T. Taylor, W.R. & Thornton, J.M. (1991) CABIOS 8:275-282). The difference for some amino acids are quite significant, especially for those amino acids, for which hardly any exchanges have been observed in 1978 (C and W). Serine and threonine are the most mutable amino acids, cysteine and tryptophan are the least likely to be succesfully replaced in evolution.
Fourth step: Mutation probability matrix
With the numbers we have calculated and compiled above we can quantify the probability that an amino acid i will replace an amino acid j in an alignment of two homologous sequences: it is the mutability of amino acid j, multiplied by the relative pair exchange frequency for i ← j (i.e. the pair exchange frequency for i←j divided by the sum of all pair exchange frequencies for amino acid j).
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle M_{ij}=m_j\frac{A_{ij}}{\sum_{i=aa}{A_{ij}}}}
Fifth step: Evolutionary distance scale
Since the diagonal elements of the matrix Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle M_{ii}} represent the probability for an amino acid to remain conserved, if we scale all cells of our matrix by a constant factor Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \lambda} we can scale the matrix to reflect a specific overall probability of change. We may chose Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \lambda} so that the expected number of changes is 1 %, this matrix then represents the exchange probabilities for the evolutionary distance of 1 PAM. An average protein will have the composition:
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n_i=f_iN}
with Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N}
being the length of the protein and Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n_i}
the number of amino acids of a certain type. If we take Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle N}
to be 100, Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n_i}
will equal the percentage of aminoa acids. The percentage of amino acids of type Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle i}
that will change in the evolutionary interval represented by a given matrix is
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle n_i(1-M_{ii})} ;
and the number of total changes - PAM - represented by the matrix is the sum over all individual changes. We can use this sum to rescale the Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle M_{ii}}
so that the percentage of change is 1, then the matrix represents one PAM, i.e. Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \lambda}
can be calculated as:
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \frac{1}{\lambda}=\sum_i n_i(1-M_{ii})}
with the scaling factor Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \lambda}
, the mutation probability elements become
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle M_{ij}=\lambda m_j\frac{A_{ij}}{\sum_{i=aa}{A_{ij}}}}
and the diagonal elements become
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle M_{ii}=1-\lambda m_i}
It is apparent that the number of amino acids that are expected to change according to the matrix depends only on the factor Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \lambda}
. But this is correct only for the evolutionary distances from which the data were compiled. In the case of the Dayhoff matrices, where sequences with less than 15 % differences were used, scaling for 1 to 5 PAM should be permissible, from the original data. For higher evolutionary distances, extrapolation can not be simply done by adjusting the Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle M_{ii}}
with a scaling factor Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \lambda}
, since this would neglect overlapping mutations.
In the framework of Dayhoff's model, a mutation probability matrix for any distance can be obtained by multiplying the 1 PAM matrix with itself the required number of times. Thus the 3 PAM matrix is simply the cube of the 1 PAM matrix. Obviously, the accuracy of this process of extrapolation will depend on the accuracy of the 1 PAM matrix - but up to that accuracy, it is rigorously correct.
Two facts should be pointed out:
The matrix at time PAM 0 is simply the unitary matrix. PAM 0 means that identities are certain and exchange is impossible:
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle M^0= \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix} }
The Matrix at infinite evolutionary disatance is:
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle M^\infty= \begin{bmatrix} f_1 & f_1 & \cdots & f_1 \\ f_2 & f_2 & \cdots & f_2 \\ \vdots & \vdots & \ddots & \vdots \\ f_20 & f_20 & \cdots & f_20 \end{bmatrix} }
Thus at large distances, the amino acids are simply expected to be replaced according to the the average composition - the protein mutates beyond recognizabiltiy.
- Scaling to evolutionary distance
As noted above, the diagonal matrix elements give an indication of the evolutionary interval.It is clear, that the evolutionary interval is only equivalent to the % difference between two sequences at very low PAM distances. The correspondence is given below:
%difference | PAM |
1 | 1 |
5 | 5 |
10 | 11 |
15 | 17 |
20 | 23 |
25 | 30 |
30 | 38 |
35 | 47 |
40 | 56 |
45 | 67 |
50 | 80 |
55 | 94 |
60 | 112 |
65 | 133 |
70 | 159 |
75 | 195 |
80 | 246 |
85 | 328 |
Note that the PAM250 matrix corresponds to approxiamtely 20% identities. the 50% identities level is approximately PAM100
The asymptote of this function is the % difference score of two randomly aligned sequences. The probability for the chance occurence of an identical amino acid pair is the square of the probability of occurrence for that amino acid. Thus the expected percent difference for random alignments is:
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathrm{diff}_{random}=100\left(1-\sum_{i=1}^{20} f_i^2\right)} .
For the Jones and Taylor tabulation of frequencies this difference expectation value is 94.2%. This is close to the value of 95% that you would expect if all amino acids occured with equal frequencies. This means: even for randomly aligned sequences, you would expect to get around 5% identities.
Sixth step: Relatedness odds
The word "odds" describes the ratio of two probabilities. In particular we frequently speak about the odds of one event happening versus that event not happening.
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathrm{odds}\left(\mathrm{event}\right) = \frac{P\left({\mathrm{event}}\right)}{1-P\left({\mathrm{event}}\right)}}
In that case odds and probabilities are related, but not identical. In particular the product of two odds will be different from the product of their corresponding probabilities. However another use of the word odds is to express the ratio of two probabilities and that's the way Dayhoff has used it in the term relatedness odds.
The mutation probability matrix gives the probability Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle M_{ij}} , that an amino acid Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle i} will replace an amino acid of type Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle j} in a given evolutionary interval, in two homologous sequences, given the evolutionary model we have developed above. By comparison, the probability that that same "replacement" is observed by random chance, is simply given by the frequency of occurence of amino acid i.
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle p_i^{\mathrm{random}}=f_i}
Accordingly the relatedness odds, i.e. relative odds that a given event is due to evolution, rather than due to chance are
Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle R_{ij}=\frac{M_{ij}}{p_i^{\mathrm{random}}}}
Last step: the log-odds matrix
Update Warning!
This page is under revision. Material below this tag is missing formulas and has not been updated.
Finally we have a tool to quantify the probability that two sequences are homologous, i.e. related by evolution. The relatedness odds for one aligned pair are given by the corresponding matrix element. The relatedness odds for a second pair are multiplied with those of the first pair to give the joint probability. This is done for all aligned pairs of the whole sequence. Since multiplication is a computationally expensive process, it is preferrable to add the logarithms of the matrix elements. This matrix, the log odds matrix, is the foundation of quantitative sequence comparisons under an evolutionary model. Since the Dayhoff matrix was taken as the log to base 10, a value of +1 would mean that the corresponding pair has been observed 10 times more frequently than expected by chance. A value of -0.2 would mean that the observed pair was observed 1.6 times less frequently than chance would predict. The most commonly used matrix is the matrix from the 1978 edition of the Dayhoff atlas, at PAM 250: this is also frequently referred to as the MDM78 PAM250 matrix.
The MDM78 PAM250 matrix
Alternate symbol comparison table for the comparison of peptide sequences. Dayhoff table (Dayhoff, M. O., Schwartz, R. M., and Orcutt, B. C. [1979] in Atlas of Protein Sequence and Structure, Dayhoff, M. O. Ed, pp. 345-352 (Figure 84), National Biomedical Research Foundation, Washington D.C.) rescaled by dividing each value by 5.0 May 24, 1990 16:47 A B C D E F G H I K L M N P Q R S T V W Y Z 0.4 0.0 -0.4 0.0 0.0 -0.8 0.2 -0.2 -0.2 -0.2 -0.4 -0.2 0.0 0.2 0.0 -0.4 0.2 0.2 0.0 -1.2 -0.6 0.0 A 0.5 -0.9 0.6 0.4 -1.0 0.1 0.3 -0.4 0.1 -0.7 -0.5 0.4 -0.2 0.3 -0.1 0.1 0.0 -0.4 -1.1 -0.6 0.4 B 2.4 -1.0 -1.0 -0.8 -0.6 -0.6 -0.4 -1.0 -1.2 -1.0 -0.8 -0.6 -1.0 -0.8 0.0 -0.4 -0.4 -1.6 0.0 -1.0 C 0.8 0.6 -1.2 0.2 0.2 -0.4 0.0 -0.8 -0.6 0.4 -0.2 0.4 -0.2 0.0 0.0 -0.4 -1.4 -0.8 0.5 D 0.8 -1.0 0.0 0.2 -0.4 0.0 -0.6 -0.4 0.2 -0.2 0.4 -0.2 0.0 0.0 -0.4 -1.4 -0.8 0.6 E 1.8 -1.0 -0.4 0.2 -1.0 0.4 0.0 -0.8 -1.0 -1.0 -0.8 -0.6 -0.6 -0.2 0.0 1.4 -1.0 F 1.0 -0.4 -0.6 -0.4 -0.8 -0.6 0.0 -0.2 -0.2 -0.6 0.2 0.0 -0.2 -1.4 -1.0 -0.1 G 1.2 -0.4 0.0 -0.4 -0.4 0.4 0.0 0.6 0.4 -0.2 -0.2 -0.4 -0.6 0.0 -0.4 H 1.0 -0.4 0.4 0.4 -0.4 -0.4 -0.4 -0.4 -0.2 0.0 0.8 -1.0 -0.2 -0.4 I 1.0 -0.6 0.0 0.2 -0.2 0.2 0.6 0.0 0.0 -0.4 -0.6 -0.8 0.1 K 1.2 0.8 -0.6 -0.6 -0.4 -0.6 -0.6 -0.4 0.4 -0.4 -0.2 -0.5 L 1.2 -0.4 -0.4 -0.2 0.0 -0.4 -0.2 0.4 -0.8 -0.4 -0.3 M 0.4 -0.2 0.2 0.0 0.2 0.0 -0.4 -0.8 -0.4 0.2 N 1.2 0.0 0.0 0.2 0.0 -0.2 -1.2 -1.0 -0.1 P 0.8 0.2 -0.2 -0.2 -0.4 -1.0 -0.8 0.6 Q 1.2 0.0 -0.2 -0.4 0.4 -0.8 0.6 R 0.4 0.2 -0.2 -0.4 -0.6 -0.1 S 0.6 0.0 -1.0 -0.6 -0.1 T 0.8 -1.2 -0.4 -0.4 V 3.4 0.0 -1.2 W 2.0 -0.8 Y 0.6 Z
A variant of the MDM78 PAM250 matrix used in many sequence alignment applications
The matrix in most widespread use for many years was entirely derived from the Dayhoff MDM78 (Gribskov, M. & Burgess, R.R. (1986) NAR 14:6745-6763). It has been renormalized to give a score of 1.5 for identical residues, and the scores for non-identical residues have been adjusted to give a mean of -0.17 and a standard deviation of 0.364. The negative expectation value is necessary for local alignment routines, since a positive expectation value would allow to increase the score simply by extending random alignments. Note that these normalizations constitute a significant departure from the evolutionary model as discussed above. Specifically, the interpretation in terms of a specific evolutionary distance is now no longer possible.
- The GCG matrix
Default scoring matrix used by COMPARE for the comparison of protein sequences. Dayhoff table (Schwartz, R. M. and Dayhoff, M. O. [1979] in Atlas of Protein Sequence and Structure, Dayhoff, M. O. Ed, pp. 353-358, National Biomedical Research Foundation, Washington D.C.) rescaled by dividing each value by the sum of its row and column, and normalizing to a mean of 0 and standard deviation of 1.0. The value for FY (Phe-Tyr) = RW = 1.425. Perfect matches are set to 1.5 and no matches on any row are better than perfect matches. Table used by Gribskov and Burgess NAR 14(16) 6745-6763 December 29, 1986 12:46 A B C D E F G H I K L M N P Q R S T V W Y Z .. 1.5 0.2 0.3 0.3 0.3 -0.5 0.7 -0.1 0.0 0.0 -0.1 0.0 0.2 0.5 0.2 -0.3 0.4 0.4 0.2 -0.8 -0.3 0.2 A 1.1 -0.4 1.1 0.7 -0.7 0.6 0.4 -0.2 0.4 -0.5 -0.3 1.1 0.1 0.5 0.1 0.3 0.2 -0.2 -0.7 -0.3 0.6 B 1.5 -0.5 -0.6 -0.1 0.2 -0.1 0.2 -0.6 -0.8 -0.6 -0.3 0.1 -0.6 -0.3 0.7 0.2 0.2 -1.2 1.0 -0.6 C 1.5 1.0 -1.0 0.7 0.4 -0.2 0.3 -0.5 -0.4 0.7 0.1 0.7 0.0 0.2 0.2 -0.2 -1.1 -0.5 0.9 D 1.5 -0.7 0.5 0.4 -0.2 0.3 -0.3 -0.2 0.5 0.1 0.7 0.0 0.2 0.2 -0.2 -1.1 -0.5 1.1 E 1.5 -0.6 -0.1 0.7 -0.7 1.2 0.5 -0.5 -0.7 -0.8 -0.5 -0.3 -0.3 0.2 1.3 1.4 -0.7 F 1.5 -0.2 -0.3 -0.1 -0.5 -0.3 0.4 0.3 0.2 -0.3 0.6 0.4 0.2 -1.0 -0.7 0.3 G 1.5 -0.3 0.1 -0.2 -0.3 0.5 0.2 0.7 0.5 -0.2 -0.1 -0.3 -0.1 0.3 0.5 H 1.5 -0.2 0.8 0.6 -0.3 -0.2 -0.3 -0.3 -0.1 0.2 1.1 -0.5 0.1 -0.2 I 1.5 -0.3 0.2 0.4 0.1 0.4 0.8 0.2 0.2 -0.2 0.1 -0.6 0.4 K 1.5 1.3 -0.4 -0.3 -0.1 -0.4 -0.4 -0.1 0.8 0.5 0.3 -0.2 L 1.5 -0.3 -0.2 0.0 0.2 -0.3 0.0 0.6 -0.3 -0.1 -0.1 M 1.5 0.0 0.4 0.1 0.3 0.2 -0.3 -0.3 -0.1 0.4 N 1.5 0.3 0.3 0.4 0.3 0.1 -0.8 -0.8 0.2 P 1.5 0.4 -0.1 -0.1 -0.2 -0.5 -0.6 1.1 Q 1.5 0.1 -0.1 -0.3 1.4 -0.6 0.2 R 1.5 0.3 -0.1 0.3 -0.4 0.0 S 1.5 0.2 -0.6 -0.3 0.1 T 1.5 -0.8 -0.1 -0.2 V 1.5 1.1 -0.8 W 1.5 -0.6 Y 1.1 Z
The BLOSUM matrices
(Derived and updated from material I originally posted at the University of Munich Gene Center in 1999.