Mutation Data Matrices

From "A B C"
Jump to navigation Jump to search

Mutation Data Matrices

   


Mutation Data Matrices are computational tools to quantify how well an observed pair-exchange conforms to a particular model of exchanges, usually a model of evolution.

One really has to understand the above sentence, if one wants to understand the whole business of sequence aligning and similarity scores in bioinformatics. We have algorithms that compute optimal sequence alignments - either global (Needleman-Wunsch), or local (Smith-Waterman). But what does optimal really mean? Optimal means that the alignment gives us the highest score that is possible, for a given mutation matrix. Actually: for a given matrix plus a given set of parameters for insertions and deletions, but since the judicious choice of these parameters depends on the matrix as well, for this discussion here we will assume it is only the matrix.

So since everything depends on the matrix, what is such a matrix? It is an ordered set of numbers that associates every possible pair of amino acids with a score. The crux of the matter is how one comes up with such a score. Ultimately, all such scores measure the probability that two amino acids are related by evolution. And in order to that, a quantitative, computable model of the evolutionary process has to be defined in order to construct such a matrix. Since we can devise several alternative computational models of how to do such a measurement, several different matrices are in common use. Which one to use depends entirely on which model of evolution is the most appropriate for the question at hand. Here is a selction of models:

  • We can argue that homologous sequences should not have mutated at all. Thus we should assign a value of 1 for identical amino acids, a value of 0 for mutations. This is the identity matrix. It is appropriate only for sequences with a very high degree of identity.
  • We can argue that evolution proceeds by point mutations in the coding sequence. Thus the number of nucleotide changes required to change one codon into another should be small. We score amino acids with similar codons highly, those with dissimilar codons poorly. This is the genetic code matrix.
  • We can argue that selection after mutation favours similar amino acids. Thus if two amino acids are related by mutation and selection, they have a higher probability of being biophysically similar than being dissimilar. Similarity can be defined from first principles with some quantitative measure of size, charge, hydrophopicity etc. This lead us to a biophysical matrix.

However all these arguments from first principles have associated problems. The identity matrix does not capture the fact that related sequences have a higher likelihood of conservaive substitutions. The genetic code matrix does not capture similar properties well for some pairs: e.g. glycine (GGG) → tryptophan (TGG). The biophysical matrix cannot take context into account, but it is context that determines of whether e.g. tyrosine is consevred because it is similar to phenylalanine (hydrophobic) or because it is an H-bonf acceptor (polar). An alternative is the empirical approach: observe what nature allows as replacements in amino acid pairs from sequences that we know to be homologous.

This approach postulates that once the evolutionary relationship of two sequences is established, the residues that did exchange can be called similar. This is the principle behind the mutation matrices compiled by Margaret O. Dayhoff and colleagues at the National Biomedical Research Foundation in the early 70s (Dayhoff, M.O. et al. (1978) Atlas of Protein Sequence and Structure. Vol. 5, Suppl. 3 National Biomedical Research Foundation, Washington D.C. U.S.A). They developed a precise and rigorous approach to implement a model of evolutionary change in their mutation data matrix.Their model allows to quantify the odds that a given alignment of two protein sequences would either be observed by chance or would reflect divergent evolution from a common ancestral sequence.

We have since obtained a vastly larger number of sequences, and we can compile directly some of the numbers that Dayhoff had to infer from the limited dataset she had available. But the principle of the process is necessarily the same for any related construction of a computational tool that quantifies probabilities. It is a paradigm for the construction of any mutation data matrix.

 


How is the Dayhoff mutation data matrix constructed ? Since we are looking for a matrix that will quantify the probabiltiy of an evolutionary relationship, we must first define a quantitative model of evolution. The model used by Dayhoff states that proteins evolve through a succesion of independent point mutations that are accepted in a population and subsequently can be observed in the sequence pool. Formally, the fact that the mutations are treated independently of their neigborhood and of their history makes this a Markov model. We can define an evolutionary distance between two sequences as the number of point mutations that was necessary to evolve one sequence into the other. (More precisely, the distance is the number of accepted mutations.) Two aspects of this process cause the evolutionary distance to be unequal in general to the number of observed differences between the sequences:

  • First, there is a chance that a certain residue may have mutated, than reverted, hiding the effect of the mutation. This phenomenon is especially important in the evaluation of biological clocks and the question of how many mutational events may become fixed per time unit.
  • Second, specific residues may have mutated more than once, thus the number of point mutations is likely to be larger than the number of differences between the two sequences.

Both factors combine so that sequence identity drops progressively more slowly as the number of accepted mutations increases. Even when very many mutations have been accepted, sequence identity will remain around 5%.

First step: Pair Exchange Frequencies

 
M.O. Dayhoff and colleagues introduced the term "accepted point mutation" for a mutation that is stably fixed in the gene pool in the course of evolution. Thus a measure of evolutionary distance between two sequences can be defined:
 

PAM

A PAM (Percent Accepted Mutation) is one accepted point mutation on the path between two sequences, per 100 residues.

 

In order to identify accepted point mutations, a complete phylogenetic tree including all ancestral sequences has to be constructed. (There are standard procedures for this.) To avoid a large degree of ambiguities in this step, Dayhoff and colleagues restricted their analysis to sequence families with more than 85% identity. Since the evolutionary distance between these highly similar proteins is small, the construction of the phylogenetic tree can be achieved without too many complicating assumptions. For each of the observed and inferred sequences, the amino acid pair exchanges are tabulated into a 20x20 matrix. It is assumed, that the likelihood of an amino-acid i being replaced by an amino acid j is the same as j replacing i. Hence the matrix is symmetric. (This assumption is probably largely true, it is equivalent to postulating that the amino acid distributions are in equilibrium regarding the evolutionary process.) Note that this process is different from comparing observed sequences directly with each other; a large number of inferred sequences are added to the data.

Aij is the number of accepted mutations observed in which amino acid i replaces amino acid j.

Second step: Frequencies of occurrence

 
Since the properties of amino acids differ and since they occur with different frequencies in proteins, all statements we can make about the average properties of sequences will depend on the frequencies of occurence of the individual amino acids. These frequencies of occurence are approximated by the frequencies of observation. They are the number of occurences of a given amino acid divided by the total number of amino acids that have been observed. Writing aai for the i-th amino acid

writing i=aa to denote the index running over all 20 amino acids, obviously

The appropriate frequencies do depend on the organism under consideration (yes, scoring matrices are improved if they take organism-specific amino acid frequencies into account) and they may depend on the type of amino acid, e.g. amino acid frequencies for membrane proteins are different from those of soluble proteins. However from a practical point of view the theoretical increase in accuracy does not appear sufficient to justify the signifcant additional computational overhead that the work with specialised scoring matrices would require.

Amino acid frequencies

 

AA1978 a1991 b
L 0.085 0.091
A 0.087 0.077
G 0.089 0.074
S 0.070 0.069
V 0.065 0.066
E 0.050 0.062
T 0.058 0.059
K 0.081 0.059
I 0.037 0.053
D 0.047 0.052
R 0.041 0.051
P 0.051 0.051
N 0.040 0.043
Q 0.038 0.041
F 0.040 0.040
Y 0.030 0.032
M 0.015 0.024
H 0.034 0.023
C 0.033 0.020
W 0.010 0.014

a) Frequencies taken from Dayhoff (1978); b) Frequencies taken from the 1991 recompilation of the mutation matrices by Jones et al. (Jones, D.T. Taylor, W.R. & Thornton, J.M. (1991) CABIOS 8:275-282) representing a database of observations that is approximately 40 times larger than that available to Dayhoff. Reassuringly, the changes are small. Note the higher abundance of hydrophobic residues, in particular Leu, this may reflect the addition of many membrane proteins to the sequence databases between 1978 and 1991.

Third step: Relative mutabilities

 

To obtain a complete picture of the mutational process, the amino-acids that do not mutate must be taken into account as well. We need to know: what is the chance, on average, that a given amino acid will mutate at all. This is the relative mutability of the amino acid. It is obtained by multiplying the number of observed changes by the amino acid's frequency of occurence.

where is arbitrarily set to 100%.

Relative mutabilities

 

AA1978 a1991 b
S 120 117
T 97 107
N 134 104
I 96 103
A 100 100
V 74 98
M 94 93
H 66 91
D 106 86
Q 93 84
R 65 83
E 102 77
K 56 72
P 56 58
L 40 54
F 41 51
G 49 50
Y 41 50
C 20 44
W 18 25

All values are taken relative to alanine, which is arbitrarily set at 100. As above: a) Frequencies taken from Dayhoff (1978); b) Frequencies taken from the 1991 recompilation of the mutation matrices by Jones et al. (Jones, D.T. Taylor, W.R. & Thornton, J.M. (1991) CABIOS 8:275-282). The difference for some amino acids are quite significant, especially for those amino acids, for which hardly any exchanges have been observed in 1978 (C and W). Serine and threonine are the most mutable amino acids, cysteine and tryptophan are the least likely to be succesfully replaced in evolution.



Fourth step: Mutation probability matrix

 
With the numbers we have calculated and compiled above we can quantify the probability that an amino acid i will replace an amino acid j in an alignment of two homologous sequences: it is the mutability of amino acid j, multiplied by the relative pair exchange frequency for ij (i.e. the pair exchange frequency for i←j divided by the sum of all pair exchange frequencies for amino acid j).


Fifth step: Evolutionary distance scale

Since the diagonal elements of the matrix represent the probability for an amino acid to remain conserved, if we scale all cells of our matrix by a constant factor we can scale the matrix to reflect a specific overall probability of change. We may chose so that the expected number of changes is 1 %, this matrix then represents the exchange probabilities for the evolutionary distance of 1 PAM. An average protein will have the composition:


with being the length of the protein and the number of amino acids of a certain type. If we take to be 100, will equal the percentage of aminoa acids. The percentage of amino acids of type that will change in the evolutionary interval represented by a given matrix is

 ;


and the number of total changes - PAM - represented by the matrix is the sum over all individual changes. We can use this sum to rescale the so that the percentage of change is 1, then the matrix represents one PAM, i.e. can be calculated as:


with the scaling factor , the mutation probability elements become



and the diagonal elements become



It is apparent that the number of amino acids that are expected to change according to the matrix depends only on the factor . But this is correct only for the evolutionary distances from which the data were compiled. In the case of the Dayhoff matrices, where sequences with less than 15 % differences were used, scaling for 1 to 5 PAM should be permissible, from the original data. For higher evolutionary distances, extrapolation can not be simply done by adjusting the with a scaling factor , since this would neglect overlapping mutations.

In the framework of Dayhoff's model, a mutation probability matrix for any distance can be obtained by multiplying the 1 PAM matrix with itself the required number of times. Thus the 3 PAM matrix is simply the cube of the 1 PAM matrix. Obviously, the accuracy of this process of extrapolation will depend on the accuracy of the 1 PAM matrix - but up to that accuracy, it is rigorously correct.

Two facts should be pointed out:

The matrix at time PAM 0 is simply the unitary matrix. PAM 0 means that identities are certain and exchange is impossible:

The Matrix at infinite evolutionary disatance is:


Thus at large distances, the amino acids are simply expected to be replaced according to the the average composition - the protein mutates beyond recognizabiltiy.


Scaling to evolutionary distance

As noted above, the diagonal matrix elements give an indication of the evolutionary interval.It is clear, that the evolutionary interval is only equivalent to the % difference between two sequences at very low PAM distances. The correspondence is given below:

 

%differencePAM
11
55
1011
1517
2023
2530
3038
3547
4056
4567
5080
5594
60112
65133
70159
75195
80246
85328



Note that the PAM250 matrix corresponds to approxiamtely 20% identities. the 50% identities level is approximately PAM100


The asymptote of this function is the % difference score of two randomly aligned sequences. The probability for the chance occurence of an identical amino acid pair is the square of the probability of occurrence for that amino acid. Thus the expected percent difference for random alignments is:

.

For the Jones and Taylor tabulation of frequencies this difference expectation value is 94.2%. This is close to the value of 95% that you would expect if all amino acids occured with equal frequencies. This means: even for randomly aligned sequences, you would expect to get around 5% identities.


Sixth step: Relatedness odds

The word "odds" describes the ratio of two probabilities. In particular we frequently speak about the odds of one event happening versus that event not happening.

In that case odds and probabilities are related, but not identical. In particular the product of two odds will be different from the product of their corresponding probabilities. However another use of the word odds is to express the ratio of two probabilities and that's the way Dayhoff has used it in the term relatedness odds.

The mutation probability matrix gives the probability , that an amino acid will replace an amino acid of type in a given evolutionary interval, in two homologous sequences, given the evolutionary model we have developed above. By comparison, the probability that that same "replacement" is observed by random chance, is simply given by the frequency of occurence of amino acid i.


Accordingly the relatedness odds, i.e. relative odds that a given event is due to evolution, rather than due to chance are



Last step: the log-odds matrix

 
 

Update Warning!

This page is under revision. Material below this tag is missing formulas and has not been updated.

 
 

Finally we have a tool to quantify the probability that two sequences are homologous, i.e. related by evolution. The relatedness odds for one aligned pair are given by the corresponding matrix element. The relatedness odds for a second pair are multiplied with those of the first pair to give the joint probability. This is done for all aligned pairs of the whole sequence. Since multiplication is a computationally expensive process, it is preferrable to add the logarithms of the matrix elements. This matrix, the log odds matrix, is the foundation of quantitative sequence comparisons under an evolutionary model. Since the Dayhoff matrix was taken as the log to base 10, a value of +1 would mean that the corresponding pair has been observed 10 times more frequently than expected by chance. A value of -0.2 would mean that the observed pair was observed 1.6 times less frequently than chance would predict. The most commonly used matrix is the matrix from the 1978 edition of the Dayhoff atlas, at PAM 250: this is also frequently referred to as the MDM78 PAM250 matrix.

The MDM78 PAM250 matrix

 

Alternate symbol comparison table for the comparison of peptide
sequences.

Dayhoff table (Dayhoff, M. O., Schwartz, R. M., and Orcutt, B. C. [1979]
in Atlas of Protein Sequence and Structure, Dayhoff, M. O. Ed, pp.
345-352 (Figure 84), National Biomedical Research Foundation, Washington
D.C.) rescaled by dividing each value by 5.0


                     May 24, 1990  16:47

  A    B    C    D    E    F    G    H    I    K    L    M    N    P    Q    R    S    T    V    W    Y    Z   
 0.4  0.0 -0.4  0.0  0.0 -0.8  0.2 -0.2 -0.2 -0.2 -0.4 -0.2  0.0  0.2  0.0 -0.4  0.2  0.2  0.0 -1.2 -0.6  0.0 A
      0.5 -0.9  0.6  0.4 -1.0  0.1  0.3 -0.4  0.1 -0.7 -0.5  0.4 -0.2  0.3 -0.1  0.1  0.0 -0.4 -1.1 -0.6  0.4 B
           2.4 -1.0 -1.0 -0.8 -0.6 -0.6 -0.4 -1.0 -1.2 -1.0 -0.8 -0.6 -1.0 -0.8  0.0 -0.4 -0.4 -1.6  0.0 -1.0 C
                0.8  0.6 -1.2  0.2  0.2 -0.4  0.0 -0.8 -0.6  0.4 -0.2  0.4 -0.2  0.0  0.0 -0.4 -1.4 -0.8  0.5 D
                     0.8 -1.0  0.0  0.2 -0.4  0.0 -0.6 -0.4  0.2 -0.2  0.4 -0.2  0.0  0.0 -0.4 -1.4 -0.8  0.6 E
                          1.8 -1.0 -0.4  0.2 -1.0  0.4  0.0 -0.8 -1.0 -1.0 -0.8 -0.6 -0.6 -0.2  0.0  1.4 -1.0 F
                               1.0 -0.4 -0.6 -0.4 -0.8 -0.6  0.0 -0.2 -0.2 -0.6  0.2  0.0 -0.2 -1.4 -1.0 -0.1 G
                                    1.2 -0.4  0.0 -0.4 -0.4  0.4  0.0  0.6  0.4 -0.2 -0.2 -0.4 -0.6  0.0 -0.4 H
                                         1.0 -0.4  0.4  0.4 -0.4 -0.4 -0.4 -0.4 -0.2  0.0  0.8 -1.0 -0.2 -0.4 I
                                              1.0 -0.6  0.0  0.2 -0.2  0.2  0.6  0.0  0.0 -0.4 -0.6 -0.8  0.1 K
                                                   1.2  0.8 -0.6 -0.6 -0.4 -0.6 -0.6 -0.4  0.4 -0.4 -0.2 -0.5 L
                                                        1.2 -0.4 -0.4 -0.2  0.0 -0.4 -0.2  0.4 -0.8 -0.4 -0.3 M
                                                             0.4 -0.2  0.2  0.0  0.2  0.0 -0.4 -0.8 -0.4  0.2 N
                                                                  1.2  0.0  0.0  0.2  0.0 -0.2 -1.2 -1.0 -0.1 P
                                                                       0.8  0.2 -0.2 -0.2 -0.4 -1.0 -0.8  0.6 Q
                                                                            1.2  0.0 -0.2 -0.4  0.4 -0.8  0.6 R
                                                                                 0.4  0.2 -0.2 -0.4 -0.6 -0.1 S
                                                                                      0.6  0.0 -1.0 -0.6 -0.1 T
                                                                                           0.8 -1.2 -0.4 -0.4 V
                                                                                                3.4  0.0 -1.2 W
                                                                                                     2.0 -0.8 Y
                                                                                                          0.6 Z

A variant of the MDM78 PAM250 matrix used in many sequence alignment applications

 
The matrix in most widespread use for many years was entirely derived from the Dayhoff MDM78 (Gribskov, M. & Burgess, R.R. (1986) NAR 14:6745-6763). It has been renormalized to give a score of 1.5 for identical residues, and the scores for non-identical residues have been adjusted to give a mean of -0.17 and a standard deviation of 0.364. The negative expectation value is necessary for local alignment routines, since a positive expectation value would allow to increase the score simply by extending random alignments. Note that these normalizations constitute a significant departure from the evolutionary model as discussed above. Specifically, the interpretation in terms of a specific evolutionary distance is now no longer possible.

The GCG matrix
Default scoring matrix used by COMPARE for the comparison of
protein sequences.

Dayhoff table (Schwartz, R. M. and Dayhoff, M. O. [1979] in Atlas of
Protein Sequence and Structure, Dayhoff, M. O. Ed, pp. 353-358, National
Biomedical Research Foundation, Washington D.C.) rescaled by dividing
each value by the sum of its row and column, and normalizing to a mean
of 0 and standard deviation of 1.0.  The value for FY (Phe-Tyr) = RW =
1.425.  Perfect matches are set to 1.5 and no matches on any row are
better than perfect matches.

Table used by Gribskov and Burgess NAR 14(16) 6745-6763


                     December 29, 1986  12:46

  A    B    C    D    E    F    G    H    I    K    L    M    N    P    Q    R    S    T    V    W    Y    Z  ..
 1.5  0.2  0.3  0.3  0.3 -0.5  0.7 -0.1  0.0  0.0 -0.1  0.0  0.2  0.5  0.2 -0.3  0.4  0.4  0.2 -0.8 -0.3  0.2 A
      1.1 -0.4  1.1  0.7 -0.7  0.6  0.4 -0.2  0.4 -0.5 -0.3  1.1  0.1  0.5  0.1  0.3  0.2 -0.2 -0.7 -0.3  0.6 B
           1.5 -0.5 -0.6 -0.1  0.2 -0.1  0.2 -0.6 -0.8 -0.6 -0.3  0.1 -0.6 -0.3  0.7  0.2  0.2 -1.2  1.0 -0.6 C
                1.5  1.0 -1.0  0.7  0.4 -0.2  0.3 -0.5 -0.4  0.7  0.1  0.7  0.0  0.2  0.2 -0.2 -1.1 -0.5  0.9 D
                     1.5 -0.7  0.5  0.4 -0.2  0.3 -0.3 -0.2  0.5  0.1  0.7  0.0  0.2  0.2 -0.2 -1.1 -0.5  1.1 E
                          1.5 -0.6 -0.1  0.7 -0.7  1.2  0.5 -0.5 -0.7 -0.8 -0.5 -0.3 -0.3  0.2  1.3  1.4 -0.7 F
                               1.5 -0.2 -0.3 -0.1 -0.5 -0.3  0.4  0.3  0.2 -0.3  0.6  0.4  0.2 -1.0 -0.7  0.3 G
                                    1.5 -0.3  0.1 -0.2 -0.3  0.5  0.2  0.7  0.5 -0.2 -0.1 -0.3 -0.1  0.3  0.5 H
                                         1.5 -0.2  0.8  0.6 -0.3 -0.2 -0.3 -0.3 -0.1  0.2  1.1 -0.5  0.1 -0.2 I
                                              1.5 -0.3  0.2  0.4  0.1  0.4  0.8  0.2  0.2 -0.2  0.1 -0.6  0.4 K
                                                   1.5  1.3 -0.4 -0.3 -0.1 -0.4 -0.4 -0.1  0.8  0.5  0.3 -0.2 L
                                                        1.5 -0.3 -0.2  0.0  0.2 -0.3  0.0  0.6 -0.3 -0.1 -0.1 M
                                                             1.5  0.0  0.4  0.1  0.3  0.2 -0.3 -0.3 -0.1  0.4 N
                                                                  1.5  0.3  0.3  0.4  0.3  0.1 -0.8 -0.8  0.2 P
                                                                       1.5  0.4 -0.1 -0.1 -0.2 -0.5 -0.6  1.1 Q
                                                                            1.5  0.1 -0.1 -0.3  1.4 -0.6  0.2 R
                                                                                 1.5  0.3 -0.1  0.3 -0.4  0.0 S
                                                                                      1.5  0.2 -0.6 -0.3  0.1 T
                                                                                           1.5 -0.8 -0.1 -0.2 V
                                                                                                1.5  1.1 -0.8 W
                                                                                                     1.5 -0.6 Y
                                                                                                          1.1 Z




The BLOSUM matrices

 



(Derived and updated from material I originally posted at the University of Munich Gene Center in 1999.