Mutation Data Matrices

From "A B C"
Revision as of 23:44, 17 January 2007 by Boris (talk | contribs)
Jump to navigation Jump to search

(Derived and upodtaed from material I originally posted at http://www.lmb.uni-muenchen.de/groups/bioinformatics/04/ch_04_3.html)</small


Mutation Data Matrices are computational tools to quantify how well an observed pair-exchange conforms to a particular model of exchanges, usually a model of evolution.

One really has to understand the above sentence, if one wants to understand the whole business of sequence aligning and similarity scores in bioinformatics. We have algorithms that compute optimal sequence alignments - either global (Needleman-Wunsch), or local (Smith-Waterman). But what does optimal really mean? Optimal means that the alignment gives us the highest score that is possible, for a given mutation matrix. Actually: for a given matrix plus a given set of parameters for insertions and deletions, but since the judicious choice of these parameters depends on the matrix as well, for this discussion here we will assume it is only the matrix.

So since everything depends on the matrix, what is such a matrix? It is an ordered set of numbers that associates every possible pair of amino acids with a score. The crux of the matter is how one comes up with such a score. Ultimately, all such scores measure the probability that two amino acids are related by evolution. We can devise several computational models of how to do such a measurement.

  • We can argue that homologous sequences should not have mutated at all. Thus we should assign a value of 1 for identical amino acids, a value of 0 for mutations. This is the identity matrix.
  • We can argue that evolution proceeds by point mutations in the coding sequence. Thus the number of nucleotide changes required to change one codon into another should be small. We score amino acids with similar codons highly, those with dissimilar codons poorly. This is the genetic code matrix.
  • We can argue that selection after mutation favours similar amino acids. Thus if two amino acids are related, they have a higher probability of being similar than being dissimilar. Similarity can be defined from first principles with some quantitative measure of size, charge, hydrophopicity etc. This lead us to a biophysical matrix.

However all these arguments from first principles have associated problems. An empirical approach seems more appropriate to score amino acid similarity: let's observe what nature allows as replacements in sequences thaqt we know are homologous.

This approach postulates that once the evolutionary relationship of two sequences is established, the residues that did exchange can be called similar. This is the principle behind the mutation matrices compiled by Margaret O. Dayhoff and colleagues at the National Biomedical Research Foundation in the early 70s (Dayhoff, M.O. et al. (1978) Atlas of Protein Sequence and Structure. Vol. 5, Suppl. 3 National Biomedical Research Foundation, Washington D.C. U.S.A). They developed a precise and rigorous approach to implement a model of evolutionary change in their mutation data matrix.Their model allows to quantify the odds that a given alignment of two protein sequences would be observed by chance or would demonstrate an echo of a common ancestral sequence.

We have since obtained access to a vastly larger number of sequences, and we can compile directly some of the numbers that Dayhoff had to infer from the limited dataset she had available. But the principle of the process is necessarily the same for any related construction of a computational tool that quantifies probabilities. It is a paradigm for the construction of any mutation data matrix.

 


How is the Dayhoff mutation data matrix constructed ? Since we are looking for a matrix that will recognize significant evolutionary relationships, we must first define a quantitative model of evolution. The model used by Dayhoff states that proteins evolve through a succesion of independent point mutations, that are accepted in a population and subsequently can be observed in the sequence pool. Formally, the fact that the mutations are treated independently of their neigborhood and of their history makes this a Markov model. We can define an evolutionary distance between two sequences as the number of point mutations that was necessary to evolve one sequence into the other. (More precisely, the distance is the minimal number of mutations.) Two aspects of this process cause the evolutionary distance to be unequal in general to the number of observed differences between the sequences:

  • First, there is a chance that a certain residue may have mutated, than reverted, hiding the effect of the mutation. This phenomenon is important in the evaluation of biological clocks and the question of how many mutational events may become fixed per time unit. In the context of the discussion of mutation matrices we do not need to consider this effect.
  • Second, specific residues may have mutated more than once, thus the number of point mutations is likely to be larger than the number of differences between the two sequences. This has to be taken into account.


First step: Pair Exchange Frequencies

 
M.O. Dayhoff and colleagues introduced the term "accepted point mutation" for a mutation that is stably fixed in the gene pool in the course of evolution. Thus a measure of evolutionary distance between two sequences can be defined:
 

PAM

A PAM (Percent Accepted Mutation) is one accepted point mutation on the path between two sequences, per 100 residues.

 

In order to identify accepted point mutations, a complete phylogenetic tree including all ancestral sequences has to be constructed. (There are standard procedures for this.) To avoid a large degree of ambiguities in this step, Dayhoff and colleagues restricted their analysis to sequence families with more than 85% identity. Since the evolutionary distance between these highly homologous proteins is small, the construction of the phylogenetic tree can be achieved without too many complicating assumptions. For each of the observed and inferred sequences, the amino acid pair exchanges are tabulated into a 20x20 matrix. It is assumed, that the likelihood of an amino-acid X being replaced by an amino acid Y is the same as Y replacing X. Hence the matrix is constructed symmetrically. (This assumption is probably largely true, if it were not, the amino acid composition of proteins would be in evolutionary disequilibrium.) Note that this process is different from comparing observed sequences directly with each other !

Aij is the number of accepted mutations observed where amino acid i replaces amino acid j.

Second step: Frequencies of occurrence

 
If the properties of amino acids differ and if they occur with different frequencies in proteins, all statements we can make about the average properties of sequences will depend on the frequencies of occurence of the individual amino acids. These frequencies of occurence are approximated by the frequencies of observation. They are the number of occurences of a given amino acid divided by the total number of amino acids that have been observed. Writing aai for the i-th amino acid

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle f_i=\frac{\mbox{observations of }aa_i}{\mbox{all observations}}}

writing i=aa to denote the index running over all 20 amino acids, obviously

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \,\sum_{i=aa} f_i=1}

The appropriate frequencies to use do depend on the organism under consideration (yes, scoring matrices are improved if they take organism-specific amino acid frequencies into account) and they may depend on the type of amino acid, e.g. amino acid frequencies for membrabe proteins are different from those of soluble proteins. However from a practical point of view the theoretical increase in accuracy does not appear sufficient to justify the signifcant additional computational overhead that the work with specialised scoring matrices requires.

Amino acid frequencies

 

AA1978 a1991 b
L 0.085 0.091
A 0.087 0.077
G 0.089 0.074
S 0.070 0.069
V 0.065 0.066
E 0.050 0.062
T 0.058 0.059
K 0.081 0.059
I 0.037 0.053
D 0.047 0.052
R 0.041 0.051
P 0.051 0.051
N 0.040 0.043
Q 0.038 0.041
F 0.040 0.040
Y 0.030 0.032
M 0.015 0.024
H 0.034 0.023
C 0.033 0.020
W 0.010 0.014

a) Frequencies taken from Dayhoff (1978); b) Frequencies taken from the 1991 recompilation of the mutation matrices by Jones et al. (Jones, D.T. Taylor, W.R. & Thornton, J.M. (1991) CABIOS 8:275-282) representing a database of observations that is approximately 40 times larger than that available to Dayhoff. Reassuringly, the changes are small. Note the higher abundance of hydrophobic residues, in particular Leu, this may reflect the addition of many membrane proteins to the sequence databases between 1978 and 1991.

Third step: Relative mutabilities

 

To obtain a complete picture of the mutational process, the amino-acids that do not mutate must be taken into account too. We need to know: what is the chance, on average, that a given amino acid will mutate at all. This is the relative mutability of the amino acid. It is obtained by multiplying the number of observed changes by the amino acids frequency of occurence.

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \,m_i=f_i(\mbox{number of times }aa_i\mbox{ is observed to change})}

where Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle m_Ala} is arbitrarily set to 100%.

Relative mutabilities

 

AA1978 a1991 b
S 120 117
T 97 107
N 134 104
I 96 103
A 100 100
V 74 98
M 94 93
H 66 91
D 106 86
Q 93 84
R 65 83
E 102 77
K 56 72
P 56 58
L 40 54
F 41 51
G 49 50
Y 41 50
C 20 44
W 18 25

All values are taken relative to alanine, which is arbitrarily set at 100. As above: a) Frequencies taken from Dayhoff (1978); b) Frequencies taken from the 1991 recompilation of the mutation matrices by Jones et al. (Jones, D.T. Taylor, W.R. & Thornton, J.M. (1991) CABIOS 8:275-282). The difference for some amino acids are quite significant, especially for those amino acids, for which hardly any exchanges have been observed in 1978 (C and W). Serine and threonine are the most mutable amino acids, cysteine and tryptophane are the most immutable.



Fourth step: Mutation probability matrix

 
With the numbers we have calculated and compiled above we can quantify the probability that an amino acid i will replace an amino acid j in an alignment of two homologous sequences: it is the mutability of amino acid j, multiplied by the relative pair exchange frequency for ij (the pair exchange frequency for ij divided by the sum of all pair exchange frequencies for amino acid i).

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle M_{ij}=m_j\frac{A_{ij}}{\sum_{i=aa}{A_{ij}}}}


Fifth step: Evolutionary distance scale

 


Sixth step: Relatedness odds

 


Last step: the log-odds matrix

 


A variant of the MDM78 PAM250 matrix used in many sequence alignment applications

 


The BLOSUM matrices