Mutation Data Matrices

From "A B C"
Revision as of 16:41, 18 December 2006 by Boris (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Mutation Data Matrices are computational tools to quantify how well an observed pair-exchange conforms to a particular model of exchanges, usually a model of evolution.

One really has to understand the above sentence, if one wants to understand the whole business of sequence aligning and similarity scores in bioinformatics. We have algorithms that compute optimal sequence alignments - either global (Needleman-Wunsch), or local (Smith-Waterman). But what does optimal really mean? Optimal means that the alignment gives us the highest score that is possible, for a given mutation matrix. Actually: for a given matrix plus a given set of parameters for insertions and deletions, but since the judicious choice of these parameters depends on the matrix as well, for this discussion here we will assume it is only the matrix.

So since evrything depends on the matrix, what is such a matrix? It is an ordered set of numbers that associates avery possible pair of amino acids with a score. The crux of the matter is how one comes up with such a score. Ultimately, all such scores measure the probability that two amino acids are related by evolution. We can devise several computational models of how to do such a measurement.

  • We can argue that homologous sequences should not have mutated at all. Thus we should assign a value of 1 for identical amino acids, a value of 0 for mutations. This is the identity matrix.
  • We can argue that evolution proceeds by point mutations in the coding sequence. Thus the number of nucleotide changes required to change one codon into another should be small. We score amino acids with similar codons highly, those with dissimilar codons poorly. This is the genetic code matrix.
  • We can argue that selection after mutation favours similar amino acids. Thus if two amino acids are related, they have a higher probability of being similar than being dissimilar. Similarity can be defined from first principles from some quantitative measure of size, charge, hydrophopicity etc. This lead us tho an biophysical matrix.

However all these arguments from first principles have associated problems. An empirical approach seems more appropriate to score amino acid similarity: let's observe what nature allows as replacements in sequences thaqt we know are homologous.

This approach postulates that once the evolutionary relationship of two sequences is established, the residues that did exchange can be called similar. This is the principle behind the mutation matrices compiled by Margaret O. Dayhoff and colleagues at the National Biomedical Research Foundation in the early 70s (Dayhoff, M.O. et al. (1978) Atlas of Protein Sequence and Structure. Vol. 5, Suppl. 3 National Biomedical Research Foundation, Washington D.C. U.S.A). They developed a precise and rigorous approach to implement a model of evolutionary change in their mutation data matrix.Their model allows to quantify the odds that a given alignment of two protein sequences would be observed by chance or would demonstrate an echo of a common ancestral sequence.

We have since obtained access to a very large number of sequences, and we can compile directly some of the numbers that Dayhoff had to infer from the limited dataset she had available. But the principle of the process is necessarily the same for any related construction of a computationla tool that quantifies probabilities. It is a paradigm for the construction of any mutation data matrix.


How is the Dayhoff mutation data matrix constructed ? Since we are looking for a matrix that will recognize significant evolutionary relationships, we must first define a quantitative model of evolution. The model used by Dayhoff states that proteins evolve through a succesion of independent point mutations, that are accepted in a population and subsequently can be observed in the sequence pool. Formally, the fact that the mutations are treated independent of their neigborhood and of their history makes this a Markov model. We can define an evolutionary distance between two sequences as the number of point mutations that was necessary to evolve one sequence into the other. (More precisely, the distance is the minimal number of mutations.) Two aspects of this process cause the evolutionary distance to be unequal in general to the number of observed differences between the sequences:

  • First, there is a chance that a certain residue may have mutated, than reverted, hiding the effect of the mutation. This phenomenon is important in the evaluation of biological clocks and the question of how many mutational events may become fixed per time unit. In the context of the discussion of mutation matrices we do not need to consider this effect.
  • Second, specific residues may have mutated more than once, thus the number of point mutations is likely to be larger than the number of differences between the two sequences. This has to be taken into account.


First step: Pair Exchange Frequencies