Difference between revisions of "Information theory"

From "A B C"
Jump to navigation Jump to search
m
Line 42: Line 42:
  
 
==Further reading and resources==
 
==Further reading and resources==
 +
<div class="reference-box">[http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf Shannon's "Mathematical Theory of Communication"] (at Bell labs)</div>
 +
{{#pmid: 22638583}}
 +
{{#pmid: 20663120}}
 +
{{#pmid: 19808039}}
 +
{{#pmid: 17519246}}
 
{{#pmid: 16916457}}
 
{{#pmid: 16916457}}
{{#pmid: 19808039}}
+
{{#pmid: 8415606}}
<div class="reference-box">[http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf Shannon's "Mathematical Theory of Communication"] (at Bell labs)</div>
+
{{#pmid: 7966282}}
 +
 
 +
 
  
 
<!-- {{#pmid:21627854}} -->
 
<!-- {{#pmid:21627854}} -->

Revision as of 14:52, 28 October 2012

Information theory


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.


This is an introduction to information theory for the bioinformatics lab.



 

Contents

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle H = - \sum_{i=0}^n p_i \log_{2} p_i}

Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle I = H_{ref} - H_{obs}}


   

Further reading and resources

Thomsen & Nielsen (2012) Seq2Logo: a method for construction and visualization of amino acid binding motifs and sequence profiles including sequence weighting, pseudo counts and two-sided representation of amino acid enrichment and depletion. Nucleic Acids Res 40:W281-7. (pmid: 22638583)

PubMed ] [ DOI ] Seq2Logo is a web-based sequence logo generator. Sequence logos are a graphical representation of the information content stored in a multiple sequence alignment (MSA) and provide a compact and highly intuitive representation of the position-specific amino acid composition of binding motifs, active sites, etc. in biological sequences. Accurate generation of sequence logos is often compromised by sequence redundancy and low number of observations. Moreover, most methods available for sequence logo generation focus on displaying the position-specific enrichment of amino acids, discarding the equally valuable information related to amino acid depletion. Seq2logo aims at resolving these issues allowing the user to include sequence weighting to correct for data redundancy, pseudo counts to correct for low number of observations and different logotype representations each capturing different aspects related to amino acid enrichment and depletion. Besides allowing input in the format of peptides and MSA, Seq2Logo accepts input as Blast sequence profiles, providing easy access for non-expert end-users to characterize and identify functionally conserved/variable amino acids in any given protein of interest. The output from the server is a sequence logo and a PSSM. Seq2Logo is available at http://www.cbs.dtu.dk/biotools/Seq2Logo (14 May 2012, date last accessed).

Johansson & Toh (2010) A comparative study of conservation and variation scores. BMC Bioinformatics 11:388. (pmid: 20663120)

PubMed ] [ DOI ] BACKGROUND: Conservation and variation scores are used when evaluating sites in a multiple sequence alignment, in order to identify residues critical for structure or function. A variety of scores are available today but it is not clear how different scores relate to each other. RESULTS: We applied 25 conservation and variation scores to alignments from the Catalytic Site Atlas (CSA). We calculated distances among scores based on correlation coefficients, and constructed a dendrogram of the scores by average linking cluster analysis. The cluster analysis showed that most scores fall into one of two groups--substitution matrix based group and frequency based group respectively. We also evaluated the scores' performance in predicting catalytic sites and found that frequency based scores generally perform best. CONCLUSIONS: Conservation and variation scores can be classified into mainly two large groups. When using a score to predict catalytic sites, frequency based scores that also consider a background distribution are most successful.

Dou et al. (2010) Several appropriate background distributions for entropy-based protein sequence conservation measures. J Theor Biol 262:317-22. (pmid: 19808039)

PubMed ] [ DOI ] Amino acid background distribution is an important factor for entropy-based methods which extract sequence conservation information from protein multiple sequence alignments (MSAs). However, MSAs are usually not large enough to allow a reliable observed background distribution. In this paper, we propose two new estimations of background distribution. One is an integration of the observed background distribution and the position-specific residue distribution, and the other is a normalized square root of observed background frequency. To validate these new background distributions, they are applied to the relative entropy model to find catalytic sites and ligand binding sites from protein MSAs. Experimental results show that they are superior to the observed background distribution in predicting functionally important residues.

Capra & Singh (2007) Predicting functionally important residues from sequence conservation. Bioinformatics 23:1875-82. (pmid: 17519246)

PubMed ] [ DOI ] MOTIVATION: All residues in a protein are not equally important. Some are essential for the proper structure and function of the protein, whereas others can be readily replaced. Conservation analysis is one of the most widely used methods for predicting these functionally important residues in protein sequences. RESULTS: We introduce an information-theoretic approach for estimating sequence conservation based on Jensen-Shannon divergence. We also develop a general heuristic that considers the estimated conservation of sequentially neighboring sites. In large-scale testing, we demonstrate that our combined approach outperforms previous conservation-based measures in identifying functionally important residues; in particular, it is significantly better than the commonly used Shannon entropy measure. We find that considering conservation at sequential neighbors improves the performance of all methods tested. Our analysis also reveals that many existing methods that attempt to incorporate the relationships between amino acids do not lead to better identification of functionally important sites. Finally, we find that while conservation is highly predictive in identifying catalytic sites and residues near bound ligands, it is much less effective in identifying residues in protein-protein interfaces. AVAILABILITY: Data sets and code for all conservation measures evaluated are available at http://compbio.cs.princeton.edu/conservation/

Wang & Samudrala (2006) Incorporating background frequency improves entropy-based residue conservation measures. BMC Bioinformatics 7:385. (pmid: 16916457)

PubMed ] [ DOI ] BACKGROUND: Several entropy-based methods have been developed for scoring sequence conservation in protein multiple sequence alignments. High scoring amino acid positions may correlate with structurally or functionally important residues. However, amino acid background frequencies are usually not taken into account in these entropy-based scoring schemes. RESULTS: We demonstrate that using a relative entropy measure that incorporates amino acid background frequency results in improved performance in identifying functional sites from protein multiple sequence alignments. CONCLUSION: Our results suggest that the application of appropriate background frequency information may lead to more biologically relevant results in many areas of bioinformatics.

Vingron & Sibbald (1993) Weighting in sequence space: a comparison of methods in terms of generalized sequences. Proc Natl Acad Sci U.S.A 90:8777-81. (pmid: 8415606)

PubMed ] [ DOI ] Four methods for weighting aligned biological sequences have recently appeared that differ mathematically, philosophically, and in their results. Thus, while there is consensus about the need to weight sequences, the method to use is contentious. A geometric analysis based on a continuous sequence space is presented that provides a common framework in which to compare the methods. It is concluded that there are two "best" methods. When the sequences are known to be phylogenetically related and a tree can be generated without introducing excessive stress into the data, the method of Altschul et al. [Altschul, S. F., Carroll, R. J. & Lipman, D. J. (1989) J. Mol. Biol. 207, 647-653] is appropriate. When the sequences are not known to be phylogenetically related or a tree cannot be produced without unduly distorting the distances between the sequences, a modification of the method of Sibbald and Argos [Sibbald, P. R. & Argos, P. (1990) J. Mol. Biol. 216, 813-818] is preferable.

Henikoff & Henikoff (1994) Position-based sequence weights. J Mol Biol 243:574-8. (pmid: 7966282)

PubMed ] [ DOI ] Sequence weighting methods have been used to reduce redundancy and emphasize diversity in multiple sequence alignment and searching applications. Each of these methods is based on a notion of distance between a sequence and an ancestral or generalized sequence. We describe a different approach, which bases weights on the diversity observed at each position in the alignment, rather than on a sequence distance measure. These position-based weights make minimal assumptions, are simple to compute, and perform well in comprehensive evaluations.