CSB Systems extraction

Mutual information

This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.

A powerful concept within the mathematical theory of information, the Mutual Information of two variables measures how much the knowledge about one variable reduces uncertainty about the other. For example, if two genes always either occur as a pair, or are both absent from a genome, it is sufficient to know whether one is present or not, to also know about the other. In biology, genes with high mutual information invariably are either components of physical complexes or collaborate functionally. Thus measuring mutual information in large datasets can be used to infer such relationships.

Introductory reading

Here is a useful introduction to the use of information theory, in particular mutual information for the analysis of signal transduction networks.

Waltermann & Klipp (2011) Information theory based approaches to cellular signaling. Biochim Biophys Acta 1810:924-32. (pmid: 21798319)

[ PubMed ] [ DOI ] Abstract

Mutual information is at the core of a novel approach to quantify non-linear correlations in data. Read the perspective on this recent work here:

Speed (2011) Mathematics. A correlation for the 21st century. Science 334:1502-3. (pmid: 22174235)

[ PubMed ] [ DOI ]

The actual paper is here; have a look, but its contents wil not be material for the quiz.

Reshef et al. (2011) Detecting novel associations in large data sets. Science 334:1518-24. (pmid: 22174245)

[ PubMed ] [ DOI ] Abstract

Exercises

Try out MINE

Create a working folder on your computer (e.g. name it MINE).
Navigate to http://www.exploredata.net/ and follow the link to Downloads.
Follow the link to the Gene Expression Data Set in the side-bar and download Spellman.csv to your folder.
Edit Spellman.csv by duplicating the first row and renaming "time" in the first row to "Name". (Don't use MSWord!!!)
Follow the link to MINE application in the side-bar.
Download MINE.jar and MINE.r to your folder.
Follow the link to Parameters in the side-bar and study your options.
Click on the link to Usage-instructions and follow the instructions: How to run MINE in R.
1. Start R and set your working folder as the working directory (command: setwd(...)).
2. Use File → Open Document... to open MINE.r
3. run: install.packages("rJava") ... to download the rJava package from CRAN if it hasn't been installed before
4. Use File → Source File... to execute the commands in MINE.r ... this executes library("rJava") and .jinit(classpath="MINE.jar") and defines the functions MINE and rMINE.
5. run: MINE("Spellman.csv","two.pairs",1,5) to verify that the installation is oK and you can access the data.
6. The MCM3 gene that was discussed in Reshef et al. (2011) has the systematic name YEL032W:

What is its index in the table?

genes <- read.csv("Spellman.csv")
genes[grep("YEL032W",genes$Name),]

Plot the gene's expression profile:

genes <- read.csv("Spellman.csv")
time <- data.matrix(genes[1,2:24])
plot (time, data.matrix(genes[1017,2:24]))

Find genes with a high MIC with YEL032W.

MINE("Spellman.csv","master.variable",1017)

Looking at the output text-file, you see that YDR191W has a high MIC. Plot its expression profile as an overlay plot, then plot the expression values of one gene against the other. Is this a postive or a negative correlation. Then explore more genes. Can you find a gene that is negatively correlated with YEL032W?

Have fun!

Further reading and resources

Wu et al. (2003) Identification of functional links between genes using phylogenetic profiles. Bioinformatics 19:1524-30. (pmid: 12912833)

[ PubMed ] [ DOI ] Abstract

MOTIVATION: Genes with identical patterns of occurrence across the phyla tend to function together in the same protein complexes or participate in the same biochemical pathway. However, the requirement that the profiles be identical (i) severely restricts the number of functional links that can be established by such phylogenetic profiling; (ii) limits detection to very strong functional links, failing to capture relations between genes that are not in the same pathway, but nevertheless subserve a common function and (iii) misses relations between analogous genes. Here we present and apply a method for relaxing the restriction, based on the probability that a given arbitrary degree of similarity between two profiles would occur by chance, with no biological pressure. Function is then inferred at any desired level of confidence. RESULTS: We derive an expression for the probability distribution of a given number of chance co-occurrences of a pair of non-homologous orthologs across a set of genomes. The method is applied to 2905 clusters of orthologous genes (COGs) from 44 fully sequenced microbial genomes representing all three domains of life. Among the results are the following. (1) Of the 51 000 annotated intrapathway gene pairs, 8935 are linked at a level of significance of 0.01. This is over 30-fold greater than the 271 intrapathway pairs obtained at the same confidence level when identical profiles are used. (2) Of the 540 000 interpathway genes pairs, some 65 000 are linked at the 0.01 level of significance, some 12 standard deviations beyond the number expected by chance at this confidence level. We speculate that many of these links involve nearest-neighbor path, and discuss some examples. (3) The difference in the percentage of linked interpathway and intrapathway genes is highly significant, consistent with the intuitive expectation that genes in the same pathway are generally under greater selective pressure than those that are not. (4) The method appears to recover well metabolic networks. This is illustrated by the TCA cycle which is recovered as a highly connected, weighted edge network of 30 of its 31 COGs. (5) The fraction of pairs having a common pathway is a symmetric function of the Hamming distance between their profiles. This finding, that the functional correlation between profiles with near maximum Hamming distance is as large as between profiles with near zero Hamming distance, and as statistically significant, is plausibly explained if the former group represents analogous genes.

Wu et al. (2005) Deciphering protein network organization using phylogenetic profile groups. Genome Inform 16:142-9. (pmid: 16362916)

[ PubMed ] Abstract

Rao et al. (2008) Using directed information to build biologically relevant influence networks. J Bioinform Comput Biol 6:493-519. (pmid: 18574860)

[ PubMed ] [ DOI ] Abstract

Luo & Woolf (2010) Reconstructing transcriptional regulatory networks using three-way mutual information and Bayesian networks. Methods Mol Biol 674:401-18. (pmid: 20827604)

[ PubMed ] [ DOI ] Abstract

Speed (2011) Mathematics. A correlation for the 21st century. Science 334:1502-3. (pmid: 22174235)

[ PubMed ] [ DOI ]

Reshef et al. (2011) Detecting novel associations in large data sets. Science 334:1518-24. (pmid: 22174245)

[ PubMed ] [ DOI ] Abstract

CSB Systems extraction

Contents

Introductory reading

Contents

Exercises

Further reading and resources

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools