Difference between revisions of "CSB Systems extraction"

From "A B C"
Jump to navigation Jump to search
 
(One intermediate revision by the same user not shown)
(No difference)

Latest revision as of 17:37, 9 September 2014

Systems extraction


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.

An obvious challenge of computational systems biology is to extract and define systems from -omics scale data. A powerful mathematical concept that has been brought to bear on this problem is the Mutual Information of two variables, which measures how much the knowledge about one variable reduces uncertainty about the other. For example, if two genes always either occur as a pair, or are both absent from a genome, it is sufficient to know whether one is present or not, to also know about the other. In biology, genes with high mutual information invariably are either components of physical complexes or collaborate functionally. Thus measuring mutual information in large datasets can be used to infer such relationships. But this is by far not the only productive approach in this field.



 

Introductory reading


Villaverde & Banga (2014) Reverse engineering and identification in systems biology: strategies, perspectives and challenges. J R Soc Interface 11:20130505. (pmid: 24307566)

PubMed ] [ DOI ] The interplay of mathematical modelling with experiments is one of the central elements in systems biology. The aim of reverse engineering is to infer, analyse and understand, through this interplay, the functional and regulatory mechanisms of biological systems. Reverse engineering is not exclusive of systems biology and has been studied in different areas, such as inverse problem theory, machine learning, nonlinear physics, (bio)chemical kinetics, control theory and optimization, among others. However, it seems that many of these areas have been relatively closed to outsiders. In this contribution, we aim to compare and highlight the different perspectives and contributions from these fields, with emphasis on two key questions: (i) why are reverse engineering problems so hard to solve, and (ii) what methods are available for the particular problems arising from systems biology?

Here is a useful introduction to the use of information theory, in particular mutual information for the analysis of signal transduction networks.

Waltermann & Klipp (2011) Information theory based approaches to cellular signaling. Biochim Biophys Acta 1810:924-32. (pmid: 21798319)

PubMed ] [ DOI ] BACKGROUND: Cells interact with their environment and they have to react adequately to internal and external changes such changes in nutrient composition, physical properties like temperature or osmolarity and other stresses. More specifically, they must be able to evaluate whether the external change is significant or just in the range of noise. Based on multiple external parameters they have to compute an optimal response. Cellular signaling pathways are considered as the major means of information perception and transmission in cells. SCOPE OF REVIEW: Here, we review different attempts to quantify information processing on the level of individual cells. We refer to Shannon entropy, mutual information, and informal measures of signaling pathway cross-talk and specificity. MAJOR CONCLUSIONS: Information theory in systems biology has been successfully applied to identification of optimal pathway structures, mutual information and entropy as system response in sensitivity analysis, and quantification of input and output information. GENERAL SIGNIFICANCE: While the study of information transmission within the framework of information theory in technical systems is an advanced field with high impact in engineering and telecommunication, its application to biological objects and processes is still restricted to specific fields such as neuroscience, structural and molecular biology. However, in systems biology dealing with a holistic understanding of biochemical systems and cellular signaling only recently a number of examples for the application of information theory have emerged. This article is part of a Special Issue entitled Systems Biology of Microorganisms.

Mutual information is at the core of a novel approach to quantify non-linear correlations in data. Read the perspective on this recent work here:

Speed (2011) Mathematics. A correlation for the 21st century. Science 334:1502-3. (pmid: 22174235)

PubMed ] [ DOI ]


 

Reverse engineering

Villaverde & Banga (2014) Reverse engineering and identification in systems biology: strategies, perspectives and challenges. J R Soc Interface 11:20130505. (pmid: 24307566)

PubMed ] [ DOI ] The interplay of mathematical modelling with experiments is one of the central elements in systems biology. The aim of reverse engineering is to infer, analyse and understand, through this interplay, the functional and regulatory mechanisms of biological systems. Reverse engineering is not exclusive of systems biology and has been studied in different areas, such as inverse problem theory, machine learning, nonlinear physics, (bio)chemical kinetics, control theory and optimization, among others. However, it seems that many of these areas have been relatively closed to outsiders. In this contribution, we aim to compare and highlight the different perspectives and contributions from these fields, with emphasis on two key questions: (i) why are reverse engineering problems so hard to solve, and (ii) what methods are available for the particular problems arising from systems biology?

Belcastro et al. (2012) Reverse engineering and analysis of genome-wide gene regulatory networks from gene expression profiles using high-performance computing. IEEE/ACM Trans Comput Biol Bioinform 9:668-78. (pmid: 21464509)

PubMed ] [ DOI ] Regulation of gene expression is a carefully regulated phenomenon in the cell. “Reverse-engineering” algorithms try to reconstruct the regulatory interactions among genes from genome-scale measurements of gene expression profiles (microarrays). Mammalian cells express tens of thousands of genes; hence, hundreds of gene expression profiles are necessary in order to have acceptable statistical evidence of interactions between genes. As the number of profiles to be analyzed increases, so do computational costs and memory requirements. In this work, we designed and developed a parallel computing algorithm to reverse-engineer genome-scale gene regulatory networks from thousands of gene expression profiles. The algorithm is based on computing pairwise Mutual Information between each gene-pair. We successfully tested it to reverse engineer the Mus Musculus (mouse) gene regulatory network in liver from gene expression profiles collected from a public repository. A parallel hierarchical clustering algorithm was implemented to discover “communities” within the gene network. Network communities are enriched for genes involved in the same biological functions. The inferred network was used to identify two mitochondrial proteins.

Correlations

This is the paper referred to in the introductory reading section.

Reshef et al. (2011) Detecting novel associations in large data sets. Science 334:1518-24. (pmid: 22174245)

PubMed ] [ DOI ] Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R(2)) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.


...

 

Exercises

Try out MINE
  1. Create a working folder on your computer (e.g. name it MINE).
  2. Navigate to http://www.exploredata.net/ and follow the link to Downloads.
  3. Follow the link to the Gene Expression Data Set in the side-bar and download Spellman.csv to your folder.
  4. Edit Spellman.csv by duplicating the first row and renaming "time" in the first row to "Name". (Don't use MSWord!!!)
  5. Follow the link to MINE application in the side-bar.
  6. Download MINE.jar and MINE.r to your folder.
  7. Follow the link to Parameters in the side-bar and study your options.
  8. Click on the link to Usage-instructions and follow the instructions: How to run MINE in R.
    1. Start R and set your working folder as the working directory (command: setwd(...)).
    2. Use FileOpen Document... to open MINE.r
    3. run: install.packages("rJava") ... to download the rJava package from CRAN if it hasn't been installed before
    4. Use FileSource File... to execute the commands in MINE.r ... this executes library("rJava") and .jinit(classpath="MINE.jar") and defines the functions MINE and rMINE.
    5. run: MINE("Spellman.csv","two.pairs",1,5) to verify that the installation is oK and you can access the data.
    6. The MCM3 gene that was discussed in Reshef et al. (2011) has the systematic name YEL032W:
What is its index in the table?
genes <- read.csv("Spellman.csv")
genes[grep("YEL032W",genes$Name),]
Plot the gene's expression profile:
genes <- read.csv("Spellman.csv")
time <- data.matrix(genes[1,2:24])
plot (time, data.matrix(genes[1017,2:24]))
Find genes with a high MIC with YEL032W.
MINE("Spellman.csv","master.variable",1017)
Looking at the output text-file, you see that YDR191W has a high MIC. Plot its expression profile as an overlay plot, then plot the expression values of one gene against the other. Is this a postive or a negative correlation. Then explore more genes. Can you find a gene that is negatively correlated with YEL032W?
Have fun!


 

Further reading and resources

Reshef et al. (2011) Detecting novel associations in large data sets. Science 334:1518-24. (pmid: 22174245)

PubMed ] [ DOI ] Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R(2)) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.

Speed (2011) Mathematics. A correlation for the 21st century. Science 334:1502-3. (pmid: 22174235)

PubMed ] [ DOI ]

Luo & Woolf (2010) Reconstructing transcriptional regulatory networks using three-way mutual information and Bayesian networks. Methods Mol Biol 674:401-18. (pmid: 20827604)

PubMed ] [ DOI ] Probabilistic methods such as mutual information and Bayesian networks have become a major category of tools for the reconstruction of regulatory relationships from quantitative biological data. In this chapter, we describe the theoretic framework and the implementation for learning gene regulatory networks using high-order mutual information via the MI3 method (Luo et al. (2008) BMC Bioinformatics 9, 467; Luo (2008) Gene regulatory network reconstruction and pathway inference from high throughput gene expression data. PhD thesis). We also cover the closely related Bayesian network method in detail.

Chou & Voit (2009) Recent developments in parameter estimation and structure identification of biochemical and genomic systems. Math Biosci 219:57-83. (pmid: 19327372)

PubMed ] [ DOI ] The organization, regulation and dynamical responses of biological systems are in many cases too complex to allow intuitive predictions and require the support of mathematical modeling for quantitative assessments and a reliable understanding of system functioning. All steps of constructing mathematical models for biological systems are challenging, but arguably the most difficult task among them is the estimation of model parameters and the identification of the structure and regulation of the underlying biological networks. Recent advancements in modern high-throughput techniques have been allowing the generation of time series data that characterize the dynamics of genomic, proteomic, metabolic, and physiological responses and enable us, at least in principle, to tackle estimation and identification tasks using 'top-down' or 'inverse' approaches. While the rewards of a successful inverse estimation or identification are great, the process of extracting structural and regulatory information is technically difficult. The challenges can generally be categorized into four areas, namely, issues related to the data, the model, the mathematical structure of the system, and the optimization and support algorithms. Many recent articles have addressed inverse problems within the modeling framework of Biochemical Systems Theory (BST). BST was chosen for these tasks because of its unique structural flexibility and the fact that the structure and regulation of a biological system are mapped essentially one-to-one onto the parameters of the describing model. The proposed methods mainly focused on various optimization algorithms, but also on support techniques, including methods for circumventing the time consuming numerical integration of systems of differential equations, smoothing overly noisy data, estimating slopes of time series, reducing the complexity of the inference task, and constraining the parameter search space. Other methods targeted issues of data preprocessing, detection and amelioration of model redundancy, and model-free or model-based structure identification. The total number of proposed methods and their applications has by now exceeded one hundred, which makes it difficult for the newcomer, as well as the expert, to gain a comprehensive overview of available algorithmic options and limitations. To facilitate the entry into the field of inverse modeling within BST and related modeling areas, the article presented here reviews the field and proposes an operational 'work-flow' that guides the user through the estimation process, identifies possibly problematic steps, and suggests corresponding solutions based on the specific characteristics of the various available algorithms. The article concludes with a discussion of the present state of the art and with a description of open questions.

Rao et al. (2008) Using directed information to build biologically relevant influence networks. J Bioinform Comput Biol 6:493-519. (pmid: 18574860)

PubMed ] [ DOI ] The systematic inference of biologically relevant influence networks remains a challenging problem in computational biology. Even though the availability of high-throughput data has enabled the use of probabilistic models to infer the plausible structure of such networks, their true interpretation of the biology of the process is questionable. In this work, we propose a network inference methodology, based on the directed information (DTI) criterion, that incorporates the biology of transcription within the framework so as to enable experimentally verifiable inference. We use publicly available embryonic kidney and T-cell microarray datasets to demonstrate our results. We present two variants of network inference via DTI--supervised and unsupervised--and the inferred networks relevant to mammalian nephrogenesis and T-cell activation. Conformity of the obtained interactions with the literature as well as comparison with the coefficient of determination (CoD) method are demonstrated. Apart from network inference, the proposed framework enables the exploration of specific interactions, not just those revealed by data. To illustrate the latter point, a DTI-based framework to resolve interactions between transcription factor modules and target coregulated genes is proposed. Additionally, we show that DTI can be used in conjunction with mutual information to infer higher-order influence networks involving cooperative gene interactions.

Wu et al. (2005) Deciphering protein network organization using phylogenetic profile groups. Genome Inform 16:142-9. (pmid: 16362916)

PubMed ] Phylogenetic profiling is now an effective computational method to detect functional associations between proteins. The method links two proteins in accordance with the similarity of their phyletic distributions across a set of genomes. While pair-wise linkage is useful, it misses correlations in higher order groups: triplets, quadruplets, and so on. Here we assess the probability of observing co-occurrence patterns of 3 binary profiles by chance and show that this probability is asymptotically the same as the mutual information in three profiles. We demonstrate the utility of the probability and the mutual information metrics in detecting overly represented triplets of orthologous proteins which could not be detected using pairwise profiles. These triplets serve as small building blocks, i.e. motifs in protein networks; they allow us to infer the function of uncharacterized members, and facilitate analysis of the local structure and global organization of the protein network. Our method is extendable to N-component clusters, and therefore serves as a general tool for high order protein function annotation.

Wu et al. (2003) Identification of functional links between genes using phylogenetic profiles. Bioinformatics 19:1524-30. (pmid: 12912833)

PubMed ] [ DOI ] MOTIVATION: Genes with identical patterns of occurrence across the phyla tend to function together in the same protein complexes or participate in the same biochemical pathway. However, the requirement that the profiles be identical (i) severely restricts the number of functional links that can be established by such phylogenetic profiling; (ii) limits detection to very strong functional links, failing to capture relations between genes that are not in the same pathway, but nevertheless subserve a common function and (iii) misses relations between analogous genes. Here we present and apply a method for relaxing the restriction, based on the probability that a given arbitrary degree of similarity between two profiles would occur by chance, with no biological pressure. Function is then inferred at any desired level of confidence. RESULTS: We derive an expression for the probability distribution of a given number of chance co-occurrences of a pair of non-homologous orthologs across a set of genomes. The method is applied to 2905 clusters of orthologous genes (COGs) from 44 fully sequenced microbial genomes representing all three domains of life. Among the results are the following. (1) Of the 51 000 annotated intrapathway gene pairs, 8935 are linked at a level of significance of 0.01. This is over 30-fold greater than the 271 intrapathway pairs obtained at the same confidence level when identical profiles are used. (2) Of the 540 000 interpathway genes pairs, some 65 000 are linked at the 0.01 level of significance, some 12 standard deviations beyond the number expected by chance at this confidence level. We speculate that many of these links involve nearest-neighbor path, and discuss some examples. (3) The difference in the percentage of linked interpathway and intrapathway genes is highly significant, consistent with the intuitive expectation that genes in the same pathway are generally under greater selective pressure than those that are not. (4) The method appears to recover well metabolic networks. This is illustrated by the TCA cycle which is recovered as a highly connected, weighted edge network of 30 of its 31 COGs. (5) The fraction of pairs having a common pathway is a symmetric function of the Hamming distance between their profiles. This finding, that the functional correlation between profiles with near maximum Hamming distance is as large as between profiles with near zero Hamming distance, and as statistically significant, is plausibly explained if the former group represents analogous genes.