Gene regulatory networks

From "A B C"
Jump to navigation Jump to search

Gene Regulatory Networks


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.


The discovery and definition of gene regulatory networks is one of the big topics of systems biology, not only because of their biological importance, but also because the basic data can be acquired from the first high-throughput assays in biology: microarray expression profiles.



 

Introductory reading

Baitaluk (2009) System biology of gene regulation. Methods Mol Biol 569:55-87. (pmid: 19623486)

PubMed ] [ DOI ] A famous joke story that exhibits the traditionally awkward alliance between theory and experiment and showing the differences between experimental biologists and theoretical modelers is when a University sends a biologist, a mathematician, a physicist, and a computer scientist to a walking trip in an attempt to stimulate interdisciplinary research. During a break, they watch a cow in a field nearby and the leader of the group asks, "I wonder how one could decide on the size of a cow?" Since a cow is a biological object, the biologist responded first: "I have seen many cows in this area and know it is a big cow." The mathematician argued, "The true volume is determined by integrating the mathematical function that describes the outer surface of the cow's body." The physicist suggested: "Let's assume the cow is a sphere...." Finally the computer scientist became nervous and said that he didn't bring his computer because there is no Internet connection up there on the hill. In this humorous but explanatory story suggestions proposed by theorists can be taken to reflect the view of many experimental biologists that computer scientists and theorists are too far removed from biological reality and therefore their theories and approaches are not of much immediate usefulness. Conversely, the statement of the biologist mirrors the view of many traditional theoretical and computational scientists that biological experiments are for the most part simply descriptive, lack rigor, and that much of the resulting biological data are of questionable functional relevance. One of the goals of current biology as a multidisciplinary science is to bring people from different scientific areas together on the same "hill" and teach them to speak the same "language." In fact, of course, when presenting their data, most experimentalist biologists do provide an interpretation and explanation for the results, and many theorists/computer scientists aim to answer (or at least to fully describe) questions of biological relevance. Thus systems biology could be treated as such a socioscientific phenomenon and a new approach to both experiments and theory that is defined by the strategy of pursuing integration of complex data about the interactions in biological systems from diverse experimental sources using interdisciplinary tools and personnel.


 

Contents

  • Principles


Annotation of transcription factor binding sites

Network discovery from expression time-series

  • The nature of time-series
  • Causality
  • Bayesian methods
  • (Partial) Granger causality
Zou et al. (2009) Granger causality vs. dynamic Bayesian network inference: a comparative study. BMC Bioinformatics 10:122. (pmid: 19393071)

PubMed ] [ DOI ] BACKGROUND: In computational biology, one often faces the problem of deriving the causal relationship among different elements such as genes, proteins, metabolites, neurons and so on, based upon multi-dimensional temporal data. Currently, there are two common approaches used to explore the network structure among elements. One is the Granger causality approach, and the other is the dynamic Bayesian network inference approach. Both have at least a few thousand publications reported in the literature. A key issue is to choose which approach is used to tackle the data, in particular when they give rise to contradictory results. RESULTS: In this paper, we provide an answer by focusing on a systematic and computationally intensive comparison between the two approaches on both synthesized and experimental data. For synthesized data, a critical point of the data length is found: the dynamic Bayesian network outperforms the Granger causality approach when the data length is short, and vice versa. We then test our results in experimental data of short length which is a common scenario in current biological experiments: it is again confirmed that the dynamic Bayesian network works better. CONCLUSION: When the data size is short, the dynamic Bayesian network inference performs better than the Granger causality approach; otherwise the Granger causality approach is better.

Yuan et al. (2011) Directed partial correlation: inferring large-scale gene regulatory network through induced topology disruptions. PLoS ONE 6:e16835. (pmid: 21494330)

PubMed ] [ DOI ] Inferring regulatory relationships among many genes based on their temporal variation in transcript abundance has been a popular research topic. Due to the nature of microarray experiments, classical tools for time series analysis lose power since the number of variables far exceeds the number of the samples. In this paper, we describe some of the existing multivariate inference techniques that are applicable to hundreds of variables and show the potential challenges for small-sample, large-scale data. We propose a directed partial correlation (DPC) method as an efficient and effective solution to regulatory network inference using these data. Specifically for genomic data, the proposed method is designed to deal with large-scale datasets. It combines the efficiency of partial correlation for setting up network topology by testing conditional independence, and the concept of Granger causality to assess topology change with induced interruptions. The idea is that when a transcription factor is induced artificially within a gene network, the disruption of the network by the induction signifies a genes role in transcriptional regulation. The benchmarking results using GeneNetWeaver, the simulator for the DREAM challenges, provide strong evidence of the outstanding performance of the proposed DPC method. When applied to real biological data, the inferred starch metabolism network in Arabidopsis reveals many biologically meaningful network modules worthy of further investigation. These results collectively suggest DPC is a versatile tool for genomics research. The R package DPC is available for download (http://code.google.com/p/dpcnet/).


 

References


 

Further reading and resources

Principles
Röttger et al. (2012) How little do we actually know? On the size of gene regulatory networks. IEEE/ACM Trans Comput Biol Bioinform 9:1293-300. (pmid: 22585140)

PubMed ] [ DOI ] The National Center for Biotechnology Information (NCBI) recently announced the availability of whole genome sequences for more than 1,000 species. And the number of sequenced individual organisms is growing. Ongoing improvement of DNA sequencing technology will further contribute to this, enabling large-scale evolution and population genetics studies. However, the availability of sequence information is only the first step in understanding how cells survive, reproduce, and adjust their behavior. The genetic control behind organized development and adaptation of complex organisms still remains widely undetermined. One major molecular control mechanism is transcriptional gene regulation. The direct juxtaposition of the total number of sequenced species to the handful of model organisms with known regulations is surprising. Here, we investigate how little we even know about these model organisms. We aim to predict the sizes of the whole-organism regulatory networks of seven species. In particular, we provide statistical lower bounds for the expected number of regulations. For Escherichia coli we estimate at most 37 percent of the expected gene regulatory interactions to be already discovered, 24 percent for Bacillus subtilis, and <3% human, respectively. We conclude that even for our best researched model organisms we still lack substantial understanding of fundamental molecular control mechanisms, at least on a large scale.

El-Samad & Weissman (2011) Genetics: Noise rules. Nature 480:188-9. (pmid: 22158239)

PubMed ] [ DOI ]

Vaquerizas et al. (2012) How do you find transcription factors? Computational approaches to compile and annotate repertoires of regulators for any genome. Methods Mol Biol 786:3-19. (pmid: 21938617)

PubMed ] [ DOI ] Transcription factors (TFs) play an important role in regulating gene expression. The availability of complete genome sequences and associated functional genomic data offer excellent opportunities to understand the transcriptional regulatory system of an entire organism. To do so, however, it is essential to compile a reliable dataset of regulatory components. Here, we review computational methods and publicly accessible resources that help identify TF-coding genes in prokaryotic and eukaryotic genomes. Since the regulatory functions of most TFs remain unknown, we also discuss approaches for combining diverse genomic datasets that will help elucidate their chromosomal organisation, expression, and evolutionary conservation. These analysis methods provide a solid foundation for further investigations of the transcriptional regulatory system.

Pilpel (2011) Noise in biological systems: pros, cons, and mechanisms of control. Methods Mol Biol 759:407-25. (pmid: 21863500)

PubMed ] [ DOI ] Genetic regulatory circuits are often regarded as precise machines that accurately determine the level of expression of each protein. Most experimental technologies used to measure gene expression levels are incapable of testing and challenging this notion, as they often measure levels averaged over entire populations of cells. Yet, when expression levels are measured at the single cell level of even genetically identical cells, substantial cell-to-cell variation (or "noise") may be observed. Sometimes different genes in a given genome may display different levels of noise; even the same gene, expressed under different environmental conditions, may display greater cell-to-cell variability in specific conditions and more tight control in other situations. While at first glance noise may seem to be an undesired property of biological networks, it might be beneficial in some cases. For instance, noise will increase functional heterogeneity in a population of microorganisms facing variable, often unpredictable, environmental changes, increasing the probability that some cells may survive the stress. In that respect, we can speculate that the population is implementing a risk distribution strategy, long before genetic heterogeneity could be acquired. Organisms may have evolved to regulate not only the averaged gene expression levels but also the extent of allowed deviations from such an average, setting it at the desired level for every gene under each specific condition. Here we review the evolving understanding of noise, its molecular underpinnings, and its effect on phenotype and fitness--when it can be detrimental, beneficial, or neutral and which regulatory tools eukaryotic cells may use to optimally control it.

Knabe et al. (2010) Genetic algorithms and their application to in silico evolution of genetic regulatory networks. Methods Mol Biol 673:297-321. (pmid: 20835807)

PubMed ] [ DOI ] A genetic algorithm (GA) is a procedure that mimics processes occurring in Darwinian evolution to solve computational problems. A GA introduces variation through "mutation" and "recombination" in a "population" of possible solutions to a problem, encoded as strings of characters in "genomes," and allows this population to evolve, using selection procedures that favor the gradual enrichment of the gene pool with the genomes of the "fitter" individuals. GAs are particularly suitable for optimization problems in which an effective system design or set of parameter values is sought.In nature, genetic regulatory networks (GRNs) form the basic control layer in the regulation of gene expression levels. GRNs are composed of regulatory interactions between genes and their gene products, and are, inter alia, at the basis of the development of single fertilized cells into fully grown organisms. This paper describes how GAs may be applied to find functional regulatory schemes and parameter values for models that capture the fundamental GRN characteristics. The central ideas behind evolutionary computation and GRN modeling, and the considerations in GA design and use are discussed, and illustrated with an extended example. In this example, a GRN-like controller is sought for a developmental system based on Lewis Wolpert's French flag model for positional specification, in which cells in a growing embryo secrete and detect morphogens to attain a specific spatial pattern of cellular differentiation.

Harbison et al. (2004) Transcriptional regulatory code of a eukaryotic genome. Nature 431:99-104. (pmid: 15343339)

PubMed ] [ DOI ] DNA-binding transcriptional regulators interpret the genome's regulatory code by binding to specific sequences to induce or repress gene expression. Comparative genomics has recently been used to identify potential cis-regulatory sequences within the yeast genome on the basis of phylogenetic conservation, but this information alone does not reveal if or when transcriptional regulators occupy these binding sites. We have constructed an initial map of yeast's transcriptional regulatory code by identifying the sequence elements that are bound by regulators under various conditions and that are conserved among Saccharomyces species. The organization of regulatory elements in promoters and the environment-dependent use of these elements by regulators are discussed. We find that environment-specific use of regulatory elements predicts mechanistic models for the function of a large population of yeast's transcriptional regulators.


TFBS and Network discovery
Ma et al. (2013) An integrated toolkit for accurate prediction and analysis of cis-regulatory motifs at a genome scale. Bioinformatics 29:2261-8. (pmid: 23846744)

PubMed ] [ DOI ] MOTIVATION: We present an integrated toolkit, BoBro2.0, for prediction and analysis of cis-regulatory motifs. This toolkit can (i) reliably identify statistically significant cis-regulatory motifs at a genome scale; (ii) accurately scan for all motif instances of a query motif in specified genomic regions using a novel method for P-value estimation; (iii) provide highly reliable comparisons and clustering of identified motifs, which takes into consideration the weak signals from the flanking regions of the motifs; and (iv) analyze co-occurring motifs in the regulatory regions. RESULTS: We have carried out systematic comparisons between motif predictions using BoBro2.0 and the MEME package. The comparison results on Escherichia coli K12 genome and the human genome show that BoBro2.0 can identify the statistically significant motifs at a genome scale more efficiently, identify motif instances more accurately and get more reliable motif clusters than MEME. In addition, BoBro2.0 provides correlational analyses among the identified motifs to facilitate the inference of joint regulation relationships of transcription factors. AVAILABILITY: The source code of the program is freely available for noncommercial uses at http://code.google.com/p/bobro/. CONTACT: xyn@bmb.uga.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Belcastro et al. (2012) Reverse engineering and analysis of genome-wide gene regulatory networks from gene expression profiles using high-performance computing. IEEE/ACM Trans Comput Biol Bioinform 9:668-78. (pmid: 21464509)

PubMed ] [ DOI ] Regulation of gene expression is a carefully regulated phenomenon in the cell. “Reverse-engineering” algorithms try to reconstruct the regulatory interactions among genes from genome-scale measurements of gene expression profiles (microarrays). Mammalian cells express tens of thousands of genes; hence, hundreds of gene expression profiles are necessary in order to have acceptable statistical evidence of interactions between genes. As the number of profiles to be analyzed increases, so do computational costs and memory requirements. In this work, we designed and developed a parallel computing algorithm to reverse-engineer genome-scale gene regulatory networks from thousands of gene expression profiles. The algorithm is based on computing pairwise Mutual Information between each gene-pair. We successfully tested it to reverse engineer the Mus Musculus (mouse) gene regulatory network in liver from gene expression profiles collected from a public repository. A parallel hierarchical clustering algorithm was implemented to discover “communities” within the gene network. Network communities are enriched for genes involved in the same biological functions. The inferred network was used to identify two mitochondrial proteins.

Schultheiss (2010) Kernel-based identification of regulatory modules. Methods Mol Biol 674:213-23. (pmid: 20827594)

PubMed ] [ DOI ] The challenge of identifying cis-regulatory modules (CRMs) is an important milestone for the ultimate goal of understanding transcriptional regulation in eukaryotic cells. It has been approached, among others, by motif-finding algorithms that identify overrepresented motifs in regulatory sequences. These methods succeed in finding single, well-conserved motifs, but fail to identify combinations of degenerate binding sites, like the ones often found in CRMs. We have developed a method that combines the abilities of existing motif finding with the discriminative power of a machine learning technique to model the regulation of genes (Schultheiss et al. (2009) Bioinformatics 25, 2126-2133). Our software is called KIRMES: , which stands for kernel-based identification of regulatory modules in eukaryotic sequences. Starting from a set of genes thought to be co-regulated, KIRMES: can identify the key CRMs responsible for this behavior and can be used to determine for any other gene not included on that list if it is also regulated by the same mechanism. Such gene sets can be derived from microarrays, chromatin immunoprecipitation experiments combined with next-generation sequencing or promoter/whole genome microarrays. The use of an established machine learning method makes the approach fast to use and robust with respect to noise. By providing easily understood visualizations for the results returned, they become interpretable and serve as a starting point for further analysis. Even for complex regulatory relationships, KIRMES: can be a helpful tool in directing the design of biological experiments.

Hickman & Hodgman (2009) Inference of gene regulatory networks using boolean-network inference methods. J Bioinform Comput Biol 7:1013-29. (pmid: 20014476)

PubMed ] [ DOI ] The modeling of genetic networks especially from microarray and related data has become an important aspect of the biosciences. This review takes a fresh look at a specific family of models used for constructing genetic networks, the so-called Boolean networks. The review outlines the various different types of Boolean network developed to date, from the original Random Boolean Network to the current Probabilistic Boolean Network. In addition, some of the different inference methods available to infer these genetic networks are also examined. Where possible, particular attention is paid to input requirements as well as the efficiency, advantages and drawbacks of each method. Though the Boolean network model is one of many models available for network inference today, it is well established and remains a topic of considerable interest in the field of genetic network inference. Hybrids of Boolean networks with other approaches may well be the way forward in inferring the most informative networks.

Chan et al. (2009) Discovering multiple realistic TFBS motifs based on a generalized model. BMC Bioinformatics 10:321. (pmid: 19811641)

PubMed ] [ DOI ] BACKGROUND: Identification of transcription factor binding sites (TFBSs) is a central problem in Bioinformatics on gene regulation. de novo motif discovery serves as a promising way to predict and better understand TFBSs for biological verifications. Real TFBSs of a motif may vary in their widths and their conservation degrees within a certain range. Deciding a single motif width by existing models may be biased and misleading. Additionally, multiple, possibly overlapping, candidate motifs are desired and necessary for biological verification in practice. However, current techniques either prohibit overlapping TFBSs or lack explicit control of different motifs. RESULTS: We propose a new generalized model to tackle the motif widths by considering and evaluating a width range of interest simultaneously, which should better address the width uncertainty. Moreover, a meta-convergence framework for genetic algorithms (GAs), is proposed to provide multiple overlapping optimal motifs simultaneously in an effective and flexible way. Users can easily specify the difference amongst expected motif kinds via similarity test. Incorporating Genetic Algorithm with Local Filtering (GALF) for searching, the new GALF-G (G for generalized) algorithm is proposed based on the generalized model and meta-convergence framework. CONCLUSION: GALF-G was tested extensively on over 970 synthetic, real and benchmark datasets, and is usually better than the state-of-the-art methods. The range model shows an increase in sensitivity compared with the single-width ones, while providing competitive precisions on the E. coli benchmark. Effectiveness can be maintained even using a very small population, exhibiting very competitive efficiency. In discovering multiple overlapping motifs in a real liver-specific dataset, GALF-G outperforms MEME by up to 73% in overall F-scores. GALF-G also helps to discover an additional motif which has probably not been annotated in the dataset. http://www.cse.cuhk.edu.hk/%7Etmchan/GALFG/

Myers et al. (2009) Discovering biological networks from diverse functional genomic data. Methods Mol Biol 563:157-75. (pmid: 19597785)

PubMed ] [ DOI ] Recent advances in biotechnology have produced a wealth of genomic data, which capture a variety of complementary cellular features. While these data promise to yield key insights into molecular biology, much of the available information remains underutilized because of the lack of scalable approaches for integrating signals across large, diverse data sets. A proper framework for capturing these numerous snapshots of complementary phenomena under a variety of conditions can provide the holistic view necessary for developing precise systems-level hypotheses. Here we describe bioPIXIE, a system for combining information from diverse genomic data sets to predict biological networks. bioPIXIE utilizes a Bayesian framework for probabilistic integration of several high-throughput genomic data types including gene expression, protein-protein interactions, genetic interactions, protein localization, and sequence data to predict biological networks. The main purpose of the system is to support user-driven exploration through the inferred functional network, which is enabled by a public, web-based interface. We describe the features and supporting methods of this integration and discovery framework and present case examples where bioPIXIE has been used to generate specific, testable hypotheses for Saccharomyces cerevisiae, many of which have been confirmed experimentally.

Lee & Tzou (2009) Computational methods for discovering gene networks from expression data. Brief Bioinformatics 10:408-23. (pmid: 19505889)

PubMed ] [ DOI ] Designing and conducting experiments are routine practices for modern biologists. The real challenge, especially in the post-genome era, usually comes not from acquiring data, but from subsequent activities such as data processing, analysis, knowledge generation and gaining insight into the research question of interest. The approach of inferring gene regulatory networks (GRNs) has been flourishing for many years, and new methods from mathematics, information science, engineering and social sciences have been applied. We review different kinds of computational methods biologists use to infer networks of varying levels of accuracy and complexity. The primary concern of biologists is how to translate the inferred network into hypotheses that can be tested with real-life experiments. Taking the biologists' viewpoint, we scrutinized several methods for predicting GRNs in mammalian cells, and more importantly show how the power of different knowledge databases of different types can be used to identify modules and subnetworks, thereby reducing complexity and facilitating the generation of testable hypotheses.

Hecker et al. (2009) Gene regulatory network inference: data integration in dynamic models-a review. BioSystems 96:86-103. (pmid: 19150482)

PubMed ] [ DOI ] Systems biology aims to develop mathematical models of biological systems by integrating experimental and theoretical techniques. During the last decade, many systems biological approaches that base on genome-wide data have been developed to unravel the complexity of gene regulation. This review deals with the reconstruction of gene regulatory networks (GRNs) from experimental data through computational methods. Standard GRN inference methods primarily use gene expression data derived from microarrays. However, the incorporation of additional information from heterogeneous data sources, e.g. genome sequence and protein-DNA interaction data, clearly supports the network inference process. This review focuses on promising modelling approaches that use such diverse types of molecular biological information. In particular, approaches are discussed that enable the modelling of the dynamics of gene regulatory systems. The review provides an overview of common modelling schemes and learning algorithms and outlines current challenges in GRN modelling.

Segal et al. (2003) Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet 34:166-76. (pmid: 12740579)

PubMed ] [ DOI ] Much of a cell's activity is organized as a network of interacting modules: sets of genes coregulated to respond to different conditions. We present a probabilistic method for identifying regulatory modules from gene expression data. Our procedure identifies modules of coregulated genes, their regulators and the conditions under which regulation occurs, generating testable hypotheses in the form 'regulator X regulates module Y under conditions W'. We applied the method to a Saccharomyces cerevisiae expression data set, showing its ability to identify functionally coherent modules and their correct regulators. We present microarray experiments supporting three novel predictions, suggesting regulatory roles for previously uncharacterized proteins.


Applications
Ishihama (2012) Prokaryotic genome regulation: a revolutionary paradigm. Proc Jpn Acad., Ser B, Phys Biol Sci 88:485-508. (pmid: 23138451)

PubMed ] [ DOI ] After determination of the whole genome sequence, the research frontier of bacterial molecular genetics has shifted to reveal the genome regulation under stressful conditions in nature. The gene selectivity of RNA polymerase is modulated after interaction with two groups of regulatory proteins, 7 sigma factors and 300 transcription factors. For identification of regulation targets of transcription factors in Escherichia coli, we have developed Genomic SELEX system and subjected to screening the binding sites of these factors on the genome. The number of regulation targets by a single transcription factor was more than those hitherto recognized, ranging up to hundreds of promoters. The number of transcription factors involved in regulation of a single promoter also increased to as many as 30 regulators. The multi-target transcription factors and the multi-factor promoters were assembled into complex networks of transcription regulation. The most complex network was identified in the regulation cascades of transcription of two master regulators for planktonic growth and biofilm formation.

Alberghina et al. (2009) Systems biology of the cell cycle of Saccharomyces cerevisiae: From network mining to system-level properties. Biotechnol Adv 27:960-978. (pmid: 19465107)

PubMed ] [ DOI ] Following a brief description of the operational procedures of systems biology (SB), the cell cycle of budding yeast is discussed as a successful example of a top-down SB analysis. After the reconstruction of the steps that have led to the identification of a sizer plus timer network in the G1 to S transition, it is shown that basic functions of the cell cycle (the setting of the critical cell size and the accuracy of DNA replication) are system-level properties, detected only by integrating molecular analysis with modelling and simulation of their underlying networks. A detailed network structure of a second relevant regulatory step of the cell cycle, the exit from mitosis, derived from extensive data mining, is constructed and discussed. To reach a quantitative understanding of how nutrients control, through signalling, metabolism and transcription, cell growth and cycle is a very relevant aim of SB. Since we know that about 900 gene products are required for cell cycle execution and control in budding yeast, it is quite clear that a purely systematic approach would require too much time. Therefore lines for a modular SB approach, which prioritises molecular and computational investigations for faster cell cycle understanding, are proposed. The relevance of the insight coming from the cell cycle SB studies in developing a new framework for tackling very complex biological processes, such as cancer and aging, is discussed.

Csikász-Nagy (2009) Computational systems biology of the cell cycle. Brief Bioinformatics 10:424-34. (pmid: 19270018)

PubMed ] [ DOI ] One of the early success stories of computational systems biology was the work done on cell-cycle regulation. The earliest mathematical descriptions of cell-cycle control evolved into very complex, detailed computational models that describe the regulation of cell division in many different cell types. On the way these models predicted several dynamical properties and unknown components of the system that were later experimentally verified/identified. Still, research on this field is far from over. We need to understand how the core cell-cycle machinery is controlled by internal and external signals, also in yeast cells and in the more complex regulatory networks of higher eukaryotes. Furthermore, there are many computational challenges what we face as new types of data appear thanks to continuing advances in experimental techniques. We have to deal with cell-to-cell variations, revealed by single cell measurements, as well as the tremendous amount of data flowing from high throughput machines. We need new computational concepts and tools to handle these data and develop more detailed, more precise models of cell-cycle regulation in various organisms. Here we review past and present of computational modeling of cell-cycle regulation, and discuss possible future directions of the field.

Efroni et al. (2007) Identification of key processes underlying cancer phenotypes using biologic pathway analysis. PLoS ONE 2:e425. (pmid: 17487280)

PubMed ] [ DOI ] Cancer is recognized to be a family of gene-based diseases whose causes are to be found in disruptions of basic biologic processes. An increasingly deep catalogue of canonical networks details the specific molecular interaction of genes and their products. However, mapping of disease phenotypes to alterations of these networks of interactions is accomplished indirectly and non-systematically. Here we objectively identify pathways associated with malignancy, staging, and outcome in cancer through application of an analytic approach that systematically evaluates differences in the activity and consistency of interactions within canonical biologic processes. Using large collections of publicly accessible genome-wide gene expression, we identify small, common sets of pathways - Trka Receptor, Apoptosis response to DNA Damage, Ceramide, Telomerase, CD40L and Calcineurin - whose differences robustly distinguish diverse tumor types from corresponding normal samples, predict tumor grade, and distinguish phenotypes such as estrogen receptor status and p53 mutation state. Pathways identified through this analysis perform as well or better than phenotypes used in the original studies in predicting cancer outcome. This approach provides a means to use genome-wide characterizations to map key biological processes to important clinical features in disease.