Enrichment
Enrichment Analysis
Enrichment analysis addresses the question: do genes in a set have a non-trivial property in common? The methodologies discussed here have applications in many fields of computational biology.
Contents
Introductory reading
Tilford & Siemers (2009) Gene set enrichment analysis. Methods Mol Biol 563:99-121. (pmid: 19597782) |
[ PubMed ] [ DOI ] Set enrichment analytical methods have become commonplace tools applied to the analysis and interpretation of biological data. The statistical techniques are used to identify categorical biases within lists of genes, proteins, or metabolites. The goal is to discover the shared functions or properties of the biological items represented within the lists. Application of these methods can provide great biological insight, including the discovery of participation in the same biological activity or pathway, shared interacting genes or regulators, common cellular compartmentalization, or association with disease. The methods require ordered or unordered lists of biological items as input, understanding of the reference set from which the lists were selected, categorical classifiers describing the items, and a statistical algorithm to assess bias of each classifier. Due to the complexity of most algorithms and the number of calculations performed, computer software is almost always used for execution of the algorithm, as well as for presentation of the results. This chapter will provide an overview of the statistical methods used to perform an enrichment analysis. Guidelines for assembly of the requisite information will be presented, with a focus on careful definition of the sets used by the statistical algorithms. The need for multiple test correction when working with large libraries of classifiers is emphasized, and we outline several options for performing the corrections. Finally, interpreting the results of such analysis will be discussed along with examples of recent research utilizing the techniques. |
Relative Enrichment
Relative Enrichment is the ratio of (fraction of elements of interest in an observed set) and (fraction of elements of interest in a reference set).
Functional Annotation Analysis (FAA)
Functional Annotation Analysis (FAA) analyses the enrichment of properties in a set of genes. Such properties may include GO terms, EC codes, membership in pathways, coregulation etc. A good resource for FAA is the DAVID database and server.
Gene Set Enrichment Analysis (GSEA)
Gene Set Enrichment Analysis (GSEA) analyses the enrichment of members of a predefined gene set a set of experimentally observed genes. Such predefined gene sets may come from pathway components, interaction clusters, genes that have particular transcription factor binding sites in common etc. The default resource is the GSEA software, distributed via the Broad Institute GSEA homepage.
Exercises
Task:
- Work through the DAVID tutorial published in nature protocols:
Huang et al. (2009) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4:44-57. (pmid: 19131956) |
[ PubMed ] [ DOI ] DAVID bioinformatics resources consists of an integrated biological knowledgebase and analytic tools aimed at systematically extracting biological meaning from large gene/protein lists. This protocol explains how to use DAVID, a high-throughput and integrated data-mining environment, to analyze gene lists derived from high-throughput genomic experiments. The procedure first requires uploading a gene list containing any number of common gene identifiers followed by analysis using one or more text and pathway-mining tools such as gene functional classification, functional annotation chart or clustering and functional annotation table. By following this protocol, investigators are able to gain an in-depth understanding of the biological themes in lists of genes that are enriched in genome-scale studies. |
- Access the Web version of the article, it conveniently contains the required links.
- Use Demo List 2, provided on the DAVID site for your analysis. Remember to read the description of the gene list.
- Do not use any of the Java tools. As of this writing Java applets in Web browsers are considered fundamentally insecure; Java should be disabled in your browser.
- For each of the analysis steps, think clearly about whether the results support od contradict your expectations about the data. Feel free to discuss your expectations and findings on the mailing list.
- If there are any problems with the assignment, contact me!
Further reading and resources
Tan et al. (2013) Network2Canvas: network visualization on a canvas with enrichment analysis. Bioinformatics 29:1872-8. (pmid: 23749960) |
[ PubMed ] [ DOI ] MOTIVATION: Networks are vital to computational systems biology research, but visualizing them is a challenge. For networks larger than ∼100 nodes and ∼200 links, ball-and-stick diagrams fail to convey much information. To address this, we developed Network2Canvas (N2C), a web application that provides an alternative way to view networks. N2C visualizes networks by placing nodes on a square toroidal canvas. The network nodes are clustered on the canvas using simulated annealing to maximize local connections where a node's brightness is made proportional to its local fitness. The interactive canvas is implemented in HyperText Markup Language (HTML)5 with the JavaScript library Data-Driven Documents (D3). We applied N2C to visualize 30 canvases made from human and mouse gene-set libraries and 6 canvases made from the Food and Drug Administration (FDA)-approved drug-set libraries. Given lists of genes or drugs, enriched terms are highlighted on the canvases, and their degree of clustering is computed. Because N2C produces visual patterns of enriched terms on canvases, a trained eye can detect signatures instantly. In summary, N2C provides a new flexible method to visualize large networks and can be used to perform and visualize gene-set and drug-set enrichment analyses. AVAILABILITY: N2C is freely available at http://www.maayanlab.net/N2C and is open source. CONTACT: avi.maayan@mssm.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
Takemasa et al. (2012) Potential biological insights revealed by an integrated assessment of proteomic and transcriptomic data in human colorectal cancer. Int J Oncol 40:551-9. (pmid: 22025299) |
[ PubMed ] [ DOI ] In the post-genomic era, the main aim of cancer research is organizing the large amount of data on gene expression and protein abundance into a meaningful biological context. Performing integrated analysis of genomic and proteomic data sets is a challenging task. To comprehensively assess the correlation between mRNA and protein expression, we focused on the gene set enrichment analysis, a recently described powerful analytical method. When the differentially expressed proteins in 12 colorectal cancer tissue samples were considered a collective set, they exhibited significant concordance with primary tumor gene expression data in 180 colorectal cancer tissue samples. We found that 53 upregulated proteins were significantly enriched in genes exhibiting elevated gene expression levels (P<0.001, ES=0.53), indicating a positive correlation between the proteomic and transcriptomic data. Similarly, 44 downregulated proteins were significantly enriched in genes exhibiting elevated gene expression levels (P<0.001, ES -0.65). Moreover, we applied gene set enrichment analysis to identify functional genetic pathways in CRC. A relatively large number of upregulated proteins were related to the two principal pathways; ECM receptor interaction was related to heparan sulfate proteoglycan 2 and vitronectin, and ribosome to RPL13, RPL27A, RPL4, RPS18, and RPS29. In conclusion, the integrated understanding of both genomic and proteomic data sets can lead to a better understanding of functional inference at the physiological level and potential molecular targets in clinical settings. |
Merico et al. (2011) Visualizing gene-set enrichment results using the Cytoscape plug-in enrichment map. Methods Mol Biol 781:257-77. (pmid: 21877285) |
[ PubMed ] [ DOI ] Gene-set enrichment analysis finds functionally coherent gene-sets, such as pathways, that are statistically overrepresented in a given gene list. Ideally, the number of resulting sets is smaller than the number of genes in the list, thus simplifying interpretation. However, the increasing number and redundancy of -gene-sets used by many current enrichment analysis resources work against this ideal. "Enrichment Map" is a Cytoscape plug-in that helps overcome gene-set redundancy and aids in the interpretation of enrichment results. Gene-sets are organized in a network, where each set is a node and links represent gene overlap between sets. Automated network layout groups related gene-sets into -network clusters, enabling the user to quickly identify the major enriched functional themes and more easily interpret enrichment results. |
Irizarry et al. (2009) Gene set enrichment analysis made simple. Stat Methods Med Res 18:565-75. (pmid: 20048385) |
[ PubMed ] [ DOI ] Among the many applications of microarray technology, one of the most popular is the identification of genes that are differentially expressed in two conditions. A common statistical approach is to quantify the interest of each gene with a p-value, adjust these p-values for multiple comparisons, choose an appropriate cut-off, and create a list of candidate genes. This approach has been criticised for ignoring biological knowledge regarding how genes work together. Recently a series of methods, that do incorporate biological knowledge, have been proposed. However, the most popular method, gene set enrichment analysis (GSEA), seems overly complicated. Furthermore, GSEA is based on a statistical test known for its lack of sensitivity. In this article we compare the performance of a simple alternative to GSEA. We find that this simple solution clearly outperforms GSEA. We demonstrate this with eight different microarray datasets. |
Abatangelo et al. (2009) Comparative study of gene set enrichment methods. BMC Bioinformatics 10:275. (pmid: 19725948) |
[ PubMed ] [ DOI ] BACKGROUND: The analysis of high-throughput gene expression data with respect to sets of genes rather than individual genes has many advantages. A variety of methods have been developed for assessing the enrichment of sets of genes with respect to differential expression. In this paper we provide a comparative study of four of these methods: Fisher's exact test, Gene Set Enrichment Analysis (GSEA), Random-Sets (RS), and Gene List Analysis with Prediction Accuracy (GLAPA). The first three methods use associative statistics, while the fourth uses predictive statistics. We first compare all four methods on simulated data sets to verify that Fisher's exact test is markedly worse than the other three approaches. We then validate the other three methods on seven real data sets with known genetic perturbations and then compare the methods on two cancer data sets where our a priori knowledge is limited. RESULTS: The simulation study highlights that none of the three method outperforms all others consistently. GSEA and RS are able to detect weak signals of deregulation and they perform differently when genes in a gene set are both differentially up and down regulated. GLAPA is more conservative and large differences between the two phenotypes are required to allow the method to detect differential deregulation in gene sets. This is due to the fact that the enrichment statistic in GLAPA is prediction error which is a stronger criteria than classical two sample statistic as used in RS and GSEA. This was reflected in the analysis on real data sets as GSEA and RS were seen to be significant for particular gene sets while GLAPA was not, suggesting a small effect size. We find that the rank of gene set enrichment induced by GLAPA is more similar to RS than GSEA. More importantly, the rankings of the three methods share significant overlap. CONCLUSION: The three methods considered in our study recover relevant gene sets known to be deregulated in the experimental conditions and pathologies analyzed. There are differences between the three methods and GSEA seems to be more consistent in finding enriched gene sets, although no method uniformly dominates over all data sets. Our analysis highlights the deep difference existing between associative and predictive methods for detecting enrichment and the use of both to better interpret results of pathway analysis. We close with suggestions for users of gene set methods. |
Subramanian et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U.S.A 102:15545-50. (pmid: 16199517) |
[ PubMed ] [ DOI ] Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets. |