Enrichment
Enrichment Analysis
Enrichment analysis addresses the question: do genes in a set have a non-trivial property in common? The methodologies discussed here have applications in many fields of computational biology.
Contents
Introductory reading
Tilford & Siemers (2009) Gene set enrichment analysis. Methods Mol Biol 563:99-121. (pmid: 19597782) |
[ PubMed ] [ DOI ] Set enrichment analytical methods have become commonplace tools applied to the analysis and interpretation of biological data. The statistical techniques are used to identify categorical biases within lists of genes, proteins, or metabolites. The goal is to discover the shared functions or properties of the biological items represented within the lists. Application of these methods can provide great biological insight, including the discovery of participation in the same biological activity or pathway, shared interacting genes or regulators, common cellular compartmentalization, or association with disease. The methods require ordered or unordered lists of biological items as input, understanding of the reference set from which the lists were selected, categorical classifiers describing the items, and a statistical algorithm to assess bias of each classifier. Due to the complexity of most algorithms and the number of calculations performed, computer software is almost always used for execution of the algorithm, as well as for presentation of the results. This chapter will provide an overview of the statistical methods used to perform an enrichment analysis. Guidelines for assembly of the requisite information will be presented, with a focus on careful definition of the sets used by the statistical algorithms. The need for multiple test correction when working with large libraries of classifiers is emphasized, and we outline several options for performing the corrections. Finally, interpreting the results of such analysis will be discussed along with examples of recent research utilizing the techniques. |
Relative Enrichment
Relative Enrichment is the ratio of (fraction of elements of interest in an observed set) and (fraction of elements of interest in a reference set).
Functional Annotation Analysis (FAA)
Functional Annotation Analysis (FAA) analyses the enrichment of properties in a set of genes. Such properties may include GO terms, EC codes, membership in pathways, coregulation etc. A good resource for FAA is the DAVID database and server.
Gene Set Enrichment Analysis (GSEA)
Gene Set Enrichment Analysis (GSEA) analyses the enrichment of members of a predefined gene set a set of experimentally observed genes. Such predefined gene sets may come from pathway components, interaction clusters, genes that have particular transcription factor binding sites in common etc. The default resource is the GSEA software, distributed via the Broad Institute GSEA homepage.
Exercises
Task:
- Work through the DAVID tutorial published in nature protocols:
Huang et al. (2009) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 4:44-57. (pmid: 19131956) |
[ PubMed ] [ DOI ] DAVID bioinformatics resources consists of an integrated biological knowledgebase and analytic tools aimed at systematically extracting biological meaning from large gene/protein lists. This protocol explains how to use DAVID, a high-throughput and integrated data-mining environment, to analyze gene lists derived from high-throughput genomic experiments. The procedure first requires uploading a gene list containing any number of common gene identifiers followed by analysis using one or more text and pathway-mining tools such as gene functional classification, functional annotation chart or clustering and functional annotation table. By following this protocol, investigators are able to gain an in-depth understanding of the biological themes in lists of genes that are enriched in genome-scale studies. |
- Access the Web version of the article, it conveniently contains the required links.
- Use Demo List 2, provided on the DAVID site for your analysis. Remember to read the description of the gene list.
- Do not use any of the Java tools. As of this writing Java applets in Web browsers are considered fundamentally insecure, they should be disabled in your browser.
- For each of the analysis steps, think clearly about whether the results support od contradict your expectations about the data. Feel free to discuss your expectations and findings on the mailing list.
- If there are any problems with the assignment, contact me!
Further reading and resources
Merico et al. (2011) Visualizing gene-set enrichment results using the Cytoscape plug-in enrichment map. Methods Mol Biol 781:257-77. (pmid: 21877285) |
[ PubMed ] [ DOI ] Gene-set enrichment analysis finds functionally coherent gene-sets, such as pathways, that are statistically overrepresented in a given gene list. Ideally, the number of resulting sets is smaller than the number of genes in the list, thus simplifying interpretation. However, the increasing number and redundancy of -gene-sets used by many current enrichment analysis resources work against this ideal. "Enrichment Map" is a Cytoscape plug-in that helps overcome gene-set redundancy and aids in the interpretation of enrichment results. Gene-sets are organized in a network, where each set is a node and links represent gene overlap between sets. Automated network layout groups related gene-sets into -network clusters, enabling the user to quickly identify the major enriched functional themes and more easily interpret enrichment results. |
Subramanian et al. (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U.S.A 102:15545-50. (pmid: 16199517) |
[ PubMed ] [ DOI ] Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets. |