CSB Gene lists

From "A B C"
Jump to navigation Jump to search

Gene Lists


This page was used in a previous iteration of a course. The material on this page is correct, but may be incomplete and is almost certainly a bit out of date.


Even though there are many different types of -omics data, many high-throughput or cross-sectional studies in molecular- and systems biology have as their result a list of genes or proteins. Whether these are significantly differentially expressed genes in a microarray study, chromosomal loci in a ChIP-Chip experiment, functionally related genes in a synthetic lethality screen, or co-purified proteins in a tandem-affinity MS experiment, the "list of genes" is a common denominator of all these approaches. Accordingly, similar or identical principles can be applied to their interpretation.


Introductory reading

Kim (2012) Chapter 8: Biological knowledge assembly and interpretation. PLoS Comput Biol 8:e1002858. (pmid: 23300429)

PubMed ] [ DOI ] Most methods for large-scale gene expression microarray and RNA-Seq data analysis are designed to determine the lists of genes or gene products that show distinct patterns and/or significant differences. The most challenging and rate-liming step, however, is to determine what the resulting lists of genes and/or transcripts biologically mean. Biomedical ontology and pathway-based functional enrichment analysis is widely used to interpret the functional role of tightly correlated or differentially expressed genes. The groups of genes are assigned to the associated biological annotations using Gene Ontology terms or biological pathways and then tested if they are significantly enriched with the corresponding annotations. Unlike previous approaches, Gene Set Enrichment Analysis takes quite the reverse approach by using pre-defined gene sets. Differential co-expression analysis determines the degree of co-expression difference of paired gene sets across different conditions. Outcomes in DNA microarray and RNA-Seq data can be transformed into the graphical structure that represents biological semantics. A number of biomedical annotation and external repositories including clinical resources can be systematically integrated by biological semantics within the framework of concept lattice analysis. This array of methods for biological knowledge assembly and interpretation has been developed during the past decade and clearly improved our biological understanding of large-scale genomic data from the high-throughput technologies.


Further reading and resources

Durinck et al. (2009) Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat Protoc 4:1184-91. (pmid: 19617889)

PubMed ] [ DOI ] Genomic experiments produce multiple views of biological systems, among them are DNA sequence and copy number variation, and mRNA and protein abundance. Understanding these systems needs integrated bioinformatic analysis. Public databases such as Ensembl provide relationships and mappings between the relevant sets of probe and target molecules. However, the relationships can be biologically complex and the content of the databases is dynamic. We demonstrate how to use the computational environment R to integrate and jointly analyze experimental datasets, employing BioMart web services to provide the molecule mappings. We also discuss typical problems that are encountered in making gene-to-transcript-to-protein mappings. The approach provides a flexible, programmable and reproducible basis for state-of-the-art bioinformatic data integration.

Boulesteix & Slawski (2009) Stability and aggregation of ranked gene lists. Brief Bioinformatics 10:556-68. (pmid: 19679825)

PubMed ] [ DOI ] Ranked gene lists are highly instable in the sense that similar measures of differential gene expression may yield very different rankings, and that a small change of the data set usually affects the obtained gene list considerably. Stability issues have long been under-considered in the literature, but they have grown to a hot topic in the last few years, perhaps as a consequence of the increasing skepticism on the reproducibility and clinical applicability of molecular research findings. In this article, we review existing approaches for the assessment of stability of ranked gene lists and the related problem of aggregation, give some practical recommendations, and warn against potential misuse of these methods. This overview is illustrated through an application to a recent leukemia data set using the freely available Bioconductor package GeneSelector.

Feng et al. (2012) Using the bioconductor GeneAnswers package to interpret gene lists. Methods Mol Biol 802:101-12. (pmid: 22130876)

PubMed ] [ DOI ] Use of microarray data to generate expression profiles of genes associated with disease can aid in identification of markers of disease and potential therapeutic targets. Pathway analysis methods further extend expression profiling by creating inferred networks that provide an interpretable structure of the gene list and visualize gene interactions. This chapter describes GeneAnswers, a novel gene-concept network analysis tool available as an open source Bioconductor package. GeneAnswers creates a gene-concept network and also can be used to build protein-protein interaction networks. The package includes an example multiple myeloma cell line dataset and tutorial. Several network analysis methods are included in GeneAnswers, and the tutorial highlights the conditions under which each type of analysis is most beneficial and provides sample code.

Warde-Farley et al. (2010) The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res 38:W214-20. (pmid: 20576703)

PubMed ] [ DOI ] GeneMANIA (http://www.genemania.org) is a flexible, user-friendly web interface for generating hypotheses about gene function, analyzing gene lists and prioritizing genes for functional assays. Given a query list, GeneMANIA extends the list with functionally similar genes that it identifies using available genomics and proteomics data. GeneMANIA also reports weights that indicate the predictive value of each selected data set for the query. Six organisms are currently supported (Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus, Homo sapiens and Saccharomyces cerevisiae) and hundreds of data sets have been collected from GEO, BioGRID, Pathway Commons and I2D, as well as organism-specific functional genomics data sets. Users can select arbitrary subsets of the data sets associated with an organism to perform their analyses and can upload their own data sets to analyze. The GeneMANIA algorithm performs as well or better than other gene function prediction methods on yeast and mouse benchmarks. The high accuracy of the GeneMANIA prediction algorithm, an intuitive user interface and large database make GeneMANIA a useful tool for any biologist.