Transcriptome

From "A B C"
Jump to navigation Jump to search

Transcriptome


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.


The transcriptome is the set of a cell's mRNA molecules. Microarray technology - the quantitative, sequence-specific hybridization of nucleotides - was the first domain of massively parallel, high-throughput biology. Quantifying gene expression levels in a tissue-, development-, or response-specific has yielded detailed insight into cellular function at the molecular level. Yet, while the questions remain, high-throughput sequencing methods are rapidly supplanting microarrays to provide the data. Moreover, we realize that the transcriptome is not just a passive buffer of expressed information: an entire, complex, intrinsic level of regulation through hybridization of small nuclear RNAs has been discovered.



 

Introductory reading

Malone & Oliver (2011) Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biol 9:34. (pmid: 21627854)

PubMed ] [ DOI ] Microarrays first made the analysis of the transcriptome possible, and have produced much important information. Today, however, researchers are increasingly turning to direct high-throughput sequencing -- RNA-Seq -- which has considerable advantages for examining transcriptome fine structure -- for example in the detection of allele-specific expression and splice junctions. In this article, we discuss the relative merits of the two techniques, the inherent biases in each, and whether all of the vast body of array work needs to be revisited using the newer technology. We conclude that microarrays remain useful and accurate tools for measuring expression levels, and RNA-Seq complements and extends microarray measurements.


 

Contents

Background

The transcriptome originates from the genome, mostly, that is, and it results in the proteome, again: mostly. RNA that is transcribed from the genome is not yet fit for translation but must be processed: splicing is ubiquitous[1] and in addition RNA editing has been encountered in many species. Some authors therefore refer to the exome—the set of transcribed exons— to indicate the actual coding sequence.

The dark matter of the transcriptome may just be noise[2].


  • Microarray standards and databases
  • Working with expression data
  • Interpretation


 

Exercises

Please review the organisation and services of GEO, the microarray data repository at the NCBI.

Barrett et al. (2011) NCBI GEO: archive for functional genomics data sets--10 years on. Nucleic Acids Res 39:D1005-10. (pmid: 21097893)

PubMed ] [ DOI ] A decade ago, the Gene Expression Omnibus (GEO) database was established at the National Center for Biotechnology Information (NCBI). The original objective of GEO was to serve as a public repository for high-throughput gene expression data generated mostly by microarray technology. However, the research community quickly applied microarrays to non-gene-expression studies, including examination of genome copy number variation and genome-wide profiling of DNA-binding proteins. Because the GEO database was designed with a flexible structure, it was possible to quickly adapt the repository to store these data types. More recently, as the microarray community switches to next-generation sequencing technologies, GEO has again adapted to host these data sets. Today, GEO stores over 20,000 microarray- and sequence-based functional genomics studies, and continues to handle the majority of direct high-throughput data submissions from the research community. Multiple mechanisms are provided to help users effectively search, browse, download and visualize the data at the level of individual genes or entire studies. This paper describes recent database enhancements, including new search and data representation tools, as well as a brief review of how the community uses GEO data. GEO is freely accessible at http://www.ncbi.nlm.nih.gov/geo/.

Barrett et al. (2013) NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res 41:D991-5. (pmid: 23193258)

PubMed ] [ DOI ] The Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) is an international public repository for high-throughput microarray and next-generation sequence functional genomic data sets submitted by the research community. The resource supports archiving of raw data, processed data and metadata which are indexed, cross-linked and searchable. All data are freely available for download in a variety of formats. GEO also provides several web-based tools and strategies to assist users to query, analyse and visualize data. This article reports current status and recent database developments, including the release of GEO2R, an R-based web application that helps users analyse GEO data.

Task:


  • Navigate to the GEO browser page.
  • In the "Series" tab, enter metastasis as a search term.
  • Click on the GEO accession code for GSE42952.
  • Click on the Analyze with GEO2R link below the experiment summary.
  • Click on the Tissue column sort-icon to sort the series by tissue type, define a group for Primary tumor and a group for Metastasis as explained in the video and associate all PDAC samples with the former and all metastasis associated samples with the latter group.
  • Click on the Top 250 button to execute the analysis of significantly differentially expressed genes.
  • By clicking on gene names, you can view the expression profiles. Find the gene for which the expression in metastasis samples is most consistently upregulated.
  • Review the R script for your analysis. Ask on the mailing list if there are any aspects of the script that are not straightforward to understand. But no worries, R code analysis will not be required on the next quiz.

There is much more exploring you can do, but this will be enough for a first introduction to expression analysis.


 

References

  1. Strictly speaking, splicing is an eukaryotic achievement, many instances of splicing have been recognized in prokaryotes as well.
  2. Jarvis & Robertson (2011) The noncoding universe. BMC Biol 9:52. (pmid: 21798102)

    PubMed ] [ DOI ]


 

Further reading and resources

Vera et al. (2013) MicroRNA-regulated networks: the perfect storm for classical molecular biology, the ideal scenario for systems biology. Adv Exp Med Biol 774:55-76. (pmid: 23377968)

PubMed ] [ DOI ] MicroRNAs (miRNAs) are involved in many regulatory pathways some of which are complex networks enriched in regulatory motifs like positive or negative feedback loops or coherent and incoherent feedforward loops. Their complexity makes the understanding of their regulation difficult and the interpretation of experimental data cumbersome. In this book chapter we claim that systems biology is the appropriate approach to investigate the regulation of these miRNA-regulated networks. Systems biology is an interdisciplinary approach by which biomedical questions on biochemical networks are addressed by integrating experiments with mathematical modelling and simulation. We here introduce the foundations of the systems biology approach, the basic theoretical and computational tools used to perform model-based analyses of miRNA-regulated networks and review the scientific literature in systems biology of miRNA regulation, with a focus on cancer.

Barbosa-Morais et al. (2012) The evolutionary landscape of alternative splicing in vertebrate species. Science 338:1587-93. (pmid: 23258890)

PubMed ] [ DOI ] How species with similar repertoires of protein-coding genes differ so markedly at the phenotypic level is poorly understood. By comparing organ transcriptomes from vertebrate species spanning ~350 million years of evolution, we observed significant differences in alternative splicing complexity between vertebrate lineages, with the highest complexity in primates. Within 6 million years, the splicing profiles of physiologically equivalent organs diverged such that they are more strongly related to the identity of a species than they are to organ type. Most vertebrate species-specific splicing patterns are cis-directed. However, a subset of pronounced splicing changes are predicted to remodel protein interactions involving trans-acting regulators. These events likely further contributed to the diversification of splicing and other transcriptomic changes that underlie phenotypic differences among vertebrate species.

Han et al. (2011) SnapShot: High-throughput sequencing applications. Cell 146:1044, 1044.e1-2. (pmid: 21925324)

PubMed ] [ DOI ]

Zheng & Tao (2011) Stochastic analysis of gene expression. Methods Mol Biol 734:123-51. (pmid: 21468988)

PubMed ] [ DOI ] In this chapter, stochasticity in gene expression is investigated using Ω-expansion technique. Two theoretical models are considered here, one concern the stochastic fluctuations in a single-gene network with negative feedback regulation, and the other the additivity of noise propagation in a protein cascade. All of these theoretical analyses may provide a basic framework for understanding stochastic gene expression.

Parkinson et al. (2011) ArrayExpress update--an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res 39:D1002-4. (pmid: 21071405)

PubMed ] [ DOI ] The ArrayExpress Archive (http://www.ebi.ac.uk/arrayexpress) is one of the three international public repositories of functional genomics data supporting publications. It includes data generated by sequencing or array-based technologies. Data are submitted by users and imported directly from the NCBI Gene Expression Omnibus. The ArrayExpress Archive is closely integrated with the Gene Expression Atlas and the sequence databases at the European Bioinformatics Institute. Advanced queries provided via ontology enabled interfaces include queries based on technology and sample attributes such as disease, cell types and anatomy.

Xie & Ahn (2010) Statistical methods for integrating multiple types of high-throughput data. Methods Mol Biol 620:511-29. (pmid: 20652519)

PubMed ] [ DOI ] Large-scale sequencing, copy number, mRNA, and protein data have given great promise to the biomedical research, while posing great challenges to data management and data analysis. Integrating different types of high-throughput data from diverse sources can increase the statistical power of data analysis and provide deeper biological understanding. This chapter uses two biomedical research examples to illustrate why there is an urgent need to develop reliable and robust methods for integrating the heterogeneous data. We then introduce and review some recently developed statistical methods for integrative analysis for both statistical inference and classification purposes. Finally, we present some useful public access databases and program code to facilitate the integrative analysis in practice.

Reimers (2010) Making informed choices about microarray data analysis. PLoS Comput Biol 6:e1000786. (pmid: 20523743)

PubMed ] [ DOI ]

Hubble et al. (2009) Implementation of GenePattern within the Stanford Microarray Database. Nucleic Acids Res 37:D898-901. (pmid: 18953035)

PubMed ] [ DOI ] Hundreds of researchers across the world use the Stanford Microarray Database (SMD; http://smd.stanford.edu/) to store, annotate, view, analyze and share microarray data. In addition to providing registered users at Stanford access to their own data, SMD also provides access to public data, and tools with which to analyze those data, to any public user anywhere in the world. Previously, the addition of new microarray data analysis tools to SMD has been limited by available engineering resources, and in addition, the existing suite of tools did not provide a simple way to design, execute and share analysis pipelines, or to document such pipelines for the purposes of publication. To address this, we have incorporated the GenePattern software package directly into SMD, providing access to many new analysis tools, as well as a plug-in architecture that allows users to directly integrate and share additional tools through SMD. In this article, we describe our implementation of the GenePattern microarray analysis software package into the SMD code base. This extension is available with the SMD source code that is fully and freely available to others under an Open Source license, enabling other groups to create a local installation of SMD with an enriched data analysis capability.

Chuang et al. (2007) Network-based classification of breast cancer metastasis. Mol Syst Biol 3:140. (pmid: 17940530)

PubMed ] [ DOI ] Mapping the pathways that give rise to metastasis is one of the key challenges of breast cancer research. Recently, several large-scale studies have shed light on this problem through analysis of gene expression profiles to identify markers correlated with metastasis. Here, we apply a protein-network-based approach that identifies markers not as individual genes but as subnetworks extracted from protein interaction databases. The resulting subnetworks provide novel hypotheses for pathways involved in tumor progression. Although genes with known breast cancer mutations are typically not detected through analysis of differential expression, they play a central role in the protein network by interconnecting many differentially expressed genes. We find that the subnetwork markers are more reproducible than individual marker genes selected without network information, and that they achieve higher accuracy in the classification of metastatic versus non-metastatic tumors.

Carninci (2007) Constructing the landscape of the mammalian transcriptome. J Exp Biol 210:1497-506. (pmid: 17449815)

PubMed ] [ DOI ] The principal route to understanding the biological significance of the genome sequence comes from discovery and characterization of that portion of the genome that is transcribed into RNA products. We now know that this ;transcriptome' is unexpectedly complex and its precise definition in any one species requires multiple technical approaches and an ability to work on a very large scale. A key step is the development of technologies able to capture snapshots of the complexity of the various kinds of RNA generated by the genome. As the human, mouse and other model genome sequencing projects approach completion, considerable effort has been focused on identifying and annotating the protein-coding genes as the principal output of the genome. In pursuing this aim, several key technologies have been developed to generate large numbers and highly diverse sets of full-length cDNAs and their variants. However, the search has identified another hidden transcriptional universe comprising a wide variety of non-protein coding RNA transcripts. Despite initial scepticism, various experiments and complementary technologies have demonstrated that these RNAs are dynamically transcribed and a subset of them can act as sense-antisense RNAs, which influence the transcriptional output of the genome. Recent experimental evidence suggests that the list of non-protein coding RNAs is still largely incomplete and that transcription is substantially more complex even than currently thought.

Barrett & Edgar (2006) Mining microarray data at NCBI's Gene Expression Omnibus (GEO)*. Methods Mol Biol 338:175-90. (pmid: 16888359)

PubMed ] [ DOI ] The Gene Expression Omnibus (GEO) at the National Center for Biotechnology Information (NCBI) has emerged as the leading fully public repository for gene expression data. This chapter describes how to use Web-based interfaces, applications, and graphics to effectively explore, visualize, and interpret the hundreds of microarray studies and millions of gene expression patterns stored in GEO. Data can be examined from both experiment-centric and gene-centric perspectives using user-friendly tools that do not require specialized expertise in microarray analysis or time-consuming download of massive data sets. The GEO database is publicly accessible through the World Wide Web at http://www.ncbi.nlm.nih.gov/geo.