BIO Assignment Week 10

Assignment for Week 10
Expression Analysis

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.

Introduction

The transcriptome is the set of a cell's mRNA molecules. Microarray technology - the quantitative, sequence-specific hybridization of nucleotides - was the first domain of massively parallel, high-throughput biology. Quantifying gene expression levels in a tissue-, development-, or response-specific has yielded detailed insight into cellular function at the molecular level. Yet, while the questions remain, high-throughput sequencing methods are rapidly supplanting microarrays to provide the data. Moreover, we realize that the transcriptome is not just a passive buffer of expressed information: an entire, complex, intrinsic level of regulation through hybridization of small nuclear RNAs has been discovered.

Barrett et al. (2011) NCBI GEO: archive for functional genomics data sets--10 years on. Nucleic Acids Res 39:D1005-10. (pmid: 21097893)

[ PubMed ] [ DOI ] A decade ago, the Gene Expression Omnibus (GEO) database was established at the National Center for Biotechnology Information (NCBI). The original objective of GEO was to serve as a public repository for high-throughput gene expression data generated mostly by microarray technology. However, the research community quickly applied microarrays to non-gene-expression studies, including examination of genome copy number variation and genome-wide profiling of DNA-binding proteins. Because the GEO database was designed with a flexible structure, it was possible to quickly adapt the repository to store these data types. More recently, as the microarray community switches to next-generation sequencing technologies, GEO has again adapted to host these data sets. Today, GEO stores over 20,000 microarray- and sequence-based functional genomics studies, and continues to handle the majority of direct high-throughput data submissions from the research community. Multiple mechanisms are provided to help users effectively search, browse, download and visualize the data at the level of individual genes or entire studies. This paper describes recent database enhancements, including new search and data representation tools, as well as a brief review of how the community uses GEO data. GEO is freely accessible at http://www.ncbi.nlm.nih.gov/geo/.

Barrett et al. (2013) NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res 41:D991-5. (pmid: 23193258)

[ PubMed ] [ DOI ] The Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) is an international public repository for high-throughput microarray and next-generation sequence functional genomic data sets submitted by the research community. The resource supports archiving of raw data, processed data and metadata which are indexed, cross-linked and searchable. All data are freely available for download in a variety of formats. GEO also provides several web-based tools and strategies to assist users to query, analyse and visualize data. This article reports current status and recent database developments, including the release of GEO2R, an R-based web application that helps users analyse GEO data.

The transcriptome originates from the genome, mostly, that is, and it results in the proteome, again: mostly. RNA that is transcribed from the genome is not yet fit for translation but must be processed: splicing is ubiquitous^[1] and in addition RNA editing has been encountered in many species. Some authors therefore refer to the exome—the set of transcribed exons— to indicate the actual coding sequence.

The dark matter of the transcriptome may just be noise^[2].

Microarray standards and databases
Working with expression data
Interpretation

http://coxpresdb.jp/cgi-bin/coex_list.cgi?gene=851503&sp=Sce

http://www.geneticsofgeneexpression.org/network/index.php?gene=CDK4

Exercises

In this exercise we will attempt to extract a set of relevant genes for the pluripotency network from deposited expression data.

Task:
A recent paper has highlighted the lineage-specific roles of SOX2, OCT4 and NANOG in human cells.

Wang et al. (2012) Distinct lineage specification roles for NANOG, OCT4, and SOX2 in human embryonic stem cells. Cell Stem Cell 10:440-54. (pmid: 22482508)

[ PubMed ] [ DOI ] Nanog, Oct4, and Sox2 are the core regulators of mouse (m)ESC pluripotency. Although their basic importance in human (h)ESCs has been demonstrated, the mechanistic functions are not well defined. Here, we identify general and cell-line-specific requirements for NANOG, OCT4, and SOX2 in hESCs. We show that OCT4 regulates, and interacts with, the BMP4 pathway to specify four developmental fates. High levels of OCT4 enable self-renewal in the absence of BMP4 but specify mesendoderm in the presence of BMP4. Low levels of OCT4 induce embryonic ectoderm differentiation in the absence of BMP4 but specify extraembryonic lineages in the presence of BMP4. NANOG represses embryonic ectoderm differentiation but has little effect on other lineages, whereas SOX2 and SOX3 are redundant and repress mesendoderm differentiation. Thus, instead of being panrepressors of differentiation, each factor controls specific cell fates. Our study revises the view of how self-renewal is orchestrated in hESCs.

First, we will access the relevant data series on GEO, the NCBI's database for expression data.

Navigate to the pubMed page of the article via the link provided in the reference box above.
Follow the link to associated GEO records in the right hand side of the PubMed page (under Related Information). The top hit is a Superseries, composed of a number of Subseries of experiments.
Open its link in a new tab.
Examine the samples that are included in this study by expanding the list of samples. You will notice that the sample titles tell you a bit about the experiment, the actual Subseries page describes more about the experiment, but here, and in general, for a reasonable understanding of the experimental variables, you will need to read the actual paper.
Not for this first-look exercise however – just note: shXXX samples are knock-downs (KD) using a lentiviral short-hairpin RNA, OE is overexpression, H1 and H9 are human embryonal stem-cell lines.

We can pursue the question: if any or all of the pluripotency maintaining transcription factors are knocked down – presumably a surrogate for a differentiation signal – what are the downstream targets and what do they have in common; conversely, what complementary effects are observed when these factors are overexpressed? The first step therefore is to identify differentially expressed genes. Conveniently, GEO offers the GEO2R utility to help perform differential expression analysis.

View the GEO2R video tutorial on youtube.

Now proceed to apply this to the stem-cell transcription factor study

On the Superset page, click on the Analyze with GEO2R link.
Click on the Treatment column header to sort the series by experimental variable.
Define meaningful groups: you could name them SOX2 KD, SOX2 OE, the same for NANOG and OCT4, and CTRL. (Note that these are just names, you could also have called the groups Capitoline, Palatine, Esquiline, Aventine, Caelian, Viminal, and Quirinal – if you remember what the names stand for.)
Then associate the group names with relevant experiments, as shown in the video. For the control samples, you can combine the H1 "controls" and the H1 "untreated" samples from the BMP4 treatment series.
Confirm that the value distributions are unbiased - overall, in such experiments, the bulk of the expression values should not change and thus means and quantiles of the expression levels should be about the same. You should note that the OE samples are systematically different from the others, and that one of the NANOG samples has very low values. Remove that series from your list and rerun the distribution to confirm that the data is no longer in the list.
In the GEO2R tab, click on the Top 250 button to execute the analysis of significantly differentially expressed genes.
By clicking on a few of the gene names in the Gene.symbol column, you can view the expression profiles that tell you why the genes were found to be differentially expressed. Can you identify a gene that increases in expression in response to all three factors?

Finally, review the R script for your analysis. Check if there are any aspects of the code that you don't understand. That will give you an idea of the level to which you ought to bring your R skills. But not right now – and: no worries, R code analysis will not be required on Wednesday's quiz.

References

↑ Strictly speaking, splicing is an eukaryotic achievement, many instances of splicing have been recognized in prokaryotes as well.
↑
Jarvis & Robertson (2011) The noncoding universe. BMC Biol 9:52. (pmid: 21798102)

[ PubMed ] [ DOI ]

Links and resources

Further reading

Ray et al. (2013) A compendium of RNA-binding motifs for decoding gene regulation. Nature 499:172-7. (pmid: 23846655)

[ PubMed ] [ DOI ] RNA-binding proteins are key regulators of gene expression, yet only a small fraction have been functionally characterized. Here we report a systematic analysis of the RNA motifs recognized by RNA-binding proteins, encompassing 205 distinct genes from 24 diverse eukaryotes. The sequence specificities of RNA-binding proteins display deep evolutionary conservation, and the recognition preferences for a large fraction of metazoan RNA-binding proteins can thus be inferred from their RNA-binding domain sequence. The motifs that we identify in vitro correlate well with in vivo RNA-binding data. Moreover, we can associate them with distinct functional roles in diverse types of post-transcriptional regulation, enabling new insights into the functions of RNA-binding proteins both in normal physiology and in human disease. These data provide an unprecedented overview of RNA-binding proteins and their targets, and constitute an invaluable resource for determining post-transcriptional regulatory mechanisms in eukaryotes.

Vera et al. (2013) MicroRNA-regulated networks: the perfect storm for classical molecular biology, the ideal scenario for systems biology. Adv Exp Med Biol 774:55-76. (pmid: 23377968)

[ PubMed ] [ DOI ] MicroRNAs (miRNAs) are involved in many regulatory pathways some of which are complex networks enriched in regulatory motifs like positive or negative feedback loops or coherent and incoherent feedforward loops. Their complexity makes the understanding of their regulation difficult and the interpretation of experimental data cumbersome. In this book chapter we claim that systems biology is the appropriate approach to investigate the regulation of these miRNA-regulated networks. Systems biology is an interdisciplinary approach by which biomedical questions on biochemical networks are addressed by integrating experiments with mathematical modelling and simulation. We here introduce the foundations of the systems biology approach, the basic theoretical and computational tools used to perform model-based analyses of miRNA-regulated networks and review the scientific literature in systems biology of miRNA regulation, with a focus on cancer.

Barbosa-Morais et al. (2012) The evolutionary landscape of alternative splicing in vertebrate species. Science 338:1587-93. (pmid: 23258890)

[ PubMed ] [ DOI ] How species with similar repertoires of protein-coding genes differ so markedly at the phenotypic level is poorly understood. By comparing organ transcriptomes from vertebrate species spanning ~350 million years of evolution, we observed significant differences in alternative splicing complexity between vertebrate lineages, with the highest complexity in primates. Within 6 million years, the splicing profiles of physiologically equivalent organs diverged such that they are more strongly related to the identity of a species than they are to organ type. Most vertebrate species-specific splicing patterns are cis-directed. However, a subset of pronounced splicing changes are predicted to remodel protein interactions involving trans-acting regulators. These events likely further contributed to the diversification of splicing and other transcriptomic changes that underlie phenotypic differences among vertebrate species.

Han et al. (2011) SnapShot: High-throughput sequencing applications. Cell 146:1044, 1044.e1-2. (pmid: 21925324)

[ PubMed ] [ DOI ]

Malone & Oliver (2011) Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biol 9:34. (pmid: 21627854)

[ PubMed ] [ DOI ] Microarrays first made the analysis of the transcriptome possible, and have produced much important information. Today, however, researchers are increasingly turning to direct high-throughput sequencing -- RNA-Seq -- which has considerable advantages for examining transcriptome fine structure -- for example in the detection of allele-specific expression and splice junctions. In this article, we discuss the relative merits of the two techniques, the inherent biases in each, and whether all of the vast body of array work needs to be revisited using the newer technology. We conclude that microarrays remain useful and accurate tools for measuring expression levels, and RNA-Seq complements and extends microarray measurements.

Zheng & Tao (2011) Stochastic analysis of gene expression. Methods Mol Biol 734:123-51. (pmid: 21468988)

[ PubMed ] [ DOI ] In this chapter, stochasticity in gene expression is investigated using Ω-expansion technique. Two theoretical models are considered here, one concern the stochastic fluctuations in a single-gene network with negative feedback regulation, and the other the additivity of noise propagation in a protein cascade. All of these theoretical analyses may provide a basic framework for understanding stochastic gene expression.

Parkinson et al. (2011) ArrayExpress update--an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res 39:D1002-4. (pmid: 21071405)

[ PubMed ] [ DOI ] The ArrayExpress Archive (http://www.ebi.ac.uk/arrayexpress) is one of the three international public repositories of functional genomics data supporting publications. It includes data generated by sequencing or array-based technologies. Data are submitted by users and imported directly from the NCBI Gene Expression Omnibus. The ArrayExpress Archive is closely integrated with the Gene Expression Atlas and the sequence databases at the European Bioinformatics Institute. Advanced queries provided via ontology enabled interfaces include queries based on technology and sample attributes such as disease, cell types and anatomy.

Xie & Ahn (2010) Statistical methods for integrating multiple types of high-throughput data. Methods Mol Biol 620:511-29. (pmid: 20652519)

[ PubMed ] [ DOI ] Large-scale sequencing, copy number, mRNA, and protein data have given great promise to the biomedical research, while posing great challenges to data management and data analysis. Integrating different types of high-throughput data from diverse sources can increase the statistical power of data analysis and provide deeper biological understanding. This chapter uses two biomedical research examples to illustrate why there is an urgent need to develop reliable and robust methods for integrating the heterogeneous data. We then introduce and review some recently developed statistical methods for integrative analysis for both statistical inference and classification purposes. Finally, we present some useful public access databases and program code to facilitate the integrative analysis in practice.

Reimers (2010) Making informed choices about microarray data analysis. PLoS Comput Biol 6:e1000786. (pmid: 20523743)

[ PubMed ] [ DOI ]

Hubble et al. (2009) Implementation of GenePattern within the Stanford Microarray Database. Nucleic Acids Res 37:D898-901. (pmid: 18953035)

[ PubMed ] [ DOI ] Hundreds of researchers across the world use the Stanford Microarray Database (SMD; http://smd.stanford.edu/) to store, annotate, view, analyze and share microarray data. In addition to providing registered users at Stanford access to their own data, SMD also provides access to public data, and tools with which to analyze those data, to any public user anywhere in the world. Previously, the addition of new microarray data analysis tools to SMD has been limited by available engineering resources, and in addition, the existing suite of tools did not provide a simple way to design, execute and share analysis pipelines, or to document such pipelines for the purposes of publication. To address this, we have incorporated the GenePattern software package directly into SMD, providing access to many new analysis tools, as well as a plug-in architecture that allows users to directly integrate and share additional tools through SMD. In this article, we describe our implementation of the GenePattern microarray analysis software package into the SMD code base. This extension is available with the SMD source code that is fully and freely available to others under an Open Source license, enabling other groups to create a local installation of SMD with an enriched data analysis capability.

Chuang et al. (2007) Network-based classification of breast cancer metastasis. Mol Syst Biol 3:140. (pmid: 17940530)

[ PubMed ] [ DOI ] Mapping the pathways that give rise to metastasis is one of the key challenges of breast cancer research. Recently, several large-scale studies have shed light on this problem through analysis of gene expression profiles to identify markers correlated with metastasis. Here, we apply a protein-network-based approach that identifies markers not as individual genes but as subnetworks extracted from protein interaction databases. The resulting subnetworks provide novel hypotheses for pathways involved in tumor progression. Although genes with known breast cancer mutations are typically not detected through analysis of differential expression, they play a central role in the protein network by interconnecting many differentially expressed genes. We find that the subnetwork markers are more reproducible than individual marker genes selected without network information, and that they achieve higher accuracy in the classification of metastatic versus non-metastatic tumors.

Carninci (2007) Constructing the landscape of the mammalian transcriptome. J Exp Biol 210:1497-506. (pmid: 17449815)

[ PubMed ] [ DOI ] The principal route to understanding the biological significance of the genome sequence comes from discovery and characterization of that portion of the genome that is transcribed into RNA products. We now know that this ;transcriptome' is unexpectedly complex and its precise definition in any one species requires multiple technical approaches and an ability to work on a very large scale. A key step is the development of technologies able to capture snapshots of the complexity of the various kinds of RNA generated by the genome. As the human, mouse and other model genome sequencing projects approach completion, considerable effort has been focused on identifying and annotating the protein-coding genes as the principal output of the genome. In pursuing this aim, several key technologies have been developed to generate large numbers and highly diverse sets of full-length cDNAs and their variants. However, the search has identified another hidden transcriptional universe comprising a wide variety of non-protein coding RNA transcripts. Despite initial scepticism, various experiments and complementary technologies have demonstrated that these RNAs are dynamically transcribed and a subset of them can act as sense-antisense RNAs, which influence the transcriptional output of the genome. Recent experimental evidence suggests that the list of non-protein coding RNAs is still largely incomplete and that transcription is substantially more complex even than currently thought.

Barrett & Edgar (2006) Mining microarray data at NCBI's Gene Expression Omnibus (GEO)*. Methods Mol Biol 338:175-90. (pmid: 16888359)

[ PubMed ] [ DOI ] The Gene Expression Omnibus (GEO) at the National Center for Biotechnology Information (NCBI) has emerged as the leading fully public repository for gene expression data. This chapter describes how to use Web-based interfaces, applications, and graphics to effectively explore, visualize, and interpret the hundreds of microarray studies and millions of gene expression patterns stored in GEO. Data can be examined from both experiment-centric and gene-centric perspectives using user-friendly tools that do not require specialized expertise in microarray analysis or time-consuming download of massive data sets. The GEO database is publicly accessible through the World Wide Web at http://www.ncbi.nlm.nih.gov/geo.

Footnotes and references

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.

Do consider how to ask your questions so that a meaningful answer is possible:
- How to create a Minimal, Complete, and Verifiable example on stackoverflow and ...
- How to make a great R reproducible example are required reading.

< Assignment 9

Assignment 11 >

[1] Strictly speaking, splicing is an eukaryotic achievement, many instances of splicing have been recognized in prokaryotes as well.

[Jarvis2011-2] 
Jarvis & Robertson (2011) The noncoding universe. BMC Biol 9:52. (pmid: 21798102)

[ PubMed ] [ DOI ]

[1]

[2]

BIO Assignment Week 10

Contents

Introduction

Exercises

References

Further reading and resources

Links and resources

Footnotes and references

Ask, if things don't work for you!

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools