Difference between revisions of "Transcriptome"

From "A B C"
Jump to navigation Jump to search
 
(11 intermediate revisions by the same user not shown)
Line 17: Line 17:
 
==Introductory reading==
 
==Introductory reading==
 
<section begin=reading />
 
<section begin=reading />
{{#pmid:21627854}}  
+
{{#pmid: 21097893}}
 +
{{#pmid: 23193258}}  
 
<section end=reading />
 
<section end=reading />
  
Line 25: Line 26:
  
 
===Background===
 
===Background===
The transcriptome originates from the genome, mostly, that is, and it results in the proteome, again: mostly. RNA that is {{WP|Transcription|transcribed}} from the genome is not yet fit for translation but must be processed: {{WP|RNA splicing|splicing}} is ubiquitous<ref>Strictly speaking, splicing is an {{WP|Eukaryote|eukaryotic}} achievement, many instances of splicing have been recognized in {{WP|Prokaryote|prokaryotes}} as well.</ref> and in addition {{WP|RNA editing}} has been encountered in many species. Some authors therefore refer to the ''exome''&mdash;the set of transcribed {{WP|exons}}&mdash; to indicate the actual coding sequence.
+
The transcriptome originates from the genome, mostly, that is, and it results in the proteome, again: mostly. RNA that is {{WP|Transcription (genetics)|transcribed}} from the genome is not yet fit for translation but must be processed: {{WP|RNA splicing|splicing}} is ubiquitous<ref>Strictly speaking, splicing is an {{WP|Eukaryote|eukaryotic}} achievement, many instances of splicing have been recognized in {{WP|Prokaryote|prokaryotes}} as well.</ref> and in addition {{WP|RNA editing}} has been encountered in many species. Some authors therefore refer to the ''exome''&mdash;the set of transcribed {{WP|exons}}&mdash; to indicate the actual coding sequence.
  
 
The dark matter of the transcriptome may just be noise{{#pmid: 21798102|Jarvis2011}}.
 
The dark matter of the transcriptome may just be noise{{#pmid: 21798102|Jarvis2011}}.
Line 33: Line 34:
 
* Working with expression data
 
* Working with expression data
 
* Interpretation
 
* Interpretation
 
  
 
&nbsp;
 
&nbsp;
  
 
==Exercises==
 
==Exercises==
 +
 
<section begin=exercises />
 
<section begin=exercises />
{{#pmid:16888359}}   
+
 
 +
In this exercise we will attempt to extract a set of relevant genes for the pluripotency network from deposited expression data.
 +
 
 +
{{task|1=
 +
 
 +
A recent paper has highlighted the lineage-specific roles of SOX2, OCT4 and NANOG in human cells.
 +
{{#pmid: 22482508}}  
 +
 
 +
;First, we will access the relevant data series on GEO, the NCBI's database for expression data.
 +
#Navigate to the pubMed page of the article via the link provided in the reference box above.
 +
#Follow the link to associated GEO records in the right hand side of the PubMed page (under ''Related Information''). The top hit is a ''Superseries'', composed of a number of ''Subseries'' of experiments.
 +
#Open its link in a new tab.
 +
#Examine the samples that are included in this study by expanding the list of samples. You will notice that the sample titles tell you a bit about the experiment, the actual ''Subseries'' page describes more about the experiment, but here, and in general, for a reasonable understanding of the experimental variables, you will need to read the actual paper.
 +
#Not for this first-look exercise however &ndash; just note: ''shXXX'' samples are knock-downs (''KD'') using a lentiviral ''short-hairpin'' RNA, ''OE'' is ''overexpression'', ''H1'' and ''H9'' are human embryonal stem-cell lines.
 +
 
 +
We can pursue the question: if any or all of the pluripotency maintaining transcription factors are knocked down &ndash; presumably a surrogate for a differentiation signal &ndash; what are the downstream targets and what do they have in common; conversely, what complementary effects are observed when these factors are overexpressed? The first step therefore is to identify differentially expressed genes. Conveniently, GEO offers the '''GEO2R''' utility to help perform differential expression analysis.
 +
 
 +
* View the [http://www.youtube.com/watch?v=EUPmGWS8ik0 '''GEO2R''' video tutorial] on youtube.
 +
 
 +
;Now proceed to apply this to the stem-cell transcription factor study:
 +
# On the ''Superset'' page, click on the [http://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE34921 '''Analyze with GEO2R'''] link.
 +
# Click on the '''Treatment''' column header to sort the series by experimental variable.
 +
# Define meaningful groups: you could name them SOX2 KD, SOX2 OE, the same for NANOG and OCT4, and CTRL. (<small>Note that these are just names, you could also have called the groups ''Capitoline'', ''Palatine'', ''Esquiline'', ''Aventine'', ''Caelian'', ''Viminal'', and ''Quirinal'' &ndash; if you remember what the names stand for.</small>)
 +
# Then associate the group names with relevant experiments, as shown in the video. For the control samples, you can combine the H1 "controls" and the H1 "untreated" samples from the BMP4 treatment series.
 +
# Confirm that the value distributions are unbiased - overall, in such experiments, the bulk of the expression values should not change and thus means and quantiles of the expression levels should be about the same. You should note that the OE samples are systematically different from the others, and that one of the NANOG samples has very low values. Remove that series from your list and rerun the distribution to confirm that the data is no longer in the list.
 +
# In the GEO2R tab, click on the '''Top 250''' button to execute the analysis of significantly differentially expressed genes.
 +
# By clicking on  a few of the gene names in the ''Gene.symbol'' column, you can view the expression profiles that tell you ''why'' the genes were found to be differentially expressed. Can you identify a gene that increases in expression in response to all three factors?
 +
 
 +
* Finally, review the '''R''' script for your analysis. Check if there are any aspects of the code that you don't understand. That will give you an idea of the level to which you ought to bring your '''R''' skills. But not right now &ndash; and: no worries, '''R''' code analysis will not be required on Wednesday's quiz.
 +
 
 +
 
 +
}}
 
<section end=exercises />
 
<section end=exercises />
  
  
 
&nbsp;
 
&nbsp;
 +
 
==References==
 
==References==
 
<references />
 
<references />
Line 51: Line 84:
  
 
==Further reading and resources==
 
==Further reading and resources==
{{#pmid:17449815}}
+
{{#pmid: 23846655}}
{{#pmid:17940530}}
+
{{#pmid: 23377968}}
{{#pmid:18953035}}
+
{{#pmid: 23258890}}
{{#pmid:20652519}}
+
{{#pmid: 21925324}}
{{#pmid:21071405}}
+
{{#pmid: 21627854}}  
{{#pmid:21468988}}
+
{{#pmid: 21468988}}
{{#pmid:21925324}}
+
{{#pmid: 21071405}}
 +
{{#pmid: 20652519}}
 +
{{#pmid: 20523743}}
 +
{{#pmid: 18953035}}
 +
{{#pmid: 17940530}}
 +
{{#pmid: 17449815}}
 +
{{#pmid: 16888359}}
  
 
<!-- {{WWW|WWW_UniProt}} -->
 
<!-- {{WWW|WWW_UniProt}} -->

Latest revision as of 03:15, 3 February 2014

Transcriptome


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.


The transcriptome is the set of a cell's mRNA molecules. Microarray technology - the quantitative, sequence-specific hybridization of nucleotides - was the first domain of massively parallel, high-throughput biology. Quantifying gene expression levels in a tissue-, development-, or response-specific has yielded detailed insight into cellular function at the molecular level. Yet, while the questions remain, high-throughput sequencing methods are rapidly supplanting microarrays to provide the data. Moreover, we realize that the transcriptome is not just a passive buffer of expressed information: an entire, complex, intrinsic level of regulation through hybridization of small nuclear RNAs has been discovered.



 

Introductory reading

Barrett et al. (2011) NCBI GEO: archive for functional genomics data sets--10 years on. Nucleic Acids Res 39:D1005-10. (pmid: 21097893)

PubMed ] [ DOI ]

Barrett et al. (2013) NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res 41:D991-5. (pmid: 23193258)

PubMed ] [ DOI ]


 

Contents

Background

The transcriptome originates from the genome, mostly, that is, and it results in the proteome, again: mostly. RNA that is transcribed from the genome is not yet fit for translation but must be processed: splicing is ubiquitous[1] and in addition RNA editing has been encountered in many species. Some authors therefore refer to the exome—the set of transcribed exons— to indicate the actual coding sequence.

The dark matter of the transcriptome may just be noise[2].


  • Microarray standards and databases
  • Working with expression data
  • Interpretation

 

Exercises


In this exercise we will attempt to extract a set of relevant genes for the pluripotency network from deposited expression data.

Task:
A recent paper has highlighted the lineage-specific roles of SOX2, OCT4 and NANOG in human cells.

Wang et al. (2012) Distinct lineage specification roles for NANOG, OCT4, and SOX2 in human embryonic stem cells. Cell Stem Cell 10:440-54. (pmid: 22482508)

PubMed ] [ DOI ]

First, we will access the relevant data series on GEO, the NCBI's database for expression data.
  1. Navigate to the pubMed page of the article via the link provided in the reference box above.
  2. Follow the link to associated GEO records in the right hand side of the PubMed page (under Related Information). The top hit is a Superseries, composed of a number of Subseries of experiments.
  3. Open its link in a new tab.
  4. Examine the samples that are included in this study by expanding the list of samples. You will notice that the sample titles tell you a bit about the experiment, the actual Subseries page describes more about the experiment, but here, and in general, for a reasonable understanding of the experimental variables, you will need to read the actual paper.
  5. Not for this first-look exercise however – just note: shXXX samples are knock-downs (KD) using a lentiviral short-hairpin RNA, OE is overexpression, H1 and H9 are human embryonal stem-cell lines.

We can pursue the question: if any or all of the pluripotency maintaining transcription factors are knocked down – presumably a surrogate for a differentiation signal – what are the downstream targets and what do they have in common; conversely, what complementary effects are observed when these factors are overexpressed? The first step therefore is to identify differentially expressed genes. Conveniently, GEO offers the GEO2R utility to help perform differential expression analysis.

Now proceed to apply this to the stem-cell transcription factor study
  1. On the Superset page, click on the Analyze with GEO2R link.
  2. Click on the Treatment column header to sort the series by experimental variable.
  3. Define meaningful groups: you could name them SOX2 KD, SOX2 OE, the same for NANOG and OCT4, and CTRL. (Note that these are just names, you could also have called the groups Capitoline, Palatine, Esquiline, Aventine, Caelian, Viminal, and Quirinal – if you remember what the names stand for.)
  4. Then associate the group names with relevant experiments, as shown in the video. For the control samples, you can combine the H1 "controls" and the H1 "untreated" samples from the BMP4 treatment series.
  5. Confirm that the value distributions are unbiased - overall, in such experiments, the bulk of the expression values should not change and thus means and quantiles of the expression levels should be about the same. You should note that the OE samples are systematically different from the others, and that one of the NANOG samples has very low values. Remove that series from your list and rerun the distribution to confirm that the data is no longer in the list.
  6. In the GEO2R tab, click on the Top 250 button to execute the analysis of significantly differentially expressed genes.
  7. By clicking on a few of the gene names in the Gene.symbol column, you can view the expression profiles that tell you why the genes were found to be differentially expressed. Can you identify a gene that increases in expression in response to all three factors?
  • Finally, review the R script for your analysis. Check if there are any aspects of the code that you don't understand. That will give you an idea of the level to which you ought to bring your R skills. But not right now – and: no worries, R code analysis will not be required on Wednesday's quiz.


 

References

  1. Strictly speaking, splicing is an eukaryotic achievement, many instances of splicing have been recognized in prokaryotes as well.
  2. Jarvis & Robertson (2011) The noncoding universe. BMC Biol 9:52. (pmid: 21798102)

    PubMed ] [ DOI ]


 

Further reading and resources

Ray et al. (2013) A compendium of RNA-binding motifs for decoding gene regulation. Nature 499:172-7. (pmid: 23846655)

PubMed ] [ DOI ]

Vera et al. (2013) MicroRNA-regulated networks: the perfect storm for classical molecular biology, the ideal scenario for systems biology. Adv Exp Med Biol 774:55-76. (pmid: 23377968)

PubMed ] [ DOI ]

Barbosa-Morais et al. (2012) The evolutionary landscape of alternative splicing in vertebrate species. Science 338:1587-93. (pmid: 23258890)

PubMed ] [ DOI ]

Han et al. (2011) SnapShot: High-throughput sequencing applications. Cell 146:1044, 1044.e1-2. (pmid: 21925324)

PubMed ] [ DOI ]

Malone & Oliver (2011) Microarrays, deep sequencing and the true measure of the transcriptome. BMC Biol 9:34. (pmid: 21627854)

PubMed ] [ DOI ]

Zheng & Tao (2011) Stochastic analysis of gene expression. Methods Mol Biol 734:123-51. (pmid: 21468988)

PubMed ] [ DOI ]

Parkinson et al. (2011) ArrayExpress update--an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res 39:D1002-4. (pmid: 21071405)

PubMed ] [ DOI ]

Xie & Ahn (2010) Statistical methods for integrating multiple types of high-throughput data. Methods Mol Biol 620:511-29. (pmid: 20652519)

PubMed ] [ DOI ]

Reimers (2010) Making informed choices about microarray data analysis. PLoS Comput Biol 6:e1000786. (pmid: 20523743)

PubMed ] [ DOI ]

Hubble et al. (2009) Implementation of GenePattern within the Stanford Microarray Database. Nucleic Acids Res 37:D898-901. (pmid: 18953035)

PubMed ] [ DOI ]

Chuang et al. (2007) Network-based classification of breast cancer metastasis. Mol Syst Biol 3:140. (pmid: 17940530)

PubMed ] [ DOI ]

Carninci (2007) Constructing the landscape of the mammalian transcriptome. J Exp Biol 210:1497-506. (pmid: 17449815)

PubMed ] [ DOI ]

Barrett & Edgar (2006) Mining microarray data at NCBI's Gene Expression Omnibus (GEO)*. Methods Mol Biol 338:175-90. (pmid: 16888359)

PubMed ] [ DOI ]