Difference between revisions of "BIO Assignment Week 10"

From "A B C"
Jump to navigation Jump to search
m
Line 9: Line 9:
 
</tr></table>
 
</tr></table>
  
{{Template:Inactive}}
+
{{Template:Active}}
  
 
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.  
 
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.  
Line 21: Line 21:
 
==Introduction==
 
==Introduction==
  
The transcriptome is the set of a cell's mRNA molecules. Microarray technology - the quantitative, sequence-specific hybridization of nucleotides - was the first domain of massively parallel, high-throughput biology. Quantifying gene expression levels in a  tissue-, development-, or response-specific has yielded detailed insight into cellular function at the molecular level. Yet, while the questions remain, high-throughput sequencing  methods are rapidly supplanting microarrays to provide the data. Moreover, we realize that the transcriptome is not just a passive buffer of expressed information: an entire, complex, intrinsic level of regulation through hybridization of small nuclear RNAs has been discovered.
+
The transcriptome is the set of a cell's mRNA molecules. The transcriptome originates from the genome, mostly, that is, and it results in the proteome, again: mostly. RNA that is {{WP|Transcription (genetics)|transcribed}} from the genome is not yet fit for translation but must be processed: {{WP|RNA splicing|splicing}} is ubiquitous<ref>Strictly speaking, splicing is an {{WP|Eukaryote|eukaryotic}} achievement, however there are examples of splicing in {{WP|Prokaryote|prokaryotes}} as well.</ref> and in addition {{WP|RNA editing}} has been encountered in many species. Some authors therefore refer to the ''exome''&mdash;the set of transcribed {{WP|exons}}&mdash; to indicate the actual coding sequence.  
  
{{#pmid: 21097893}}
+
'''Microarray technology''' &mdash; the quantitative, sequence-specific hybridization of labelled nucleotides in chip-format &mdash; was the first domain of "high-throughput biology". Today, it has largely been replaced by {{WP|RNA-Seq|'''RNA-seq'''}}: quantification of transcribed mRNA by high-throughput sequencing and mapping reads to genes. Quantifying gene expression levels in a  tissue-, development-, or response-specific way has yielded detailed insight into cellular function at the molecular level, with recent results of single-cell sequencing experiments adding a new level of precision. But not all transcripts are mapped to genes: we increasingly realize that the transcriptome is not merely a passive buffer of expressed information on its way to be translated into proteins, but contains multiple levels of complex, regulation through hybridization of small nuclear RNAs<ref>{{#pmid: 25565024}} {{#pmid: 21798102}}</ref>.
{{#pmid: 23193258}}
 
 
 
 
 
&nbsp;
 
The transcriptome originates from the genome, mostly, that is, and it results in the proteome, again: mostly. RNA that is {{WP|Transcription (genetics)|transcribed}} from the genome is not yet fit for translation but must be processed: {{WP|RNA splicing|splicing}} is ubiquitous<ref>Strictly speaking, splicing is an {{WP|Eukaryote|eukaryotic}} achievement, many instances of splicing have been recognized in {{WP|Prokaryote|prokaryotes}} as well.</ref> and in addition {{WP|RNA editing}} has been encountered in many species. Some authors therefore refer to the ''exome''&mdash;the set of transcribed {{WP|exons}}&mdash; to indicate the actual coding sequence.
 
 
 
The dark matter of the transcriptome may just be noise{{#pmid: 21798102|Jarvis2011}}.
 
  
 +
In this assignment, we will look at differential expression of Mbp1 and its target genes.
  
* Microarray standards and databases
 
* Working with expression data
 
* Interpretation
 
  
 
&nbsp;
 
&nbsp;
  
 +
==GEO2R==
  
 +
<section begin=exercises />
  
----
+
In this exercise we will use the analysis facilities of the GEO database at the NCBI.
  
* http://coxpresdb.jp/cgi-bin/coex_list.cgi?gene=851503&sp=Sce
+
{{task|1=
  
* http://www.geneticsofgeneexpression.org/network/index.php?gene=CDK4
+
;First, we will search for relevant data sets on GEO, the NCBI's database for expression data.
 +
#Navigate to the entry page for [http://www.ncbi.nlm.nih.gov/gds/ ''' GEO data sets]].
 +
#Enter the following querry in the usual Entrez query format: <code>"cell cycle"[ti] AND "saccharomyces cerevisiae"[organism]</code>.
 +
#You should get two datasets among the top hits that analyze wild-type yeast (W303a cells) accross two cell-cycles after release from alpha-factor arrest. Choose the [http://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2347 experiment with lower resolution] (13 samples).
 +
#On the linked GEO DataSet Browser page, follow the link to the [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE3635 Accession Viewer page: the "Reference series"].
 +
#Read about the experiment and samples, then follow the link to [http://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE3635 '''analyze with GEO2R''']
  
 +
* View the [http://www.youtube.com/watch?v=EUPmGWS8ik0 '''GEO2R''' video tutorial] on youtube.
  
&nbsp;
+
;Now proceed to apply this to the yeast cell-cycle study:[[File:GSE3635_ValueDistribution.png|frame|right|Value distribution for the yeast cell-cycle experiment GSE3635. Experiments are grouped approximately into equivalent time-points on a cell cycle.]]
 
 
==Exercises==
 
 
 
<section begin=exercises />
 
 
 
In this exercise we will attempt to extract a set of relevant genes for the pluripotency network from deposited expression data.
 
  
{{task|1=
+
# '''Define groups''': the associated publication shows us that one cell-cycle takes pretty exactly 60 minutes. Create timepoints T0, T1, T2, ... T6. Then associate the 0 and 60 min. sample with "T0"; 10 and 70 minutes get grouped as "T1"; 30 and 80 minutes are T2, etc. up to T5. The final sample does not get assigned.
 +
# Confirm that the '''Value distributions''' are unbiased by accessing the value distribution tab - overall, in such experiments, the bulk of the expression values should not change and thus means and quantiles of the expression levels should be about the same.
 +
# Your distribution should like like the image on the right: properly grouped into six categories, and unbiased regarding absolute expression levels and trends.
 +
# '''Look for differentially expressed genes''': open the GEO2R tab and click on '''Top 250'''.
  
A recent paper has highlighted the lineage-specific roles of SOX2, OCT4 and NANOG in human cells.
+
;Analyze the results.
{{#pmid: 22482508}}
 
  
;First, we will access the relevant data series on GEO, the NCBI's database for expression data.
+
# Examine the top hits. Clicking on a few of the gene names in the ''Gene.symbol'' column to view the expression profiles that tell you ''why'' the genes were found to be differentially expressed. What do you think? Is this what you would have expected for genes' responses to the cell-cycle? What seems to be the algorithm's notion of what "differentially expressed means"?
#Navigate to the pubMed page of the article via the link provided in the reference box above.
 
#Follow the link to associated GEO records in the right hand side of the PubMed page (under ''Related Information''). The top hit is a ''Superseries'', composed of a number of ''Subseries'' of experiments.
 
#Open its link in a new tab.
 
#Examine the samples that are included in this study by expanding the list of samples. You will notice that the sample titles tell you a bit about the experiment, the actual ''Subseries'' page describes more about the experiment, but here, and in general, for a reasonable understanding of the experimental variables, you will need to read the actual paper.
 
#Not for this first-look exercise however &ndash; just note: ''shXXX'' samples are knock-downs (''KD'') using a lentiviral ''short-hairpin'' RNA, ''OE'' is ''overexpression'', ''H1'' and ''H9'' are human embryonal stem-cell lines.
 
  
We can pursue the question: if any or all of the pluripotency maintaining transcription factors are knocked down &ndash; presumably a surrogate for a differentiation signal &ndash; what are the downstream targets and what do they have in common; conversely, what complementary effects are observed when these factors are overexpressed? The first step therefore is to identify differentially expressed genes. Conveniently, GEO offers the '''GEO2R''' utility to help perform differential expression analysis.
+
# Look for expected genes. Here are a few genes that are known to be differentially expressed in the cell-cycle as target genes of the MBF complex: <code>DSE1</code>, <code>DSE2</code>, <code>ERF3</code>, <code>HTA2</code>, <code>HTB2</code>, and <code>GAS3</code>. But what about the MBD complex proteins themselves: Mbp1 and Swi6?
 +
The notion of "differential expression" and "cell-cycle dependent expression" do not overlap cempletely. Significant differential expression is mathematically determined for genes that have low variance within groups and large differences between groups. This algorithm has no notion of any expectation you might have about the shape of the expression profile. All it finds are genes for which differential expression between some groups is statistically supported. The algorithm returns the top 250 of those. Consistency within groups is very important, while we intuitively might be giving more weight to conformance to our expectations of a cyclical pattern.
  
* View the [http://www.youtube.com/watch?v=EUPmGWS8ik0 '''GEO2R''' video tutorial] on youtube.
+
Let's see if we can group our time points differently to enhance the contrast between expression levels for cyclically expressed genes. Let's define only two groups: one set before and between the two cycles, one set at the peaks - and we'll omit some of the intermediate values.
  
;Now proceed to apply this to the stem-cell transcription factor study:
+
# Remove all of your groups and define two groups only. Call them "A" and "B".
# On the ''Superset'' page, click on the [http://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE34921 '''Analyze with GEO2R'''] link.
 
# Click on the '''Treatment''' column header to sort the series by experimental variable.
 
# Define meaningful groups: you could name them SOX2 KD, SOX2 OE, the same for NANOG and OCT4, and CTRL. (<small>Note that these are just names, you could also have called the groups ''Capitoline'', ''Palatine'', ''Esquiline'', ''Aventine'', ''Caelian'', ''Viminal'', and ''Quirinal'' &ndash; if you remember what the names stand for.</small>)
 
# Then associate the group names with relevant experiments, as shown in the video. For the control samples, you can combine the H1 "controls" and the H1 "untreated" samples from the BMP4 treatment series.
 
# Confirm that the value distributions are unbiased - overall, in such experiments, the bulk of the expression values should not change and thus means and quantiles of the expression levels should be about the same. You should note that the OE samples are systematically different from the others, and that one of the NANOG samples has very low values. Remove that series from your list and rerun the distribution to confirm that the data is no longer in the list.
 
# In the GEO2R tab, click on the '''Top 250''' button to execute the analysis of significantly differentially expressed genes.
 
# By clicking on  a few of the gene names in the ''Gene.symbol'' column, you can view the expression profiles that tell you ''why'' the genes were found to be differentially expressed. Can you identify a gene that increases in expression in response to all three factors?
 
  
* Finally, review the '''R''' script for your analysis. Check if there are any aspects of the code that you don't understand. That will give you an idea of the level to which you ought to bring your '''R''' skills. But not right now &ndash; and: no worries,  '''R''' code analysis will not be required on Wednesday's quiz.
+
# Assign samples for T = 0 min, 10, 60 and 70 min. to the "A" group. Assign sets  30, 40, 90, and 100 to the "B" group.
  
 +
# Recalculate the '''Top 250''' differentially expressed genes (you might have to refresh the page to get the "Top 250" button back.) Which of the "known" MBF targets are now contained in the set? What about Mbp1 and Swi6?
  
 +
# Finally: Let's compare the expression profiles for Mbp1, Swi6 and Swi4. It is not obvious that transcription factors are themselves under '''transcriptional''' control, as opposed to being expressed at a basal level and ''activated'' by phosporylation or ligand binding. In a new page, navigate to the [http://www.ncbi.nlm.nih.gov/geoprofiles '''Geo profiles'''] page and enter <code>(Mbp1 OR Swi6 OR Swi4) AND GSE3635</code> (GSE3635 is the ID of the GEO data set we have just studied). (You could have got similar results in the '''Profile graph''' tab of the GEO2R page.) What do you find? What does this tell you? Would this information allow you to define groups that are even better suited for finding cyclically expressed genes?
 +
 +
* Finally, review the '''R''' script for the GEO2R analysis in the '''R script''' tab. This code will run on your machine and make the expression analysis available. Once the datasets are loaded and prepared, you could - for example - to a "real" time series analysis, or search for genes that are '''co-regulated''' with your genes of interest.
 
}}
 
}}
 
<section end=exercises />
 
<section end=exercises />
Line 90: Line 78:
 
&nbsp;
 
&nbsp;
  
==References==
+
==Co-Expression==
<references />
 
  
  
&nbsp;
+
{{task|1=
  
==Further reading and resources==
+
[http://coxpresdb.jp/ '''CoExpressdb'''] is a well curated database of pre-calculated co-expression profiles for model organisms. Expression values across a large number of published experiments on the same platform are compared via their coefficient of correlation. Highly correlated genes are either co-regulated, or one gene influences the expression level of the other.
  
 +
* Navigate to [http://coxpresdb.jp/ '''CoExpressdb'''].
 +
* Enter <code>Mbp1</code> as a "gene alias" in the search field.
 +
* Click on the link to the coexpressed gene list. Do any of the "known" target genes appear here? How do you interpret this result?
  
 +
Unfortunately, the support for yeast genes is very limited. CoexDB is however an excellent resource to study higher eukaryotic, especially human genes. You might want to consider it for its additional capabilities for your "systems" term project. Refer to the [http://coxpresdb.jp/help/movie/ YouTube tutorials for details].
  
;That is all.
+
}}
  
  
 
&nbsp;
 
&nbsp;
  
== Links and resources ==
+
==Further reading and resources==
 +
 
 +
{{#pmid: 25392420}}
 +
{{#pmid: 23193258}}
  
; Further reading
 
  
 +
<!--
 
{{#pmid: 23846655}}
 
{{#pmid: 23846655}}
 
{{#pmid: 23377968}}
 
{{#pmid: 23377968}}
Line 115: Line 109:
 
{{#pmid: 21627854}}  
 
{{#pmid: 21627854}}  
 
{{#pmid: 21468988}}
 
{{#pmid: 21468988}}
 +
{{#pmid: 21097893}}
 
{{#pmid: 21071405}}
 
{{#pmid: 21071405}}
 
{{#pmid: 20652519}}
 
{{#pmid: 20652519}}
Line 122: Line 117:
 
{{#pmid: 17449815}}
 
{{#pmid: 17449815}}
 
{{#pmid: 16888359}}   
 
{{#pmid: 16888359}}   
 +
-->
  
  

Revision as of 01:11, 7 December 2015

Assignment for Week 10
Expression Analysis

< Assignment 9 Assignment 11 >

Note! This assignment is currently active. All significant changes will be announced on the mailing list.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.



 

Introduction

The transcriptome is the set of a cell's mRNA molecules. The transcriptome originates from the genome, mostly, that is, and it results in the proteome, again: mostly. RNA that is transcribed from the genome is not yet fit for translation but must be processed: splicing is ubiquitous[1] and in addition RNA editing has been encountered in many species. Some authors therefore refer to the exome—the set of transcribed exons— to indicate the actual coding sequence.

Microarray technology — the quantitative, sequence-specific hybridization of labelled nucleotides in chip-format — was the first domain of "high-throughput biology". Today, it has largely been replaced by RNA-seq: quantification of transcribed mRNA by high-throughput sequencing and mapping reads to genes. Quantifying gene expression levels in a tissue-, development-, or response-specific way has yielded detailed insight into cellular function at the molecular level, with recent results of single-cell sequencing experiments adding a new level of precision. But not all transcripts are mapped to genes: we increasingly realize that the transcriptome is not merely a passive buffer of expressed information on its way to be translated into proteins, but contains multiple levels of complex, regulation through hybridization of small nuclear RNAs[2].

In this assignment, we will look at differential expression of Mbp1 and its target genes.


 

GEO2R


In this exercise we will use the analysis facilities of the GEO database at the NCBI.

Task:

First, we will search for relevant data sets on GEO, the NCBI's database for expression data.
  1. Navigate to the entry page for GEO data sets].
  2. Enter the following querry in the usual Entrez query format: "cell cycle"[ti] AND "saccharomyces cerevisiae"[organism].
  3. You should get two datasets among the top hits that analyze wild-type yeast (W303a cells) accross two cell-cycles after release from alpha-factor arrest. Choose the experiment with lower resolution (13 samples).
  4. On the linked GEO DataSet Browser page, follow the link to the Accession Viewer page: the "Reference series".
  5. Read about the experiment and samples, then follow the link to analyze with GEO2R
Now proceed to apply this to the yeast cell-cycle study
Value distribution for the yeast cell-cycle experiment GSE3635. Experiments are grouped approximately into equivalent time-points on a cell cycle.
  1. Define groups: the associated publication shows us that one cell-cycle takes pretty exactly 60 minutes. Create timepoints T0, T1, T2, ... T6. Then associate the 0 and 60 min. sample with "T0"; 10 and 70 minutes get grouped as "T1"; 30 and 80 minutes are T2, etc. up to T5. The final sample does not get assigned.
  2. Confirm that the Value distributions are unbiased by accessing the value distribution tab - overall, in such experiments, the bulk of the expression values should not change and thus means and quantiles of the expression levels should be about the same.
  3. Your distribution should like like the image on the right: properly grouped into six categories, and unbiased regarding absolute expression levels and trends.
  4. Look for differentially expressed genes: open the GEO2R tab and click on Top 250.
Analyze the results.
  1. Examine the top hits. Clicking on a few of the gene names in the Gene.symbol column to view the expression profiles that tell you why the genes were found to be differentially expressed. What do you think? Is this what you would have expected for genes' responses to the cell-cycle? What seems to be the algorithm's notion of what "differentially expressed means"?
  1. Look for expected genes. Here are a few genes that are known to be differentially expressed in the cell-cycle as target genes of the MBF complex: DSE1, DSE2, ERF3, HTA2, HTB2, and GAS3. But what about the MBD complex proteins themselves: Mbp1 and Swi6?

The notion of "differential expression" and "cell-cycle dependent expression" do not overlap cempletely. Significant differential expression is mathematically determined for genes that have low variance within groups and large differences between groups. This algorithm has no notion of any expectation you might have about the shape of the expression profile. All it finds are genes for which differential expression between some groups is statistically supported. The algorithm returns the top 250 of those. Consistency within groups is very important, while we intuitively might be giving more weight to conformance to our expectations of a cyclical pattern.

Let's see if we can group our time points differently to enhance the contrast between expression levels for cyclically expressed genes. Let's define only two groups: one set before and between the two cycles, one set at the peaks - and we'll omit some of the intermediate values.

  1. Remove all of your groups and define two groups only. Call them "A" and "B".
  1. Assign samples for T = 0 min, 10, 60 and 70 min. to the "A" group. Assign sets 30, 40, 90, and 100 to the "B" group.
  1. Recalculate the Top 250 differentially expressed genes (you might have to refresh the page to get the "Top 250" button back.) Which of the "known" MBF targets are now contained in the set? What about Mbp1 and Swi6?
  1. Finally: Let's compare the expression profiles for Mbp1, Swi6 and Swi4. It is not obvious that transcription factors are themselves under transcriptional control, as opposed to being expressed at a basal level and activated by phosporylation or ligand binding. In a new page, navigate to the Geo profiles page and enter (Mbp1 OR Swi6 OR Swi4) AND GSE3635 (GSE3635 is the ID of the GEO data set we have just studied). (You could have got similar results in the Profile graph tab of the GEO2R page.) What do you find? What does this tell you? Would this information allow you to define groups that are even better suited for finding cyclically expressed genes?
  • Finally, review the R script for the GEO2R analysis in the R script tab. This code will run on your machine and make the expression analysis available. Once the datasets are loaded and prepared, you could - for example - to a "real" time series analysis, or search for genes that are co-regulated with your genes of interest.


 

Co-Expression

Task:
CoExpressdb is a well curated database of pre-calculated co-expression profiles for model organisms. Expression values across a large number of published experiments on the same platform are compared via their coefficient of correlation. Highly correlated genes are either co-regulated, or one gene influences the expression level of the other.

  • Navigate to CoExpressdb.
  • Enter Mbp1 as a "gene alias" in the search field.
  • Click on the link to the coexpressed gene list. Do any of the "known" target genes appear here? How do you interpret this result?

Unfortunately, the support for yeast genes is very limited. CoexDB is however an excellent resource to study higher eukaryotic, especially human genes. You might want to consider it for its additional capabilities for your "systems" term project. Refer to the YouTube tutorials for details.


 

Further reading and resources

Okamura et al. (2015) COXPRESdb in 2015: coexpression database for animal species by DNA-microarray and RNAseq-based expression data with multiple quality assessment systems. Nucleic Acids Res 43:D82-6. (pmid: 25392420)

PubMed ] [ DOI ] The COXPRESdb (http://coxpresdb.jp) provides gene coexpression relationships for animal species. Here, we report the updates of the database, mainly focusing on the following two points. For the first point, we added RNAseq-based gene coexpression data for three species (human, mouse and fly), and largely increased the number of microarray experiments to nine species. The increase of the number of expression data with multiple platforms could enhance the reliability of coexpression data. For the second point, we refined the data assessment procedures, for each coexpressed gene list and for the total performance of a platform. The assessment of coexpressed gene list now uses more reasonable P-values derived from platform-specific null distribution. These developments greatly reduced pseudo-predictions for directly associated genes, thus expanding the reliability of coexpression data to design new experiments and to discuss experimental results.

Barrett et al. (2013) NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res 41:D991-5. (pmid: 23193258)

PubMed ] [ DOI ] The Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) is an international public repository for high-throughput microarray and next-generation sequence functional genomic data sets submitted by the research community. The resource supports archiving of raw data, processed data and metadata which are indexed, cross-linked and searchable. All data are freely available for download in a variety of formats. GEO also provides several web-based tools and strategies to assist users to query, analyse and visualize data. This article reports current status and recent database developments, including the release of GEO2R, an R-based web application that helps users analyse GEO data.




 


Footnotes and references

  1. Strictly speaking, splicing is an eukaryotic achievement, however there are examples of splicing in prokaryotes as well.
  2. (2015) The noncoding explosion. Nat Struct Mol Biol 22:1. (pmid: 25565024)

    PubMed ] [ DOI ]

    Jarvis & Robertson (2011) The noncoding universe. BMC Biol 9:52. (pmid: 21798102)

    PubMed ] [ DOI ]


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.



< Assignment 9 Assignment 11 >