Difference between revisions of "BIO Assignment Week 10"

From "A B C"
Jump to navigation Jump to search
m
m
 
(18 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
<div class="b1">
 
<div class="b1">
 
Assignment for Week 10<br />
 
Assignment for Week 10<br />
<span style="font-size: 70%">Genome Browsers</span>
+
<span style="font-size: 70%">Expression Analysis</span>
 
</div>
 
</div>
 +
<table style="width:100%;"><tr>
 +
<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_9|&lt;&nbsp;Assignment&nbsp;9]]</td>
 +
<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[BIO_Assignment_Week_11|Assignment&nbsp;11&nbsp;&gt;]]</td>
 +
</tr></table>
  
 
{{Template:Inactive}}
 
{{Template:Inactive}}
Line 14: Line 18:
  
 
&nbsp;
 
&nbsp;
 +
 
==Introduction==
 
==Introduction==
  
Large scale genome sequencing and annotation has made a wealth of information available that is all related to the same biological objects: the DNA. The information however can be of very different types, it includes:
+
The transcriptome is the set of a cell's mRNA molecules. The transcriptome originates from the genome, mostly, that is, and it results in the proteome, again: mostly. RNA that is {{WP|Transcription (genetics)|transcribed}} from the genome is not yet fit for translation but must be processed: {{WP|RNA splicing|splicing}} is ubiquitous<ref>Strictly speaking, splicing is an {{WP|Eukaryote|eukaryotic}} achievement, however there are examples of splicing in {{WP|Prokaryote|prokaryotes}} as well.</ref> and in addition {{WP|RNA editing}} has been encountered in many species. Some authors therefore refer to the ''exome''&mdash;the set of transcribed {{WP|exons}}&mdash; to indicate the actual coding sequence.  
* the actual sequence
 
* sequence variants (SNPS and CNVs)
 
* conservation between related species
 
* genes (with introns and exons)
 
* mRNAs
 
* expression levels
 
* regulatory features such as transcription factor bindings sites
 
and much more.
 
  
Since all of this information relates to specific positions or ranges on the chromosome, displaying it alongside the chromosomal coordinates is a useful way to integrate and visualize it. We call such strips of annotation ''tracts'' and display them in '''genome browsers''. Quite a number of such browsers exist, and most work on the same principle: server hosted databases are queried through a Web interface; the resulting data is displayed graphically in a Web browser window. The large data centres each have their own browsers, but arguably the best engineered, most informative and mostly widely used one is provided by the University of California Santa Cruz (UCSC) Genome Browser Project.  
+
'''Microarray technology''' &mdash; the quantitative, sequence-specific hybridization of labelled nucleotides in chip-format &mdash; was the first domain of "high-throughput biology". Today, it has largely been replaced by {{WP|RNA-Seq|'''RNA-seq'''}}: quantification of transcribed mRNA by high-throughput sequencing and mapping reads to genes. Quantifying gene expression levels in a tissue-, development-, or response-specific way has yielded detailed insight into cellular function at the molecular level, with recent results of single-cell sequencing experiments adding a new level of precision. But not all transcripts are mapped to genes: we increasingly realize that the transcriptome is not merely a passive buffer of expressed information on its way to be translated into proteins, but contains multiple levels of complex, regulation through hybridization of small nuclear RNAs<ref>{{#pmid: 25565024}} {{#pmid: 21798102}}</ref>.
 
 
In this assignment you will explore some of the browsers and we will go through an exercise that relates fungal replication genes to human genes. We have previously focused a lot on Mbp1 homologs, but these have no clear equivalences in "higher" eukaryotes. However one of the key target genes of Mbp1 is the cell cycle protein {{WP|Cdc6}}, and CDC6 is universally conserved in eukaryotes and has a {{WP|CDC6|human homolog}}.
 
  
 +
In this assignment, we will look at differential expression of Mbp1 and its target genes.
  
  
 
&nbsp;
 
&nbsp;
==GBrowse==
 
  
[http://gmod.org/wiki/GBrowse '''GBrowse'''] - the Generic genome Browser - is the browser developed by the [http://gmod.org/wiki/Main_Page Generic Model Organism Database] project that aims to make industry-strength bioinformatics tools and software available for the model organism community. One of the many databases that uses GMod tools is [http://www.yeastgenome.org/ the Saccharomyces Genome Database].
+
==GEO2R==
  
{{task|1=
+
<section begin=exercises />
In this task you will access the SGD GBrowse page for Cdc6 and explore some of the options.
 
# Navigate to the [http://www.yeastgenome.org/ the Saccharomyces Genome Database], enter Cdc6 into the site search field and on the result page click on the '''GBrowse''' link at the '''Chromosome location''' heading.
 
# Locate CDC6 (YJL194W) as a red bar in the graph. Note that the triangle at the end points in the direction of transcription.
 
# Note how the shape of the cursor changes over different regions of the window. For example, you can click/hold the graph and slide it left and right (this changes the overview indicator that shows where on the chromosome the currently displayed window of sequence is located). You can click on and follow annotation information. You can also select a stretch of nucleotides and dump it as FASTA (hover over the ruler in the ''Details'' pane).
 
# Zoom in by selecting '''Show 5 kbp''' at the scroll/zoom controls.
 
# Click on the '''Select Tracks''' tab. This gives you access to a fine-grained selection of all tracks that have been created as genome annotations.
 
# Find the section for '''Transcription Factors'''. Click on the star next to '''TF ChIP chip''' to mark this experiment as a "favorite". Then click on '''Show Favorites Only''' at the top of the page. Finally check '''All on''' for the '''Transcription Factors''' track and '''Back to browser'''.
 
}}
 
 
 
 
 
This view shows you the ChIP-chip validated TF-binding sites in the upstream regulatory region of Cdc6. Note that Mbp1 is among them. Curiously, Swi6 is also listed there - but you know that [http://www.yeastgenome.org/cgi-bin/locus.fpl?locus=YLR182W Swi6] does not actually bind DNA directly, but forms a complex with the APSES domain transcription factors Mbp1/Swi4 which form the [http://www.yeastgenome.org/cgi-bin/GO/goTerm.pl?goid=0030907 MBF] complex. However, crosslinking of the complex and immunoprecipitation with anti-Swi6 would certainly identify this region. You should be aware that an annotation of a protein in a ChIP-chip experiment is not the same as demonstrating a protein's physical interaction with DNA.
 
 
 
 
 
 
 
&nbsp;
 
 
 
==NCBI Map Viewer==
 
  
 +
In this exercise we will use the analysis facilities of the GEO database at the NCBI.
  
 
{{task|1=
 
{{task|1=
  
In this task you will locate and display a map view at the NCBI for the yeast Cdc6 gene.
+
;First, we will search for relevant data sets on GEO, the NCBI's database for expression data.
 
+
#Navigate to the entry page for [http://www.ncbi.nlm.nih.gov/gds/ ''' GEO data sets]].
# Navigate to the [http://www.ncbi.nlm.nih.gov/ '''NCBI''' home page] and follow the link to '''Genomes & maps''' in the left-hand menu.
+
#Enter the following query in the usual Entrez query format: <code>"cell cycle"[ti] AND "saccharomyces cerevisiae"[organism]</code>.
# Click on the '''Tools''' tab and find the link to the [http://www.ncbi.nlm.nih.gov/mapview/ '''Map Viewer''']
+
#You should get two datasets among the top hits that analyze wild-type yeast (W303a cells) across two cell-cycles after release from alpha-factor arrest. Choose the [http://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS2347 experiment with lower resolution] (13 samples).
# In the '''Fungi''' section, click on the latest "build" of the ''Saccharomycs cerevisiae'' genome. This takes you to an overview page of the status of the Genome project. Each chromosome is linked to its map. If you would not know what chromosome to look for, you would need to search by keyword, or gene name in the nucleotide database. Regarding Cdc6, you remember from the task above that it is located on [http://www.ncbi.nlm.nih.gov/projects/mapview/maps.cgi?taxid=4932&chr=X Chromosome X] (''i.e'' the {{WP|Roman numerals|roman numeral}} ten, not the "X-Chromosome"). You will arrive at the actual mapview of the entire Chromosome with the RefSeq accession number <code>NC_001142.9</code>. This large nucleotide record containing the entire chromosomal sequence underlies the display.
+
#On the linked GEO DataSet Browser page, follow the link to the [http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE3635 Accession Viewer page: the "Reference series"].
# Enter '''Cdc6''' into the Search field and click the '''Find in This View''' button. Then zoom in a few levels.
+
#Read about the experiment and samples, then follow the link to [http://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE3635 '''analyze with GEO2R''']
}}
 
 
 
 
 
The [http://www.ncbi.nlm.nih.gov/projects/mapview/maps.cgi?TAXID=4932&CHR=X&MAPS=cntg-r,genes%5B36220.54%3A43678.04%5D&QUERY=Cdc6&zoom=10 resulting view] shows you the location and orientation of the gene on the chromosome. A number of links to various NCBI databases are given for each gene. Note that this is primarily a tool for database crossreferencing, not for integrating and displaying annotations.
 
 
 
 
 
  
&nbsp;
+
* View the [http://www.youtube.com/watch?v=EUPmGWS8ik0 '''GEO2R''' video tutorial] on youtube.
  
==Ensembl==
+
;Now proceed to apply this to the yeast cell-cycle study:[[File:GSE3635_ValueDistribution.png|frame|right|Value distribution for the yeast cell-cycle experiment GSE3635. Experiments are grouped approximately into equivalent time-points on a cell cycle.]]
  
The EBI offers its own version of genome browsers through the Ensembl project. A large number of genomes have been annotated, cross-referenced and made available for viewing. The EBI has spent a lot of effort on automated curation of their genome offerings. '''The ensemble offerings are therefore more comprehensive and complete than those of other sources'''. In particular, you will find a genome view for YFO.  
+
# '''Define groups''': the associated publication shows us that one cell-cycle takes pretty exactly 60 minutes. Create timepoints T0, T1, T2, ... T5. Then associate the 0 and 60 min. sample with "T0"; 10 and 70 minutes get grouped as "T1"; 20 and 80 minutes are T2, etc. up to T5. The final sample does not get assigned.
 +
# Confirm that the '''Value distributions''' are unbiased by accessing the value distribution tab - overall, in such experiments, the bulk of the expression values should not change and thus means and quantiles of the expression levels should be about the same.
 +
# Your distribution should look like the image on the right: properly grouped into six categories, and unbiased regarding absolute expression levels and trends.
 +
# '''Look for differentially expressed genes''': open the GEO2R tab and click on '''Top 250'''.
  
{{task|1=
+
;Analyze the results.
  
In this task you will review the ensembl view of the YFO ortholog to yeast CDC6.
+
# Examine the top hits. Click on a few of the gene names in the ''Gene.symbol'' column to view the expression profiles that tell you ''why'' the genes were found to be differentially expressed. What do you think? Is this what you would have expected for genes' responses to the cell-cycle? What seems to be the algorithm's notion of what "differentially expressed" means?
 +
# Look for expected genes. Here are a few genes that are known to be differentially expressed in the cell-cycle as target genes of the MBF complex: <code>DSE1</code>, <code>DSE2</code>, <code>ERF3</code>, <code>HTA2</code>, <code>HTB2</code>, and <code>GAS3</code>. But what about the MBD complex proteins themselves: Mbp1 and Swi6?
  
# Navigate to the [http://fungi.ensembl.org/index.html '''EnsemblFungi'''] page (easy to find via Google).
+
The notion of "differential expression" and "cell-cycle dependent expression" do not overlap completely. Significant differential expression is mathematically determined for genes that have low variance within groups and large differences between groups. This algorithm has no notion of any expectation you might have about the shape of the expression profile. All it finds are genes for which differential expression between some groups is statistically supported. The algorithm returns the top 250 of those. Consistency within groups is very important, while we intuitively might be giving more weight to conformance to our expectations of a cyclical pattern.
# Select ''Saccharomyces cerevisiae'' from the species list.
 
# '''Search''' for  Cdc6 as a search term in the ''Search Saccharomyces cerevisiae ...'' field.
 
# Click on [http://fungi.ensembl.org/Saccharomyces_cerevisiae/Gene/Summary?g=YJL194W;r=X:69338-70879;t=YJL194W CDC6 (YJL194W)]
 
  
You will be taken to a browser view of the genome. Tracts can be switched on and off through the menu on the left hand side.
+
Let's see if we can group our time points differently to enhance the contrast between expression levels for cyclically expressed genes. Let's define only two groups: one set before and between the two cycles, one set at the peaks - and we'll omit some of the intermediate values.
 
 
# Find the link to [http://fungi.ensembl.org/Saccharomyces_cerevisiae/Gene/Compara_Ortholog?g=YJL194W;r=X:69338-70879;t=YJL194W '''Orthologues'''] under the '''Fungal Compara''' section in the menu.
 
# In the resulting page, find the YFO orthologue and click on the link in the '''Location''' column.
 
# On the Browser page, click on the cogwheel icon in the bottom left bar of the lower pane to configure tracks.
 
# On the configuration page, click on '''Sequence''' in the left-hand menu and click the (check)-boxes to turn '''Contigs''' off and '''Translated sequence''' on. Click the checkmark in the top-right corner of the configuration window to return to the browser view.
 
# Zoom in until you see the display of the actual nucleotides and the six reading frames.
 
  
 +
# Remove all of your groups and define two groups only. Call them "A" and "B".
 +
# Assign samples for T = 0 min, 10, 60 and 70 min. to the "A" group. Assign sets  30, 40, 90, and 100 to the "B" group.
 +
# Recalculate the '''Top 250''' differentially expressed genes (you might have to refresh the page to get the "Top 250" button back.) Which of the "known" MBF targets are now contained in the set? What about Mbp1 and Swi6?
 +
# Finally: Let's compare the expression profiles for Mbp1, Swi6 and Swi4. It is not obvious that transcription factors are themselves under '''transcriptional''' control, as opposed to being expressed at a basal level and ''activated'' by phosporylation or ligand binding. In a new page, navigate to the [http://www.ncbi.nlm.nih.gov/geoprofiles '''Geo profiles'''] page and enter <code>(Mbp1 OR Swi6 OR Swi4 OR Nrm1 OR Cln1 OR Clb6 OR Act1 OR Alg9) AND GSE3635</code> (Nrm1, Cln1, and Clb6 are Mbp1 target genes. Act1 and Alg9 are beta-Actin and mannosyltransferase, these are often used as "housekeeping genes, i.e. genes with condition-independent expression levels, especially for qPCR studies - although Alg9 is also an Mbp1 target. We include them here as negative controls. CGSE3635 is the ID of the GEO data set we have just studied). You could have got similar results in the '''Profile graph''' tab of the GEO2R page. What do you find? What does this tell you? Would this information allow you to define groups that are even better suited for finding cyclically expressed genes?
 +
# Click on the profile graph for Mbp1 and print out the page. Write your name and student number on the page. With a red pen, '''in one sentence''' describe the evidence you find '''on that page''' that allows us to conclude '''whether or not''' Mbp1 is a cell-cycle gene. You'll probably want to think for a moment what this question really means, how a cell-cycle gene could be defined, and what can be considered "evidence", before you write. I will mark your response for a maximum of four marks.
 +
<!--
 +
* Finally, review the '''R''' script for the GEO2R analysis in the '''R script''' tab. This code will run on your machine and make the expression analysis available. Once the datasets are loaded and prepared, you could - for example - perform a "real" time series analysis, calculate correlation coefficients with an idealized sine wave, or search for genes that are '''co-regulated''' with your genes of interest.
 +
-->
 
}}
 
}}
 +
<section end=exercises />
  
  
This is a very comprehensive offering in terms of sequences. However, ensemble too offers little in terms of annotations of DNA elements, expression levels and the like. Nevertheless, since it is the only database that has YFO annotated, it would be the tool to go to if you were to compare syntenic regions or genomic context between different species.
+
&nbsp;
  
  
 +
<!--
 +
==Co-Expression==
  
 
&nbsp;
 
 
==The UCSC genome browser==
 
 
The University of California Santa Cruz (UCSC) Genome Browser Project has the largest offering of annotation information. However it is strictly model-organism oriented and you will probably not find YFO among its curated genomes. Nevertheless, if you are studying eg. human genes, or yeast, the UCSC browser should be your first choice.
 
  
 
{{task|1=
 
{{task|1=
  
In this task you will the UCSC genome browser view of the yeast Cdc6 gene and its human orthologue. You will explore some of the very large number of tracks that are available for both and compare transcription factor binding regions.
+
[http://coxpresdb.jp/ '''CoExpressdb'''] is a well curated database of pre-calculated co-expression profiles for model organisms. Expression values across a large number of published experiments on the same platform are compared via their coefficient of correlation. Highly correlated genes are either co-regulated, or one gene influences the expression level of the other.  
  
# Navigate to the [http://genome.ucsc.edu/ '''UCSC''' Genome Bioinformatics entry page] and follow the link to the '''Genome Browser''' in the left-hand menu.
+
* Navigate to [http://coxpresdb.jp/ '''CoExpressdb'''].
# From the available menus, access the ''S. cerevisiae'' information ('''group &rarr; other''') and enter Cdc6 as the '''search term'''.
+
* Enter <code>Mbp1</code> as a "gene alias" in the search field.
# Click on the link to the [http://genome.ucsc.edu/cgi-bin/hgTracks?position=chrX:69338-70879&hgsid=311433759&sgdGene=pack&hgFind.matches=YJL194W, Cdc6 gene] on chromosome X.
+
* Click on the link to the coexpressed gene list. Do any of the "known" target genes appear here? How do you interpret this result?
# Click on the button to zoom out '''3x''' - we want to see the upstream regulatory region.
 
# In the subsection for '''Expression and Regulation''', find the menu for '''Regulatory Code''' and select '''full'''; select '''hide''' for all other expression tracks. Click '''refresh'''.
 
  
Up to now, this looks very similar to the SGD genome browser.
+
Unfortunately, the support for yeast genes is very limited. CoexDB is however an excellent resource to study higher eukaryotic, especially human genes. You might want to consider it for its additional capabilities for your "systems" term project. Refer to the [http://coxpresdb.jp/help/movie/ YouTube tutorials for details].
  
# Open a second window, and access the UCSC Genome browser for the '''human genome'''. Search for CDC6 and click the link to the ''Homo sapiens'' cell division cycle 6 homolog (''S. cerevisiae'') (CDC6) on chromosome 17.
 
# Study the [http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr17:38444146-38459413&hgsid=353080265&knownGene=pack&hgFind.matches=uc002huj.1, Genome Browser view of the CDC6 homolog].
 
## In particular, note the extensive functional annotations of DNA and the alignments of vertebrate syntenic regions that allow detailed genomic comparisons.
 
## Distinguish between exon and intron sequence.
 
## Note that the mammal Conservation track has high values for all of the exons, but not only for exons.
 
# Zoom out '''1.5x''' and click/slide the gene to the right to view the upstream regulatory region.
 
# On the page, note the '''large''' number of available tracks that have been integrated into this view. Most of them are switched off. Find the '''Regulation''' section, and click on '''ENCODE Transcription Factor Binding Tracks''' to access the information page on where exactly this data originates from. Note that you can switch individual experiments on or off on this page, as well as setting the display format for all of the results. Leave all of the experiments checked, set the display to '''show''' and click the '''Submit''' button.
 
# Get a sense of the amount of information that is displayed here and note that all experiments agree on a regulatory region that ranges from about 1.5kb upstream to 0.5 kb downstream of the transcription start.
 
# Go back to the '''ENCODE Transcription Factor Binding Tracks''' page uncheck all of the data sources except for the ENCODE/Stanford/Yale/USC/Harvard Chip-seq experiment (SYDH TFBS), set the format to '''full''', '''Display mode: show''' and click '''submit'''.
 
# The resulting tracks are an excellent view of the kind of information that is provided by ChIP-seq experiments in which bound transcription factors are crosslinked to the DNA, immuno-precipitated with transcription factor specific antibodies, and the co-precipitated DNA sequenced with high-throughput sequencing methods. Note that most sequence tags are found in a unimodal distribution close to the transcription start, but some TFs (e.g. Rad21) apparently have more than one binding site.
 
# Now scroll down to the track sections, '''hide''' the '''ENCODE TF binding data''' and show the '''full''' view of the '''TFBS conserved''' track - a consensus of human/mouse and rat annotated TF binding sites. Click on the small vertical bar in the <code>V$E2F_02</code> row, this will take you to a detailed information page on this transcription factor, with cross-references to the databases.
 
 
}}
 
}}
  
  
Based on this kind of information, it should be straightforward to identify human transcription factors that potentially regulate human Cdc6 and determine - via sequence comparisons - whether any of them are homologous to any of the yeast transcription factors. Through a detailed analysis of existing systems, their regulatory components and the conservation of regulation, one can in principle establish functional equivalences across large evolutionary distances.
+
{{Vspace}}
 +
-->
  
 +
==Further reading and resources==
  
The UCSC browser has a sometimes bewildering amount of information available. But its curators are aware of the need for educating users regarding the utility of their tools.
+
{{#pmid: 25392420}}
 +
{{#pmid: 23193258}}
  
{{task|1=
 
  
In this task you will access some of the tutorial information that UCSC provides.
+
<!--
# Return to the [http://genome.ucsc.edu/ '''UCSC''' Genome Bioinformatics entry page] and follow the link to '''Training''' in the left-hand menu.
+
{{#pmid: 23846655}}
# Follow the link to the [http://www.openhelix.com/ucsc '''OpenHelix UCSC tutorials'''].
+
{{#pmid: 23377968}}
# Download the Hands-on exercise PDF file and work through '''Exercise 2'''
+
{{#pmid: 23258890}}
}}
+
{{#pmid: 21925324}}
 
+
{{#pmid: 21627854}}
This exercise includes a number of interesting options to work with the UCSC data - the BLAT tool for genomic region alignment and the selective display of SNP annotations.
+
{{#pmid: 21468988}}
 
+
{{#pmid: 21097893}}
; Optional
+
{{#pmid: 21071405}}
* Work through exercise one and three of the OpenHelix UCSC introduction.
+
{{#pmid: 20652519}}
* Access the [http://www.openhelix.com/ENCODE2 OpenHelix '''ENCODE''' tutorial], download the '''Hands-on Exercises''' pdf and work through the exercises. Exercise 3 is particularly valuable, as it teaches you how to create results from complex intersections of queries.
+
{{#pmid: 20523743}}
* Study the ''User's guide to ENCODE'' paper linked below.
+
{{#pmid: 18953035}}
 
+
{{#pmid: 17940530}}
 
+
{{#pmid: 17449815}}
&nbsp;
+
{{#pmid: 16888359}}
 
+
-->
== Links and resources ==
 
{{#pmid: 22764121}}
 
{{#pmid: 21526222}}
 
  
  
<!-- {{#pmid: 19957275}} -->
 
 
<!-- {{WWW|WWW_GMOD}} -->
 
<!-- {{WWW|WWW_GMOD}} -->
 
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
Line 170: Line 128:
 
&nbsp;
 
&nbsp;
 
{{#lst:BIO_Assignment_Week_1|assignment_footer}}
 
{{#lst:BIO_Assignment_Week_1|assignment_footer}}
 +
 +
<table style="width:100%;"><tr>
 +
<td style="height:30px; vertical-align:middle; text-align:left; font-size:80%;">[[BIO_Assignment_Week_9|&lt;&nbsp;Assignment&nbsp;9]]</td>
 +
<td style="height:30px; vertical-align:middle; text-align:right; font-size:80%;">[[BIO_Assignment_Week_11|Assignment&nbsp;11&nbsp;&gt;]]</td>
 +
</tr></table>
  
  

Latest revision as of 04:12, 13 December 2016

Assignment for Week 10
Expression Analysis

< Assignment 9 Assignment 11 >

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.



 

Introduction

The transcriptome is the set of a cell's mRNA molecules. The transcriptome originates from the genome, mostly, that is, and it results in the proteome, again: mostly. RNA that is transcribed from the genome is not yet fit for translation but must be processed: splicing is ubiquitous[1] and in addition RNA editing has been encountered in many species. Some authors therefore refer to the exome—the set of transcribed exons— to indicate the actual coding sequence.

Microarray technology — the quantitative, sequence-specific hybridization of labelled nucleotides in chip-format — was the first domain of "high-throughput biology". Today, it has largely been replaced by RNA-seq: quantification of transcribed mRNA by high-throughput sequencing and mapping reads to genes. Quantifying gene expression levels in a tissue-, development-, or response-specific way has yielded detailed insight into cellular function at the molecular level, with recent results of single-cell sequencing experiments adding a new level of precision. But not all transcripts are mapped to genes: we increasingly realize that the transcriptome is not merely a passive buffer of expressed information on its way to be translated into proteins, but contains multiple levels of complex, regulation through hybridization of small nuclear RNAs[2].

In this assignment, we will look at differential expression of Mbp1 and its target genes.


 

GEO2R


In this exercise we will use the analysis facilities of the GEO database at the NCBI.

Task:

First, we will search for relevant data sets on GEO, the NCBI's database for expression data.
  1. Navigate to the entry page for GEO data sets].
  2. Enter the following query in the usual Entrez query format: "cell cycle"[ti] AND "saccharomyces cerevisiae"[organism].
  3. You should get two datasets among the top hits that analyze wild-type yeast (W303a cells) across two cell-cycles after release from alpha-factor arrest. Choose the experiment with lower resolution (13 samples).
  4. On the linked GEO DataSet Browser page, follow the link to the Accession Viewer page: the "Reference series".
  5. Read about the experiment and samples, then follow the link to analyze with GEO2R
Now proceed to apply this to the yeast cell-cycle study
Value distribution for the yeast cell-cycle experiment GSE3635. Experiments are grouped approximately into equivalent time-points on a cell cycle.
  1. Define groups: the associated publication shows us that one cell-cycle takes pretty exactly 60 minutes. Create timepoints T0, T1, T2, ... T5. Then associate the 0 and 60 min. sample with "T0"; 10 and 70 minutes get grouped as "T1"; 20 and 80 minutes are T2, etc. up to T5. The final sample does not get assigned.
  2. Confirm that the Value distributions are unbiased by accessing the value distribution tab - overall, in such experiments, the bulk of the expression values should not change and thus means and quantiles of the expression levels should be about the same.
  3. Your distribution should look like the image on the right: properly grouped into six categories, and unbiased regarding absolute expression levels and trends.
  4. Look for differentially expressed genes: open the GEO2R tab and click on Top 250.
Analyze the results.
  1. Examine the top hits. Click on a few of the gene names in the Gene.symbol column to view the expression profiles that tell you why the genes were found to be differentially expressed. What do you think? Is this what you would have expected for genes' responses to the cell-cycle? What seems to be the algorithm's notion of what "differentially expressed" means?
  2. Look for expected genes. Here are a few genes that are known to be differentially expressed in the cell-cycle as target genes of the MBF complex: DSE1, DSE2, ERF3, HTA2, HTB2, and GAS3. But what about the MBD complex proteins themselves: Mbp1 and Swi6?

The notion of "differential expression" and "cell-cycle dependent expression" do not overlap completely. Significant differential expression is mathematically determined for genes that have low variance within groups and large differences between groups. This algorithm has no notion of any expectation you might have about the shape of the expression profile. All it finds are genes for which differential expression between some groups is statistically supported. The algorithm returns the top 250 of those. Consistency within groups is very important, while we intuitively might be giving more weight to conformance to our expectations of a cyclical pattern.

Let's see if we can group our time points differently to enhance the contrast between expression levels for cyclically expressed genes. Let's define only two groups: one set before and between the two cycles, one set at the peaks - and we'll omit some of the intermediate values.

  1. Remove all of your groups and define two groups only. Call them "A" and "B".
  2. Assign samples for T = 0 min, 10, 60 and 70 min. to the "A" group. Assign sets 30, 40, 90, and 100 to the "B" group.
  3. Recalculate the Top 250 differentially expressed genes (you might have to refresh the page to get the "Top 250" button back.) Which of the "known" MBF targets are now contained in the set? What about Mbp1 and Swi6?
  4. Finally: Let's compare the expression profiles for Mbp1, Swi6 and Swi4. It is not obvious that transcription factors are themselves under transcriptional control, as opposed to being expressed at a basal level and activated by phosporylation or ligand binding. In a new page, navigate to the Geo profiles page and enter (Mbp1 OR Swi6 OR Swi4 OR Nrm1 OR Cln1 OR Clb6 OR Act1 OR Alg9) AND GSE3635 (Nrm1, Cln1, and Clb6 are Mbp1 target genes. Act1 and Alg9 are beta-Actin and mannosyltransferase, these are often used as "housekeeping genes, i.e. genes with condition-independent expression levels, especially for qPCR studies - although Alg9 is also an Mbp1 target. We include them here as negative controls. CGSE3635 is the ID of the GEO data set we have just studied). You could have got similar results in the Profile graph tab of the GEO2R page. What do you find? What does this tell you? Would this information allow you to define groups that are even better suited for finding cyclically expressed genes?
  5. Click on the profile graph for Mbp1 and print out the page. Write your name and student number on the page. With a red pen, in one sentence describe the evidence you find on that page that allows us to conclude whether or not Mbp1 is a cell-cycle gene. You'll probably want to think for a moment what this question really means, how a cell-cycle gene could be defined, and what can be considered "evidence", before you write. I will mark your response for a maximum of four marks.


 


Further reading and resources

Okamura et al. (2015) COXPRESdb in 2015: coexpression database for animal species by DNA-microarray and RNAseq-based expression data with multiple quality assessment systems. Nucleic Acids Res 43:D82-6. (pmid: 25392420)

PubMed ] [ DOI ] The COXPRESdb (http://coxpresdb.jp) provides gene coexpression relationships for animal species. Here, we report the updates of the database, mainly focusing on the following two points. For the first point, we added RNAseq-based gene coexpression data for three species (human, mouse and fly), and largely increased the number of microarray experiments to nine species. The increase of the number of expression data with multiple platforms could enhance the reliability of coexpression data. For the second point, we refined the data assessment procedures, for each coexpressed gene list and for the total performance of a platform. The assessment of coexpressed gene list now uses more reasonable P-values derived from platform-specific null distribution. These developments greatly reduced pseudo-predictions for directly associated genes, thus expanding the reliability of coexpression data to design new experiments and to discuss experimental results.

Barrett et al. (2013) NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res 41:D991-5. (pmid: 23193258)

PubMed ] [ DOI ] The Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) is an international public repository for high-throughput microarray and next-generation sequence functional genomic data sets submitted by the research community. The resource supports archiving of raw data, processed data and metadata which are indexed, cross-linked and searchable. All data are freely available for download in a variety of formats. GEO also provides several web-based tools and strategies to assist users to query, analyse and visualize data. This article reports current status and recent database developments, including the release of GEO2R, an R-based web application that helps users analyse GEO data.




 


Footnotes and references

  1. Strictly speaking, splicing is an eukaryotic achievement, however there are examples of splicing in prokaryotes as well.
  2. (2015) The noncoding explosion. Nat Struct Mol Biol 22:1. (pmid: 25565024)

    PubMed ] [ DOI ]

    Jarvis & Robertson (2011) The noncoding universe. BMC Biol 9:52. (pmid: 21798102)

    PubMed ] [ DOI ]


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.



< Assignment 9 Assignment 11 >