R expression analysis

From "A B C"
Revision as of 15:45, 9 February 2013 by Boris (talk | contribs) (→‎Exercises)
Jump to navigation Jump to search

R Expression Analysis


A short tutorial for expression analysis in R.




 

Exercises


In this exercise we will search for and download a dataset in GEO, open it using R, perform some data normalizations and finally search for differentially expressed genes.

If you are not (or no longer) familiar with R, work through the introductory tutorial on this Wiki.

The R code for this exercise should run if you simply copy and paste it. However, it has been shown that exercises with contributed code are significantly more effective if you actually type the code, rather than copying it from a source. In whatever way you do it, the code should end up in an editor window in your local R session, you can then save it, modify it, and run it in the usual way: by selecting a piece of code and simply hitting <cmd-enter> (<ctrl-R on Windows>).

For this exercise, we will work with a human cell cycle dataset, simply because it may allow us to ask some more complex questions then just "which gene is overexpressed?" (albeit not in this specific exercise).

Task:

Part 1
Identify a dataset


There are many platforms available, each with different pros and cons. For our purpose, we should work with a gene-centric array, as a reasonable compromise between accuracy and size. We could chose some of the much larger exon arrays, or even full tiling arrays instead, but would then have to add the extra step to compile the different measurements for each gene into a single value. One of the standard technologies in the filed are Affymetrix arrays.

  • Click on the "Platforms" tab.
  • Enter Affymetrix Human as a search term. To get an idea of the datavolume of modern arrays, click on the column header for Data rows, then again to sort descending: the largest arrays have over 4,000,000 data values!
  • Now sort the list again by number of Series, descending, to pick up the chips for which the largest number of experiments are available. A good, relatively modern and widely used platform is [HuGene-1_0-st] Affymetrix Human Gene 1.0 ST Array [transcript (gene) version] - published in 2007 and the most popular gene level array in the list, with 33,000 data columns.
  • Click on the platform ID GPL6244 and briefly review the data that is available for the platform.
  • You could open the list of experiments in this page and text-search for "cell cycle" ... but a more principled approach is to go back to the list and click on the number in the "Series" column for this platform.
  • Then enter cell cycle in the search field.

The most "physiological" (if anything to do with HeLa cells can be considered to be physiological at all, that is) experimental design is the "Cell cycle expression profiles in HeLa cells" experiment.

  • Access its description page.
  • Briefly study the description, then follow the link to the Experiment's citation. Review it briefly to get a sense of what to expect. It's a lot more fun to work with data for which you understand the background, so you can evaluate your results, form hypotheses etc. - but the details of the paper will not be on the quiz.


Sadasivam et al. (2012) The MuvB complex sequentially recruits B-Myb and FoxM1 to promote mitotic gene expression. Genes Dev 26:474-89. (pmid: 22391450)

PubMed ] [ DOI ] Cell cycle progression is dependent on two major waves of gene expression. Early cell cycle gene expression occurs during G1/S to generate factors required for DNA replication, while late cell cycle gene expression begins during G2 to prepare for mitosis. Here we demonstrate that the MuvB complex-comprised of LIN9, LIN37, LIN52, LIN54, and RBBP4-serves an essential role in three distinct transcription complexes to regulate cell cycle gene expression. The MuvB complex, together with the Rb-like protein p130, E2F4, and DP1, forms the DREAM complex during quiescence and represses expression of both early and late genes. Upon cell cycle entry, the MuvB complex dissociates from p130/DREAM, binds to B-Myb, and reassociates with the promoters of late genes during S phase. MuvB and B-Myb are required for the subsequent recruitment of FoxM1 to late gene promoters during G2. The MuvB complex remains bound to FoxM1 during peak late cell cycle gene expression, while B-Myb binding is lost when it undergoes phosphorylation-dependent, proteasome-mediated degradation during late S phase. Our results reveal a novel role for the MuvB complex in recruiting B-Myb and FoxM1 to promote late cell cycle gene expression and in regulating cell cycle gene expression from quiescence through mitosis.


Task:

Part 2
Download the data in R
  • Open an R session on your computer, choose a (new) working directory, then choose FileNew Document and save the file in your working directory.

<lang source="RSplus">

  1. ABC_R_Expression_Analysis.R
  2. Analysis of differential expression and expression profiles with
  3. a dataset sourced from GEO.
  1. It is good practice to set variables you might want to change
  2. in a header block so you don't need to hunt all over the code
  3. for strings you need to update.

setwd("/your/R/working/directory") setwd("~/Documents/07.TEACHING/50.5-BCB420-JTB2020\ 2013/ExpressionAnalysis/")

  1. ================================================
  2. Download required packages
  3. ================================================
  1. The required packages are available via the BioConductor project.
  2. Download and install them if you haven't done so already.
  3. Update all old packages to their newest version if prompted.
  4. Otherwise, skip this step.

source("http://bioconductor.org/biocLite.R") biocLite("Biobase") biocLite("GEOquery") biocLite("limma") biocLite("samr")


  1. ================================================
  2. Load required libraries
  3. ================================================

library(Biobase) library(GEOquery) library(limma)

  1. If you get a warning that the apackages were built under a later version
  2. than the R you are currently running, you may consider to update R to
  3. the newest version.
  1. ================================================
  2. Load series and platform data from GEO
  3. ================================================
  1. This follows the GEO2R script, available on the GEO site
  2. first, check what getGEO does:

?getGEO()

  1. now download the GSE26922 dataset

gset <- getGEO("GSE26922", GSEMatrix =TRUE)

  1. characterize briefly what you have:

gset gset <- gset1

  1. make proper column names to match toptable

fvarLabels(gset) <- make.names(fvarLabels(gset))

  1. the expression values are available via the function exprs()

head(exprs(gset))


</lang>

...TBC


 

Further reading and resources