R expression analysis

From "A B C"
Jump to navigation Jump to search

R Expression Analysis


A short tutorial for expression analysis in R.




 

Exercises


In this exercise we will search for and download a dataset in GEO, open it using R, perform some data normalizations and finally search for differentially expressed genes.

If you are not (or no longer) familiar with R, work through the introductory tutorial on this Wiki.

The R code for this exercise should run if you simply copy and paste it. However, it has been shown that exercises with contributed code are significantly more effective if you actually type the code, rather than copying it from a source. In whatever way you do it, the code should end up in an editor window in your local R session, you can then save it, modify it, and run it in the usual way: by selecting a piece of code and simply hitting <cmd-enter> (<ctrl-R> on Windows).

For this exercise, we will work with a human cell cycle dataset, simply because it may allow us to ask some more complex questions then just "which gene is overexpressed?" (albeit not in this specific exercise).

Task:

Part 1
Identify a dataset


There are many platforms available, each with different pros and cons. For our purpose, we should work with a gene-centric array, as a reasonable compromise between accuracy and size. We could chose some of the much larger exon arrays, or even full tiling arrays instead, but would then have to add the extra step to compile the different measurements for each gene into a single value. One of the standard technologies in the field are Affymetrix arrays.

  • Click on the "Platforms" tab.
  • Enter Affymetrix Human as a search term. To get an idea of the datavolume of modern arrays, click on the column header for Data rows, then again to sort descending: the largest arrays have over 4,000,000 data values!
  • Now sort the list again by number of Series, descending, to pick up the chips for which the largest number of experiments are available. A good, relatively modern and widely used platform is [HuGene-1_0-st] Affymetrix Human Gene 1.0 ST Array [transcript (gene) version] - published in 2007 and the most popular gene level array in the list, with 33,000 data columns.
  • Click on the platform ID GPL6244 and briefly review the data that is available for the platform.
  • You could open the list of experiments in this page and text-search for "cell cycle" ... but a more principled approach is to go back to the list and click on the number in the "Series" column for this platform.
  • Then enter cell cycle in the search field.

The most "physiological" (if anything to do with HeLa cells can be considered to be physiological at all, that is) experimental design is the "Cell cycle expression profiles in HeLa cells" experiment.

  • Access its description page.
  • Briefly study the description, then follow the link to the Experiment's citation. Review it briefly to get a sense of what to expect. It's a lot more fun to work with data for which you understand the background, so you can evaluate your results, form hypotheses etc. - but the details of the paper will not be on the quiz.


Sadasivam et al. (2012) The MuvB complex sequentially recruits B-Myb and FoxM1 to promote mitotic gene expression. Genes Dev 26:474-89. (pmid: 22391450)

PubMed ] [ DOI ] Cell cycle progression is dependent on two major waves of gene expression. Early cell cycle gene expression occurs during G1/S to generate factors required for DNA replication, while late cell cycle gene expression begins during G2 to prepare for mitosis. Here we demonstrate that the MuvB complex-comprised of LIN9, LIN37, LIN52, LIN54, and RBBP4-serves an essential role in three distinct transcription complexes to regulate cell cycle gene expression. The MuvB complex, together with the Rb-like protein p130, E2F4, and DP1, forms the DREAM complex during quiescence and represses expression of both early and late genes. Upon cell cycle entry, the MuvB complex dissociates from p130/DREAM, binds to B-Myb, and reassociates with the promoters of late genes during S phase. MuvB and B-Myb are required for the subsequent recruitment of FoxM1 to late gene promoters during G2. The MuvB complex remains bound to FoxM1 during peak late cell cycle gene expression, while B-Myb binding is lost when it undergoes phosphorylation-dependent, proteasome-mediated degradation during late S phase. Our results reveal a novel role for the MuvB complex in recruiting B-Myb and FoxM1 to promote late cell cycle gene expression and in regulating cell cycle gene expression from quiescence through mitosis.


Task:

Part 2
Download the data in R and mine for differential expression
  • Open an R session on your computer, choose a (new) working directory, then choose FileNew Document and save the file in your working directory.
# ABC_R_Expression_Analysis.R
# Analysis of differential expression with
# a dataset sourced from GEO.
 
# It is good practice to set variables you might want to change
# in a header block so you don't need to hunt all over the code
# for strings you need to update.
#
setwd("/your/R/working/directory")
 
# ================================================
#    Download required packages
# ================================================

# The required packages are available via the BioConductor project.
# Download and install them if you haven't done so already.
# Update all old packages to their newest version if prompted.
# Otherwise, skip this step.
source("http://bioconductor.org/biocLite.R")
biocLite("Biobase")
biocLite("GEOquery")
biocLite("limma")
biocLite("samr")


# ================================================
#    Load required libraries
# ================================================

library(Biobase)
library(GEOquery)
library(limma)

# If you get a warning that the packages were built under a later version
# than the R you are currently running, you may consider to update R to
# the newest version.

# ================================================
#    Load series and platform data from GEO
# ================================================

# First, check what getGEO does:
?getGEO()

# now download the GSE26922 expression data series
gse <- getGEO("GSE26922")
# setting up the datastructures takes a little while ... patience.

# getGEO returns a list of expression objects, but ...
length(gse)
# ... shows us there is only one object in it. We assign it to
# the same variable.
gse <- gse[[1]]

# characterize briefly what we have:
show(gse)

# The actual expression data are accessible in the "exprs" ... an
# Expression Set, the generic data class that BioConductor uses
# for expression data
head(exprs(gse))
length(exprs(gse)) # 32321 rows times 18 columns

# exprs(gse) is a matrix that we can assign to its own variable, to
# conveniently access the data rows and columns
ex <- exprs(gse)
dim(ex)
colnames(ex)
ex[1:5,]

# ================================================
#    Analyze value distributions
# ================================================

boxplot(ex)
# The boxplot shows that the data are very similar for range and mean;
# this is explained on the documentation accompanying the samples on GEO:
# http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM662895
#     " After loading the raw CEL files in R, RMA (Robust Multi-Array Averaging)
#     [a function of the BioConductor Affy package] was used to subtract
#     background, normalize intensities and summarize gene expression levels."

# This means the required preparation of the measured values for comparison
# has already been perfomred. Had it not, we would need to this ourselves. 

# Let us look more closely at the data distributions in the samples. A
# common question is whether the variance in the data is independent of
# the size of the signal - which is a common assumption for many analyses
# of significance.

# In principle, we can write a loop to calculate values for each row
# of our matrix. Run this code for the first three columns of the expression
# matrix and note how very long it takes.

tmp <- rep(0, nrow(ex))
for (i in 1:nrow(ex)) {
	tmp[i] <- mean(c(ex[i,1], ex[i,2], ex[i,3]))
}

# Such for-loops are easy to understand, but very slow in R. The R-like
# solution here is always to use one of the apply() functions.
??apply

# Now compare execution speed: 
# apply function "mean" to columns 1 to 3 of array ex, by rows
mean <- apply(ex[, 1:3], 1, "mean")

# We calculate standard deviation as well, and plot.
sd <- apply(ex[, 1:3], 1, "sd")
plot(mean, sd)

# The plot looks fine, but is too dense. No wonder, it has over 30,000 points.
# For expression analyses, there are several alternatives to plot very high
# density data.

# hexbin 
install.packages("hexbin")
library(hexbin)
?hexbin
plot(hexbin(mean,sd))

# smoothScatter
?smoothScatter
smoothScatter(mean,sd, nrpoints=200, pch=20, cex=0.5, col="#6633BB55")

#densCols
?densCols
plot (mean, sd, col=densCols(mean,sd), pch=20)

# We can overlay this plot with a LOWESS curve:
?lowess
trend <- lowess(mean, sd)
lines(trend, col="firebrick", lwd=2)

# This plot again confirms that our data is very well behaved.

# ================================================
#    Housekeeping: gene names
# ================================================

# The row names in our expression data are chip specific IDs not
# gene names.
featureNames(gse)[1:10]

# We need to load the platform specific annotations to get the genes.
# The platform is recorded as part of the expression set metadata ...
annotation(gse)
# ... and annotation can be easily downloaded from GEO ...
platf <- getGEO(annotation(gse), AnnotGPL=TRUE)
# This is very extensive data for each gene, including the IDs,
# gene symbols, description, GO codes ...
show(platf)

# the attr() function allows us to extract specific data from this
# very large table. For example: row 65
attr(dataTable(platf), "table")[65,]

# We can get the IDs and symbols we need. Study this expresion carefully.
attr(dataTable(platf), "table")[1:100,c("ID", "Gene symbol")]

# We store this in the matrix IDs
IDs <- attr(dataTable(platf), "table")[,c("ID", "Gene symbol")]


# Now all we need to do is to create a vector of rownames from our data, 
# then to replace the rownames with the corresponding gene symbols.
rows <- rownames(ex)

# The replacement is done using which(), which returns the index
# of a row that matches a particular condition, and rownames(), which
# gets or sets rownames of a table.

# consider the following expressions:
rows[4146]
which(IDs[,1] == rows[4146])
IDs[which(IDs[,1] == rows[4146]), 2]
# The "20327 levels" message comes about since R treats character data
# as factors, unless told not to.
# To assign the row names, we need to cast the value as.character()
as.character(IDs[which(IDs[,1] == rows[4146]), 2])

# We loop over the array to create a vector of gene symbols
symb <- rep(0, length(rows))
for (i in 1:length(rows)) {
	symb[i] <- as.character(IDs[which(IDs[,1] == rows[i]), 2])
}
# ... a bit slow, isn't it? But hopefully easy to understand.
# now we can replace the rownames of our matrix with the symbols ...
rownames(ex) <- symb
# ... and check that we did not change anything.
rownames(ex)[4146]

# But there are two problems we need to address: Consider ...
tail(rownames(ex), 10)
# (A) We don't actually have gene symbols for all rows. Many of them are empty strings.
# (B) Some of the symbols are not unique - apparently we have multiple measurements
# for some genes.
# Let's address (A) first.
# With a few keystrokes we can create a new submatrix that includes only non-
# empty row names. First, create a vector for our selection: 

sel <- apply(as.matrix(symb, ncol=1), 1,
             function(x){if (x == "") return(FALSE) else return(TRUE)})
# then select from the matrix.

selExpr <- c()
selSymb <- c()

for (i in 1:nrow(ex)) {
	if (sel[i]) {
		selExpr <- rbind(selExpr, ex[i,])
		selSymb <- rbind(selSymb, symb[i])
	}
}

rownames(selExpr) <- selSymb

# Regarding the duplicate rows, let's see what we have:
a <- table(rownames(selExpr)) # count the number of occurrences of each rowname
b <- a[a > 1] # select all those that occur more than once
length(b)     # how many are there?
sum(b)        # how many genes do they affect

# Less than 10% of genes affected ...
# I could iterate over b, calculate means and remove the offenders from the table,
# but I'll leave that as an exercise. Remember that you can exclude a rowname
# from a matrix, with the matching operator %in%, like so ...
help("%in%")

c <- tail(selExpr, 10)
c
excluded <- rownames(c) %in% c("NPM1")
c[!excluded,]

# ================================================
#    Differential expression
# ================================================

# In order to analyze differential expression, first we take
# the log of all data values.

# Let's make sure there are no values less or equal to zero in the matrix:
min(selExpr)

# Then ...
logSelExpr <- log2(selExpr)

# The typical way to look at expression differences is to plot
# a so called MA-plot:
# M (minus) is the log ratio.
# A (average) is the overall intensity.

# Here we plot MA for the 4 hour sample:
M <- apply(logSelExpr, 1, function(x) {mean(x[7:9])-mean(x[1:3])} )
A <- apply(logSelExpr, 1, function(x) {sum(x[c(1:3, 7:9)])/6} )

plot(A, M, xlab="A", ylab="M", col=densCols(A, M), pch=20, cex=0.5)

# Let's calculate which expression values have significantly changed
# between 0h and 4h. We compare significance levels with and
# without Bonferroni adjustment. The probabilities that there is no
# change in expression levels can be calculated with a t-test. 
# What we need here is a two-tailed, unpaired t-test that does not assume equal
# sample variances.
?t.test

# Let's look at two examples - we simply take the rows for which M has the
# maximum and minimum values:
which.min(abs(M))
t.test(logSelExpr[1846,1:3], logSelExpr[1846,7:9])

which.max(abs(M))
t.test(logSelExpr[14834,1:3], logSelExpr[14834,7:9])

# Let's apply this to the entire dataset, and plot significant differences...

n <- nrow(logSelExpr)
pVal   <- rep(0, n)
for(i in 1:n) {
  pVal[i] <- t.test(logSelExpr[i,1:3], logSelExpr[i,7:9])$p.value
}
points(A[pVal<0.05], M[pVal<0.05], col="#445588", cex=0.6)

# Let's also calculate Bonferroni corrected statistics.
pValBC  <- p.adjust(pVal, method="bonferroni")
points(A[pValBC<0.05], M[pValBC<0.05], col="#FFCC00", cex=1.1, pch=20)

# As you see, of all the potential differentially expressed genes, only three
# remain significant under Bonferroni corrected significance levels.
# Since this is an unsatisfactory situation, alternative statistical
# approaches have been developed focussing on the so-called
# False Discovery Rate.
# fdr is one of the options R offers for p-value adjustments. Here we
# use the "Benjamini-Hochberg" method.
?p.adjust
pValBHC  <- p.adjust(pVal, method="BH")
# We allow a 2% false positive rate:
points(A[pValBHC<0.02], M[pValBHC<0.02], col="#BB0000", cex=0.7, pch=20)

# Now we have 43 hits, (just check ...
sum(pValBHC<0.02)
# ...) but note that at a 2% FDR we actually don't even expect a
# false positive in the collection.

# What is it, that FDR is looking for? Consider the following plot
# of M against the average standard deviation of the two replicate sets:
SD <- apply(logSelExpr, 1, function(x) {mean( c(sd(x[7:9]), sd(x[1:3])) )} )

plot(A, SD, xlab="Average", ylab="Standard Deviation", col=densCols(A, SD), pch=20, cex=0.5)
points(A[pValBHC<0.02], SD[pValBHC<0.02], col="#BB0000", cex=0.7, pch=20)

# We find that all values found through the FDR criterion have high
# absolute values and small variances. Whether this is the biologically
# most meaningful criterion is a valid question. 

# Note that the NCBI GEO2R method uses a diferent approach - based on the
# function eBayes(), a part of the BioConductor limma package.
# For details of the implementation, refer to the 
# R code at the GEO site.
?eBayes

# [End]


 

Further reading and resources