Difference between revisions of "BIO Assignment Week 5"
m |
|||
Line 200: | Line 200: | ||
# Evaluating an alignment with a sliding window score | # Evaluating an alignment with a sliding window score | ||
# Boris Steipe, October 2012. Update October 2013 | # Boris Steipe, October 2012. Update October 2013 | ||
− | setwd("~/ | + | setwd("~/path/to/your/R_files/") |
# Scoring matrices can be found at the NCBI. | # Scoring matrices can be found at the NCBI. |
Revision as of 04:37, 11 October 2013
Assignment for Week 5
Sequence alignment
Note! This assignment is currently active. All significant changes will be announced on the mailing list.
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.
Contents
Introduction
In this assignment we will perform an optimal global and local sequence alignment, and use R to plot the alignment quality as a colored bar-graph.
Optimal sequence alignments
Online programs for optimal sequence alignment are part of the EMBOSS tools. The programs take FASTA files as input.
- Local optimal SEQUENCE alignment "water"
Task:
- Retrieve the FASTA file for the YFO Mbp1 protein and for Saccharomyces''.
- Save the files as text files to your computer, (if you haven't done so already). You could give them an extension of
.fa
. - Access the EMBOSS Explorer site (if you haven't done so yet, you might want to bookmark it.)
- Look for ALIGNMENT LOCAL, click on water, paste your FASTA sequences and run the program with default parameters.
- Study the results. You will probably find that the alignment extends over most of the protein, but does not include the termini.
- Considering the sequence identy cutoff we discussed in class (25% over the length of a domain), do you believe that the APSES domains are homologous?
- Change the Gap opening and Gap extension parameters to high values (e.g. 30 and 5). Then run the alignment again.
- Note what is different.
- You could try getting only an alignment for the ankyrin domains, by deleting the approximate region of the APSES domains from your input.
- Global optimal SEQUENCE alignment "needle"
Task:
- Look for ALIGNMENT GLOBAL, click on needle, paste your FASTA sequences and run the program with default parameters.
- Study the results. You will find that the alignment extends over the entire protein, likely with long indels at the termini.
- Change the Output alignment format to FASTA pairwise simple, to retrieve the aligned FASTA files with indels.
- Copy the aligned sequences (with indels) and save them to your computer. You could give them an extension of
.fal
to remind you that they are aligned FASTA sequences.
The Mutation Data Matrix
The NCBI makes its alignment matrices available by ftp at ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62 . Access that site and download the BLOSUM62
matrix to your computer. You could give it a filename of BLOSUM62.mdm
.
It should look like this.
# Matrix made by matblas from blosum62.iij
# * column uses minimum score
# BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
# Blocks Database = /data/blocks_5.0/blocks.dat
# Cluster Percentage: >= 62
# Entropy = 0.6979, Expected = -0.5209
A R N D C Q E G H I L K M F P S T W Y V B Z X *
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4
R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4
N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4
D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4
C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4
E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4
B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4
Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4
X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
Task:
- Study this and make sure you understand what this table is, how it can be used, and what a reasonable range of values for identities and pairscores for non-identical, similar and dissimilar residues is. Ask on the mailing list in case you have questions.
The DNA binding site
Now, that you know how YFO Mbp1 aligns with yeast Mbp1, you can evaluate functional conservation in these homologous proteins. You probably already downloaded the two Biochemistry papers by Taylor et al. (2000) and by Deleeuw et al. (2008) that we encountered in Assignment 2. These discuss the residues involved in DNA binding[1]. In particular the residues between 50-74 have been proposed to comprise the DNA recognition domain.
Task:
- Using the APSES domain alignment you have just constructed, find the YFO Mbp1 residues that correspond to the range 50-74 in yeast.
- Note whether the sequences are especially highly conserved in this region.
- Using VMD, look at the region. Use the sequence viewer to make sure that the sequence numbering between the paper and the PDB file are the same (they are often not identical!). Then select the residues - the proposed recognition domain - and color them differently for emphasis. Study this in stereo to get a sense of the spatial relationships. Check where the conserved residues are.
- A good representation is Licorice - but other representations that include sidechains will also serve well. You may want to reduce the thickness of bonds to declutter the image a bit.
- Calculate a solvent accessible surface of the protein in a separate representation and make it transparent.
- You could combine three representations: (1) the backbone (in new cartoon), (2) the sidechains of residues that presumably contact DNA, distinctly colored, and (3) a transparent surface of the entire protein. This image should show whether residues annotated as DNA binding form a contiguous binding interface. Note: VMD makes smart use of GPU capabilities of your computer. Try setting the VMD graphics parameters to visualize with GLSL - your transparent surface may look much better.
DNA binding interfaces are expected to comprise a number of positively charged amino acids, that might form salt-bridges with the phosphate backbone.
Task:
- Study and consider whether this is the case here and which residues might be included.
R code: coloring the alignment by quality
Task:
- Study this code carefully, execute it, section by section and make sure you understand all of it. Ask on the list if anything is not clear.
# BiostringsExample.R
# Short tutorial on sequence alignment with the Biostrings package.
# Boris Steipe, October 2013
#
setwd("~/path/to/your/R_files/")
# Biostrings is a package within the bioconductor project.
# bioconducter packages have their own installation system,
# they are normally not installed via CRAN.
# http://www.bioconductor.org/packages/2.13/bioc/vignettes/Biostrings/inst/doc/PairwiseAlignments.pdf
source("http://bioconductor.org/biocLite.R")
biocLite("Biostrings")
library(Biostrings)
library(help=Biostrings)
# Read in two fasta files - you will need to edit this for YFO
sacce <- readAAStringSet("mbp1_sacce.fa", format="fasta")
ustma <- readAAStringSet("mbp1_ustma.fa", format="fasta")
sacce
names(sacce)
names(sacce) <- "Mbp1 SACCE"
names(ustma) <- "Mbp1 USTMA" # Example only ... modify for YFO
width(sacce)
as.character(sacce)
# Biostrings takes a sophisticated approach to sequence alignment ...
?pairwiseAlignment
# ... but the use in practice is quite simple:
ali <- pairwiseAlignment(sacce, ustma, substitutionMatrix = "BLOSUM50")
ali
pattern(ali)
subject(ali)
writePairwiseAlignments(ali)
p <- aligned(pattern(ali))
names(p) <- "Mbp1 SACCE aligned"
s <- aligned(subject(ali))
names(s) <- "Mbp1 USTMA aligned"
# don't overwrite your EMBOSS .fal files
writeXStringSet(p, "mbp1_sacce.R.fal", append=FALSE, format="fasta")
writeXStringSet(s, "mbp1_ustma.R.fal", append=FALSE, format="fasta")
# Done.
- Compare the alignments you received from the EMBOSS server, and that you co puted using R. Are they aproximately the same? Exactly? You did use different matrices and gap aameters, so minor differences are to be expected. But by and large you should get the same alignments.
We will now use the aligned sequences to compute a graphical display of alignment quality.
Task:
- Study this code carefully, execute it, section by section and make sure you understand all of it. Ask on the list if anything is not clear.
# aliScore.R
# Evaluating an alignment with a sliding window score
# Boris Steipe, October 2012. Update October 2013
setwd("~/path/to/your/R_files/")
# Scoring matrices can be found at the NCBI.
# ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62
# It is good practice to set variables you might want to change
# in a header block so you don't need to hunt all over the code
# for strings you need to update.
#
fa1 <- "mbp1_sacce.R.fal"
fa2 <- "mbp1_ustma.R.fal"
code1 <- "SACCE"
code2 <- "USTMA"
mdmFile <- "BLOSUM62.mdm"
window <- 9 # window-size (should be an odd integer)
# ================================================
# Read data files
# ================================================
# read fasta datafiles using seqinr function read.fasta()
install.packages("seqinr")
library(seqinr)
tmp <- unlist(read.fasta(fa1, seqtype="AA", as.string=FALSE, seqonly=TRUE))
seq1 <- unlist(strsplit(as.character(tmp), split=""))
tmp <- unlist(read.fasta(fa2, seqtype="AA", as.string=FALSE, seqonly=TRUE))
seq2 <- unlist(strsplit(as.character(tmp), split=""))
if (length(seq1) != length(seq2)) {
print("Error: Sequences have unequal length!")
}
lSeq <- length(seq1)
# ================================================
# Read scoring matrix
# ================================================
MDM <- read.table(mdmFile, skip=6)
# This is a dataframe. Study how it can be accessed:
MDM
MDM[1,]
MDM[,1]
MDM[5,5] # Cys-Cys
MDM[20,20] # Val-Val
MDM[,"W"] # the tryptophan column
MDM["R","W"] # Arg-Trp pairscore
MDM["W","R"] # Trp-Arg pairscore: pairscores are symmetric
colnames(MDM) # names of columns
rownames(MDM) # names of rows
colnames(MDM)[3] # third column
rownames(MDM)[12] # twelvth row
# change the two "*" names to "-" so we can use them to score
# indels of the alignment. This is a bit of a hack, since this
# does not reflect the actual indel penalties (which is, as you)
# remember from your lectures, calculated as a gap opening
# + gap extension penalty; it can't be calculated in a pairwise
# manner) EMBOSS defaults for BLODSUM62 are opening -10 and
# extension -0.5 i.e. a gap of size 3 (-11.5) has approximately
# the same penalty as a 3-character score of "-" matches (-12)
# so a pairscore of -4 is not entirely unreasonable.
colnames(MDM)[24]
rownames(MDM)[24]
colnames(MDM)[24] <- "-"
rownames(MDM)[24] <- "-"
colnames(MDM)[24]
rownames(MDM)[24]
MDM["Q", "-"]
MDM["-", "D"]
# so far so good.
# ================================================
# Tabulate pairscores for alignment
# ================================================
# It is trivial to create a pairscore vector along the
# length of the aligned sequences.
PS <- vector()
for (i in 1:lSeq) {
aa1 <- seq1[i]
aa2 <- seq2[i]
PS[i] = MDM[aa1, aa2]
}
PS
# ================================================
# Calculate moving averages
# ================================================
# In order to evaluate the alignment, we will calculate a
# sliding window average over the pairscores. Somewhat surprisingly
# R doesn't (yet) have a native function for moving averages: options
# that are quoted are:
# - rollmean() in the "zoo" package http://rss.acs.unt.edu/Rdoc/library/zoo/html/rollmean.html
# - MovingAverages() in "TTR" http://rss.acs.unt.edu/Rdoc/library/TTR/html/MovingAverages.html
# - ma() in "forecast" http://robjhyndman.com/software/forecast/
# But since this is easy to code, we shall implement it ourselves.
PSma <- vector() # will hold the averages
winS <- floor(window/2) # span of elements above/below the centre
winC <- winS+1 # centre of the window
# extend the vector PS with zeros (virtual observations) above and below
PS <- c(rep(0, winS), PS , rep(0, winS))
# initialize the window score for the first position
winScore <- sum(PS[1:window])
# write the first score to PSma
PSma[1] <- winScore
# Slide the window along the sequence, and recalculate sum()
# Loop from the next position, to the last position that does not exceed the vector...
for (i in (winC + 1):(lSeq + winS)) {
# subtract the value that has just dropped out of the window
winScore <- winScore - PS[(i-winS-1)]
# add the value that has just entered the window
winScore <- winScore + PS[(i+winS)]
# put score into PSma
PSma[i-winS] <- winScore
}
# convert the sums to averages
PSma <- PSma / window
# have a quick look at the score distributions
boxplot(PSma)
hist(PSma)
# ================================================
# Plot the alignment scores
# ================================================
# normalize the scores
PSma <- (PSma-min(PSma))/(max(PSma) - min(PSma) + 0.0001)
# spread the normalized values to a desired range, n
nCol <- 10
PSma <- floor(PSma * nCol) + 1
# Assign a colorspectrum to a vector (with a bit of colormagic,
# don't worry about that for now). Dark colors are poor scores,
# "hot" colors are high scores
spect <- colorRampPalette(c("black", "red", "yellow", "white"), bias=0.4)(nCol)
# Color is an often abused aspect of plotting. One can use color to label
# *quantities* or *qualities*. For the most part, our pairscores measure amino
# acid similarity. That is a quantity and with the spectrum that we just defined
# we associte the measured quantities with the color of a glowing piece
# of metal: we start with black #000000, then first we ramp up the red
# (i.e. low-energy) part of the visible spectrum to red #FF0000, then we
# add and ramp up the green spectrum giving us yellow #FFFF00 and finally we
# add blue, giving us white #FFFFFF. Let's have a look at the spectrum:
s <- rep(1, nCol)
barplot(s, col=spect, axes=F, main="Color spectrum")
# But one aspect of our data is not quantitatively different: indels.
# We valued indels with pairscores of -4. But indels are not simply poor alignment,
# rather they are non-alignment. This means stretches of -4 values are really
# *qualitatively* different. Let's color them differently by changing the lowest
# level of the spectrum to grey.
spect[1] <- "#CCCCCC"
barplot(s, col=spect, axes=F, main="Color spectrum")
# Now we can display our alignment score vector with colored rectangles.
# Convert the integers in PSma to color values from spect
PScol <- vector()
for (i in 1:length(PSma)) {
PScol[i] <- spect[ PSma[i] ] # this is how a value from PSma is used as an index of spect
}
# Plot the scores. The code is similar to the last assignment.
# Create an empty plot window of appropriate size
plot(1,1, xlim=c(-100, lSeq), ylim=c(0, 2) , type="n", yaxt="n", bty="n", xlab="position in alignment", ylab="")
# Add a label to the left
text (-30, 1, adj=1, labels=c(paste("Mbp1:\n", code1, "\nvs.\n", code2)), cex=0.9 )
# Loop over the vector and draw boxes without border, filled with color.
for (i in 1:lSeq) {
rect(i, 0.9, i+1, 1.1, border=NA, col=PScol[i])
}
# Note that the numbers along the X-axis are not sequence numbers, but numbers
# of the alignment, i.e. sequence number + indel length. That is important to
# realize: if you would like to add the annotations from the last assignment
# which I will leave as an exercise, you need to map your sequence numbering
# into alignment numbering. Let me know in case you try that but need some help.
- That is all.
Links and resources
Notes and references
Footnotes and references
Ask, if things don't work for you!
- If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.
- Do consider how to ask your questions so that a meaningful answer is possible:
- How to create a Minimal, Complete, and Verifiable example on stackoverflow and ...
- How to make a great R reproducible example are required reading.