Difference between revisions of "BIO Assignment Week 12"

From "A B C"
Jump to navigation Jump to search
m
m
Line 16: Line 16:
  
  
 +
== Some material that is left-over from the major rearrangement==
 +
 +
===Computing alignments===
 +
 +
 +
try two MSA's algorithms and load them in Jalview.
 +
Locally: which one do you prefer? Modify the consensus. Annotate domains.
 +
 +
 +
The EBI has a very convenient [http://www.ebi.ac.uk/Tools/msa/ page to access a number of MSA algorithms]. This is especially convenient when you want to compare, e.g. T-Coffee and Muscle and MAFFT results to see which regions of your alignment are robust. You could use any of these tools, just paste your sequences into a Webform, download the results and load into Jalview. Easy.
 +
 +
But even easier is to calculate the alignments directly from Jalview.  available. (Not today. <small>Bummer.</small>)
 +
 +
 +
No. Claculate an external alignment and import.
 +
 +
;Calculate a MAFFT alignment using the Jalview Web service option:
 +
 +
{{task|1=
 +
#In Jalview, select '''Web Service &rarr; Alignment &rarr; MAFFT with defaults...'''. The alignment is calculated in a few minutes and displayed in a new window.
 +
}}
 +
 +
;Calculate a MAFFT alignment when the Jalview Web service is NOT available:
 +
 +
{{task|1=
 +
#In Jalview, select '''File &rarr; Output to Textbox &rarr; FASTA'''
 +
#Copy the sequences.
 +
#Navigate to the [http://www.ebi.ac.uk/Tools/msa/mafft/ '''MAFFT Input form'''] at the EBI.
 +
#Paste your sequences into the form.
 +
#Click on '''Submit'''.
 +
#Close the Jalview sequence window and either save your MAFFT alignment to file and load in Jalview, or simply ''''File &rarr; Input Alignment &rarr; from Textbox''', paste and click '''New Window'''.
 +
}}
 +
 +
 +
In any case, you should now have an alignment.
 +
 +
{{task|1=
 +
#Choose '''Colour &rarr; Hydrophobicity''' and '''&rarr; by Conservation'''. Then adjust the slider left or right to see which columns are highly conserved. You will notice that the Swi6 sequence that was supposed to align only to the ankyrin domains was in fact aligned to other parts of the sequence as well. This is one part of the MSA that we will have to correct manually and a common problem when aligning sequences of different lengths.
 +
}}
 +
 +
 +
 +
&nbsp;
 +
 +
==R code: load alignment and compute information scores==
 +
<!-- Add sequence weighting and sampling bias correction ? -->
 +
 +
As discussed in the lecture, Shannon information is calculated as the difference between expected and observed entropy, where entropy is the negative sum over probabilities times the log of those probabilities:
 +
 +
 +
*a review of regex range characters +?*{min,max}, and greedy.
 +
*build an AT-hook motif matcher https://en.wikipedia.org/wiki/AT-hook
 +
 +
 +
Here we compute Shannon information scores for aligned positions of the APSES domain, and plot the values in '''R'''. You can try this with any part of your alignment, but I have used only the aligned residues for the APSES domain for my example. This is a good choice for a first try, since there are (almost) no gaps.
 +
 +
{{task|1=
 +
# Export only the sequences of the aligned APSES domains to a file on your computer, in FASTA format as explained below. You could call this: <code>Mbp1_All_APSES.fa</code>.
 +
##Use your mouse and clik and drag to ''select'' the aligned APSES domains in the alignment window.
 +
##Copy your selection to the clipboard.
 +
##Use the main menu (not the menu of your alignment window) and select '''File &rarr; Input alignment &rarr; from Textbox'''; paste the selection into the textbox and click '''New Window'''.
 +
##Use '''File &rarr; save as''' to save the aligned siequences in multi-FASTA format under the filename you want in your '''R''' project directory.
 +
 
 +
# Explore the R-code below. Be sure that you understand it correctly. Note that this code does not implement any sampling bias correction, so positions with large numbers of gaps will receive artificially high scores (the alignment looks like the gap charecter were a conserved character).
 +
 +
 +
<source lang="rsplus">
 +
 +
# CalculateInformation.R
 +
# Calculate Shannon information for positions in a multiple sequence alignment.
 +
# Requires: an MSA in multi FASTA format
 +
 +
# It is good practice to set variables you might want to change
 +
# in a header block so you don't need to hunt all over the code
 +
# for strings you need to update.
 +
#
 +
setwd("/your/R/working/directory")
 +
mfa      <- "MBP1_All_APSES.fa"
 +
 +
# ================================================
 +
#    Read sequence alignment fasta file
 +
# ================================================
 +
 +
# read MFA datafile using seqinr function read.fasta()
 +
library(seqinr)
 +
tmp  <- read.alignment(mfa, format="fasta")
 +
MSA  <- as.matrix(tmp)  # convert the list into a characterwise matrix
 +
                        # with appropriate row and column names using
 +
                        # the seqinr function as.matrix.alignment()
 +
                        # You could have a look under the hood of this
 +
                        # function to understand beter how to convert a
 +
                        # list into something else ... simply type
 +
                        # "as.matrix.alignment" - without the parentheses
 +
                        # to retrieve the function source code (as for any
 +
                        # function btw).
 +
 +
### Explore contents of and access to the matrix of sequences
 +
MSA
 +
MSA[1,]
 +
MSA[,1]
 +
length(MSA[,1])
 +
 +
 +
# ================================================
 +
#    define function to calculate entropy
 +
# ================================================
 +
 +
entropy <- function(v) { # calculate shannon entropy for the aa vector v
 +
                    # Note: we are not correcting for small sample sizes
 +
                    # here. Thus if there are a large number of gaps in
 +
                    # the alignment, this will look like small entropy
 +
                    # since only a few amino acids are present. In the
 +
                    # extreme case: if a position is only present in
 +
                    # one sequence, that one amino acid will be treated
 +
                    # as 100% conserved - zero entropy. Sampling error
 +
                    # corrections are discussed eg. in Schneider et al.
 +
                    # (1986) JMB 188:414
 +
l <- length(v)
 +
a <- rep(0, 21)      # initialize a vector with 21 elements (20 aa plus gap)
 +
                    # the set the name of each row to the one letter
 +
                    # code. Through this, we can access a row by its
 +
                    # one letter code.
 +
names(a)  <- unlist(strsplit("acdefghiklmnpqrstvwy-", ""))
 +
 +
for (i in 1:l) {      # for the whole vector of amino acids
 +
c <- v[i]          # retrieve the character
 +
a[c] <- a[c] + 1  # increment its count by one
 +
} # note: we could also have used the table() function for this
 +
 +
tot <- sum(a) - a["-"] # calculate number of observed amino acids
 +
                      # i.e. subtract gaps
 +
a <- a/tot            # frequency is observations of one amino acid
 +
                      # divided by all observations. We assume that
 +
                      # frequency equals probability.
 +
a["-"] <- 0                             
 +
for (i in 1:length(a)) {
 +
if (a[i] != 0) { # if a[i] is not zero, otherwise leave as is.
 +
            # By definition, 0*log(0) = 0  but R calculates
 +
            # this in parts and returns NaN for log(0).
 +
a[i] <- a[i] * (log(a[i])/log(2)) # replace a[i] with
 +
                                  # p(i) log_2(p(i))
 +
}
 +
}
 +
return(-sum(a)) # return Shannon entropy
 +
}
 +
 +
# ================================================
 +
#    calculate entropy for reference distribution
 +
#    (from UniProt, c.f. Assignment 2)
 +
# ================================================
 +
 +
refData <- c(
 +
    "A"=8.26,
 +
    "Q"=3.93,
 +
    "L"=9.66,
 +
    "S"=6.56,
 +
    "R"=5.53,
 +
    "E"=6.75,
 +
    "K"=5.84,
 +
    "T"=5.34,
 +
    "N"=4.06,
 +
    "G"=7.08,
 +
    "M"=2.42,
 +
    "W"=1.08,
 +
    "D"=5.45,
 +
    "H"=2.27,
 +
    "F"=3.86,
 +
    "Y"=2.92,
 +
    "C"=1.37,
 +
    "I"=5.96,
 +
    "P"=4.70,
 +
    "V"=6.87
 +
    )
 +
 +
### Calculate the entropy of this distribution
 +
 +
H.ref <- 0
 +
for (i in 1:length(refData)) {
 +
p <- refData[i]/sum(refData) # convert % to probabilities
 +
    H.ref <- H.ref - (p * (log(p)/log(2)))
 +
}
 +
 +
# ================================================
 +
#    calculate information for each position of
 +
#    multiple sequence alignment
 +
# ================================================
 +
 +
lAli <- dim(MSA)[2] # length of row in matrix is second element of dim(<matrix>).
 +
I <- rep(0, lAli)  # initialize result vector
 +
for (i in 1:lAli) {
 +
I[i] = H.ref - entropy(MSA[,i])  # I = H_ref - H_obs
 +
}
 +
 +
### evaluate I
 +
I
 +
quantile(I)
 +
hist(I)
 +
plot(I)
 +
 +
# you can see that we have quite a large number of columns with the same,
 +
# high value ... what are these?
 +
 +
which(I > 4)
 +
MSA[,which(I > 4)]
 +
 +
# And what is in the columns with low values?
 +
MSA[,which(I < 1.5)]
 +
 +
 +
# ===================================================
 +
#    plot the information
 +
#    (c.f. Assignment 5, see there for explanations)
 +
# ===================================================
 +
 +
IP <- (I-min(I))/(max(I) - min(I) + 0.0001)
 +
nCol <- 15
 +
IP <- floor(IP * nCol) + 1
 +
spect <- colorRampPalette(c("#DD0033", "#00BB66", "#3300DD"), bias=0.6)(nCol)
 +
# lets set the information scores from single informations to grey. We 
 +
# change the highest level of the spectrum to grey.
 +
#spect[nCol] <- "#CCCCCC"
 +
Icol <- vector()
 +
for (i in 1:length(I)) {
 +
Icol[i] <- spect[ IP[i] ]
 +
}
 +
 +
plot(1,1, xlim=c(0, lAli), ylim=c(-0.5, 5) ,
 +
    type="n", bty="n", xlab="position in alignment", ylab="Information (bits)")
 +
 +
# plot as rectangles: height is information and color is coded to information
 +
for (i in 1:lAli) {
 +
  rect(i, 0, i+1, I[i], border=NA, col=Icol[i])
 +
}
 +
 +
# As you can see, some of the columns reach very high values, but they are not
 +
# contiguous in sequence. Are they contiguous in structure? We will find out in
 +
# a later assignment, when we map computed values to structure.
 +
 +
</source>
 +
}}
 +
 +
 +
[[Image:InformationPlot.jpg|frame|none|Plot of information vs. sequence position produced by the '''R''' script above, for an alignment of Mbp1 ortholog APSES domains.]]
 +
 +
== Calculating conservation scores ==
 +
 +
 +
 +
{{task|1=
 +
 +
* Study this code carefully, execute it, section by section and make sure you understand all of it. Ask on the list if anything is not clear.
 +
 +
<source lang="R">
 +
# BiostringsExample.R
 +
# Short tutorial on sequence alignment with the Biostrings package.
 +
# Boris Steipe for BCH441, 2013 - 2014
 +
#
 +
setwd("~/path/to/your/R_files/")
 +
setwd("~/Documents/07.TEACHING/37-BCH441 Bioinformatics 2014/05-Materials/Assignment_5 data")
 +
 +
# Biostrings is a package within the bioconductor project.
 +
# bioconducter packages have their own installation system,
 +
# they are normally not installed via CRAN.
 +
 +
# First, you load the BioConductor installer...
 +
source("http://bioconductor.org/biocLite.R")
 +
 +
# Then you can install the Biostrings package and all of its dependencies.
 +
biocLite("Biostrings")
 +
 +
# ... and load the library.
 +
library(Biostrings)
 +
 +
# Some basic (technical) information is available ...
 +
library(help=Biostrings)
 +
 +
# ... but for more in depth documentation, use the
 +
# so called "vignettes" that are provided with every R package.
 +
browseVignettes("Biostrings")
 +
 +
# In this code, we mostly use functions that are discussed in the
 +
# pairwise alignement vignette.
 +

# Read in two fasta files - you will need to edit this for YFO
 +
sacce <- readAAStringSet("mbp1-sacce.fa", format="fasta")
 +
 +
# "USTMA" is used only as an example here - modify for YFO  :-)
 +
ustma <- readAAStringSet("mbp1-ustma.fa", format="fasta")
 +
 +
sacce
 +
names(sacce)
 +
names(sacce) <- "Mbp1 SACCE"
 +
names(ustma) <- "Mbp1 USTMA" # Example only ... modify for YFO
 +
 +
width(sacce)
 +
as.character(sacce)
 +
 +
# Biostrings takes a sophisticated approach to sequence alignment ...
 +
?pairwiseAlignment
 +
 +
# ... but the use in practice is quite simple:
 +
ali <- pairwiseAlignment(sacce, ustma, substitutionMatrix = "BLOSUM50")
 +
ali
 +
 +
pattern(ali)
 +
subject(ali)
 +
 +
writePairwiseAlignments(ali)
 +
 +
p <- aligned(pattern(ali))
 +
names(p) <- "Mbp1 SACCE aligned"
 +
s <- aligned(subject(ali))
 +
names(s) <- "Mbp1 USTMA aligned"
 +
 +
# don't overwrite your EMBOSS .fal files
 +
writeXStringSet(p, "mbp1-sacce.R.fal", append=FALSE, format="fasta")
 +
writeXStringSet(s, "mbp1-ustma.R.fal", append=FALSE, format="fasta")
 +
 +
# Done.
 +
 +
</source>
 +
 +
* Compare the alignments you received from the EMBOSS server, and that you computed using '''R'''. Are they approximately the same? Exactly? You did use different matrices and gap parameters, so minor differences are to be expected. But by and large you should get the same alignments.
 +
 +
}}
 +
 +
We will now use the aligned sequences to compute a graphical display of alignment quality.
 +
 +
 +
{{task|1=
 +
 +
* Study this code carefully, execute it, section by section and make sure you understand all of it. Ask on the list if anything is not clear.
 +
 +
<source lang="R">
 +
# aliScore.R
 +
# Evaluating an alignment with a sliding window score
 +
# Boris Steipe, October 2012. Update October 2013
 +
setwd("~/path/to/your/R_files/")
 +
 +
# Scoring matrices can be found at the NCBI.
 +
# ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62
 +
 +
# It is good practice to set variables you might want to change
 +
# in a header block so you don't need to hunt all over the code
 +
# for strings you need to update.
 +
#
 +
fa1      <- "mbp1-sacce.R.fal"
 +
fa2      <- "mbp1-ustma.R.fal"
 +
code1    <- "SACCE"
 +
code2    <- "USTMA"
 +
mdmFile  <- "BLOSUM62.mdm"
 +
window  <- 9  # window-size (should be an odd integer)
 +
 +
# ================================================
 +
#    Read data files
 +
# ================================================
 +
 +
# read fasta datafiles using seqinr function read.fasta()
 +
install.packages("seqinr")
 +
library(seqinr)
 +
tmp  <- unlist(read.fasta(fa1, seqtype="AA", as.string=FALSE, seqonly=TRUE))
 +
seq1 <- unlist(strsplit(as.character(tmp), split=""))
 +
 +
tmp  <- unlist(read.fasta(fa2, seqtype="AA", as.string=FALSE, seqonly=TRUE))
 +
seq2 <- unlist(strsplit(as.character(tmp), split=""))
 +
 +
if (length(seq1) != length(seq2)) {
 +
print("Error: Sequences have unequal length!")
 +
}
 +
 +
lSeq <- length(seq1)
 +
 +
# ================================================
 +
#    Read scoring matrix
 +
# ================================================
 +
 +
MDM <- read.table(mdmFile, skip=6)
 +
 +
# This is a dataframe. Study how it can be accessed:
 +
 +
MDM
 +
MDM[1,]
 +
MDM[,1]
 +
MDM[5,5]  # Cys-Cys
 +
MDM[20,20] # Val-Val
 +
MDM[,"W"]  # the tryptophan column
 +
MDM["R","W"]  # Arg-Trp pairscore
 +
MDM["W","R"]  # Trp-Arg pairscore: pairscores are symmetric
 +
 +
colnames(MDM)  # names of columns
 +
rownames(MDM)  # names of rows
 +
colnames(MDM)[3]  # third column
 +
rownames(MDM)[12]  # twelfth row
 +
 +
# change the two "*" names to "-" so we can use them to score
 +
# indels of the alignment. This is a bit of a hack, since this
 +
# does not reflect the actual indel penalties (which is, as you)
 +
# remember from your lectures, calculated as a gap opening
 +
# + gap extension penalty; it can't be calculated in a pairwise
 +
# manner) EMBOSS defaults for BLODSUM62 are opening -10 and
 +
# extension -0.5 i.e. a gap of size 3 (-11.5) has approximately
 +
# the same penalty as a 3-character score of "-" matches (-12)
 +
# so a pairscore of -4 is not entirely unreasonable.
 +
 +
colnames(MDM)[24]
 +
rownames(MDM)[24]
 +
colnames(MDM)[24] <- "-"
 +
rownames(MDM)[24] <- "-"
 +
colnames(MDM)[24]
 +
rownames(MDM)[24]
 +
MDM["Q", "-"]
 +
MDM["-", "D"]
 +
# so far so good.
 +
 +
# ================================================
 +
#    Tabulate pairscores for alignment
 +
# ================================================
 +
 +
 +
# It is trivial to create a pairscore vector along the
 +
# length of the aligned sequences.
 +
 +
PS <- vector()
 +
for (i in 1:lSeq) {
 +
  aa1 <- seq1[i]
 +
  aa2 <- seq2[i]
 +
  PS[i] = MDM[aa1, aa2]
 +
}
 +
 +
PS
 +
 +
 +
# The same vector could be created - albeit perhaps not so
 +
# easy to understand - with the expression ...
 +
MDM[cbind(seq1,seq2)]
 +
 +
 +
 +
# ================================================
 +
#    Calculate moving averages
 +
# ================================================
 +
 +
# In order to evaluate the alignment, we will calculate a
 +
# sliding window average over the pairscores. Somewhat surprisingly
 +
# R doesn't (yet) have a native function for moving averages: options
 +
# that are quoted are:
 +
#  - rollmean() in the "zoo" package http://rss.acs.unt.edu/Rdoc/library/zoo/html/rollmean.html
 +
#  - MovingAverages() in "TTR"  http://rss.acs.unt.edu/Rdoc/library/TTR/html/MovingAverages.html
 +
#  - ma() in "forecast"  http://robjhyndman.com/software/forecast/
 +
# But since this is easy to code, we shall implement it ourselves.
 +
 +
PSma <- vector()          # will hold the averages
 +
winS <- floor(window/2)    # span of elements above/below the centre
 +
winC <- winS+1            # centre of the window
 +
 +
# extend the vector PS with zeros (virtual observations) above and below
 +
PS <- c(rep(0, winS), PS , rep(0, winS))
 +
 +
# initialize the window score for the first position
 +
winScore <- sum(PS[1:window])
 +
 +
# write the first score to PSma
 +
PSma[1] <- winScore
 +
 +
# Slide the window along the sequence, and recalculate sum()
 +
# Loop from the next position, to the last position that does not exceed the vector...
 +
for (i in (winC + 1):(lSeq + winS)) {
 +
  # subtract the value that has just dropped out of the window
 +
  winScore <- winScore - PS[(i-winS-1)]
 +
  # add the value that has just entered the window
 +
  winScore <- winScore + PS[(i+winS)] 
 +
  # put score into PSma
 +
  PSma[i-winS] <- winScore
 +
}
 +
 +
# convert the sums to averages
 +
PSma <- PSma / window
 +
 +
# have a quick look at the score distributions
 +
 +
boxplot(PSma)
 +
hist(PSma)
 +
 +
# ================================================
 +
#    Plot the alignment scores
 +
# ================================================
 +
 +
# normalize the scores
 +
PSma <- (PSma-min(PSma))/(max(PSma) - min(PSma) + 0.0001)
 +
# spread the normalized values to a desired range, n
 +
nCol <- 10
 +
PSma <- floor(PSma * nCol) + 1
 +
 +
# Assign a colorspectrum to a vector (with a bit of colormagic,
 +
# don't worry about that for now). Dark colors are poor scores,
 +
# "hot" colors are high scores
 +
spect <- colorRampPalette(c("black", "red", "yellow", "white"), bias=0.4)(nCol)
 +
 +
# Color is an often abused aspect of plotting. One can use color to label
 +
# *quantities* or *qualities*. For the most part, our pairscores measure amino
 +
# acid similarity. That is a quantity and with the spectrum that we just defined
 +
# we associte the measured quantities with the color of a glowing piece
 +
# of metal: we start with black #000000, then first we ramp up the red
 +
# (i.e. low-energy) part of the visible spectrum to red #FF0000, then we
 +
# add and ramp up the green spectrum giving us yellow #FFFF00 and finally we
 +
# add blue, giving us white #FFFFFF. Let's have a look at the spectrum:
 +
 +
s <- rep(1, nCol)
 +
barplot(s, col=spect, axes=F, main="Color spectrum")
 +
 +
# But one aspect of our data is not quantitatively different: indels.
 +
# We valued indels with pairscores of -4. But indels are not simply poor alignment,
 +
# rather they are non-alignment. This means stretches of -4 values are really
 +
# *qualitatively* different. Let's color them differently by changing the lowest
 +
# level of the spectrum to grey.
 +
 +
spect[1] <- "#CCCCCC"
 +
barplot(s, col=spect, axes=F, main="Color spectrum")
 +
 +
# Now we can display our alignment score vector with colored rectangles.
 +
 +
# Convert the integers in PSma to color values from spect
 +
PScol <- vector()
 +
for (i in 1:length(PSma)) {
 +
PScol[i] <- spect[ PSma[i] ]  # this is how a value from PSma is used as an index of spect
 +
}
 +
 +
# Plot the scores. The code is similar to the last assignment.
 +
# Create an empty plot window of appropriate size
 +
plot(1,1, xlim=c(-100, lSeq), ylim=c(0, 2) , type="n", yaxt="n", bty="n", xlab="position in alignment", ylab="")
 +
 +
# Add a label to the left
 +
text (-30, 1, adj=1, labels=c(paste("Mbp1:\n", code1, "\nvs.\n", code2)), cex=0.9 )
 +
 +
# Loop over the vector and draw boxes  without border, filled with color.
 +
for (i in 1:lSeq) {
 +
  rect(i, 0.9, i+1, 1.1, border=NA, col=PScol[i])
 +
}
 +
 +
# Note that the numbers along the X-axis are not sequence numbers, but numbers
 +
# of the alignment, i.e. sequence number + indel length. That is important to
 +
# realize: if you would like to add the annotations from the last assignment
 +
# which I will leave as an exercise, you need to map your sequence numbering
 +
# into alignment numbering. Let me know in case you try that but need some help.
 +
 +
</source>
 +
}}
 +
 +
 +
&nbsp;
  
  

Revision as of 23:56, 12 October 2015

Assignment for Week 12
No more ...

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.



 


Some material that is left-over from the major rearrangement

Computing alignments

try two MSA's algorithms and load them in Jalview.
Locally: which one do you prefer? Modify the consensus. Annotate domains.


The EBI has a very convenient page to access a number of MSA algorithms. This is especially convenient when you want to compare, e.g. T-Coffee and Muscle and MAFFT results to see which regions of your alignment are robust. You could use any of these tools, just paste your sequences into a Webform, download the results and load into Jalview. Easy.

But even easier is to calculate the alignments directly from Jalview. available. (Not today. Bummer.)


No. Claculate an external alignment and import.
Calculate a MAFFT alignment using the Jalview Web service option

Task:

  1. In Jalview, select Web Service → Alignment → MAFFT with defaults.... The alignment is calculated in a few minutes and displayed in a new window.
Calculate a MAFFT alignment when the Jalview Web service is NOT available

Task:

  1. In Jalview, select File → Output to Textbox → FASTA
  2. Copy the sequences.
  3. Navigate to the MAFFT Input form at the EBI.
  4. Paste your sequences into the form.
  5. Click on Submit.
  6. Close the Jalview sequence window and either save your MAFFT alignment to file and load in Jalview, or simply 'File → Input Alignment → from Textbox, paste and click New Window.


In any case, you should now have an alignment.

Task:

  1. Choose Colour → Hydrophobicity and → by Conservation. Then adjust the slider left or right to see which columns are highly conserved. You will notice that the Swi6 sequence that was supposed to align only to the ankyrin domains was in fact aligned to other parts of the sequence as well. This is one part of the MSA that we will have to correct manually and a common problem when aligning sequences of different lengths.


 

R code: load alignment and compute information scores

As discussed in the lecture, Shannon information is calculated as the difference between expected and observed entropy, where entropy is the negative sum over probabilities times the log of those probabilities:


*a review of regex range characters +?*{min,max}, and greedy.
*build an AT-hook motif matcher https://en.wikipedia.org/wiki/AT-hook


Here we compute Shannon information scores for aligned positions of the APSES domain, and plot the values in R. You can try this with any part of your alignment, but I have used only the aligned residues for the APSES domain for my example. This is a good choice for a first try, since there are (almost) no gaps.

Task:

  1. Export only the sequences of the aligned APSES domains to a file on your computer, in FASTA format as explained below. You could call this: Mbp1_All_APSES.fa.
    1. Use your mouse and clik and drag to select the aligned APSES domains in the alignment window.
    2. Copy your selection to the clipboard.
    3. Use the main menu (not the menu of your alignment window) and select File → Input alignment → from Textbox; paste the selection into the textbox and click New Window.
    4. Use File → save as to save the aligned siequences in multi-FASTA format under the filename you want in your R project directory.
  1. Explore the R-code below. Be sure that you understand it correctly. Note that this code does not implement any sampling bias correction, so positions with large numbers of gaps will receive artificially high scores (the alignment looks like the gap charecter were a conserved character).


# CalculateInformation.R
# Calculate Shannon information for positions in a multiple sequence alignment.
# Requires: an MSA in multi FASTA format
 
# It is good practice to set variables you might want to change
# in a header block so you don't need to hunt all over the code
# for strings you need to update.
#
setwd("/your/R/working/directory")
mfa      <- "MBP1_All_APSES.fa"
 
# ================================================
#    Read sequence alignment fasta file
# ================================================
 
# read MFA datafile using seqinr function read.fasta()
library(seqinr)
tmp  <- read.alignment(mfa, format="fasta")
MSA  <- as.matrix(tmp)  # convert the list into a characterwise matrix
                        # with appropriate row and column names using
                        # the seqinr function as.matrix.alignment()
                        # You could have a look under the hood of this
                        # function to understand beter how to convert a
                        # list into something else ... simply type
                        # "as.matrix.alignment" - without the parentheses
                        # to retrieve the function source code (as for any
                        # function btw).

### Explore contents of and access to the matrix of sequences
MSA
MSA[1,]
MSA[,1]
length(MSA[,1])


# ================================================
#    define function to calculate entropy
# ================================================

entropy <- function(v) { # calculate shannon entropy for the aa vector v
	                     # Note: we are not correcting for small sample sizes
	                     # here. Thus if there are a large number of gaps in
	                     # the alignment, this will look like small entropy
	                     # since only a few amino acids are present. In the 
	                     # extreme case: if a position is only present in 
	                     # one sequence, that one amino acid will be treated
	                     # as 100% conserved - zero entropy. Sampling error
	                     # corrections are discussed eg. in Schneider et al.
	                     # (1986) JMB 188:414
	l <- length(v)
	a <- rep(0, 21)      # initialize a vector with 21 elements (20 aa plus gap)
	                     # the set the name of each row to the one letter
	                     # code. Through this, we can access a row by its
	                     # one letter code.
	names(a)  <- unlist(strsplit("acdefghiklmnpqrstvwy-", ""))
	
	for (i in 1:l) {       # for the whole vector of amino acids
		c <- v[i]          # retrieve the character
		a[c] <- a[c] + 1   # increment its count by one
	} # note: we could also have used the table() function for this
	
	tot <- sum(a) - a["-"] # calculate number of observed amino acids
	                       # i.e. subtract gaps
	a <- a/tot             # frequency is observations of one amino acid
	                       # divided by all observations. We assume that
	                       # frequency equals probability.
	a["-"] <- 0       	                        
	for (i in 1:length(a)) {
		if (a[i] != 0) { # if a[i] is not zero, otherwise leave as is.
			             # By definition, 0*log(0) = 0  but R calculates
			             # this in parts and returns NaN for log(0).
			a[i] <- a[i] * (log(a[i])/log(2)) # replace a[i] with
			                                  # p(i) log_2(p(i))
		}
	}
	return(-sum(a)) # return Shannon entropy
}

# ================================================
#    calculate entropy for reference distribution
#    (from UniProt, c.f. Assignment 2)
# ================================================

refData <- c(
    "A"=8.26,
    "Q"=3.93,
    "L"=9.66,
    "S"=6.56,
    "R"=5.53,
    "E"=6.75,
    "K"=5.84,
    "T"=5.34,
    "N"=4.06,
    "G"=7.08,
    "M"=2.42,
    "W"=1.08,
    "D"=5.45,
    "H"=2.27,
    "F"=3.86,
    "Y"=2.92,
    "C"=1.37,
    "I"=5.96,
    "P"=4.70,
    "V"=6.87
    )

### Calculate the entropy of this distribution

H.ref <- 0
for (i in 1:length(refData)) {
	p <- refData[i]/sum(refData) # convert % to probabilities
    H.ref <- H.ref - (p * (log(p)/log(2)))
}

# ================================================
#    calculate information for each position of 
#    multiple sequence alignment
# ================================================

lAli <- dim(MSA)[2] # length of row in matrix is second element of dim(<matrix>).
I <- rep(0, lAli)   # initialize result vector
for (i in 1:lAli) { 
	I[i] = H.ref - entropy(MSA[,i])  # I = H_ref - H_obs
}

### evaluate I
I
quantile(I)
hist(I)
plot(I)

# you can see that we have quite a large number of columns with the same,
# high value ... what are these?

which(I > 4)
MSA[,which(I > 4)]

# And what is in the columns with low values?
MSA[,which(I < 1.5)]


# ===================================================
#    plot the information
#    (c.f. Assignment 5, see there for explanations)
# ===================================================

IP <- (I-min(I))/(max(I) - min(I) + 0.0001)
nCol <- 15
IP <- floor(IP * nCol) + 1
spect <- colorRampPalette(c("#DD0033", "#00BB66", "#3300DD"), bias=0.6)(nCol)
# lets set the information scores from single informations to grey. We   
# change the highest level of the spectrum to grey.
#spect[nCol] <- "#CCCCCC"
Icol <- vector()
for (i in 1:length(I)) {
	Icol[i] <- spect[ IP[i] ] 
}
 
plot(1,1, xlim=c(0, lAli), ylim=c(-0.5, 5) ,
     type="n", bty="n", xlab="position in alignment", ylab="Information (bits)")

# plot as rectangles: height is information and color is coded to information
for (i in 1:lAli) {
   rect(i, 0, i+1, I[i], border=NA, col=Icol[i])
}

# As you can see, some of the columns reach very high values, but they are not
# contiguous in sequence. Are they contiguous in structure? We will find out in
# a later assignment, when we map computed values to structure.


Plot of information vs. sequence position produced by the R script above, for an alignment of Mbp1 ortholog APSES domains.

Calculating conservation scores

Task:

  • Study this code carefully, execute it, section by section and make sure you understand all of it. Ask on the list if anything is not clear.
# BiostringsExample.R
# Short tutorial on sequence alignment with the Biostrings package.
# Boris Steipe for BCH441, 2013 - 2014
#
setwd("~/path/to/your/R_files/")
setwd("~/Documents/07.TEACHING/37-BCH441 Bioinformatics 2014/05-Materials/Assignment_5 data")

# Biostrings is a package within the bioconductor project.
# bioconducter packages have their own installation system,
# they are normally not installed via CRAN.

# First, you load the BioConductor installer...
source("http://bioconductor.org/biocLite.R")

# Then you can install the Biostrings package and all of its dependencies.
biocLite("Biostrings")

# ... and load the library.
library(Biostrings)

# Some basic (technical) information is available ...
library(help=Biostrings)

# ... but for more in depth documentation, use the
# so called "vignettes" that are provided with every R package.
browseVignettes("Biostrings")

# In this code, we mostly use functions that are discussed in the 
# pairwise alignement vignette.

# Read in two fasta files - you will need to edit this for YFO
sacce <- readAAStringSet("mbp1-sacce.fa", format="fasta")

# "USTMA" is used only as an example here - modify for YFO  :-)
ustma <- readAAStringSet("mbp1-ustma.fa", format="fasta")

sacce
names(sacce) 
names(sacce) <- "Mbp1 SACCE"
names(ustma) <- "Mbp1 USTMA" # Example only ... modify for YFO

width(sacce)
as.character(sacce)

# Biostrings takes a sophisticated approach to sequence alignment ...
?pairwiseAlignment

# ... but the use in practice is quite simple:
ali <- pairwiseAlignment(sacce, ustma, substitutionMatrix = "BLOSUM50")
ali

pattern(ali)
subject(ali)

writePairwiseAlignments(ali)

p <- aligned(pattern(ali))
names(p) <- "Mbp1 SACCE aligned"
s <- aligned(subject(ali))
names(s) <- "Mbp1 USTMA aligned"

# don't overwrite your EMBOSS .fal files
writeXStringSet(p, "mbp1-sacce.R.fal", append=FALSE, format="fasta")
writeXStringSet(s, "mbp1-ustma.R.fal", append=FALSE, format="fasta")

# Done.
  • Compare the alignments you received from the EMBOSS server, and that you computed using R. Are they approximately the same? Exactly? You did use different matrices and gap parameters, so minor differences are to be expected. But by and large you should get the same alignments.

We will now use the aligned sequences to compute a graphical display of alignment quality.


Task:

  • Study this code carefully, execute it, section by section and make sure you understand all of it. Ask on the list if anything is not clear.
# aliScore.R
# Evaluating an alignment with a sliding window score
# Boris Steipe, October 2012. Update October 2013
setwd("~/path/to/your/R_files/")

# Scoring matrices can be found at the NCBI. 
# ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62

# It is good practice to set variables you might want to change
# in a header block so you don't need to hunt all over the code
# for strings you need to update.
#
fa1      <- "mbp1-sacce.R.fal"
fa2      <- "mbp1-ustma.R.fal"
code1    <- "SACCE"
code2    <- "USTMA"
mdmFile  <- "BLOSUM62.mdm"
window   <- 9   # window-size (should be an odd integer)

# ================================================
#    Read data files
# ================================================

# read fasta datafiles using seqinr function read.fasta()
install.packages("seqinr")
library(seqinr)
tmp  <- unlist(read.fasta(fa1, seqtype="AA", as.string=FALSE, seqonly=TRUE))
seq1 <- unlist(strsplit(as.character(tmp), split=""))

tmp  <- unlist(read.fasta(fa2, seqtype="AA", as.string=FALSE, seqonly=TRUE))
seq2 <- unlist(strsplit(as.character(tmp), split=""))

if (length(seq1) != length(seq2)) {
	print("Error: Sequences have unequal length!")
	}
	
lSeq <- length(seq1)

# ================================================
#    Read scoring matrix
# ================================================

MDM <- read.table(mdmFile, skip=6)

# This is a dataframe. Study how it can be accessed:

MDM
MDM[1,]
MDM[,1]
MDM[5,5]   # Cys-Cys
MDM[20,20] # Val-Val
MDM[,"W"]  # the tryptophan column
MDM["R","W"]  # Arg-Trp pairscore
MDM["W","R"]  # Trp-Arg pairscore: pairscores are symmetric

colnames(MDM)  # names of columns
rownames(MDM)  # names of rows
colnames(MDM)[3]   # third column
rownames(MDM)[12]  # twelfth row

# change the two "*" names to "-" so we can use them to score
# indels of the alignment. This is a bit of a hack, since this
# does not reflect the actual indel penalties (which is, as you)
# remember from your lectures, calculated as a gap opening
# + gap extension penalty; it can't be calculated in a pairwise
# manner) EMBOSS defaults for BLODSUM62 are opening -10 and
# extension -0.5 i.e. a gap of size 3 (-11.5) has approximately
# the same penalty as a 3-character score of "-" matches (-12)
# so a pairscore of -4 is not entirely unreasonable.

colnames(MDM)[24] 
rownames(MDM)[24]
colnames(MDM)[24] <- "-"
rownames(MDM)[24] <- "-"
colnames(MDM)[24] 
rownames(MDM)[24]
MDM["Q", "-"]
MDM["-", "D"]
# so far so good.

# ================================================
#    Tabulate pairscores for alignment
# ================================================


# It is trivial to create a pairscore vector along the
# length of the aligned sequences.

PS <- vector()
for (i in 1:lSeq) {
   aa1 <- seq1[i] 
   aa2 <- seq2[i] 
   PS[i] = MDM[aa1, aa2]
}

PS


# The same vector could be created - albeit perhaps not so
# easy to understand - with the expression ...
MDM[cbind(seq1,seq2)]



# ================================================
#    Calculate moving averages
# ================================================

# In order to evaluate the alignment, we will calculate a 
# sliding window average over the pairscores. Somewhat surprisingly
# R doesn't (yet) have a native function for moving averages: options
# that are quoted are:
#   - rollmean() in the "zoo" package http://rss.acs.unt.edu/Rdoc/library/zoo/html/rollmean.html
#   - MovingAverages() in "TTR"  http://rss.acs.unt.edu/Rdoc/library/TTR/html/MovingAverages.html
#   - ma() in "forecast"  http://robjhyndman.com/software/forecast/
# But since this is easy to code, we shall implement it ourselves.

PSma <- vector()           # will hold the averages
winS <- floor(window/2)    # span of elements above/below the centre
winC <- winS+1             # centre of the window

# extend the vector PS with zeros (virtual observations) above and below
PS <- c(rep(0, winS), PS , rep(0, winS))

# initialize the window score for the first position
winScore <- sum(PS[1:window])

# write the first score to PSma
PSma[1] <- winScore

# Slide the window along the sequence, and recalculate sum()
# Loop from the next position, to the last position that does not exceed the vector...
for (i in (winC + 1):(lSeq + winS)) { 
   # subtract the value that has just dropped out of the window
   winScore <- winScore - PS[(i-winS-1)] 
   # add the value that has just entered the window
   winScore <- winScore + PS[(i+winS)]  
   # put score into PSma
   PSma[i-winS] <- winScore
}

# convert the sums to averages
PSma <- PSma / window

# have a quick look at the score distributions

boxplot(PSma)
hist(PSma)

# ================================================
#    Plot the alignment scores
# ================================================

# normalize the scores 
PSma <- (PSma-min(PSma))/(max(PSma) - min(PSma) + 0.0001)
# spread the normalized values to a desired range, n
nCol <- 10
PSma <- floor(PSma * nCol) + 1

# Assign a colorspectrum to a vector (with a bit of colormagic,
# don't worry about that for now). Dark colors are poor scores,
# "hot" colors are high scores
spect <- colorRampPalette(c("black", "red", "yellow", "white"), bias=0.4)(nCol)

# Color is an often abused aspect of plotting. One can use color to label
# *quantities* or *qualities*. For the most part, our pairscores measure amino
# acid similarity. That is a quantity and with the spectrum that we just defined
# we associte the measured quantities with the color of a glowing piece
# of metal: we start with black #000000, then first we ramp up the red
# (i.e. low-energy) part of the visible spectrum to red #FF0000, then we
# add and ramp up the green spectrum giving us yellow #FFFF00 and finally we 
# add blue, giving us white #FFFFFF. Let's have a look at the spectrum:

s <- rep(1, nCol)
barplot(s, col=spect, axes=F, main="Color spectrum")

# But one aspect of our data is not quantitatively different: indels.
# We valued indels with pairscores of -4. But indels are not simply poor alignment, 
# rather they are non-alignment. This means stretches of -4 values are really 
# *qualitatively* different. Let's color them differently by changing the lowest 
# level of the spectrum to grey.

spect[1] <- "#CCCCCC"
barplot(s, col=spect, axes=F, main="Color spectrum")

# Now we can display our alignment score vector with colored rectangles.

# Convert the integers in PSma to color values from spect
PScol <- vector()
for (i in 1:length(PSma)) {
	PScol[i] <- spect[ PSma[i] ]  # this is how a value from PSma is used as an index of spect
}

# Plot the scores. The code is similar to the last assignment.
# Create an empty plot window of appropriate size
plot(1,1, xlim=c(-100, lSeq), ylim=c(0, 2) , type="n", yaxt="n", bty="n", xlab="position in alignment", ylab="")

# Add a label to the left
text (-30, 1, adj=1, labels=c(paste("Mbp1:\n", code1, "\nvs.\n", code2)), cex=0.9 )

# Loop over the vector and draw boxes  without border, filled with color.
for (i in 1:lSeq) {
   rect(i, 0.9, i+1, 1.1, border=NA, col=PScol[i])
}

# Note that the numbers along the X-axis are not sequence numbers, but numbers
# of the alignment, i.e. sequence number + indel length. That is important to
# realize: if you would like to add the annotations from the last assignment 
# which I will leave as an exercise, you need to map your sequence numbering
# into alignment numbering. Let me know in case you try that but need some help.


 


That is all.


 

Links and resources

 


Footnotes and references


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.