Difference between revisions of "BIO Assignment Week 4"

From "A B C"
Jump to navigation Jump to search
m
m
Line 2: Line 2:
 
<div class="b1">
 
<div class="b1">
 
Assignment for Week 4<br />
 
Assignment for Week 4<br />
<span style="font-size: 70%">Domain annotations</span>
+
<span style="font-size: 70%">Sequence alignment</span>
 
</div>
 
</div>
 
<table style="width:100%;"><tr>
 
<table style="width:100%;"><tr>
Line 16: Line 16:
 
__TOC__
 
__TOC__
  
 +
&nbsp;
  
&nbsp;
 
 
==Introduction==
 
==Introduction==
 +
 +
In this assignment we will perform an optimal global and local sequence alignment, and use '''R''' to plot the alignment quality as a colored bar-graph.
 +
 +
 +
=== Optimal sequence alignments ===
 +
 +
 +
Online programs for optimal sequence alignment are part of the EMBOSS tools. The programs take FASTA files as input.
 +
 +
;Local optimal SEQUENCE alignment "water"
 +
{{task|1=
 +
# Retrieve the FASTA file for the YFO Mbp1 protein and for [http://www.ncbi.nlm.nih.gov/protein/NP_010227?report=fasta&log$=seqview&format=text ''Saccharomyces cerevisiae''].
 +
# Save the files as text files to your computer, (if you haven't done so already). You could give them an extension of <code>.fa</code>.
 +
# Access the [http://emboss.bioinformatics.nl/ EMBOSS Explorer site] (if you haven't done so yet, you might want to bookmark it.)
 +
# Look for '''ALIGNMENT LOCAL''', click on '''water''', paste your FASTA sequences and run the program with default parameters.
 +
# Study the results. You will probably find that the alignment extends over most of the protein, but does not include the termini.
 +
# Considering the sequence identy cutoff we discussed in class (25% over the length of a domain), do you believe that the APSES domains are homologous?
 +
# Change the '''Gap opening''' and '''Gap extension''' parameters to high values (e.g. 30 and 5). Then run the alignment again.
 +
# Note what is different.
 +
# You could try getting only an alignment for the ankyrin domains that you have found in the last assignment, by deleting the approximate region of the APSES domains from your input.
 +
}}
 +
 +
 +
;Global optimal SEQUENCE alignment "needle"
 +
{{task|1=
 +
# Look for '''ALIGNMENT GLOBAL''', click on '''needle''', paste your FASTA sequences and run the program with default parameters.
 +
# Study the results. You will find that the alignment extends over the entire protein, likely with long ''indels'' at the termini.
 +
# Change the '''Output alignment format''' to '''FASTA pairwise simple''', to retrieve the aligned FASTA files with indels.
 +
# Copy the aligned sequences (with indels) and save them to your computer. You could give them an extension of <code>.fal</code> to remind you that they are aligned FASTA sequences.
 +
}}
 +
  
 
&nbsp;
 
&nbsp;
  
In this assignment we will perform a few computations with coordinate data in PDB files,  look more in depth at domain annotations, and compare them across different proteins related to yeast Mbp1. We will write a function in '''R''' to help us plot a graphic of the comparison and collaborate to share data for our plots. But first we need to go through a few more '''R''' concepts.
+
== The Mutation Data Matrix ==
 +
 
 +
The NCBI makes its alignment matrices available by ftp. They are located at  ftp://ftp.ncbi.nih.gov/blast/matrices - for example here is a link to the [ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62 '''BLOSUM62 matrix''']<ref>The directory also contains sourcecode to generte the PAM matrices. This may be of interest for you if you ever want to produce scoring matrices from your own datasets.</ref>. Access that site and download the <code>BLOSUM62</code> matrix to your computer. You could give it a filename of <code>BLOSUM62.mdm</code>.
 +
 
 +
It should look like this.
 +
 
 +
<source lang="text">
 +
#  Matrix made by matblas from blosum62.iij
 +
#  * column uses minimum score
 +
#  BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
 +
#  Blocks Database = /data/blocks_5.0/blocks.dat
 +
#  Cluster Percentage: >= 62
 +
#  Entropy =  0.6979, Expected =  -0.5209
 +
  A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
 +
A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1  0 -4
 +
R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1  0 -1 -4
 +
N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  3  0 -1 -4
 +
D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4  1 -1 -4
 +
C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4
 +
Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0  3 -1 -4
 +
E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4
 +
G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -2 -1 -4
 +
H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0  0 -1 -4
 +
I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3 -3 -3 -1 -4
 +
L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1 -4 -3 -1 -4
 +
K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2  0  1 -1 -4
 +
M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1 -3 -1 -1 -4
 +
F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1 -3 -3 -1 -4
 +
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2 -2 -1 -2 -4
 +
S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2  0  0  0 -4
 +
T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0 -1 -1  0 -4
 +
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3 -4 -3 -2 -4
 +
Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1 -3 -2 -1 -4
 +
V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4 -3 -2 -1 -4
 +
B -2 -1  3  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4  1 -1 -4
 +
Z -1  0  0  1 -3  3  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4
 +
X  0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2  0  0 -2 -1 -1 -1 -1 -1 -4
 +
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4  1
 +
</source>
 +
 
 +
 
 +
{{task|
 +
* Study this and make sure you understand what this table is, how it can be used, and what a reasonable range of values for identities and pairscores for non-identical, similar and dissimilar residues is. Ask on the mailing list in case you have questions.
 +
}}
 +
 
  
  
 
&nbsp;
 
&nbsp;
 +
== The DNA binding site ==
 +
  
== Programming '''R''' code  ==
+
Now, that you know how YFO Mbp1 aligns with yeast Mbp1, you can evaluate functional conservation in these homologous proteins. You probably already downloaded the two Biochemistry papers by Taylor et al. (2000) and by Deleeuw et al. (2008) that we encountered in Assignment 2. These discuss the residues involved in DNA binding<ref>([http://www.ncbi.nlm.nih.gov/pubmed/10747782 Taylor ''et al.'' (2000) ''Biochemistry'' '''39''': 3943-3954] and [http://www.ncbi.nlm.nih.gov/pubmed/18491920 Deleeuw ''et al.'' (2008) Biochemistry. '''47''':6378-6385])</ref>. In particular the residues between 50-74 have been proposed to comprise the DNA recognition domain.
  
First, we will cover essentials of '''R''' programming: the fundamental statements that are needed to write programs&ndash;conditional expressions and loops, and how to define ''functions'' that allow us to use programming code. But let's start with a few more data types of '''R''' so we can use the concepts later on: matrices, lists and data frames.
+
{{task|
 +
# Using the APSES domain alignment you have just constructed, find the YFO Mbp1 residues that correspond to the range 50-74 in yeast.
 +
# Note whether the sequences are especially highly conserved in this region.
 +
# Using Chimera, look at the region. Use the sequence window '''to make sure''' that the sequence numbering between the paper and the PDB file are the same (they are often not identical!). Then select the residues - the proposed recognition domain -  and color them differently for emphasis. Study this in stereo to get a sense of the spatial relationships. Check where the conserved residues are.
 +
# A good representation is '''stick''' - but other representations that include sidechains will also serve well.
 +
# Calculate a solvent accessible surface of the protein in a separate representation and make it transparent.
 +
# You could  combine three representations: (1) the backbone (in '''ribbon view'''), (2) the sidechains of residues that presumably contact DNA, distinctly colored, and (3) a transparent surface of the entire protein. This image should show whether residues annotated as DNA binding form a contiguous binding interface.
 +
}}
 +
 
 +
 
 +
DNA binding interfaces are expected to comprise a number of positively charged amino acids, that might form salt-bridges with the phosphate backbone.  
  
  
 
{{task|
 
{{task|
Please begin by working through the short [http://biochemistry.utoronto.ca/steipe/abc/index.php/R_tutorial#Matrices '''R''' - tutorial: matrices] section and the following sections on "Lists" and "Data frames".
+
*Study and consider whether this is the case here and which residues might be included.
 +
}}
 +
 
 +
 
 +
&nbsp;
 +
== R code: coloring the alignment by quality ==
 +
 
 +
 
 +
 
 +
{{task|1=
  
Note that what we have done here is just the bare minimum on vectors, matrices and lists. The concepts are very generally useful, and there are many useful ways to extract subsets of values. We'll come across these in various parts of '''R''' sample code. But unless you read the provided code examples line by line, make sure you understand every single statement and '''ask''' if you are not clear about the syntax, this will all be quickly forgotten. Use it or lose it!
+
* Study this code carefully, execute it, section by section and make sure you understand all of it. Ask on the list if anything is not clear.
}}
+
 
 +
<source lang="R">
 +
# BiostringsExample.R
 +
# Short tutorial on sequence alignment with the Biostrings package.
 +
# Boris Steipe for BCH441, 2013 - 2014
 +
#
 +
setwd("~/path/to/your/R_files/")
 +
setwd("~/Documents/07.TEACHING/37-BCH441 Bioinformatics 2014/05-Materials/Assignment_5 data")
 +
 
 +
# Biostrings is a package within the bioconductor project.
 +
# bioconducter packages have their own installation system,
 +
# they are normally not installed via CRAN.
  
 +
# First, you load the BioConductor installer...
 +
source("http://bioconductor.org/biocLite.R")
  
'''R''' is a full featured programming language and in order to be that, it needs the ability to manipulate '''Control Flow''', i.e. the order in which instructions are executed. By far the most important control flow statement is a '''conditional branch''' i.e. the instruction to execute code at alternative points of the program, depending on the presence or absence of a condition. Using such conditional branch instructions, two main types of programming constructs can be realized: ''conditional expressions'' and ''loops''.
+
# Then you can install the Biostrings package and all of its dependencies.
 +
biocLite("Biostrings")
  
=== Conditional expressions ===
+
# ... and load the library.
 +
library(Biostrings)
  
The template for conditional expression in R is:
+
# Some basic (technical) information is available ...
 +
library(help=Biostrings)
  
<source lang="rsplus">
+
# ... but for more in depth documentation, use the
if( <expression 1> ) {
+
# so called "vignettes" that are provided with every R package.
  <statement 1>
+
browseVignettes("Biostrings")
}
 
else if ( <expresssion 2> ) {
 
  <statement 2>
 
}
 
else {
 
  <statement 3>
 
}
 
  
</source>
+
# In this code, we mostly use functions that are discussed in the
 +
# pairwise alignement vignette.
 +

# Read in two fasta files - you will need to edit this for YFO
 +
sacce <- readAAStringSet("mbp1-sacce.fa", format="fasta")
  
...where both the <code>else if (...) { ... }</code> and the <code>else (...)</code> block are optional. We have encountered this construct previously, when we assigned the appropriate colors for amino acids in the frequency plot:
+
# "USTMA" is used only as an example here - modify for YFO  :-)
 +
ustma <- readAAStringSet("mbp1-ustma.fa", format="fasta")
  
<source lang="rsplus">
+
sacce
if      (names(logRatio[i]) == "F") { barColors[i] <- hydrophobic }
+
names(sacce)
else if (names(logRatio[i]) == "G") { barColors[i] <- plain }
+
names(sacce) <- "Mbp1 SACCE"
  [... etc ...]
+
names(ustma) <- "Mbp1 USTMA" # Example only ... modify for YFO
else                                { barColors[i] <- plain }
 
  
</source>
+
width(sacce)
+
as.character(sacce)
==== Logical expressions ====
 
  
We have to consider the <code>&lt;expression&gt;</code> in a bit more detail: anything that is, or produces, or can be interpreted as, a Boolean TRUE or FALSE value can serve as an expression in a conditional statement.  
+
# Biostrings takes a sophisticated approach to sequence alignment ...
 +
?pairwiseAlignment
  
{{task|1=
+
# ... but the use in practice is quite simple:
Here are some examples. Copy the code to an '''R''' script, predict what will happen on in each line and try it out:
+
ali <- pairwiseAlignment(sacce, ustma, substitutionMatrix = "BLOSUM50")
 +
ali
  
<source lang="rsplus">
+
pattern(ali)
# A boolean constant is interpreted as is:
+
subject(ali)
if (TRUE)    {print("true")} else {print("false")}
 
if (FALSE)  {print("true")} else {print("false")}
 
  
# Strings of "true" and "false" are coerced to their
+
writePairwiseAlignments(ali)
# Boolean equivalent, but contrary to other programming languages
 
# arbitrary, non-empty or empty strings are not interpreted.
 
if ("true") {print("true")} else {print("false")}
 
if ("false") {print("true")} else {print("false")}
 
if ("widdershins")    {print("true")} else {print("false")}
 
if ("") {print("true")} else {print("false")}
 
  
# All non-zero, defined numbers are TRUE
+
p <- aligned(pattern(ali))
if (1)      {print("true")} else {print("false")}
+
names(p) <- "Mbp1 SACCE aligned"
if (0)     {print("true")} else {print("false")}
+
s <- aligned(subject(ali))
if (-1)    {print("true")} else {print("false")}
+
names(s) <- "Mbp1 USTMA aligned"
if (pi)    {print("true")} else {print("false")}
 
if (NULL)   {print("true")} else {print("false")}
 
if (NA)    {print("true")} else {print("false")}
 
if (NaN)    {print("true")} else {print("false")}
 
if (Inf)    {print("true")} else {print("false")}
 
  
# functions can return Boolean values
+
# don't overwrite your EMBOSS .fal files
affirm <- function() { return(TRUE) }
+
writeXStringSet(p, "mbp1-sacce.R.fal", append=FALSE, format="fasta")
deny <- function() { return(FALSE) }
+
writeXStringSet(s, "mbp1-ustma.R.fal", append=FALSE, format="fasta")
if (affirm())    {print("true")} else {print("false")}
 
if (deny())    {print("true")} else {print("false")}
 
  
# N.B. coercion of Booleans into numbers can be done as well
+
# Done.
and is sometimes useful: consider ...
 
a <- c(TRUE, TRUE, FALSE, TRUE, FALSE)
 
a
 
as.numeric(a)
 
sum(a)
 
  
# ... or coercing the other way ...
 
as.logical(-1:1)
 
 
</source>
 
</source>
 +
 +
* Compare the alignments you received from the EMBOSS server, and that you co puted using '''R'''. Are they aproximately the same? Exactly? You did use different matrices and gap aameters, so minor differences are to be expected. But by and large you should get the same alignments.
 +
 
}}
 
}}
  
==== Logical operators ====
+
We will now use the aligned sequences to compute a graphical display of alignment quality.
  
To actually write a conditional statement, we have to be able to '''test a condition''' and this is what logical operators do. Is something '''equal''' to something else? Is it less? Does something exist? Is it a number?
 
  
 
{{task|1=
 
{{task|1=
  
Here are some examples. Again, predict what will happen ...
+
* Study this code carefully, execute it, section by section and make sure you understand all of it. Ask on the list if anything is not clear.
  
<source lang="rsplus">
+
<source lang="R">
TRUE            # Just a statement.
+
# aliScore.R
 +
# Evaluating an alignment with a sliding window score
 +
# Boris Steipe, October 2012. Update October 2013
 +
setwd("~/path/to/your/R_files/")
  
#   unary operator
+
# Scoring matrices can be found at the NCBI.
! TRUE          # NOT ...
+
# ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62
  
# binary operators
+
# It is good practice to set variables you might want to change
FALSE > TRUE    # GREATER THAN ...
+
# in a header block so you don't need to hunt all over the code
FALSE < TRUE    # ... these are coerced to numbers
+
# for strings you need to update.
FALSE < -1       
+
#
0 == FALSE      # Careful! == compares, = assigns!!!
+
fa1      <- "mbp1-sacce.R.fal"
 +
fa2      <- "mbp1-ustma.R.fal"
 +
code1    <- "SACCE"
 +
code2    <- "USTMA"
 +
mdmFile  <- "BLOSUM62.mdm"
 +
window  <- # window-size (should be an odd integer)
  
"x" == "u"      # using lexical sort order ...
+
# ================================================
"x" >= "u"
+
#   Read data files
"x" <= "u"
+
# ================================================
"x" != "u"
 
"aa" > "u"      # ... not just length, if different.
 
"abc" < "u" 
 
  
TRUE | FALSE    # OR: TRUE if either is true
+
# read fasta datafiles using seqinr function read.fasta()
TRUE & FALSE   # AND: TRUE if both are TRUE
+
install.packages("seqinr")
 +
library(seqinr)
 +
tmp  <- unlist(read.fasta(fa1, seqtype="AA", as.string=FALSE, seqonly=TRUE))
 +
seq1 <- unlist(strsplit(as.character(tmp), split=""))
  
# equality and identity
+
tmp  <- unlist(read.fasta(fa2, seqtype="AA", as.string=FALSE, seqonly=TRUE))
?identical
+
seq2 <- unlist(strsplit(as.character(tmp), split=""))
a <- c(TRUE)
 
b <- c(TRUE)
 
a; b
 
a == b
 
identical(a, b)
 
  
b <- 1
+
if (length(seq1) != length(seq2)) {
a; b
+
print("Error: Sequences have unequal length!")
a == b
+
}
identical(a, b)   # Aha: equal, but not identical
+
 +
lSeq <- length(seq1)
  
 +
# ================================================
 +
#    Read scoring matrix
 +
# ================================================
  
# some other useful tests for conditional expressions
+
MDM <- read.table(mdmFile, skip=6)
?all
+
 
?any
+
# This is a dataframe. Study how it can be accessed:
?duplicated
+
 
?exists
+
MDM
?is.character
+
MDM[1,]
?is.factor
+
MDM[,1]
?is.integer
+
MDM[5,5]  # Cys-Cys
?is.null
+
MDM[20,20] # Val-Val
?is.numeric
+
MDM[,"W"]  # the tryptophan column
?is.unsorted
+
MDM["R","W"]  # Arg-Trp pairscore
?is.vector
+
MDM["W","R"]  # Trp-Arg pairscore: pairscores are symmetric
</source>
+
 
}}
+
colnames(MDM)  # names of columns
 +
rownames(MDM)  # names of rows
 +
colnames(MDM)[3]  # third column
 +
rownames(MDM)[12]  # twelfth row
 +
 
 +
# change the two "*" names to "-" so we can use them to score
 +
# indels of the alignment. This is a bit of a hack, since this
 +
# does not reflect the actual indel penalties (which is, as you)
 +
# remember from your lectures, calculated as a gap opening
 +
# + gap extension penalty; it can't be calculated in a pairwise
 +
# manner) EMBOSS defaults for BLODSUM62 are opening -10 and
 +
# extension -0.5 i.e. a gap of size 3 (-11.5) has approximately
 +
# the same penalty as a 3-character score of "-" matches (-12)
 +
# so a pairscore of -4 is not entirely unreasonable.
 +
 
 +
colnames(MDM)[24]
 +
rownames(MDM)[24]
 +
colnames(MDM)[24] <- "-"
 +
rownames(MDM)[24] <- "-"
 +
colnames(MDM)[24]
 +
rownames(MDM)[24]
 +
MDM["Q", "-"]
 +
MDM["-", "D"]
 +
# so far so good.
  
 +
# ================================================
 +
#    Tabulate pairscores for alignment
 +
# ================================================
  
=== Loops ===
 
  
Loops allow you to repeat tasks many times over. The template is:
+
# It is trivial to create a pairscore vector along the
 +
# length of the aligned sequences.
  
<source lang="rsplus">
+
PS <- vector()
for (<name> in <vector>) {
+
for (i in 1:lSeq) {
   <statement>
+
   aa1 <- seq1[i]
 +
  aa2 <- seq2[i]
 +
  PS[i] = MDM[aa1, aa2]
 
}
 
}
</source>
 
  
{{task|1=
+
PS
Consider the following: Again, copy the code to a script, study it, predict what will happen and then run it.
+
 
 +
 
 +
# The same vector could be created - albeit perhaps not so
 +
# easy to understand - with the expression ...
 +
MDM[cbind(seq1,seq2)]
 +
 
 +
 
 +
 
 +
# ================================================
 +
#    Calculate moving averages
 +
# ================================================
 +
 
 +
# In order to evaluate the alignment, we will calculate a
 +
# sliding window average over the pairscores. Somewhat surprisingly
 +
# R doesn't (yet) have a native function for moving averages: options
 +
# that are quoted are:
 +
#  - rollmean() in the "zoo" package http://rss.acs.unt.edu/Rdoc/library/zoo/html/rollmean.html
 +
#  - MovingAverages() in "TTR"  http://rss.acs.unt.edu/Rdoc/library/TTR/html/MovingAverages.html
 +
#  - ma() in "forecast"  http://robjhyndman.com/software/forecast/
 +
# But since this is easy to code, we shall implement it ourselves.
 +
 
 +
PSma <- vector()          # will hold the averages
 +
winS <- floor(window/2)    # span of elements above/below the centre
 +
winC <- winS+1            # centre of the window
 +
 
 +
# extend the vector PS with zeros (virtual observations) above and below
 +
PS <- c(rep(0, winS), PS , rep(0, winS))
 +
 
 +
# initialize the window score for the first position
 +
winScore <- sum(PS[1:window])
 +
 
 +
# write the first score to PSma
 +
PSma[1] <- winScore
  
<source lang="rsplus">
+
# Slide the window along the sequence, and recalculate sum()
# simple for loop
+
# Loop from the next position, to the last position that does not exceed the vector...
for (i in 1:10) {
+
for (i in (winC + 1):(lSeq + winS)) {  
print(c(i, i^2, i^3))
+
  # subtract the value that has just dropped out of the window
 +
  winScore <- winScore - PS[(i-winS-1)]
 +
  # add the value that has just entered the window
 +
  winScore <- winScore + PS[(i+winS)] 
 +
  # put score into PSma
 +
  PSma[i-winS] <- winScore
 
}
 
}
  
# Compare excution times: one million square roots from a vector ...
+
# convert the sums to averages
n <- 1000000
+
PSma <- PSma / window
x <- 1:n
+
 
y <- sqrt(x)
+
# have a quick look at the score distributions
 +
 
 +
boxplot(PSma)
 +
hist(PSma)
  
# ... or done explicitly in a for-loop
+
# ================================================
for (i in 1:n) {
+
#    Plot the alignment scores
  y[i] <- sqrt (x[i])
+
# ================================================
}
+
 
 +
# normalize the scores
 +
PSma <- (PSma-min(PSma))/(max(PSma) - min(PSma) + 0.0001)
 +
# spread the normalized values to a desired range, n
 +
nCol <- 10
 +
PSma <- floor(PSma * nCol) + 1
 +
 
 +
# Assign a colorspectrum to a vector (with a bit of colormagic,
 +
# don't worry about that for now). Dark colors are poor scores,
 +
# "hot" colors are high scores
 +
spect <- colorRampPalette(c("black", "red", "yellow", "white"), bias=0.4)(nCol)
  
</source>
+
# Color is an often abused aspect of plotting. One can use color to label
 +
# *quantities* or *qualities*. For the most part, our pairscores measure amino
 +
# acid similarity. That is a quantity and with the spectrum that we just defined
 +
# we associte the measured quantities with the color of a glowing piece
 +
# of metal: we start with black #000000, then first we ramp up the red
 +
# (i.e. low-energy) part of the visible spectrum to red #FF0000, then we
 +
# add and ramp up the green spectrum giving us yellow #FFFF00 and finally we
 +
# add blue, giving us white #FFFFFF. Let's have a look at the spectrum:
  
''If'' you can achieve your result with an '''R''' vector expression, it will be faster than using a loop. But sometimes you need to do things explicitly, for example if you need to access intermediate results.
+
s <- rep(1, nCol)
 +
barplot(s, col=spect, axes=F, main="Color spectrum")
  
}}
+
# But one aspect of our data is not quantitatively different: indels.
 +
# We valued indels with pairscores of -4. But indels are not simply poor alignment,
 +
# rather they are non-alignment. This means stretches of -4 values are really
 +
# *qualitatively* different. Let's color them differently by changing the lowest
 +
# level of the spectrum to grey.
  
 +
spect[1] <- "#CCCCCC"
 +
barplot(s, col=spect, axes=F, main="Color spectrum")
  
Here is an example to play with loops: a password generator. Passwords are a '''pain'''. We need them everywhere, they are frustrating to type, to remember and since the cracking programs are getting smarter they become more and more likely to be broken. Here is a simple password generator that creates random strings with consonant/vowel alterations. These are melodic and easy to memorize, but actually as '''strong''' as an 8-character, fully random password that uses all characters of the keyboard such as <code>)He.{2jJ</code> or <code>#h$bB2X^</code> (which is pretty much unmemorizable). The former is taken from 20<sup>7</sup> * 7<sup>7</sup> 10<sup>15</sup> possibilities, the latter is from 94<sup>8</sup> ~ 6*10<sup>15</sup> possibilities. HIgh-end GPU supported {{WP|Password cracking|password crackers}} can test about 10<sup>9</sup> passwords a second, the passwords generated by this little algorithm would thus take on the order of 10<sup>6</sup> seconds or eleven days to crack<ref>That's assuming the worst case in that the attacker needs to know the pattern with which the password is formed, i.e. the number of characters and the alphabet that we chose from. But note that there is an even worse case: if the attacker had access to our code and the seed to our random number generator. When the random number generator starts up, a new seed is generated from system time, thus the possible space of seeds can be devastatingly small. But even if a seed is set explicitly with the <code>set.seed()</code> function, that seed is a 32-bit integer and thus can take only a bit more than 4*10<sup>9</sup> values, six orders of magnitude less than the 10<sup>15</sup> password complexity we thought we had. It turns out that the code may be a much greater vulnerability than the password itself. Keep that in mind. <small>Keep it secret. <small>Keep it safe.</small></small></ref>. This is probably good enough to deter a casual attack.
+
# Now we can display our alignment score vector with colored rectangles.
  
{{task|1=
+
# Convert the integers in PSma to color values from spect
Copy, study and run ...
+
PScol <- vector()
<source lang="rsplus">
+
for (i in 1:length(PSma)) {
# Suggest memorizable passwords
+
PScol[i] <- spect[ PSma[i] ]  # this is how a value from PSma is used as an index of spect
# Below we use the functions:
+
}
?nchar
 
?sample
 
?substr
 
?paste
 
?print
 
  
#define a string of  consonants ...
+
# Plot the scores. The code is similar to the last assignment.
con <- "bcdfghjklmnpqrstvwxz"
+
# Create an empty plot window of appropriate size
# ... and a string of of vowels
+
plot(1,1, xlim=c(-100, lSeq), ylim=c(0, 2) , type="n", yaxt="n", bty="n", xlab="position in alignment", ylab="")
vow <- "aeiouy"
 
  
for (i in 1:10) {  # ten sample passwords to choose from ...
+
# Add a label to the left
    pass = rep("", 14)  # make an empty character vector
+
text (-30, 1, adj=1, labels=c(paste("Mbp1:\n", code1, "\nvs.\n", code2)), cex=0.9 )
    for (j in 1:7) {    # seven consonant/vowel pairs to be created ...
 
        k  <- sample(1:nchar(con), 1)  # pick a random index for consonants ...
 
        ch  <- substr(con,k,k)         #  ... get the corresponding character ...
 
        idx <- (2*j)-1                  # ... compute the position (index) of where to put the consonant ...
 
        pass[idx] <- ch                # ...  and put it in the right spot
 
  
        # same thing for the vowel, but coded with fewer intermediate assignments
+
# Loop over the vector and draw boxes  without border, filled with color.
        # of results to variables
+
for (i in 1:lSeq) {
        k <- sample(1:nchar(vow), 1)
+
  rect(i, 0.9, i+1, 1.1, border=NA, col=PScol[i])
        pass[(2*j)] <- substr(vow,k,k)
 
    }
 
    print( paste(pass, collapse="") )  # collapse the vector in to a string and print
 
 
}
 
}
 +
 +
# Note that the numbers along the X-axis are not sequence numbers, but numbers
 +
# of the alignment, i.e. sequence number + indel length. That is important to
 +
# realize: if you would like to add the annotations from the last assignment
 +
# which I will leave as an exercise, you need to map your sequence numbering
 +
# into alignment numbering. Let me know in case you try that but need some help.
 +
 
</source>
 
</source>
 
}}
 
}}
  
=== Functions ===
 
  
Finally: functions. Functions look very much like the statements we have seen above. the template looks like:
+
;That is all.
 +
 
 +
 
 +
&nbsp;
 +
 
 +
&nbsp;
 +
==Introduction==
 +
 
 +
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
 +
 
 +
 
 +
&nbsp;<br>
 +
 
 +
;Take care of things, and they will take care of you.
 +
:''Shunryu Suzuki''
 +
</div>
 +
 
 +
 
 +
Anyone can click buttons on a Web page, but to use the powerful sequence database search tools ''right'' often takes considerable more care, caution and consideration.
 +
 
 +
Much of what we know about a protein's physiological function is based on the '''conservation''' of that function as the species evolves. We assess conservation by comparing sequences between related proteins. Conservation - or its opposite: ''variation'' - is a consequence of '''selection under constraints''': protein sequences change as a consequence of DNA mutations, this changes the protein's structure, this in turn changes functions and that has the multiple effects on a species' fitness function. Detrimental variants may be removed. Variation that is tolerated is largely neutral and therefore found only in positions that are neither structurally nor functionally critical. Conservation patterns can thus provide evidence for many different questions: structural conservation among proteins with similar 3D-structures, functional conservation among homologues with comparable roles, or amino acid propensities as predictors for protein engineering and design tasks.
 +
 
 +
Measuring conservation requires alignment. Therefore a carefully done multiple sequence alignment ('''MSA''') is a cornerstone for the annotation of the essential properties a gene or protein. MSAs are also useful to resolve ambiguities in the precise placement of indels and to ensure that columns in alignments actually contain amino acids that evolve in a similar context. MSAs serve as input for
 +
* functional annotation;
 +
* protein homology modeling;
 +
* phylogenetic analyses, and
 +
* sensitive homology searches in databases.
 +
 
 +
In order to perform a multiple sequence alignment, we obviously need a set of homologous sequences. This is where the trouble begins. All interpretation of MSA results depends '''absolutely''' on how the input sequences were chosen. Should we include only orthologs, or paralogs as well? Should we include only species with fully sequenced genomes, or can we tolerate that some orthologous genes are possibly missing for a species? Should we include all sequences we can lay our hands on, or should we restrict the selection to a manageable number of ''representative'' sequences? All of these choices influence our interpretation:
 +
*orthologs are expected to be functionally and structurally conserved;
 +
*paralogs may have divergent function but have similar structure;
 +
*missing genes may make paralogs look like orthologs; and
 +
*selection bias may weight our results toward sequences that are over-represented and do not provide a fair representation of evolutionary divergence.
 +
 
 +
 
 +
In this assignment, we will set ourselves the task to use PSI-BLAST and '''find all orthologs and paralogs of the APSES domain containing transcription factors in YFO'''. We will use these sequences later for multiple alignments, calculation of conservation ''etc''. The methodical problem we will address is: how do we perform a sensitive PSI-BLAST search '''in one organism'''. There is an issue to consider:
 +
* If we restrict the PSI-BLAST search to YFO, PSI-BLAST has little chance of building a meaningful profile - the number of homologues that actually are '''in''' YFO is too small. Thus the search will not become very sensitive.
 +
* If we don't restrict our search, but search in all species, the number of hits may become too large. It becomes increasingly difficult to closely check all hits as to whether they have good coverage, and how will we evaluate the fringe cases of marginal E-value, where we need to decide whether to include a new sequence in the profile, or whether to hold off on it for one or two iterations, to see whether the E-value drops significantly. Profile corruption would make the search useless. This is maybe still manageable if we restrict our search to fungi, but imagine you are working with a bacterial protein, or a protein that is conserved across the entire tree of life: your search will find thousands of sequences. And by next year, thousands more will have been added.  
 +
 
 +
Therefore we have to find a middle ground: add enough species (sequences) to compile a sensitive profile, but not so many that we can no longer individually assess the sequences that contribute to the profile.
 +
 
 +
 
 +
Thus in practice, a sensitive PSI-BLAST search needs to address two issues before we begin:
 +
# We need to define the sequence we are searching with; and
 +
# We need to define the dataset we are searching in.
 +
 
 +
 
 +
 
 +
 
 +
 
 +
==Defining the sequence to search with==
  
<source lang="rsplus">
 
<name> <- function (<parameters>) {
 
  <statements>
 
}
 
</source>
 
  
In this statement, the function is assigned to the ''name'' - any valid name in '''R'''. Once it is assigned, it the function can be invoked with <code>name()</code>. The parameter list (the values we write into the parentheses followin the function name) can be empty, or hold a list of variable names. If variable names are present, you need to enter the corresponding parameters when you execute the function. These assigned variables are available inside the function, and can be used for computations. This is called "passing the variable into the function".
+
Consider again the task we set out from: '''find all orthologs and paralogs of the APSES domain containing transcription factors in YFO'''.  
  
You have encountered a function to choose YFO names. In this function, your Student ID was the parameter. Here is another example to play with: a function that calculates how old you are. In days. This is neat - you can celebrate your 10,000 birth'''day''' - or so.
 
  
 
{{task|1=
 
{{task|1=
 +
What query sequence should you use? Should you ...
  
Copy, explore and run ...
 
  
;Define the function ...
+
# Search with the full-length Mbp1 sequence from ''Saccharomyces cerevisiae''?
<source lang = "rsplus">
+
# Search with the full-length Mbp1 homolog that you found in YFO?
# A lifedays calculator function
+
# Search with the structurally defined ''S. cerevisiae'' APSES domain sequence?
 +
# Search with the APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?
 +
# Search with the KilA-N domain sequence?
  
myLifeDays <- function(date = NULL) { # give "date" a default value so we can test whether it has been set
 
    if (is.null(date)) {
 
        print ("Enter your birthday as a string in \"YYYY-MM-DD\" format.")
 
        return()
 
    }
 
    x <- strptime(date, "%Y-%m-%d") # convert string to time
 
    y <- format(Sys.time(), "%Y-%m-%d") # convert "now" to time
 
    diff <- round(as.numeric(difftime(y, x, unit="days")))
 
    print(paste("This date was ", diff, " days ago."))
 
}
 
</source>
 
  
;Use the function (example):
+
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand" data-collapsetext="Collapse" style="border:#000000 solid 1px; padding: 10px; margin-left:25px; margin-right:25px;">Reflect on this (pretend this is a quiz question) and come up with a reasoned answer. Then click on "Expand" to read my opinion on this question.
<source lang = "rsplus">
+
<div  class="mw-collapsible-content">
  myLifeDays("1932-09-25")  # Glenn Gould's birthday
+
;The full-length Mbp1 sequence from ''Saccharomyces cerevisiae''
</source>
+
:Since this sequence contains multiple domains (in particular the ubiquitous Ankyrin domains) it is not suitable for BLAST database searches. You must restrict your search to the domain of greatest interest for your question. That would be the APSES domain.
}}
+
 
 +
;The full-length Mbp1 homolog that you found in YFO
 +
:What organism the search sequence comes from does not make a difference. Since you aim to find '''all''' homologs in YFO, it is not necessary to have your search sequence more or less similar to '''any particular''' homologs. In fact '''any''' APSES sequence should give you the same result, since they are '''all''' homologous. But the full-length sequence in YFO has the same problem as the ''Saccharomyces'' sequence.
 +
 
 +
;The structurally defined ''S. cerevisiae'' APSES domain sequence?
 +
:That would be my first choice, just because it is structurally well defined as a complete domain, and the sequence is easy to obtain from the <code>1BM8</code> PDB entry. (<code>1MB1</code> would also work, but you would need to edit out the penta-Histidine tag at the C-terminus that was engineered into the sequence to help purify the recombinantly expressed protein.)
  
Here is a good opportunity to play and practice programming: modify this function to accept a second argument. When a second argument is present (e.g. 10000) the function should print the calendar date on which the input date will be that number of days ago. Then you could use it to know when to celebrate your 10,000<sup>th</sup> lifeDay, or your 777<sup>th</sup> anniversary day or whatever.
+
;The APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?
 +
:As argued above: since they are all homologs, any of them should lead to the same set of results.
  
Enjoy.
+
;The KilA-N domain sequence?
 +
:This is a shorter sequence and a more distant homolog to the domain we are interested in. It would not be my first choice: the fact that it is more distantly related might make the search '''more sensitive'''. The fact that it is shorter might make the search '''less specific'''. The effect of this tradeoff would need to be compared and considered. By the way: the same holds for the even shorter subdomain 50-74 we discussed in the last assignment. However: one of the results of our analysis will be '''whether APSES domains in fungi all have the same length as the Mbp1 domain, or whether some are indeed much shorter, as suggested by the Pfam alignment.'''
  
== Protein structure annotation ==
 
  
Let's do something very quick in Chimera: calculate a Ramachandran plot.
+
So in my opinion, you should search with the yeast Mbp1 APSES domain, i.e. the sequence which you have previously studied in the crystal structure. Where is that? Well, you might have saved it in your journal, or you can get it again from the [http://www.pdb.org/pdb/explore/explore.do?structureId=1BM8 '''PDB'''] (i.e. [http://www.pdb.org/pdb/files/fasta.txt?structureIdList=1BM8 here], or from [[BIO_Assignment_Week_3#Search input|Assignment 3]].
  
{{task|1=
+
</div>
# Open Chimera and load <code>1BM8</code>.
+
</div>
# Choose '''Favorites''' &rarr; '''Model Panel'''
 
# Look for the Option '''Ramachandran plot...''' in the choices on the right.
 
# Click the button and study the result. What do you see? What does this mean? What options are there? What do they do? What significance does this have?
 
 
}}
 
}}
  
 +
&nbsp;
  
&nbsp;
+
==Selecting species for a PSI-BLAST search==
 +
 
 +
 
 +
As discussed in the introduction, in order to use our sequence set for studying structural and functional features and conservation patterns of our APSES domain proteins, we should start with a well selected dataset of APSES domain containing homologs in YFO. Since these may be quite divergent, we can't rely on '''BLAST''' to find all of them, we need to use the much more sensitive search of '''PSI-BLAST''' instead. But even though you are interested only in YFO's genes, it would be a mistake to restrict the PSI-BLAST search to YFO. PSI-BLAST becomes more sensitive if the profile represents more diverged homologs. Therefore we should always search with a broadly representative set of species, even if we are interested only in the results for one of the species. This is important. Please reflect on this for a bit and make sure you understand the rationale why we include sequences in the search that we are not actually interested in.
  
== CDD domain annotation ==
 
  
In the last assignment, you followed a link to '''CDD Search Results''' from the [http://www.ncbi.nlm.nih.gov/protein/NP_010227 RefSeq record for yeast Mbp1] and briefly looked at the information offered by the NCBI's Conserved Domain Database, a database of ''Position Specific Scoring Matrices'' that embody domain definitions. Rather than access precomputed results, you can also search CDD with sequences: assuming you have saved the YFO Mbp1 sequence in FASTA format, this is straightforward. If you did not save this sequence, return to [[BIO_Assignment_Week_3|Assignment 3]] and retrieve it again.
+
But you can also search with '''too many''' species: if the number of species is large and PSI-BLAST finds a large number of results:
 +
# it becomes unwieldy to check the newly included sequences at each iteration, inclusion of false-positive hits may result, profile corruption and loss of specificity. The search will fail.
 +
# since genomes from some parts of the Tree Of Life are over represented, the inclusion of all sequences leads to selection bias and loss of sensitivity.
  
  
{{task|1=
+
We should therefore try to find a subset of species
# Access the [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml '''CDD database'''] at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
+
# that represent as large a '''range''' as possible on the evolutionary tree;
# Read the information. CDD is a superset of various other database domain annotations as well as NCBI-curated domain definitions.
+
# that are as well '''distributed''' as possible on the tree; and
# Copy the YFO Mbp1 FASTA sequence, paste it into the search form and click '''Submit'''.
+
# whose '''genomes''' are fully sequenced.
## On the result page, clik on '''View full result'''
 
## Note that there are a number of partially overlapping ankyrin domain modules. We will study ankyrin domains in a later assignment.
 
## Also note that there may be blocks of sequence colored cyan in the sequence bar. Hover your mouse over the blocks to see what these blocks signify.
 
## Open the link to '''Search for similar domain architecture''' in a separate window and study it. This is the '''CDART''' database. Think about what these results may be useful for.
 
## Click on one of the ANK superfamily graphics and see what the associated information looks like: there is a summary of structure and function, links to specific literature and a tree of the relationship of related sequences.
 
}}
 
  
== SMART domain annotation ==
+
These criteria are important. Again, reflect on them and understand their justification. Choosing your species well for a PSI-BLAST search can be crucial to obtain results that are robust and meaningful.
  
 +
How can we '''define''' a list of such species, and how can we '''use''' the list?
  
The [http://smart.embl-heidelberg.de/ SMART database] at the EMBL in Heidelberg offers an alternative view on domain architectures. I personally find it more useful for annotations because it integrates a number of additional features. You can search by sequence, or by accession number and that raises the question of how to retrieve a database cross-reference from an NCBI sequence identifier to a UniProt sequence ID:
+
The definition is a rather typical bioinformatics task for integrating datasources: "retrieve a list of representative fungi with fully sequenced genomes". Unfortunately, to do this in a principled way requires tools that you can't (yet) program: we would need to use a list of genome sequenced fungi, estimate their evolutionary distance and select a well-distributed sample. Regrettably you can't combine such information easily with the resources available from the NCBI.
  
 +
We will use an approach that is conceptually similar: selecting a set of species according to their shared taxonomic rank in the tree of life. {{WP|Biological classification|'''Biological classification'''}} provides a hierarchical system that describes evolutionary relatedness for all living entities. The levels of this hierarchy are so called {{WP|Taxonomic rank|'''taxonomic ranks'''}}. These ranks are defined in ''Codes of Nomenclature'' that are curated by the self-governed international associations of scientists working in the field. The number of ranks is not specified: there is a general consensus on seven principal ranks (see below, in bold) but many subcategories exist and may be newly introduced. It is desired&ndash;but not mandated&ndash;that ranks represent ''clades'' (a group of related species, or a "branch" of a phylogeny), and it is desired&ndash;but not madated&ndash;that the rank is sharply defined. The system is based on subjective dissimilarity. Needless to say that it is in flux.
  
===ID mapping===
+
If we follow a link to an entry in the NCBI's Taxonomy database, eg. [http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=559292 ''Saccharomyces cerevisiae S228c''], the strain from which the original "yeast genome" was sequenced in the late 1990s, we see the following specification of its taxonomic lineage:
  
  
{{task|
+
<source lang="text">
<!-- Yeast:  NP_010227 ... P39678 -->
+
cellular organisms; Eukaryota; Opisthokonta; Fungi; Dikarya;
# Access the [http://www.uniprot.org/mapping/ UniProt ID mapping service] at http://www.uniprot.org/mapping/
+
Ascomycota; Saccharomyceta; Saccharomycotina; Saccharomycetes;
# Paste the RefSeq identifier for YFO Mbp1 into the search field.
+
Saccharomycetales; Saccharomycetaceae; Saccharomyces; Saccharomyces cerevisiae
# Use the menu to choose '''From''' ''RefSeq Protein'' and '''To''' ''UniprotKB AC''&ndash;the UniProt Knowledge Base Accession number.
+
</source>
# Click on '''Map''' to execute the search.
 
# Note the ID - it probably starts with a letter, followed by numbers and letters. Here are some examples for fungal Mbp1-like proteins: <code>P39678 Q5B8H6 Q5ANP5 P41412</code> ''etc.''
 
# Click on the link, and explore how the UniProt sequence page is similar or different from the RefSeq page.
 
}}
 
  
===SMART search===
 
  
{{task|1=
+
These names can be mapped into taxonomic ranks ranks, since the suffixes of these names e.g. ''-mycotina'', ''-mycetaceae'' are specific to defined ranks. (NCBI does not provide this mapping, but {{WP|Taxonomic rank|Wikipedia}} is helpful here.)
# Access the [http://smart.embl-heidelberg.de/ '''SMART database'''] at http://smart.embl-heidelberg.de/
 
# Click the lick to access SMART in the '''normal''' mode.
 
# Paste the YFO Mbp1 UniProtKB Accession number into the '''Sequence ID or ACC''' field.
 
# Check the boxes for:
 
## '''PFAM domains''' (domains defined by sequence similarity in the PFAM database)
 
## '''signal peptides''' (using the Gunnar von Heijne's SignalP 4.0 server at the Technical University in Lyngby, Denmark)
 
## '''internal repeats''' (using the programs ''ariadne'' and ''prospero'' at the Wellcome Trust Centre for Human Genetics at Oxford University, England)
 
## '''intrinsic protein disorder''' (using Rune Linding's DisEMBL program at the EMBL in Heidelberg, Germany)
 
# Click on '''Sequence SMART''' to run the search and annotation. <small>(In case you get an error like: "Sorry, your entry seems to have no SMART domain ...", let me know and repeat the search with the actual FASTA sequence instead of the accession number.)</small>
 
  
Study the results. Specifically, have a look at the proteins with similar domain '''ORGANISATION''' and '''COMPOSITION'''. This is similar to the NCBI's CDART.
+
<table>
  
}}
+
<tr class="sh">
 +
<td>Rank</td>
 +
<td>Suffix</td>
 +
<td>Example</td>
 +
</tr>
  
=== Visual comparison of domain annotations in '''R''' ===
+
<tr class="s1">
 +
<td>Domain</td>
 +
<td></td>
 +
<td>Eukaryota (Eukarya)</td>
 +
</tr>
  
 +
<tr class="s2">
 +
<td>&nbsp;&nbsp;Subdomain</td>
 +
<td>&nbsp;</td>
 +
<td>Opisthokonta</td>
 +
</tr>
  
The versatile plotting functions of '''R''' allow us to compare domain annotations. The distribution of segments that are annotated as being "low-complexity" or "disordered is particulalry interesting: these are functional features of the amino acid sequence that are often not associated with sequence similarity.
+
<tr class="s1">
 +
<td>'''Kingdom'''</td>
 +
<td>&nbsp;</td>
 +
<td>Fungi</td>
 +
</tr>
  
In the following code tutorial, we create a plot similar to the CDD and SMART displays. It is based on the SMART domain annotations of the six-fungal reference species for the course.
+
<tr class="s2">
 +
<td>&nbsp;&nbsp;Subkingdom</td>
 +
<td>&nbsp;</td>
 +
<td>Dikarya</td>
 +
</tr>
  
{{task|1=
+
<tr class="s1">
 +
<td>'''Phylum'''</td>
 +
<td>&nbsp;</td>
 +
<td>Ascomycota</td>
 +
</tr>
  
Copy the code to an '''R''' script, study and execute it.
+
<tr class="s2">
<source lang="R">
+
<td>&nbsp;&nbsp;''rankless taxon''<ref>The -myceta are well supported groups above the Class rank. See {{WP|Leotiomyceta|''Leotiomyceta''}} for details and references.</ref></td>
 +
<td>-myceta</td>
 +
<td>Saccharomyceta</td>
 +
</tr>
  
# plotDomains
+
<tr class="s1">
# tutorial and functions to plot a colored rectangle from a list of domain annotations
+
<td>&nbsp;&nbsp;Subphylum</td>
 +
<td>-mycotina</td>
 +
<td>Saccharomycotina</td>
 +
</tr>
  
 +
<tr class="s2">
 +
<td>'''Class'''</td>
 +
<td>-mycetes</td>
 +
<td>Saccharomycetes</td>
 +
</tr>
  
# First task: create a list structure for the annotations: this is a list of lists
+
<tr class="s1">
# As you see below, we need to mix strings, numbers and vectors of numbers. In R
+
<td>&nbsp;&nbsp;Subclass</td>
# such mixed data types must go into a list.
+
<td>-mycetidae</td>
 +
<td>&nbsp;</td>
 +
</tr>
  
Mbp1Domains <- list()  # start with an empty list
+
<tr class="s2">
 +
<td>'''Order'''</td>
 +
<td>-ales</td>
 +
<td>Saccharomycetales</td>
 +
</tr>
  
# For each species annotation, compile the SMART domain annotations in a list.
+
<tr class="s1">
Mbp1Domains <- rbind(Mbp1Domains, list(  # rbind() appends the list to the existing
+
<td>'''Family'''</td>
    species = "Saccharomyces cerevisiae",
+
<td>-aceae</td>
    code    = "SACCE",
+
<td>Saccharomycetaceae</td>
    ACC    = "P39678",
+
</tr>
    length  = 833,
 
    KilAN  = c(18,102),  # Note: Vector of (start, end) pairs
 
    AThook  = NULL,      # Note: NULL, because this annotation was not observed in this sequence
 
    Seg    = c(108,122,236,241,279,307,700,717),
 
    DisEMBL = NULL,
 
    Ankyrin = c(394,423,427,463,512,541),  # Note: Merge overlapping domains, if present
 
    Coils  = c(633, 655)
 
    ))
 
  
Mbp1Domains <- rbind(Mbp1Domains, list(
+
<tr class="s2">
    species = "Emericella nidulans",
+
<td>&nbsp;&nbsp;Subfamily</td>
    code    = "ASPNI",
+
<td>-oideae</td>
    ACC    = "Q5B8H6",
+
<td>&nbsp;</td>
    length  = 695,
+
</tr>
    KilAN  = c(23,94),
 
    AThook  = NULL,
 
    Seg    = c(529,543),
 
    DisEMBL = NULL,
 
    Ankyrin = c(260,289,381,413),
 
    Coils  = c(509,572)
 
    ))
 
  
Mbp1Domains <- rbind(Mbp1Domains, list(
+
<tr class="s1">
    species = "Candida albicans",
+
<td>&nbsp;&nbsp;Tribe</td>
    code    = "CANAL",
+
<td>-eae</td>
    ACC    = "Q5ANP5",
+
<td>&nbsp;</td>
    length  = 852,
+
</tr>
    KilAN  = c(19,102),
 
    AThook  = NULL,
 
    Seg    = c(351,365,678,692),
 
    DisEMBL = NULL,
 
    Ankyrin = c(376,408,412,448,516,545),
 
    Coils  = c(665,692)
 
    ))
 
  
Mbp1Domains <- rbind(Mbp1Domains, list(
+
<tr class="s2">
    species = "Neurospora crassa",
+
<td>&nbsp;&nbsp;Subtribe</td>
    code    = "NEUCR",
+
<td>-ineae</td>
    ACC    = "Q7RW59",
+
<td>&nbsp;</td>
    length  = 833,
+
</tr>
    KilAN  = c(31,110),
 
    AThook  = NULL,
 
    Seg    = c(130,141,253,266,514,525,554,564,601,618,620,629,636,652,658,672,725,735,752,771),
 
    DisEMBL = NULL,
 
    Ankyrin = c(268,297,390,419),
 
    Coils  = c(500,550)
 
    ))
 
  
Mbp1Domains <- rbind(Mbp1Domains, list(
+
<tr class="s1">
    species = "Schizosaccharomyces pombe",
+
<td>'''Genus'''</td>
    code    = "SCHPO",
+
<td>&nbsp;</td>
    ACC    = "P41412",
+
<td>Saccharomyces</td>
    length  = 657,
+
</tr>
    KilAN  = c(21,97),
 
    AThook  = NULL,
 
    Seg    = c(111,125,136,145,176,191,422,447),
 
    DisEMBL = NULL,
 
    Ankyrin = c(247,276,368,397),
 
    Coils  = c(457,538)
 
    ))
 
  
Mbp1Domains <- rbind(Mbp1Domains, list(
+
<tr class="s2">
    species = "Ustilago maydis",
+
<td>'''Species'''</td>
    code    = "USTMA",
+
<td>&nbsp;</td>
    ACC    = "Q4P117",
+
<td>''Saccharomyces cerevisiae''</td>
    length  = 956,
+
</tr>
    KilAN  = c(21,98),
 
    AThook  = NULL,
 
    Seg    = c(106,116,161,183,657,672,776,796),
 
    DisEMBL = NULL,
 
    Ankyrin = c(245,274,355,384),
 
    Coils  = c(581,609)
 
    ))
 
  
 +
<table>
  
# Working with data in lists and dataframes can be awkward, since the result
 
# of filters and slices are themselves lists, not vectors.
 
# Therefore we need to use the unlist() function a lot. When in doubt: unlist()
 
  
#### Boxes #####
+
You can see that there is not a common mapping between the yeast lineage and the commonly recognized categories - not all ranks are represented. Nor is this consistent across species in the taxonomic database: some have subfamily ranks and some don't. And the tree is in no way normalized - some of the ranks have thousands of members, and for some, only a single extant member may be known, or it may be a rank that only relates to the fossil record. But the ranks do provide some guidance to evolutionary divergence. Say you want to choose four species across the tree of life for a study, you should choose one from each of the major '''domains''' of life: Eubacteria, Euryarchaeota, Crenarchaeota-Eocytes, and Eukaryotes. Or you want to study a gene that is specific to mammals. Then you could choose from the clades listed in the NCBI taxonomy database under [http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=40674&lvl=4 '''Mammalia'''] (a {{WP|Mammal_classification|'''class rank'''}}, and depending how many species you would want to include, use the
# Define a function to draw colored boxes, given input of a vector with
+
subclass-, order-, or family rank (hover over the names to see their taxonomic rank.)
# (start,end) pairs, a color, and the height where the box should go.
 
drawBoxes <- function(v, c, h) {  # vector of xleft, xright pairs; color; height
 
    if (is.null(v)) { return() }
 
    for (i in seq(1,length(v),by=2)) {
 
        rect(v[i], h-0.1, v[i+1], h+0.1, border="black", col=c)
 
    }
 
}
 
  
#### Annotations ####
+
There will still be quite a bit of manual work involved and an exploration of different options on the Web may be useful. For our purposes here we can retrieve a good set of organisms from the [http://fungi.ensembl.org/info/website/species.html '''ensembl fungal genomes page'''] - maintained by the EBI's genome annotation group - that lists species grouped by taxonomic ''order''. All of these organisms are genome-sequenced, we can pick a set of representatives:
# Define a function to write the species code, draw a grey
 
# horizontal line and call drawBoxes() for every annotation type
 
# in the list
 
drawGene <- function(rIndex) {
 
    # define colors:
 
    kil <- "#32344F"
 
    ank <- "#691A2C"
 
    seg <- "#598C9E"
 
    coi <- "#8B9998"
 
    xxx <- "#EDF7F7"
 
   
 
    text (-30, rIndex, adj=1, labels=unlist(Mbp1Domains[rIndex,"code"]), cex=0.75 )
 
    lines(c(0, unlist(Mbp1Domains[rIndex,"length"])), c(rIndex, rIndex), lwd=3, col="#999999")
 
  
    drawBoxes(unlist(Mbp1Domains[rIndex,"KilAN"]),  kil, rIndex)
+
# Capnodiales&nbsp;&nbsp;&nbsp;''Zymoseptoria tritici''
    drawBoxes(unlist(Mbp1Domains[rIndex,"AThook"]),  xxx, rIndex)
+
# Erysiphales&nbsp;&nbsp;&nbsp;''Blumeria graminis''
    drawBoxes(unlist(Mbp1Domains[rIndex,"Seg"]),    seg, rIndex)
+
# Eurotiales&nbsp;&nbsp;&nbsp;''Aspergillus nidulans''
    drawBoxes(unlist(Mbp1Domains[rIndex,"DisEMBL"]), xxx, rIndex)
+
# Glomerellales&nbsp;&nbsp;&nbsp;''Glomerella graminicola''
    drawBoxes(unlist(Mbp1Domains[rIndex,"Ankyrin"]), ank, rIndex)
+
# Hypocreales&nbsp;&nbsp;&nbsp;''Trichoderma reesei''
    drawBoxes(unlist(Mbp1Domains[rIndex,"Coils"]),  coi, rIndex)
+
# Magnaporthales&nbsp;&nbsp;&nbsp;''Magnaporthe oryzae''
}
+
# Microbotryales&nbsp;&nbsp;&nbsp;''Microbotryum violaceum''
 +
# Pezizales&nbsp;&nbsp;&nbsp;''Tuber melanosporum''
 +
# Pleosporales&nbsp;&nbsp;&nbsp;''Phaeosphaeria nodorum''
 +
# Pucciniales&nbsp;&nbsp;&nbsp;''Puccinia graminis''
 +
# Saccharomycetales&nbsp;&nbsp;&nbsp;''Saccharomyces cerevisiae''
 +
# Schizosaccharomycetales&nbsp;&nbsp;&nbsp;''Schizosaccharomyces pombe''
 +
# Sclerotiniaceae&nbsp;&nbsp;&nbsp;''Sclerotinia sclerotiorum''
 +
# Sordariales&nbsp;&nbsp;&nbsp;''Neurospora crassa''
 +
# Tremellales&nbsp;&nbsp;&nbsp;''Cryptococcus neoformans''
 +
# Ustilaginales&nbsp;&nbsp;&nbsp;''Ustilago maydis''
  
#### Plot ####
+
This set of organisms thus can be used to generate a PSI-BLAST search in a well-distributed set of species. Of course '''you must also include YFO''' (<small>if YFO is not in this list already</small>).
# define the size of the plot-frame
 
yMax <- length(Mbp1Domains[,1])  # number of domains in list
 
xMax <- max(unlist(Mbp1Domains[,"length"])) # largest sequence length
 
  
# plot an empty frame
+
To enter these 16 species as an Entrez restriction, they need to be formatted as below. (<small>One could also enter species one by one, by pressing the '''(+)''' button after the organism list</small>)
plot(1,1, xlim=c(-100,xMax), ylim=c(0, yMax) , type="n", yaxt="n", bty="n", xlab="sequence number", ylab="")
 
  
# Finally, iterate over all species and call drawGene()
 
for (i in 1:length(Mbp1Domains[,1])) {
 
    drawGene(i)
 
}
 
  
# end
+
<source lang="text">
 +
Aspergillus nidulans[orgn]
 +
OR Blumeria graminis[orgn]
 +
OR Cryptococcus neoformans[orgn]
 +
OR Glomerella graminicola[orgn]
 +
OR Magnaporthe oryzae[orgn]
 +
OR Microbotryum violaceum[orgn]
 +
OR Neurospora crassa[orgn]
 +
OR Phaeosphaeria nodorum[orgn]
 +
OR Puccinia graminis[orgn]
 +
OR Sclerotinia sclerotiorum[orgn]
 +
OR Trichoderma reesei[orgn]
 +
OR Tuber melanosporum[orgn]
 +
OR Saccharomyces cerevisiae[orgn]
 +
OR Schizosaccharomyces pombe[orgn]
 +
OR Ustilago maydis[orgn]
 +
OR Zymoseptoria tritici[orgn]
  
 
</source>
 
</source>
 +
 +
 +
 +
&nbsp;
 +
 +
==Executing the PSI-BLAST search==
 +
 +
We have a list of species. Good. Next up: how do we '''use''' it.
 +
 +
{{task|1=
 +
 +
 +
# Navigate to the BLAST homepage.
 +
# Select '''protein BLAST'''.
 +
# Paste the APSES domain sequence into the search field.
 +
# Select '''refseq''' as the database.
 +
# Copy the organism restriction list from above '''and enter the correct name for YFO''' into the list if it is not there already. Obviously, you can't find sequences in YFO if YFO is not included in your search space. Paste the list into the '''Entrez Query''' field.
 +
# In the '''Algorithm''' section, select PSI-BLAST.
 +
#Click on '''BLAST'''.
 
}}
 
}}
  
  
When you execute the code, your plot should look similar to this one:
+
Evaluate the results carefully. Since we used default parameters, the threshold for inclusion was set at an '''E-value''' of 0.005 by default, and that may be a bit too lenient. If you look at the table of your hits&ndash; in the '''Sequences producing significant alignments...''' section&ndash; there may also be a few sequences that have a low query coverage of less than 80%. Let's exclude these from the profile initially: not to worry, if they are true positives, the will come back with improved E-values and greater coverage in subsequent iterations. But if they were false positives, their E-values will rise and they should drop out of the profile and not contaminate it.
  
[[Image:DomainAnnotations.jpg|frame|none|SMART domain annotations for Mbp1 proteins from six fungal species.
 
]]
 
  
 
{{task|1=
 
{{task|1=
 +
#In the header section, click on '''Formatting options''' and in the line "Format for..." set the '''with inclusion threshold''' to <code>0.001</code> (This means E-values can't be above 10<sup>-03</sup> for the sequence to be included.)
 +
# Click on the '''Reformat''' button (top right).
 +
# In the table of sequence descriptions (not alignments!), click on the '''Query coverage''' to sort the table by coverage, not by score.
 +
# Copy the rows with a coverage of less than 80% and paste them into some text editor so you can compare what happens with these sequences in the next iteration.
 +
# '''Deselect''' the check mark next to these sequences in the right-hand column '''Select for PSI blast'''. (For me these are six sequences, but with YFO included that may be a bit different.)
 +
# Then scroll to '''Run PSI-BLAST iteration 2 ...''' and click on '''<code>Go</code>'''.
 +
}}
 +
 +
 +
This is now the "real" PSI-BLAST at work: it constructs a profile from all the full-length sequences and searches with the '''profile''', not with any individual sequence. Note that we are controlling what goes into the profile in two ways:
 +
# we are explicitly removing sequences with poor coverage; and
 +
# we are requiring a more stringent minimum E-value for each sequence.
  
On the Student Wiki, add the annotations for YFO to the plot:
 
  
# Copy one of the list definitions for Mbp1 domains and edit it with the appropriate values for your own annotations.
+
{{task|1=
# Test that you can add the YFO annotation to the plot.
+
#Again, study the table of hits. Sequences highlighted in yellow have met the search criteria in the second iteration. Note that the coverage of (some) of the previously excluded sequences is now above 80%.  
# Submit your validated code block to the [http://biochemistry.utoronto.ca/steipe/abc/students/index.php/BCH441_2014_Assignment_4_domain_annotations '''Student Wiki here''']. The goal is to compile an overview of all species we are studying in class. 
+
# Let's exclude partial matches one more time. Again, deselect all sequences with less than 80% coverage. Then run the third iteration.
# If your working annotation block is in the Wiki before noontime on Wednesday, you will be awarded a 10% bonus on the quiz.
+
# Iterate the search in this way until no more "New" sequences are added to the profile. Then scan the list of excluded hits ... are there any from YFO that seem like they could potentially make the list? Marginal E-value perhaps, or reasonable E-value but less coverage? If that's the case, try returning the E-value threshold to the default 0.005 and see what happens...
 
}}
 
}}
  
 +
 +
Once no "new" sequences have been added, if we were to repeat the process again and again, we would always get the same result because the profile stays the same. We say that the search has '''converged'''. Good. Time to harvest.
 +
 +
 +
{{task|1=
 +
# At the header, click on '''Taxonomy reports''' and find YFO in the '''Organism Report''' section. These are your APSES domain homologs. All of them. Actually, perhaps more than all: the report may also include sequences with E-values above the inclusion threshold.
 +
# From the report copy the sequence identifiers
 +
## from YFO,
 +
## with E-values above your defined threshold.
 +
}}
 +
 +
For example, the list of ''Saccharomyces'' genes is the following:
 +
 +
<code>
 +
<b>[http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=559292 Saccharomyces cerevisiae S288c]</b> [http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=4890 [ascomycetes]] taxid 559292<br \>
 +
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6320147&dopt=GenPept ref|NP_010227.1|] Mbp1p [Saccharomyces cerevisiae S288c]          [ 131]  1e-38<br \>
 +
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6320957&dopt=GenPept ref|NP_011036.1|] Swi4p [Saccharomyces cerevisiae S288c]          [ 123]  1e-35<br \>
 +
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6322808&dopt=GenPept ref|NP_012881.1|] Phd1p [Saccharomyces cerevisiae S288c]          [  91]  1e-25<br \>
 +
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6323658&dopt=GenPept ref|NP_013729.1|] Sok2p [Saccharomyces cerevisiae S288c]          [  93]  3e-25<br \>
 +
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6322090&dopt=GenPept ref|NP_012165.1|] Xbp1p [Saccharomyces cerevisiae S288c]          [  40]  5e-07<br \>
 +
</code>
 +
 +
[[Saccharomyces cerevisiae Xbp1|Xbp1]] is a special case. It has only very low coverage, but that is because it has a long domain insertion and the N-terminal match often is not recognized by alignment because the gap scores for long indels are unrealistically large. For now, I keep that sequence with the others.
 +
 +
 +
Next we need to retrieve the sequences. Tedious to retrieve them one by one, but we can get them all at the same time:
 +
 +
 +
{{task|1=
 +
 +
# Return to the BLAST results page and again open the '''Formatting options'''.
 +
# Find the '''Limit results''' section and enter YFO's name into the field. For example <code>Saccharomyces cerevisiae [ORGN]</code>
 +
# Click on '''Reformat'''
 +
# Scroll to the '''Descriptions''' section, check the box at the left-hand margin, next to each sequence you want to keep. Then click on '''Download &rarr; FASTA complete sequence &rarr; Continue'''.
 +
 +
 +
 +
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand" data-collapsetext="Collapse" style="border:#000000 solid 1px; padding: 10px; margin-left:25px; margin-right:25px;">There are actually several ways to download lists of sequences. Using the results page utility is only one. But if you know the GIs of the sequences you need, you can get them more directly by putting them into the URL...
 +
<div  class="mw-collapsible-content">
 +
 +
* http://www.ncbi.nlm.nih.gov/protein/6320147,6320957,6322808,6323658,6322090?report=docsum  - The default report
 +
* http://www.ncbi.nlm.nih.gov/protein/6320147,6320957,6322808,6323658,6322090?report=fasta - FASTA sequences with NCBI HTML markup
 +
 +
Even more flexible is the [http://www.ncbi.nlm.nih.gov/books/NBK25500/#chapter1.Downloading_Full_Records '''eUtils'''] interface to the NCBI databases. For example you can download the dataset in text format by clicking below.
 +
 +
* http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=6320147,6320957,6322808,6323658,6322090&rettype=fasta&retmode=text
 +
 +
Note that this utility does not '''show''' anything, but downloads the (multi) fasta file to your default download directory.
 +
 +
</div>
 +
</div>
 +
}}
 +
<!--
 +
 +
Add to this assignment:
 +
- study the BLAST output format, links, tools, scores ...
 +
- compare the improvement in E-values to standard BLAST
 +
- examine this in terms of sensitivity and specificity
 +
 +
-->
  
  

Revision as of 14:28, 2 October 2015

Assignment for Week 4
Sequence alignment

< Assignment 3 Assignment 5 >

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.


 

Introduction

In this assignment we will perform an optimal global and local sequence alignment, and use R to plot the alignment quality as a colored bar-graph.


Optimal sequence alignments

Online programs for optimal sequence alignment are part of the EMBOSS tools. The programs take FASTA files as input.

Local optimal SEQUENCE alignment "water"

Task:

  1. Retrieve the FASTA file for the YFO Mbp1 protein and for Saccharomyces cerevisiae.
  2. Save the files as text files to your computer, (if you haven't done so already). You could give them an extension of .fa.
  3. Access the EMBOSS Explorer site (if you haven't done so yet, you might want to bookmark it.)
  4. Look for ALIGNMENT LOCAL, click on water, paste your FASTA sequences and run the program with default parameters.
  5. Study the results. You will probably find that the alignment extends over most of the protein, but does not include the termini.
  6. Considering the sequence identy cutoff we discussed in class (25% over the length of a domain), do you believe that the APSES domains are homologous?
  7. Change the Gap opening and Gap extension parameters to high values (e.g. 30 and 5). Then run the alignment again.
  8. Note what is different.
  9. You could try getting only an alignment for the ankyrin domains that you have found in the last assignment, by deleting the approximate region of the APSES domains from your input.


Global optimal SEQUENCE alignment "needle"

Task:

  1. Look for ALIGNMENT GLOBAL, click on needle, paste your FASTA sequences and run the program with default parameters.
  2. Study the results. You will find that the alignment extends over the entire protein, likely with long indels at the termini.
  3. Change the Output alignment format to FASTA pairwise simple, to retrieve the aligned FASTA files with indels.
  4. Copy the aligned sequences (with indels) and save them to your computer. You could give them an extension of .fal to remind you that they are aligned FASTA sequences.


 

The Mutation Data Matrix

The NCBI makes its alignment matrices available by ftp. They are located at ftp://ftp.ncbi.nih.gov/blast/matrices - for example here is a link to the BLOSUM62 matrix[1]. Access that site and download the BLOSUM62 matrix to your computer. You could give it a filename of BLOSUM62.mdm.

It should look like this.

#  Matrix made by matblas from blosum62.iij
#  * column uses minimum score
#  BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
#  Blocks Database = /data/blocks_5.0/blocks.dat
#  Cluster Percentage: >= 62
#  Entropy =   0.6979, Expected =  -0.5209
   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1  0 -4 
R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1  0 -1 -4 
N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  3  0 -1 -4 
D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4  1 -1 -4 
C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 
Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0  3 -1 -4 
E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4 
G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -2 -1 -4 
H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0  0 -1 -4 
I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3 -3 -3 -1 -4 
L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1 -4 -3 -1 -4 
K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2  0  1 -1 -4 
M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1 -3 -1 -1 -4 
F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1 -3 -3 -1 -4 
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2 -2 -1 -2 -4 
S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2  0  0  0 -4 
T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0 -1 -1  0 -4 
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3 -4 -3 -2 -4 
Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1 -3 -2 -1 -4 
V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4 -3 -2 -1 -4 
B -2 -1  3  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4  1 -1 -4 
Z -1  0  0  1 -3  3  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4 
X  0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2  0  0 -2 -1 -1 -1 -1 -1 -4 
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4  1


Task:

  • Study this and make sure you understand what this table is, how it can be used, and what a reasonable range of values for identities and pairscores for non-identical, similar and dissimilar residues is. Ask on the mailing list in case you have questions.


 

The DNA binding site

Now, that you know how YFO Mbp1 aligns with yeast Mbp1, you can evaluate functional conservation in these homologous proteins. You probably already downloaded the two Biochemistry papers by Taylor et al. (2000) and by Deleeuw et al. (2008) that we encountered in Assignment 2. These discuss the residues involved in DNA binding[2]. In particular the residues between 50-74 have been proposed to comprise the DNA recognition domain.

Task:

  1. Using the APSES domain alignment you have just constructed, find the YFO Mbp1 residues that correspond to the range 50-74 in yeast.
  2. Note whether the sequences are especially highly conserved in this region.
  3. Using Chimera, look at the region. Use the sequence window to make sure that the sequence numbering between the paper and the PDB file are the same (they are often not identical!). Then select the residues - the proposed recognition domain - and color them differently for emphasis. Study this in stereo to get a sense of the spatial relationships. Check where the conserved residues are.
  4. A good representation is stick - but other representations that include sidechains will also serve well.
  5. Calculate a solvent accessible surface of the protein in a separate representation and make it transparent.
  6. You could combine three representations: (1) the backbone (in ribbon view), (2) the sidechains of residues that presumably contact DNA, distinctly colored, and (3) a transparent surface of the entire protein. This image should show whether residues annotated as DNA binding form a contiguous binding interface.


DNA binding interfaces are expected to comprise a number of positively charged amino acids, that might form salt-bridges with the phosphate backbone.


Task:

  • Study and consider whether this is the case here and which residues might be included.


 

R code: coloring the alignment by quality

Task:

  • Study this code carefully, execute it, section by section and make sure you understand all of it. Ask on the list if anything is not clear.
# BiostringsExample.R
# Short tutorial on sequence alignment with the Biostrings package.
# Boris Steipe for BCH441, 2013 - 2014
#
setwd("~/path/to/your/R_files/")
setwd("~/Documents/07.TEACHING/37-BCH441 Bioinformatics 2014/05-Materials/Assignment_5 data")

# Biostrings is a package within the bioconductor project.
# bioconducter packages have their own installation system,
# they are normally not installed via CRAN.

# First, you load the BioConductor installer...
source("http://bioconductor.org/biocLite.R")

# Then you can install the Biostrings package and all of its dependencies.
biocLite("Biostrings")

# ... and load the library.
library(Biostrings)

# Some basic (technical) information is available ...
library(help=Biostrings)

# ... but for more in depth documentation, use the
# so called "vignettes" that are provided with every R package.
browseVignettes("Biostrings")

# In this code, we mostly use functions that are discussed in the 
# pairwise alignement vignette.

# Read in two fasta files - you will need to edit this for YFO
sacce <- readAAStringSet("mbp1-sacce.fa", format="fasta")

# "USTMA" is used only as an example here - modify for YFO  :-)
ustma <- readAAStringSet("mbp1-ustma.fa", format="fasta")

sacce
names(sacce) 
names(sacce) <- "Mbp1 SACCE"
names(ustma) <- "Mbp1 USTMA" # Example only ... modify for YFO

width(sacce)
as.character(sacce)

# Biostrings takes a sophisticated approach to sequence alignment ...
?pairwiseAlignment

# ... but the use in practice is quite simple:
ali <- pairwiseAlignment(sacce, ustma, substitutionMatrix = "BLOSUM50")
ali

pattern(ali)
subject(ali)

writePairwiseAlignments(ali)

p <- aligned(pattern(ali))
names(p) <- "Mbp1 SACCE aligned"
s <- aligned(subject(ali))
names(s) <- "Mbp1 USTMA aligned"

# don't overwrite your EMBOSS .fal files
writeXStringSet(p, "mbp1-sacce.R.fal", append=FALSE, format="fasta")
writeXStringSet(s, "mbp1-ustma.R.fal", append=FALSE, format="fasta")

# Done.
  • Compare the alignments you received from the EMBOSS server, and that you co puted using R. Are they aproximately the same? Exactly? You did use different matrices and gap aameters, so minor differences are to be expected. But by and large you should get the same alignments.

We will now use the aligned sequences to compute a graphical display of alignment quality.


Task:

  • Study this code carefully, execute it, section by section and make sure you understand all of it. Ask on the list if anything is not clear.
# aliScore.R
# Evaluating an alignment with a sliding window score
# Boris Steipe, October 2012. Update October 2013
setwd("~/path/to/your/R_files/")

# Scoring matrices can be found at the NCBI. 
# ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62

# It is good practice to set variables you might want to change
# in a header block so you don't need to hunt all over the code
# for strings you need to update.
#
fa1      <- "mbp1-sacce.R.fal"
fa2      <- "mbp1-ustma.R.fal"
code1    <- "SACCE"
code2    <- "USTMA"
mdmFile  <- "BLOSUM62.mdm"
window   <- 9   # window-size (should be an odd integer)

# ================================================
#    Read data files
# ================================================

# read fasta datafiles using seqinr function read.fasta()
install.packages("seqinr")
library(seqinr)
tmp  <- unlist(read.fasta(fa1, seqtype="AA", as.string=FALSE, seqonly=TRUE))
seq1 <- unlist(strsplit(as.character(tmp), split=""))

tmp  <- unlist(read.fasta(fa2, seqtype="AA", as.string=FALSE, seqonly=TRUE))
seq2 <- unlist(strsplit(as.character(tmp), split=""))

if (length(seq1) != length(seq2)) {
	print("Error: Sequences have unequal length!")
	}
	
lSeq <- length(seq1)

# ================================================
#    Read scoring matrix
# ================================================

MDM <- read.table(mdmFile, skip=6)

# This is a dataframe. Study how it can be accessed:

MDM
MDM[1,]
MDM[,1]
MDM[5,5]   # Cys-Cys
MDM[20,20] # Val-Val
MDM[,"W"]  # the tryptophan column
MDM["R","W"]  # Arg-Trp pairscore
MDM["W","R"]  # Trp-Arg pairscore: pairscores are symmetric

colnames(MDM)  # names of columns
rownames(MDM)  # names of rows
colnames(MDM)[3]   # third column
rownames(MDM)[12]  # twelfth row

# change the two "*" names to "-" so we can use them to score
# indels of the alignment. This is a bit of a hack, since this
# does not reflect the actual indel penalties (which is, as you)
# remember from your lectures, calculated as a gap opening
# + gap extension penalty; it can't be calculated in a pairwise
# manner) EMBOSS defaults for BLODSUM62 are opening -10 and
# extension -0.5 i.e. a gap of size 3 (-11.5) has approximately
# the same penalty as a 3-character score of "-" matches (-12)
# so a pairscore of -4 is not entirely unreasonable.

colnames(MDM)[24] 
rownames(MDM)[24]
colnames(MDM)[24] <- "-"
rownames(MDM)[24] <- "-"
colnames(MDM)[24] 
rownames(MDM)[24]
MDM["Q", "-"]
MDM["-", "D"]
# so far so good.

# ================================================
#    Tabulate pairscores for alignment
# ================================================


# It is trivial to create a pairscore vector along the
# length of the aligned sequences.

PS <- vector()
for (i in 1:lSeq) {
   aa1 <- seq1[i] 
   aa2 <- seq2[i] 
   PS[i] = MDM[aa1, aa2]
}

PS


# The same vector could be created - albeit perhaps not so
# easy to understand - with the expression ...
MDM[cbind(seq1,seq2)]



# ================================================
#    Calculate moving averages
# ================================================

# In order to evaluate the alignment, we will calculate a 
# sliding window average over the pairscores. Somewhat surprisingly
# R doesn't (yet) have a native function for moving averages: options
# that are quoted are:
#   - rollmean() in the "zoo" package http://rss.acs.unt.edu/Rdoc/library/zoo/html/rollmean.html
#   - MovingAverages() in "TTR"  http://rss.acs.unt.edu/Rdoc/library/TTR/html/MovingAverages.html
#   - ma() in "forecast"  http://robjhyndman.com/software/forecast/
# But since this is easy to code, we shall implement it ourselves.

PSma <- vector()           # will hold the averages
winS <- floor(window/2)    # span of elements above/below the centre
winC <- winS+1             # centre of the window

# extend the vector PS with zeros (virtual observations) above and below
PS <- c(rep(0, winS), PS , rep(0, winS))

# initialize the window score for the first position
winScore <- sum(PS[1:window])

# write the first score to PSma
PSma[1] <- winScore

# Slide the window along the sequence, and recalculate sum()
# Loop from the next position, to the last position that does not exceed the vector...
for (i in (winC + 1):(lSeq + winS)) { 
   # subtract the value that has just dropped out of the window
   winScore <- winScore - PS[(i-winS-1)] 
   # add the value that has just entered the window
   winScore <- winScore + PS[(i+winS)]  
   # put score into PSma
   PSma[i-winS] <- winScore
}

# convert the sums to averages
PSma <- PSma / window

# have a quick look at the score distributions

boxplot(PSma)
hist(PSma)

# ================================================
#    Plot the alignment scores
# ================================================

# normalize the scores 
PSma <- (PSma-min(PSma))/(max(PSma) - min(PSma) + 0.0001)
# spread the normalized values to a desired range, n
nCol <- 10
PSma <- floor(PSma * nCol) + 1

# Assign a colorspectrum to a vector (with a bit of colormagic,
# don't worry about that for now). Dark colors are poor scores,
# "hot" colors are high scores
spect <- colorRampPalette(c("black", "red", "yellow", "white"), bias=0.4)(nCol)

# Color is an often abused aspect of plotting. One can use color to label
# *quantities* or *qualities*. For the most part, our pairscores measure amino
# acid similarity. That is a quantity and with the spectrum that we just defined
# we associte the measured quantities with the color of a glowing piece
# of metal: we start with black #000000, then first we ramp up the red
# (i.e. low-energy) part of the visible spectrum to red #FF0000, then we
# add and ramp up the green spectrum giving us yellow #FFFF00 and finally we 
# add blue, giving us white #FFFFFF. Let's have a look at the spectrum:

s <- rep(1, nCol)
barplot(s, col=spect, axes=F, main="Color spectrum")

# But one aspect of our data is not quantitatively different: indels.
# We valued indels with pairscores of -4. But indels are not simply poor alignment, 
# rather they are non-alignment. This means stretches of -4 values are really 
# *qualitatively* different. Let's color them differently by changing the lowest 
# level of the spectrum to grey.

spect[1] <- "#CCCCCC"
barplot(s, col=spect, axes=F, main="Color spectrum")

# Now we can display our alignment score vector with colored rectangles.

# Convert the integers in PSma to color values from spect
PScol <- vector()
for (i in 1:length(PSma)) {
	PScol[i] <- spect[ PSma[i] ]  # this is how a value from PSma is used as an index of spect
}

# Plot the scores. The code is similar to the last assignment.
# Create an empty plot window of appropriate size
plot(1,1, xlim=c(-100, lSeq), ylim=c(0, 2) , type="n", yaxt="n", bty="n", xlab="position in alignment", ylab="")

# Add a label to the left
text (-30, 1, adj=1, labels=c(paste("Mbp1:\n", code1, "\nvs.\n", code2)), cex=0.9 )

# Loop over the vector and draw boxes  without border, filled with color.
for (i in 1:lSeq) {
   rect(i, 0.9, i+1, 1.1, border=NA, col=PScol[i])
}

# Note that the numbers along the X-axis are not sequence numbers, but numbers
# of the alignment, i.e. sequence number + indel length. That is important to
# realize: if you would like to add the annotations from the last assignment 
# which I will leave as an exercise, you need to map your sequence numbering
# into alignment numbering. Let me know in case you try that but need some help.


That is all.


 

 

Introduction


 

Take care of things, and they will take care of you.
Shunryu Suzuki


Anyone can click buttons on a Web page, but to use the powerful sequence database search tools right often takes considerable more care, caution and consideration.

Much of what we know about a protein's physiological function is based on the conservation of that function as the species evolves. We assess conservation by comparing sequences between related proteins. Conservation - or its opposite: variation - is a consequence of selection under constraints: protein sequences change as a consequence of DNA mutations, this changes the protein's structure, this in turn changes functions and that has the multiple effects on a species' fitness function. Detrimental variants may be removed. Variation that is tolerated is largely neutral and therefore found only in positions that are neither structurally nor functionally critical. Conservation patterns can thus provide evidence for many different questions: structural conservation among proteins with similar 3D-structures, functional conservation among homologues with comparable roles, or amino acid propensities as predictors for protein engineering and design tasks.

Measuring conservation requires alignment. Therefore a carefully done multiple sequence alignment (MSA) is a cornerstone for the annotation of the essential properties a gene or protein. MSAs are also useful to resolve ambiguities in the precise placement of indels and to ensure that columns in alignments actually contain amino acids that evolve in a similar context. MSAs serve as input for

  • functional annotation;
  • protein homology modeling;
  • phylogenetic analyses, and
  • sensitive homology searches in databases.

In order to perform a multiple sequence alignment, we obviously need a set of homologous sequences. This is where the trouble begins. All interpretation of MSA results depends absolutely on how the input sequences were chosen. Should we include only orthologs, or paralogs as well? Should we include only species with fully sequenced genomes, or can we tolerate that some orthologous genes are possibly missing for a species? Should we include all sequences we can lay our hands on, or should we restrict the selection to a manageable number of representative sequences? All of these choices influence our interpretation:

  • orthologs are expected to be functionally and structurally conserved;
  • paralogs may have divergent function but have similar structure;
  • missing genes may make paralogs look like orthologs; and
  • selection bias may weight our results toward sequences that are over-represented and do not provide a fair representation of evolutionary divergence.


In this assignment, we will set ourselves the task to use PSI-BLAST and find all orthologs and paralogs of the APSES domain containing transcription factors in YFO. We will use these sequences later for multiple alignments, calculation of conservation etc. The methodical problem we will address is: how do we perform a sensitive PSI-BLAST search in one organism. There is an issue to consider:

  • If we restrict the PSI-BLAST search to YFO, PSI-BLAST has little chance of building a meaningful profile - the number of homologues that actually are in YFO is too small. Thus the search will not become very sensitive.
  • If we don't restrict our search, but search in all species, the number of hits may become too large. It becomes increasingly difficult to closely check all hits as to whether they have good coverage, and how will we evaluate the fringe cases of marginal E-value, where we need to decide whether to include a new sequence in the profile, or whether to hold off on it for one or two iterations, to see whether the E-value drops significantly. Profile corruption would make the search useless. This is maybe still manageable if we restrict our search to fungi, but imagine you are working with a bacterial protein, or a protein that is conserved across the entire tree of life: your search will find thousands of sequences. And by next year, thousands more will have been added.

Therefore we have to find a middle ground: add enough species (sequences) to compile a sensitive profile, but not so many that we can no longer individually assess the sequences that contribute to the profile.


Thus in practice, a sensitive PSI-BLAST search needs to address two issues before we begin:

  1. We need to define the sequence we are searching with; and
  2. We need to define the dataset we are searching in.



Defining the sequence to search with

Consider again the task we set out from: find all orthologs and paralogs of the APSES domain containing transcription factors in YFO.


Task:
What query sequence should you use? Should you ...


  1. Search with the full-length Mbp1 sequence from Saccharomyces cerevisiae?
  2. Search with the full-length Mbp1 homolog that you found in YFO?
  3. Search with the structurally defined S. cerevisiae APSES domain sequence?
  4. Search with the APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?
  5. Search with the KilA-N domain sequence?


Reflect on this (pretend this is a quiz question) and come up with a reasoned answer. Then click on "Expand" to read my opinion on this question.
The full-length Mbp1 sequence from Saccharomyces cerevisiae
Since this sequence contains multiple domains (in particular the ubiquitous Ankyrin domains) it is not suitable for BLAST database searches. You must restrict your search to the domain of greatest interest for your question. That would be the APSES domain.
The full-length Mbp1 homolog that you found in YFO
What organism the search sequence comes from does not make a difference. Since you aim to find all homologs in YFO, it is not necessary to have your search sequence more or less similar to any particular homologs. In fact any APSES sequence should give you the same result, since they are all homologous. But the full-length sequence in YFO has the same problem as the Saccharomyces sequence.
The structurally defined S. cerevisiae APSES domain sequence?
That would be my first choice, just because it is structurally well defined as a complete domain, and the sequence is easy to obtain from the 1BM8 PDB entry. (1MB1 would also work, but you would need to edit out the penta-Histidine tag at the C-terminus that was engineered into the sequence to help purify the recombinantly expressed protein.)
The APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?
As argued above: since they are all homologs, any of them should lead to the same set of results.
The KilA-N domain sequence?
This is a shorter sequence and a more distant homolog to the domain we are interested in. It would not be my first choice: the fact that it is more distantly related might make the search more sensitive. The fact that it is shorter might make the search less specific. The effect of this tradeoff would need to be compared and considered. By the way: the same holds for the even shorter subdomain 50-74 we discussed in the last assignment. However: one of the results of our analysis will be whether APSES domains in fungi all have the same length as the Mbp1 domain, or whether some are indeed much shorter, as suggested by the Pfam alignment.


So in my opinion, you should search with the yeast Mbp1 APSES domain, i.e. the sequence which you have previously studied in the crystal structure. Where is that? Well, you might have saved it in your journal, or you can get it again from the PDB (i.e. here, or from Assignment 3.

 

Selecting species for a PSI-BLAST search

As discussed in the introduction, in order to use our sequence set for studying structural and functional features and conservation patterns of our APSES domain proteins, we should start with a well selected dataset of APSES domain containing homologs in YFO. Since these may be quite divergent, we can't rely on BLAST to find all of them, we need to use the much more sensitive search of PSI-BLAST instead. But even though you are interested only in YFO's genes, it would be a mistake to restrict the PSI-BLAST search to YFO. PSI-BLAST becomes more sensitive if the profile represents more diverged homologs. Therefore we should always search with a broadly representative set of species, even if we are interested only in the results for one of the species. This is important. Please reflect on this for a bit and make sure you understand the rationale why we include sequences in the search that we are not actually interested in.


But you can also search with too many species: if the number of species is large and PSI-BLAST finds a large number of results:

  1. it becomes unwieldy to check the newly included sequences at each iteration, inclusion of false-positive hits may result, profile corruption and loss of specificity. The search will fail.
  2. since genomes from some parts of the Tree Of Life are over represented, the inclusion of all sequences leads to selection bias and loss of sensitivity.


We should therefore try to find a subset of species

  1. that represent as large a range as possible on the evolutionary tree;
  2. that are as well distributed as possible on the tree; and
  3. whose genomes are fully sequenced.

These criteria are important. Again, reflect on them and understand their justification. Choosing your species well for a PSI-BLAST search can be crucial to obtain results that are robust and meaningful.

How can we define a list of such species, and how can we use the list?

The definition is a rather typical bioinformatics task for integrating datasources: "retrieve a list of representative fungi with fully sequenced genomes". Unfortunately, to do this in a principled way requires tools that you can't (yet) program: we would need to use a list of genome sequenced fungi, estimate their evolutionary distance and select a well-distributed sample. Regrettably you can't combine such information easily with the resources available from the NCBI.

We will use an approach that is conceptually similar: selecting a set of species according to their shared taxonomic rank in the tree of life. Biological classification provides a hierarchical system that describes evolutionary relatedness for all living entities. The levels of this hierarchy are so called taxonomic ranks. These ranks are defined in Codes of Nomenclature that are curated by the self-governed international associations of scientists working in the field. The number of ranks is not specified: there is a general consensus on seven principal ranks (see below, in bold) but many subcategories exist and may be newly introduced. It is desired–but not mandated–that ranks represent clades (a group of related species, or a "branch" of a phylogeny), and it is desired–but not madated–that the rank is sharply defined. The system is based on subjective dissimilarity. Needless to say that it is in flux.

If we follow a link to an entry in the NCBI's Taxonomy database, eg. Saccharomyces cerevisiae S228c, the strain from which the original "yeast genome" was sequenced in the late 1990s, we see the following specification of its taxonomic lineage:


cellular organisms; Eukaryota; Opisthokonta; Fungi; Dikarya; 
Ascomycota; Saccharomyceta; Saccharomycotina; Saccharomycetes; 
Saccharomycetales; Saccharomycetaceae; Saccharomyces; Saccharomyces cerevisiae


These names can be mapped into taxonomic ranks ranks, since the suffixes of these names e.g. -mycotina, -mycetaceae are specific to defined ranks. (NCBI does not provide this mapping, but Wikipedia is helpful here.)

Rank Suffix Example
Domain Eukaryota (Eukarya)
  Subdomain   Opisthokonta
Kingdom   Fungi
  Subkingdom   Dikarya
Phylum   Ascomycota
  rankless taxon[3] -myceta Saccharomyceta
  Subphylum -mycotina Saccharomycotina
Class -mycetes Saccharomycetes
  Subclass -mycetidae  
Order -ales Saccharomycetales
Family -aceae Saccharomycetaceae
  Subfamily -oideae  
  Tribe -eae  
  Subtribe -ineae  
Genus   Saccharomyces
Species   Saccharomyces cerevisiae
You can see that there is not a common mapping between the yeast lineage and the commonly recognized categories - not all ranks are represented. Nor is this consistent across species in the taxonomic database: some have subfamily ranks and some don't. And the tree is in no way normalized - some of the ranks have thousands of members, and for some, only a single extant member may be known, or it may be a rank that only relates to the fossil record. But the ranks do provide some guidance to evolutionary divergence. Say you want to choose four species across the tree of life for a study, you should choose one from each of the major domains of life: Eubacteria, Euryarchaeota, Crenarchaeota-Eocytes, and Eukaryotes. Or you want to study a gene that is specific to mammals. Then you could choose from the clades listed in the NCBI taxonomy database under Mammalia (a class rank, and depending how many species you would want to include, use the subclass-, order-, or family rank (hover over the names to see their taxonomic rank.) There will still be quite a bit of manual work involved and an exploration of different options on the Web may be useful. For our purposes here we can retrieve a good set of organisms from the ensembl fungal genomes page - maintained by the EBI's genome annotation group - that lists species grouped by taxonomic order. All of these organisms are genome-sequenced, we can pick a set of representatives:
  1. Capnodiales   Zymoseptoria tritici
  2. Erysiphales   Blumeria graminis
  3. Eurotiales   Aspergillus nidulans
  4. Glomerellales   Glomerella graminicola
  5. Hypocreales   Trichoderma reesei
  6. Magnaporthales   Magnaporthe oryzae
  7. Microbotryales   Microbotryum violaceum
  8. Pezizales   Tuber melanosporum
  9. Pleosporales   Phaeosphaeria nodorum
  10. Pucciniales   Puccinia graminis
  11. Saccharomycetales   Saccharomyces cerevisiae
  12. Schizosaccharomycetales   Schizosaccharomyces pombe
  13. Sclerotiniaceae   Sclerotinia sclerotiorum
  14. Sordariales   Neurospora crassa
  15. Tremellales   Cryptococcus neoformans
  16. Ustilaginales   Ustilago maydis
This set of organisms thus can be used to generate a PSI-BLAST search in a well-distributed set of species. Of course you must also include YFO (if YFO is not in this list already). To enter these 16 species as an Entrez restriction, they need to be formatted as below. (One could also enter species one by one, by pressing the (+) button after the organism list)
Aspergillus nidulans[orgn]
OR Blumeria graminis[orgn]
OR Cryptococcus neoformans[orgn]
OR Glomerella graminicola[orgn]
OR Magnaporthe oryzae[orgn]
OR Microbotryum violaceum[orgn] 
OR Neurospora crassa[orgn]
OR Phaeosphaeria nodorum[orgn]
OR Puccinia graminis[orgn]
OR Sclerotinia sclerotiorum[orgn]
OR Trichoderma reesei[orgn]
OR Tuber melanosporum[orgn]
OR Saccharomyces cerevisiae[orgn]
OR Schizosaccharomyces pombe[orgn]
OR Ustilago maydis[orgn]
OR Zymoseptoria tritici[orgn]


 

Executing the PSI-BLAST search

We have a list of species. Good. Next up: how do we use it.

Task:

  1. Navigate to the BLAST homepage.
  2. Select protein BLAST.
  3. Paste the APSES domain sequence into the search field.
  4. Select refseq as the database.
  5. Copy the organism restriction list from above and enter the correct name for YFO into the list if it is not there already. Obviously, you can't find sequences in YFO if YFO is not included in your search space. Paste the list into the Entrez Query field.
  6. In the Algorithm section, select PSI-BLAST.
  7. Click on BLAST.


Evaluate the results carefully. Since we used default parameters, the threshold for inclusion was set at an E-value of 0.005 by default, and that may be a bit too lenient. If you look at the table of your hits– in the Sequences producing significant alignments... section– there may also be a few sequences that have a low query coverage of less than 80%. Let's exclude these from the profile initially: not to worry, if they are true positives, the will come back with improved E-values and greater coverage in subsequent iterations. But if they were false positives, their E-values will rise and they should drop out of the profile and not contaminate it.


Task:

  1. In the header section, click on Formatting options and in the line "Format for..." set the with inclusion threshold to 0.001 (This means E-values can't be above 10-03 for the sequence to be included.)
  2. Click on the Reformat button (top right).
  3. In the table of sequence descriptions (not alignments!), click on the Query coverage to sort the table by coverage, not by score.
  4. Copy the rows with a coverage of less than 80% and paste them into some text editor so you can compare what happens with these sequences in the next iteration.
  5. Deselect the check mark next to these sequences in the right-hand column Select for PSI blast. (For me these are six sequences, but with YFO included that may be a bit different.)
  6. Then scroll to Run PSI-BLAST iteration 2 ... and click on Go.


This is now the "real" PSI-BLAST at work: it constructs a profile from all the full-length sequences and searches with the profile, not with any individual sequence. Note that we are controlling what goes into the profile in two ways:

  1. we are explicitly removing sequences with poor coverage; and
  2. we are requiring a more stringent minimum E-value for each sequence.


Task:

  1. Again, study the table of hits. Sequences highlighted in yellow have met the search criteria in the second iteration. Note that the coverage of (some) of the previously excluded sequences is now above 80%.
  2. Let's exclude partial matches one more time. Again, deselect all sequences with less than 80% coverage. Then run the third iteration.
  3. Iterate the search in this way until no more "New" sequences are added to the profile. Then scan the list of excluded hits ... are there any from YFO that seem like they could potentially make the list? Marginal E-value perhaps, or reasonable E-value but less coverage? If that's the case, try returning the E-value threshold to the default 0.005 and see what happens...


Once no "new" sequences have been added, if we were to repeat the process again and again, we would always get the same result because the profile stays the same. We say that the search has converged. Good. Time to harvest.


Task:

  1. At the header, click on Taxonomy reports and find YFO in the Organism Report section. These are your APSES domain homologs. All of them. Actually, perhaps more than all: the report may also include sequences with E-values above the inclusion threshold.
  2. From the report copy the sequence identifiers
    1. from YFO,
    2. with E-values above your defined threshold.

For example, the list of Saccharomyces genes is the following:

Saccharomyces cerevisiae S288c [ascomycetes] taxid 559292
ref|NP_010227.1| Mbp1p [Saccharomyces cerevisiae S288c] [ 131] 1e-38
ref|NP_011036.1| Swi4p [Saccharomyces cerevisiae S288c] [ 123] 1e-35
ref|NP_012881.1| Phd1p [Saccharomyces cerevisiae S288c] [ 91] 1e-25
ref|NP_013729.1| Sok2p [Saccharomyces cerevisiae S288c] [ 93] 3e-25
ref|NP_012165.1| Xbp1p [Saccharomyces cerevisiae S288c] [ 40] 5e-07

Xbp1 is a special case. It has only very low coverage, but that is because it has a long domain insertion and the N-terminal match often is not recognized by alignment because the gap scores for long indels are unrealistically large. For now, I keep that sequence with the others.


Next we need to retrieve the sequences. Tedious to retrieve them one by one, but we can get them all at the same time:


Task:

  1. Return to the BLAST results page and again open the Formatting options.
  2. Find the Limit results section and enter YFO's name into the field. For example Saccharomyces cerevisiae [ORGN]
  3. Click on Reformat
  4. Scroll to the Descriptions section, check the box at the left-hand margin, next to each sequence you want to keep. Then click on Download → FASTA complete sequence → Continue.


There are actually several ways to download lists of sequences. Using the results page utility is only one. But if you know the GIs of the sequences you need, you can get them more directly by putting them into the URL...

Even more flexible is the eUtils interface to the NCBI databases. For example you can download the dataset in text format by clicking below.

Note that this utility does not show anything, but downloads the (multi) fasta file to your default download directory.


That is all.


 

Links and resources

 


Footnotes and references

  1. The directory also contains sourcecode to generte the PAM matrices. This may be of interest for you if you ever want to produce scoring matrices from your own datasets.
  2. (Taylor et al. (2000) Biochemistry 39: 3943-3954 and Deleeuw et al. (2008) Biochemistry. 47:6378-6385)
  3. The -myceta are well supported groups above the Class rank. See Leotiomyceta for details and references.


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.



< Assignment 3 Assignment 5 >