Difference between revisions of "BIO Assignment Week 4"

From "A B C"
Jump to navigation Jump to search
m
Line 53: Line 53:
 
 
 
 
  
== The Mutation Data Matrix ==
+
=== The Mutation Data Matrix ===
  
 
The NCBI makes its alignment matrices available by ftp. They are located at  ftp://ftp.ncbi.nih.gov/blast/matrices - for example here is a link to the [ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62 '''BLOSUM62 matrix''']<ref>The directory also contains sourcecode to generte the PAM matrices. This may be of interest for you if you ever want to produce scoring matrices from your own datasets.</ref>. Access that site and download the <code>BLOSUM62</code> matrix to your computer. You could give it a filename of <code>BLOSUM62.mdm</code>.
 
The NCBI makes its alignment matrices available by ftp. They are located at  ftp://ftp.ncbi.nih.gov/blast/matrices - for example here is a link to the [ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62 '''BLOSUM62 matrix''']<ref>The directory also contains sourcecode to generte the PAM matrices. This may be of interest for you if you ever want to produce scoring matrices from your own datasets.</ref>. Access that site and download the <code>BLOSUM62</code> matrix to your computer. You could give it a filename of <code>BLOSUM62.mdm</code>.
Line 101: Line 101:
  
 
&nbsp;
 
&nbsp;
== The DNA binding site ==
+
=== The DNA binding site ===
  
  
Line 125: Line 125:
  
 
&nbsp;
 
&nbsp;
== R code: coloring the alignment by quality ==
 
  
 +
&nbsp;
  
 +
==BLAST==
  
{{task|1=
 
  
* Study this code carefully, execute it, section by section and make sure you understand all of it. Ask on the list if anything is not clear.
+
One of the foundations of bioinformatics is the empirical observation that related sequences conserve structure, and often function. This is the basis on which we can make inferences from well-studied model organisms in species that have not been studied as deeply. The model case for our assignments is to take annotations from baker's yeast, ''Saccharomyces cerevisiae'' and apply them to YFO.
 +
 
 +
Therefore, in this assignment we will
 +
* use the sequence search program BLAST to retrieve a sequence similar to yeast Mbp1 in YFO;
 +
* use a number of tools to annotate the sequence.
  
<source lang="R">
+
Keeping with our theme of sequence analysis, we will
# BiostringsExample.R
+
* explore EMBOSS tools;
# Short tutorial on sequence alignment with the Biostrings package.
+
* compute and plot relative amino acid frequencies in '''R''';
# Boris Steipe for BCH441, 2013 - 2014
+
* and (optionally) use Chimera to explore H-bond patterns in the Mbp1 APSES domain structure.
#
 
setwd("~/path/to/your/R_files/")
 
setwd("~/Documents/07.TEACHING/37-BCH441 Bioinformatics 2014/05-Materials/Assignment_5 data")
 
  
# Biostrings is a package within the bioconductor project.
+
&nbsp;
# bioconducter packages have their own installation system,
 
# they are normally not installed via CRAN.
 
  
# First, you load the BioConductor installer...
+
===Retrieve===
source("http://bioconductor.org/biocLite.R")
 
  
# Then you can install the Biostrings package and all of its dependencies.
 
biocLite("Biostrings")
 
  
# ... and load the library.
+
In [[BIO_Assignment_Week_2#Protein|Assignment 2]] you looked at sequences in YFO that are [http://www.ncbi.nlm.nih.gov/protein?LinkName=protein_protein&from_uid=6320147 related to yeast Mbp1], by following a link from the RefSeq record. I mentioned that there are more principled ways to find related proteins: that principle is to search for similar sequences. Exactly how this works will be the subject of later lectures, but the tool that is most commonly used for this task is called '''BLAST''' (Basic Local Alignment And Search Tool). The task of this assignment is to perform a number of sequence annotations to the sequence from YFO that is '''most similar''' to Mbp1, or, more precisely, that contains an APSES domain that is most similar<ref>As you will see later on in the assignment, Mbp1-related proteins contain "Ankyrin" domains, a very widely distributed protein-protein interaction motif that may give rise to false-positive similarities for full-length sequence searches. Therefore, we search only with the DNA binding domain sequence, since this is the functionality that best characterizes the "function" of the protein we are interested in.</ref>.
library(Biostrings)
 
  
# Some basic (technical) information is available ...
+
&nbsp;
library(help=Biostrings)
+
===Search input===
  
# ... but for more in depth documentation, use the
 
# so called "vignettes" that are provided with every R package.
 
browseVignettes("Biostrings")
 
  
# In this code, we mostly use functions that are discussed in the  
+
First, we need to '''define the sequence''' we will search with, as the search input.  
# pairwise alignement vignette.
 

# Read in two fasta files - you will need to edit this for YFO
 
sacce <- readAAStringSet("mbp1-sacce.fa", format="fasta")
 
  
# "USTMA" is used only as an example here - modify for YFO  :-)
 
ustma <- readAAStringSet("mbp1-ustma.fa", format="fasta")
 
  
sacce
+
====Defining the sequence to search with====
names(sacce)
 
names(sacce) <- "Mbp1 SACCE"
 
names(ustma) <- "Mbp1 USTMA" # Example only ... modify for YFO
 
  
width(sacce)
+
I have highlighted the extent of the APSES domain sequence in the previous assignment, but when you explored the corresponding structure in Chimera, you saw that the structured protein domain is larger and the additional secondary structure elements are in fact well integrated into the overall domain. This is not surprising: canonical domain definitions are compiled from many species and examples, and they generally comprise only the common core. Looking up the source of the domain annotations for Mbp1 is very easy:
as.character(sacce)
 
  
# Biostrings takes a sophisticated approach to sequence alignment ...
 
?pairwiseAlignment
 
  
# ... but the use in practice is quite simple:
+
{{task|1=
ali <- pairwiseAlignment(sacce, ustma, substitutionMatrix = "BLOSUM50")
+
<ol>
ali
+
<li> Access the [http://www.ncbi.nlm.nih.gov/protein/NP_010227 RefSeq record for yeast Mbp1].</li>
 +
<li> While you are here, download a FASTA formatted version of the sequence to your '''R''' working directory and give it a filename of <code>mbp1-sacce.fa</code>. We will need it later. <small>It should be straightforward from the NCBI page how to achieve that. As a hint, you need to use the '''Send to...''' link to actually download the file.</small></li>
 +
<li>  On the RefSeq page, look for the link '''Related Information''' &rarr; '''CDD Search Results''' and  follow it.</li>
 +
</ol>
  
pattern(ali)
 
subject(ali)
 
  
writePairwiseAlignments(ali)
+
This is a domain annotation: CDD is the NCBI's '''C'''onserved '''D'''omain '''D'''atabase and the annotation was done by a tool that scanned the sequence of Mbp1 for segments that are similar to any of the domain definitions stored in the CDD. We will return to CDD in the next assignment.
  
p <- aligned(pattern(ali))
+
<ol start="4">
names(p) <- "Mbp1 SACCE aligned"
+
<li>Click on the blue box labeled Kila-N in the graph to access the CDD entry for this domain.</li>
s <- aligned(subject(ali))
+
<li>Read the abstract. You should understand the relationship between Kila-N and APSES domains. One is a subfamily of the other.</li>
names(s) <- "Mbp1 USTMA aligned"
+
<li>Confirm that the domain definition &ndash; as applied to the Mbp1 sequence (which is labeled as "query") &ndash; corresponds to the region we highlighted in the last assignment.</li>
 +
</ol>
  
# don't overwrite your EMBOSS .fal files
 
writeXStringSet(p, "mbp1-sacce.R.fal", append=FALSE, format="fasta")
 
writeXStringSet(s, "mbp1-ustma.R.fal", append=FALSE, format="fasta")
 
  
# Done.
+
What precisely constitutes an APSES domain however is a matter of definition, as you can explore in the following (optional) task.
  
</source>
 
  
* Compare the alignments you received from the EMBOSS server, and that you co puted using '''R'''. Are they aproximately the same? Exactly? You did use different matrices and gap aameters, so minor differences are to be expected. But by and large you should get the same alignments.
+
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand" data-collapsetext="Collapse" style="border:#000000 solid 1px; padding: 10px; margin-left:25px; margin-right:25px;">Optional: Load the structure in Chimera, like you did in the last assignment and switch on stereo viewing ... (more) <div  class="mw-collapsible-content">
 +
<ol start="7">
 +
<li>Display the protein in ribbon style, e.g. with the '''Interactive 1''' preset.
 +
<li>Access the '''Interpro''' information page for Mbp1 at the EBI: http://www.ebi.ac.uk/interpro/protein/P39678
 +
<li>In the section '''Domains and repeats''', mouse over the red annotations and note down the residue numbers for the annotated domains. Also follow the links to the respective Interpro domain definition pages.
 +
</ol>
  
}}
+
At this point we have definitions for the following regions on the Mbp1 protein ...
 +
*The KilA-N (pfam 04383) domain definition as applied to the Mbp1 protein sequence by CDD;
 +
*The InterPro ''KilA, N-terminal/APSES-type HTH, DNA-binding (IPR018004)'' definition annotated on the Mbp1 sequence;
 +
*The InterPro ''Transcription regulator HTH, APSES-type DNA-binding domain (IPR003163)'' definition annotated on the Mbp1 sequence;
 +
*<small>(... in addition &ndash; without following the source here &ndash; the UniProt record for Mbp1 annotates a "HTH APSES-type" domain from residues 5-111)</small>
  
We will now use the aligned sequences to compute a graphical display of alignment quality.
+
... each with its distinct and partially overlapping sequence range. Back to Chimera:
  
 +
<!-- For reference:
 +
1MB1: 3-100
 +
2BM8: 4-102
 +
CDD KilA-N: 19-93
 +
InterPro KilA-N: 23-88
 +
InterPro APSES: 3-133
 +
Uniprot HTH/APSES: 5-111
 +
-->
  
{{task|1=
+
<ol start="10">
 +
<li>In the sequence window, select the sequence corresponding to the '''Interpro KilA-N''' annotation and colour this fragment red. <small>Remember that you can get the sequence numbers of a residue in the sequence window when you hover the pointer over it - but do confirm that the sequence numbering that Chimera displays matches the numbering of the Interpro domain definition.</small></li>
  
* Study this code carefully, execute it, section by section and make sure you understand all of it. Ask on the list if anything is not clear.
+
<li>Then select the residue range(s) by which the '''CDD KilA-N''' definition is larger, and colour that fragment orange.</li>
  
<source lang="R">
+
<li>Then select the residue range(s) by which the '''InterPro APSES domain''' definition is larger, and colour that fragment yellow.</li>
# aliScore.R
 
# Evaluating an alignment with a sliding window score
 
# Boris Steipe, October 2012. Update October 2013
 
setwd("~/path/to/your/R_files/")
 
  
# Scoring matrices can be found at the NCBI.  
+
<li>If the structure contains residues outside these ranges, colour these white.</li>
# ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62
 
  
# It is good practice to set variables you might want to change
+
<li>Study this in a side-by-side stereo view and get a sense for how the ''extra'' sequence beyond the Kil-A N domain(s) is part of the structure, and how the integrity of the folded structure would be affected if these fragments were missing.</li>
# in a header block so you don't need to hunt all over the code
 
# for strings you need to update.
 
#
 
fa1      <- "mbp1-sacce.R.fal"
 
fa2      <- "mbp1-ustma.R.fal"
 
code1    <- "SACCE"
 
code2    <- "USTMA"
 
mdmFile  <- "BLOSUM62.mdm"
 
window  <- 9  # window-size (should be an odd integer)
 
  
# ================================================
+
<li>Display Hydrogen bonds, to get a sense of interactions between residues from the differently colored parts. First show the protein as a stick model, with sticks that are thicker than the default to give a better sense of sidechain packing:<br />
#    Read data files
+
::(i) '''Select''' &rarr; '''Select all''' <br />
# ================================================
+
::(ii) '''Actions''' &rarr; '''Ribbon''' &rarr; '''hide''' <br />
 +
::(iii) '''Select''' &rarr; '''Structure''' &rarr; '''protein''' <br />
 +
::(iv) '''Actions''' &rarr; '''Atoms/Bonds''' &rarr; '''show''' <br />
 +
::(v)  '''Actions''' &rarr; '''Atoms/Bonds''' &rarr; '''stick''' <br />
 +
::(vi) click on the looking glass icon at the bottom right of the graphics window to bring up the inspector window and choose '''Inspect ... Bond'''. Change the radius to 0.4.<br />
 +
</li>
  
# read fasta datafiles using seqinr function read.fasta()
+
<li>Then calculate and display the hydrogen bonds:<br />
install.packages("seqinr")
+
::(vii) '''Tools''' &rarr; '''Surface/Binding Analysis''' &rarr; '''FindHbond''' <br />
library(seqinr)
+
::(viii) Set the '''Line width''' to 3.0, leave all other parameters with their default values an click '''Apply'''<br />
tmp  <- unlist(read.fasta(fa1, seqtype="AA", as.string=FALSE, seqonly=TRUE))
+
:: Clear the selection.<br />
seq1 <- unlist(strsplit(as.character(tmp), split=""))
+
Study this view, especially regarding side chain H-bonds. Are there many? Do side chains interact more with other sidechains, or with the backbone?
 +
</li>
  
tmp  <- unlist(read.fasta(fa2, seqtype="AA", as.string=FALSE, seqonly=TRUE))
+
<li>Let's now simplify the scene a bit and focus on backbone/backbone H-bonds:<br />
seq2 <- unlist(strsplit(as.character(tmp), split=""))
+
::(ix) '''Select''' &rarr; '''Structure''' &rarr; '''Backbone''' &rarr; '''full'''<br />
 +
::(x) '''Actions''' &rarr; '''Atoms/Bonds''' &rarr; '''show only'''<br /><br />
 +
:: Clear the selection.<br />
 +
In this way you can appreciate how H-bonds build secondary structure - &alpha;-helices and &beta;-sheets - and how these interact with each other ... in part '''across the KilA N boundary'''.
 +
</li>
  
if (length(seq1) != length(seq2)) {
 
print("Error: Sequences have unequal length!")
 
}
 
 
lSeq <- length(seq1)
 
  
# ================================================
+
<li>Save the resulting image as a jpeg no larger than 600px across and upload it to your Lab notebook on the Wiki.</li>
#    Read scoring matrix
+
<li>When you are done, congratulate yourself on having earned a bonus of 10% on the next quiz.</li>
# ================================================
+
</ol>
  
MDM <- read.table(mdmFile, skip=6)
+
</div>
 +
</div>
  
# This is a dataframe. Study how it can be accessed:
 
  
MDM
+
There is a rather important lesson in this: domain definitions may be fluid, and their boundaries may be computationally derived from sequence comparisons across many families, and do not necessarily correspond to individual structures. Make sure you understand this well.
MDM[1,]
+
}}
MDM[,1]
 
MDM[5,5]  # Cys-Cys
 
MDM[20,20] # Val-Val
 
MDM[,"W"]  # the tryptophan column
 
MDM["R","W"]  # Arg-Trp pairscore
 
MDM["W","R"]  # Trp-Arg pairscore: pairscores are symmetric
 
  
colnames(MDM)  # names of columns
 
rownames(MDM)  # names of rows
 
colnames(MDM)[3]  # third column
 
rownames(MDM)[12]  # twelfth row
 
  
# change the two "*" names to "-" so we can use them to score
+
Given this, it seems appropriate to search the sequence database with the sequence of an Mbp1 structure&ndash;this being a structured, stable, subdomain of the whole that presumably contains the protein's most unique and specific function. Let us retrieve this sequence. All PDB structures have their sequences stored in the NCBI protein database. They can be accessed simply via the PDB-ID, which serves as an identifier both for the NCBI and the PDB databases. However there is a small catch (isn't there always?). PDB files can contain more than one protein, e.g. if the crystal structure contains a complex<ref>Think of the [http://www.pdb.org/pdb/101/motm.do?momID=121 ribosome] or [http://www.pdb.org/pdb/101/motm.do?momID=3 DNA-polymerase] as extreme examples.</ref>. Each of the individual proteins gets a so-called '''chain ID'''&ndash;a one letter identifier&ndash; to identify them uniquely. To find their unique sequence in the database, you need to know the PDB ID as well as the chain ID. If the file contains only a single protein (as in our case), the chain ID is always '''<code>A</code>'''<ref>Otherwise, you need to study the PDB Web page for the structure, or the text in the PDB file itself, to identify which part of the complex is labeled with which chain ID. For example, immunoglobulin structures some time label the ''light-'' and ''heavy chain'' fragments as "L" and "H", and sometimes as "A" and "B"&ndash;there are no fixed rules. You can also load the structure in VMD, color "by chain" and use the mouse to click on residues in each chain to identify it.</ref>. make sure you understand the concept of protein chains, and chain IDs.
# indels of the alignment. This is a bit of a hack, since this
 
# does not reflect the actual indel penalties (which is, as you)
 
# remember from your lectures, calculated as a gap opening
 
# + gap extension penalty; it can't be calculated in a pairwise
 
# manner) EMBOSS defaults for BLODSUM62 are opening -10 and
 
# extension -0.5 i.e. a gap of size 3 (-11.5) has approximately
 
# the same penalty as a 3-character score of "-" matches (-12)
 
# so a pairscore of -4 is not entirely unreasonable.
 
  
colnames(MDM)[24]
 
rownames(MDM)[24]
 
colnames(MDM)[24] <- "-"
 
rownames(MDM)[24] <- "-"
 
colnames(MDM)[24]
 
rownames(MDM)[24]
 
MDM["Q", "-"]
 
MDM["-", "D"]
 
# so far so good.
 
  
# ================================================
+
{{task|1=
#    Tabulate pairscores for alignment
+
<ol>
# ================================================
+
<li> Back at the [http://www.ncbi.nlm.nih.gov/protein/NP_010227 RefSeq record for yeast Mbp1], enter the '''PDB-ID''', an underscore, and the '''chain ID''' for one of the crystal structures into the search field. You can use <code>1MB1_A</code> or <code>1BM8_A</code>, but don't use <code>1L3G</code>: this NMR structure includes a large stretch of unstructured residues.</li>
 
+
<li> Click on '''Display settings''' and choose '''FASTA (text)'''. You should get something like:
 
+
<source lang="text">
# It is trivial to create a pairscore vector along the
+
>gi|157830387|pdb|1BM8|A Chain A, Dna-Binding Domain Of Mbp1
# length of the aligned sequences.
+
QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKY
 +
QGTWVPLNIAKQLAEKFSVYDQLKPLFDF
 +
</source></li>
 +
<li> Save this sequence in your notebook, in case we need it later.</li>
 +
</ol>
 +
}}
 +
 
 +
 
 +
Next, we use this sequence to find its most similar relative in YFO using BLAST.
 +
 
  
PS <- vector()
+
&nbsp;
for (i in 1:lSeq) {
 
  aa1 <- seq1[i]
 
  aa2 <- seq2[i]
 
  PS[i] = MDM[aa1, aa2]
 
}
 
  
PS
+
====BLAST search====
  
  
# The same vector could be created - albeit perhaps not so
+
{{task|1=
# easy to understand - with the expression ...
+
# Navigate to the [http://www.ncbi.nlm.nih.gov/blast '''BLAST''' entry page at the NCBI].
MDM[cbind(seq1,seq2)]
+
# Click on '''protein blast''' as the BLAST program to run.
 +
# Paste the sequence of the yeast Mbp1 DNA-binding domain into the search field.
 +
# Set the following parameters:
 +
## As '''Database''' option choose '''Reference proteins (refseq_protein)'''
 +
## As '''Organism''' enter the binomial name of YFO. Make sure you spell it right, the page will try to autocomplete your entry. Species level is detailed enough, you don't have to specify the strain (e.g. I would specify "''Ustilago maydis''" '''not''' "''Ustilago maydis'' 521").
 +
# Then click on the '''BLAST''' button and wait for the result to appear. You will first see a graph of any conserved domains in your query sequence, this is not yet what you are waiting for...
 +
# Patience.
 +
# Patience. The database is large.
 +
# Patience. Execution times vary greatly by time of day.
 +
# The top "hit" on the results page is what you are looking for. Its alignment and alignment score are shown in the '''Alignments''' section a bit further down the page. Your hit should have on the order of more than 40% identities to the query and match at least 80 residues or so. <small>If your match seems less and worse than that, please eMail me to troubleshoot.</small>
 +
# The first item for each hit is a link to its database entry, right next to the checkbox.  It says something like <code>ref&#124;NP_123456789</code> or <code>ref&#124;XP_123456789</code> ... follow that link.
 +
# Note the RefSeq ID, and save the sequence in FASTA format into your '''R''' working directory, as you did for Mbp1 at the beginning of the assignment. Give this a filename of <code>mbp1-xxxxx.fa</code>, but replace <code>xxxxx</code> with its short species label for YFO. For simplicity I will refer to this sequence as "''YFO'' Mbp1" in the future.
 +
}}
  
  
 +
&nbsp;
  
# ================================================
 
#    Calculate moving averages
 
# ================================================
 
  
# In order to evaluate the alignment, we will calculate a
 
# sliding window average over the pairscores. Somewhat surprisingly
 
# R doesn't (yet) have a native function for moving averages: options
 
# that are quoted are:
 
#  - rollmean() in the "zoo" package http://rss.acs.unt.edu/Rdoc/library/zoo/html/rollmean.html
 
#  - MovingAverages() in "TTR"  http://rss.acs.unt.edu/Rdoc/library/TTR/html/MovingAverages.html
 
#  - ma() in "forecast"  http://robjhyndman.com/software/forecast/
 
# But since this is easy to code, we shall implement it ourselves.
 
  
PSma <- vector()          # will hold the averages
 
winS <- floor(window/2)    # span of elements above/below the centre
 
winC <- winS+1            # centre of the window
 
  
# extend the vector PS with zeros (virtual observations) above and below
 
PS <- c(rep(0, winS), PS , rep(0, winS))
 
  
# initialize the window score for the first position
 
winScore <- sum(PS[1:window])
 
  
# write the first score to PSma
 
PSma[1] <- winScore
 
  
# Slide the window along the sequence, and recalculate sum()
+
==PSI BLAST==
# Loop from the next position, to the last position that does not exceed the vector...
 
for (i in (winC + 1):(lSeq + winS)) {
 
  # subtract the value that has just dropped out of the window
 
  winScore <- winScore - PS[(i-winS-1)]
 
  # add the value that has just entered the window
 
  winScore <- winScore + PS[(i+winS)] 
 
  # put score into PSma
 
  PSma[i-winS] <- winScore
 
}
 
  
# convert the sums to averages
+
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
PSma <- PSma / window
 
  
# have a quick look at the score distributions
 
  
boxplot(PSma)
+
&nbsp;<br>
hist(PSma)
 
  
# ================================================
+
;Take care of things, and they will take care of you.
#    Plot the alignment scores
+
:''Shunryu Suzuki''
# ================================================
+
</div>
  
# normalize the scores
 
PSma <- (PSma-min(PSma))/(max(PSma) - min(PSma) + 0.0001)
 
# spread the normalized values to a desired range, n
 
nCol <- 10
 
PSma <- floor(PSma * nCol) + 1
 
  
# Assign a colorspectrum to a vector (with a bit of colormagic,
+
Anyone can click buttons on a Web page, but to use the powerful sequence database search tools ''right'' often takes considerable more care, caution and consideration.
# don't worry about that for now). Dark colors are poor scores,
 
# "hot" colors are high scores
 
spect <- colorRampPalette(c("black", "red", "yellow", "white"), bias=0.4)(nCol)
 
  
# Color is an often abused aspect of plotting. One can use color to label
+
Much of what we know about a protein's physiological function is based on the '''conservation''' of that function as the species evolves. We assess conservation by comparing sequences between related proteins. Conservation - or its opposite: ''variation'' - is a consequence of '''selection under constraints''': protein sequences change as a consequence of DNA mutations, this changes the protein's structure, this in turn changes functions and that has the multiple effects on a species' fitness function. Detrimental variants may be removed. Variation that is tolerated is largely neutral and therefore found only in positions that are neither structurally nor functionally critical. Conservation patterns can thus provide evidence for many different questions: structural conservation among proteins with similar 3D-structures, functional conservation among homologues with comparable roles, or amino acid propensities as predictors for protein engineering and design tasks.
# *quantities* or *qualities*. For the most part, our pairscores measure amino
 
# acid similarity. That is a quantity and with the spectrum that we just defined
 
# we associte the measured quantities with the color of a glowing piece
 
# of metal: we start with black #000000, then first we ramp up the red
 
# (i.e. low-energy) part of the visible spectrum to red #FF0000, then we
 
# add and ramp up the green spectrum giving us yellow #FFFF00 and finally we
 
# add blue, giving us white #FFFFFF. Let's have a look at the spectrum:
 
  
s <- rep(1, nCol)
+
Measuring conservation requires alignment. Therefore a carefully done multiple sequence alignment ('''MSA''') is a cornerstone for the annotation of the essential properties a gene or protein. MSAs are also useful to resolve ambiguities in the precise placement of indels and to ensure that columns in alignments actually contain amino acids that evolve in a similar context. MSAs serve as input for
barplot(s, col=spect, axes=F, main="Color spectrum")
+
* functional annotation;
 +
* protein homology modeling;
 +
* phylogenetic analyses, and
 +
* sensitive homology searches in databases.
  
# But one aspect of our data is not quantitatively different: indels.
+
In order to perform a multiple sequence alignment, we obviously need a set of homologous sequences. This is where the trouble begins. All interpretation of MSA results depends '''absolutely''' on how the input sequences were chosen. Should we include only orthologs, or paralogs as well? Should we include only species with fully sequenced genomes, or can we tolerate that some orthologous genes are possibly missing for a species? Should we include all sequences we can lay our hands on, or should we restrict the selection to a manageable number of ''representative'' sequences? All of these choices influence our interpretation:
# We valued indels with pairscores of -4. But indels are not simply poor alignment,  
+
*orthologs are expected to be functionally and structurally conserved;
# rather they are non-alignment. This means stretches of -4 values are really
+
*paralogs may have divergent function but have similar structure;
# *qualitatively* different. Let's color them differently by changing the lowest
+
*missing genes may make paralogs look like orthologs; and
# level of the spectrum to grey.
+
*selection bias may weight our results toward sequences that are over-represented and do not provide a fair representation of evolutionary divergence.
  
spect[1] <- "#CCCCCC"
 
barplot(s, col=spect, axes=F, main="Color spectrum")
 
  
# Now we can display our alignment score vector with colored rectangles.
+
In this assignment, we will set ourselves the task to use PSI-BLAST and '''find all orthologs and paralogs of the APSES domain containing transcription factors in YFO'''. We will use these sequences later for multiple alignments, calculation of conservation ''etc''. The methodical problem we will address is: how do we perform a sensitive PSI-BLAST search '''in one organism'''. There is an issue to consider:
 +
* If we restrict the PSI-BLAST search to YFO, PSI-BLAST has little chance of building a meaningful profile - the number of homologues that actually are '''in''' YFO is too small. Thus the search will not become very sensitive.
 +
* If we don't restrict our search, but search in all species, the number of hits may become too large. It becomes increasingly difficult to closely check all hits as to whether they have good coverage, and how will we evaluate the fringe cases of marginal E-value, where we need to decide whether to include a new sequence in the profile, or whether to hold off on it for one or two iterations, to see whether the E-value drops significantly. Profile corruption would make the search useless. This is maybe still manageable if we restrict our search to fungi, but imagine you are working with a bacterial protein, or a protein that is conserved across the entire tree of life: your search will find thousands of sequences. And by next year, thousands more will have been added.  
  
# Convert the integers in PSma to color values from spect
+
Therefore we have to find a middle ground: add enough species (sequences) to compile a sensitive profile, but not so many that we can no longer individually assess the sequences that contribute to the profile.
PScol <- vector()
 
for (i in 1:length(PSma)) {
 
PScol[i] <- spect[ PSma[i] ]  # this is how a value from PSma is used as an index of spect
 
}
 
  
# Plot the scores. The code is similar to the last assignment.
 
# Create an empty plot window of appropriate size
 
plot(1,1, xlim=c(-100, lSeq), ylim=c(0, 2) , type="n", yaxt="n", bty="n", xlab="position in alignment", ylab="")
 
  
# Add a label to the left
+
Thus in practice, a sensitive PSI-BLAST search needs to address two issues before we begin:
text (-30, 1, adj=1, labels=c(paste("Mbp1:\n", code1, "\nvs.\n", code2)), cex=0.9 )
+
# We need to define the sequence we are searching with; and
 +
# We need to define the dataset we are searching in.
  
# Loop over the vector and draw boxes  without border, filled with color.
 
for (i in 1:lSeq) {
 
  rect(i, 0.9, i+1, 1.1, border=NA, col=PScol[i])
 
}
 
  
# Note that the numbers along the X-axis are not sequence numbers, but numbers
 
# of the alignment, i.e. sequence number + indel length. That is important to
 
# realize: if you would like to add the annotations from the last assignment
 
# which I will leave as an exercise, you need to map your sequence numbering
 
# into alignment numbering. Let me know in case you try that but need some help.
 
  
</source>
 
}}
 
  
  
;That is all.
+
===Defining the sequence to search with===
  
  
&nbsp;
+
Consider again the task we set out from: '''find all orthologs and paralogs of the APSES domain containing transcription factors in YFO'''.
  
&nbsp;
 
  
==Choosing the Sequence (formerly A3)
+
{{task|1=
 +
What query sequence should you use? Should you ...
  
  
One of the foundations of bioinformatics is the empirical observation that related sequences conserve structure, and often function. This is the basis on which we can make inferences from well-studied model organisms in species that have not been studied as deeply. The model case for our assignments is to take annotations from baker's yeast, ''Saccharomyces cerevisiae'' and apply them to YFO.
+
# Search with the full-length Mbp1 sequence from ''Saccharomyces cerevisiae''?
 +
# Search with the full-length Mbp1 homolog that you found in YFO?
 +
# Search with the structurally defined ''S. cerevisiae'' APSES domain sequence?
 +
# Search with the APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?
 +
# Search with the KilA-N domain sequence?
  
Therefore, in this assignment we will
 
* use the sequence search program BLAST to retrieve a sequence similar to yeast Mbp1 in YFO;
 
* use a number of tools to annotate the sequence.
 
  
Keeping with our theme of sequence analysis, we will
+
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand" data-collapsetext="Collapse" style="border:#000000 solid 1px; padding: 10px; margin-left:25px; margin-right:25px;">Reflect on this (pretend this is a quiz question) and come up with a reasoned answer. Then click on "Expand" to read my opinion on this question.
* explore EMBOSS tools;
+
<div  class="mw-collapsible-content">
* compute and plot relative amino acid frequencies in '''R''';
+
;The full-length Mbp1 sequence from ''Saccharomyces cerevisiae''
* and (optionally) use Chimera to explore H-bond patterns in the Mbp1 APSES domain structure.
+
:Since this sequence contains multiple domains (in particular the ubiquitous Ankyrin domains) it is not suitable for BLAST database searches. You must restrict your search to the domain of greatest interest for your question. That would be the APSES domain.
  
&nbsp;
+
;The full-length Mbp1 homolog that you found in YFO
 +
:What organism the search sequence comes from does not make a difference. Since you aim to find '''all''' homologs in YFO, it is not necessary to have your search sequence more or less similar to '''any particular''' homologs. In fact '''any''' APSES sequence should give you the same result, since they are '''all''' homologous. But the full-length sequence in YFO has the same problem as the ''Saccharomyces'' sequence.
  
==Retrieve==
+
;The structurally defined ''S. cerevisiae'' APSES domain sequence?
 +
:That would be my first choice, just because it is structurally well defined as a complete domain, and the sequence is easy to obtain from the <code>1BM8</code> PDB entry. (<code>1MB1</code> would also work, but you would need to edit out the penta-Histidine tag at the C-terminus that was engineered into the sequence to help purify the recombinantly expressed protein.)
  
 +
;The APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?
 +
:As argued above: since they are all homologs, any of them should lead to the same set of results.
 +
 +
;The KilA-N domain sequence?
 +
:This is a shorter sequence and a more distant homolog to the domain we are interested in. It would not be my first choice: the fact that it is more distantly related might make the search '''more sensitive'''. The fact that it is shorter might make the search '''less specific'''. The effect of this tradeoff would need to be compared and considered. By the way: the same holds for the even shorter subdomain 50-74 we discussed in the last assignment. However: one of the results of our analysis will be '''whether APSES domains in fungi all have the same length as the Mbp1 domain, or whether some are indeed much shorter, as suggested by the Pfam alignment.'''
  
In [[BIO_Assignment_Week_2#Protein|Assignment 2]] you looked at sequences in YFO that are [http://www.ncbi.nlm.nih.gov/protein?LinkName=protein_protein&from_uid=6320147 related to yeast Mbp1], by following a link from the RefSeq record. I mentioned that there are more principled ways to find related proteins: that principle is to search for similar sequences. Exactly how this works will be the subject of later lectures, but the tool that is most commonly used for this task is called '''BLAST''' (Basic Local Alignment And Search Tool). The task of this assignment is to perform a number of sequence annotations to the sequence from YFO that is '''most similar''' to Mbp1, or, more precisely, that contains an APSES domain that is most similar<ref>As you will see later on in the assignment, Mbp1-related proteins contain "Ankyrin" domains, a very widely distributed protein-protein interaction motif that may give rise to false-positive similarities for full-length sequence searches. Therefore, we search only with the DNA binding domain sequence, since this is the functionality that best characterizes the "function" of the protein we are interested in.</ref>.
+
 
 +
So in my opinion, you should search with the yeast Mbp1 APSES domain, i.e. the sequence which you have previously studied in the crystal structure. Where is that? Well, you might have saved it in your journal, or you can get it again from the [http://www.pdb.org/pdb/explore/explore.do?structureId=1BM8 '''PDB'''] (i.e. [http://www.pdb.org/pdb/files/fasta.txt?structureIdList=1BM8 here], or from [[BIO_Assignment_Week_3#Search input|Assignment 3]].
 +
 
 +
</div>
 +
</div>
 +
}}
  
 
&nbsp;
 
&nbsp;
===Search input===
 
  
 +
===Selecting species for a PSI-BLAST search===
  
First, we need to '''define the sequence''' we will search with, as the search input.
 
  
 +
As discussed in the introduction, in order to use our sequence set for studying structural and functional features and conservation patterns of our APSES domain proteins, we should start with a well selected dataset of APSES domain containing homologs in YFO. Since these may be quite divergent, we can't rely on '''BLAST''' to find all of them, we need to use the much more sensitive search of '''PSI-BLAST''' instead. But even though you are interested only in YFO's genes, it would be a mistake to restrict the PSI-BLAST search to YFO. PSI-BLAST becomes more sensitive if the profile represents more diverged homologs. Therefore we should always search with a broadly representative set of species, even if we are interested only in the results for one of the species. This is important. Please reflect on this for a bit and make sure you understand the rationale why we include sequences in the search that we are not actually interested in.
  
====Defining the sequence to search with====
 
  
I have highlighted the extent of the APSES domain sequence in the previous assignment, but when you explored the corresponding structure in Chimera, you saw that the structured protein domain is larger and the additional secondary structure elements are in fact well integrated into the overall domain. This is not surprising: canonical domain definitions are compiled from many species and examples, and they generally comprise only the common core. Looking up the source of the domain annotations for Mbp1 is very easy:
+
But you can also search with '''too many''' species: if the number of species is large and PSI-BLAST finds a large number of results:
 +
# it becomes unwieldy to check the newly included sequences at each iteration, inclusion of false-positive hits may result, profile corruption and loss of specificity. The search will fail.
 +
# since genomes from some parts of the Tree Of Life are over represented, the inclusion of all sequences leads to selection bias and loss of sensitivity.
  
  
{{task|1=
+
We should therefore try to find a subset of species
<ol>
+
# that represent as large a '''range''' as possible on the evolutionary tree;
<li> Access the [http://www.ncbi.nlm.nih.gov/protein/NP_010227 RefSeq record for yeast Mbp1].</li>
+
# that are as well '''distributed''' as possible on the tree; and
<li> While you are here, download a FASTA formatted version of the sequence to your '''R''' working directory and give it a filename of <code>mbp1-sacce.fa</code>. We will need it later. <small>It should be straightforward from the NCBI page how to achieve that. As a hint, you need to use the '''Send to...''' link to actually download the file.</small></li>
+
# whose '''genomes''' are fully sequenced.
<li>  On the RefSeq page, look for the link '''Related Information''' &rarr; '''CDD Search Results''' and  follow it.</li>
 
</ol>
 
  
 +
These criteria are important. Again, reflect on them and understand their justification. Choosing your species well for a PSI-BLAST search can be crucial to obtain results that are robust and meaningful.
  
This is a domain annotation: CDD is the NCBI's '''C'''onserved '''D'''omain '''D'''atabase and the annotation was done by a tool that scanned the sequence of Mbp1 for segments that are similar to any of the domain definitions stored in the CDD. We will return to CDD in the next assignment.
+
How can we '''define''' a list of such species, and how can we '''use''' the list?
  
<ol start="4">
+
The definition is a rather typical bioinformatics task for integrating datasources: "retrieve a list of representative fungi with fully sequenced genomes".  Unfortunately, to do this in a principled way requires tools that you can't (yet) program: we would need to use a list of genome sequenced fungi, estimate their evolutionary distance and select a well-distributed sample. Regrettably you can't combine such information easily with the resources available from the NCBI.
<li>Click on the blue box labeled Kila-N in the graph to access the CDD entry for this domain.</li>
 
<li>Read the abstract. You should understand the relationship between Kila-N and APSES domains. One is a subfamily of the other.</li>
 
<li>Confirm that the domain definition &ndash; as applied to the Mbp1 sequence (which is labeled as "query") &ndash; corresponds to the region we highlighted in the last assignment.</li>
 
</ol>
 
  
 +
We will use an approach that is conceptually similar: selecting a set of species according to their shared taxonomic rank in the tree of life. {{WP|Biological classification|'''Biological classification'''}} provides a hierarchical system that describes evolutionary relatedness for all living entities. The levels of this hierarchy are so called {{WP|Taxonomic rank|'''taxonomic ranks'''}}. These ranks are defined in ''Codes of Nomenclature'' that are curated by the self-governed international associations of scientists working in the field. The number of ranks is not specified: there is a general consensus on seven principal ranks (see below, in bold) but many subcategories exist and may be newly introduced. It is desired&ndash;but not mandated&ndash;that ranks represent ''clades'' (a group of related species, or a "branch" of a phylogeny), and it is desired&ndash;but not madated&ndash;that the rank is sharply defined. The system is based on subjective dissimilarity. Needless to say that it is in flux.
  
What precisely constitutes an APSES domain however is a matter of definition, as you can explore in the following (optional) task.
+
If we follow a link to an entry in the NCBI's Taxonomy database, eg. [http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=559292 ''Saccharomyces cerevisiae S228c''], the strain from which the original "yeast genome" was sequenced in the late 1990s, we see the following specification of its taxonomic lineage:
  
  
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand" data-collapsetext="Collapse" style="border:#000000 solid 1px; padding: 10px; margin-left:25px; margin-right:25px;">Optional: Load the structure in Chimera, like you did in the last assignment and switch on stereo viewing ... (more) <div  class="mw-collapsible-content">
+
<source lang="text">
<ol start="7">
+
cellular organisms; Eukaryota; Opisthokonta; Fungi; Dikarya;  
<li>Display the protein in ribbon style, e.g. with the '''Interactive 1''' preset.
+
Ascomycota; Saccharomyceta; Saccharomycotina; Saccharomycetes;
<li>Access the '''Interpro''' information page for Mbp1 at the EBI: http://www.ebi.ac.uk/interpro/protein/P39678
+
Saccharomycetales; Saccharomycetaceae; Saccharomyces; Saccharomyces cerevisiae
<li>In the section '''Domains and repeats''', mouse over the red annotations and note down the residue numbers for the annotated domains. Also follow the links to the respective Interpro domain definition pages.
+
</source>
</ol>
 
  
At this point we have definitions for the following regions on the Mbp1 protein ...
 
*The KilA-N (pfam 04383) domain definition as applied to the Mbp1 protein sequence by CDD;
 
*The InterPro ''KilA, N-terminal/APSES-type HTH, DNA-binding (IPR018004)'' definition annotated on the Mbp1 sequence;
 
*The InterPro ''Transcription regulator HTH, APSES-type DNA-binding domain (IPR003163)'' definition annotated on the Mbp1 sequence;
 
*<small>(... in addition &ndash; without following the source here &ndash; the UniProt record for Mbp1 annotates a "HTH APSES-type" domain from residues 5-111)</small>
 
  
... each with its distinct and partially overlapping sequence range. Back to Chimera:
+
These names can be mapped into taxonomic ranks ranks, since the suffixes of these names e.g. ''-mycotina'', ''-mycetaceae'' are specific to defined ranks. (NCBI does not provide this mapping, but {{WP|Taxonomic rank|Wikipedia}} is helpful here.)
  
<!-- For reference:
+
<table>
1MB1: 3-100
 
2BM8: 4-102
 
CDD KilA-N: 19-93
 
InterPro KilA-N: 23-88
 
InterPro APSES: 3-133
 
Uniprot HTH/APSES: 5-111
 
-->
 
  
<ol start="10">
+
<tr class="sh">
<li>In the sequence window, select the sequence corresponding to the '''Interpro KilA-N''' annotation and colour this fragment red. <small>Remember that you can get the sequence numbers of a residue in the sequence window when you hover the pointer over it - but do confirm that the sequence numbering that Chimera displays matches the numbering of the Interpro domain definition.</small></li>
+
<td>Rank</td>
 +
<td>Suffix</td>
 +
<td>Example</td>
 +
</tr>
  
<li>Then select the residue range(s) by which the '''CDD KilA-N''' definition is larger, and colour that fragment orange.</li>
+
<tr class="s1">
 +
<td>Domain</td>
 +
<td></td>
 +
<td>Eukaryota (Eukarya)</td>
 +
</tr>
  
<li>Then select the residue range(s) by which the '''InterPro APSES domain''' definition is larger, and colour that fragment yellow.</li>
+
<tr class="s2">
 +
<td>&nbsp;&nbsp;Subdomain</td>
 +
<td>&nbsp;</td>
 +
<td>Opisthokonta</td>
 +
</tr>
  
<li>If the structure contains residues outside these ranges, colour these white.</li>
+
<tr class="s1">
 +
<td>'''Kingdom'''</td>
 +
<td>&nbsp;</td>
 +
<td>Fungi</td>
 +
</tr>
  
<li>Study this in a side-by-side stereo view and get a sense for how the ''extra'' sequence beyond the Kil-A N domain(s) is part of the structure, and how the integrity of the folded structure would be affected if these fragments were missing.</li>
+
<tr class="s2">
 +
<td>&nbsp;&nbsp;Subkingdom</td>
 +
<td>&nbsp;</td>
 +
<td>Dikarya</td>
 +
</tr>
  
<li>Display Hydrogen bonds, to get a sense of interactions between residues from the differently colored parts. First show the protein as a stick model, with sticks that are thicker than the default to give a better sense of sidechain packing:<br />
+
<tr class="s1">
::(i) '''Select''' &rarr; '''Select all''' <br />
+
<td>'''Phylum'''</td>
::(ii) '''Actions''' &rarr; '''Ribbon''' &rarr; '''hide''' <br />
+
<td>&nbsp;</td>
::(iii) '''Select''' &rarr; '''Structure''' &rarr; '''protein''' <br />
+
<td>Ascomycota</td>
::(iv) '''Actions''' &rarr; '''Atoms/Bonds''' &rarr; '''show''' <br />
+
</tr>
::(v)  '''Actions''' &rarr; '''Atoms/Bonds''' &rarr; '''stick''' <br />
 
::(vi) click on the looking glass icon at the bottom right of the graphics window to bring up the inspector window and choose '''Inspect ... Bond'''. Change the radius to 0.4.<br />
 
</li>
 
  
<li>Then calculate and display the hydrogen bonds:<br />
+
<tr class="s2">
::(vii) '''Tools''' &rarr; '''Surface/Binding Analysis''' &rarr; '''FindHbond''' <br />
+
<td>&nbsp;&nbsp;''rankless taxon''<ref>The -myceta are well supported groups above the Class rank. See {{WP|Leotiomyceta|''Leotiomyceta''}} for details and references.</ref></td>
::(viii) Set the '''Line width''' to 3.0, leave all other parameters with their default values an click '''Apply'''<br />
+
<td>-myceta</td>
:: Clear the selection.<br />
+
<td>Saccharomyceta</td>
Study this view, especially regarding side chain H-bonds. Are there many? Do side chains interact more with other sidechains, or with the backbone?
+
</tr>
</li>
 
  
<li>Let's now simplify the scene a bit and focus on backbone/backbone H-bonds:<br />
+
<tr class="s1">
::(ix) '''Select''' &rarr; '''Structure''' &rarr; '''Backbone''' &rarr; '''full'''<br />
+
<td>&nbsp;&nbsp;Subphylum</td>
::(x)  '''Actions''' &rarr; '''Atoms/Bonds''' &rarr; '''show only'''<br /><br />
+
<td>-mycotina</td>
:: Clear the selection.<br />
+
<td>Saccharomycotina</td>
In this way you can appreciate how H-bonds build secondary structure - &alpha;-helices and &beta;-sheets - and how these interact with each other ... in part '''across the KilA N boundary'''.
+
</tr>
</li>
 
  
 +
<tr class="s2">
 +
<td>'''Class'''</td>
 +
<td>-mycetes</td>
 +
<td>Saccharomycetes</td>
 +
</tr>
  
<li>Save the resulting image as a jpeg no larger than 600px across and upload it to your Lab notebook on the Wiki.</li>
+
<tr class="s1">
<li>When you are done, congratulate yourself on having earned a bonus of 10% on the next quiz.</li>
+
<td>&nbsp;&nbsp;Subclass</td>
</ol>
+
<td>-mycetidae</td>
 +
<td>&nbsp;</td>
 +
</tr>
  
</div>
+
<tr class="s2">
</div>
+
<td>'''Order'''</td>
 +
<td>-ales</td>
 +
<td>Saccharomycetales</td>
 +
</tr>
  
 +
<tr class="s1">
 +
<td>'''Family'''</td>
 +
<td>-aceae</td>
 +
<td>Saccharomycetaceae</td>
 +
</tr>
  
There is a rather important lesson in this: domain definitions may be fluid, and their boundaries may be computationally derived from sequence comparisons across many families, and do not necessarily correspond to individual structures. Make sure you understand this well.
+
<tr class="s2">
}}
+
<td>&nbsp;&nbsp;Subfamily</td>
 +
<td>-oideae</td>
 +
<td>&nbsp;</td>
 +
</tr>
  
 +
<tr class="s1">
 +
<td>&nbsp;&nbsp;Tribe</td>
 +
<td>-eae</td>
 +
<td>&nbsp;</td>
 +
</tr>
  
Given this, it seems appropriate to search the sequence database with the sequence of an Mbp1 structure&ndash;this being a structured, stable, subdomain of the whole that presumably contains the protein's most unique and specific function. Let us retrieve this sequence. All PDB structures have their sequences stored in the NCBI protein database. They can be accessed simply via the PDB-ID, which serves as an identifier both for the NCBI and the PDB databases. However there is a small catch (isn't there always?). PDB files can contain more than one protein, e.g. if the crystal structure contains a complex<ref>Think of the [http://www.pdb.org/pdb/101/motm.do?momID=121 ribosome] or [http://www.pdb.org/pdb/101/motm.do?momID=3 DNA-polymerase] as extreme examples.</ref>. Each of the individual proteins gets a so-called '''chain ID'''&ndash;a one letter identifier&ndash; to identify them uniquely. To find their unique sequence in the database, you need to know the PDB ID as well as the chain ID. If the file contains only a single protein (as in our case), the chain ID is always '''<code>A</code>'''<ref>Otherwise, you need to study the PDB Web page for the structure, or the text in the PDB file itself, to identify which part of the complex is labeled with which chain ID. For example, immunoglobulin structures some time label the ''light-'' and ''heavy chain'' fragments as "L" and "H", and sometimes as "A" and "B"&ndash;there are no fixed rules. You can also load the structure in VMD, color "by chain" and use the mouse to click on residues in each chain to identify it.</ref>. make sure you understand the concept of protein chains, and chain IDs.
+
<tr class="s2">
 
+
<td>&nbsp;&nbsp;Subtribe</td>
 +
<td>-ineae</td>
 +
<td>&nbsp;</td>
 +
</tr>
  
{{task|1=
+
<tr class="s1">
<ol>
+
<td>'''Genus'''</td>
<li> Back at the [http://www.ncbi.nlm.nih.gov/protein/NP_010227 RefSeq record for yeast Mbp1], enter the '''PDB-ID''', an underscore, and the '''chain ID''' for one of the crystal structures into the search field. You can use <code>1MB1_A</code> or <code>1BM8_A</code>, but don't use <code>1L3G</code>: this NMR structure includes a large stretch of unstructured residues.</li>
+
<td>&nbsp;</td>
<li> Click on '''Display settings''' and choose '''FASTA (text)'''. You should get something like:
+
<td>Saccharomyces</td>
<source lang="text">
+
</tr>
>gi|157830387|pdb|1BM8|A Chain A, Dna-Binding Domain Of Mbp1
 
QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKY
 
QGTWVPLNIAKQLAEKFSVYDQLKPLFDF
 
</source></li>
 
<li> Save this sequence in your notebook, in case we need it later.</li>
 
</ol>
 
}}
 
  
 +
<tr class="s2">
 +
<td>'''Species'''</td>
 +
<td>&nbsp;</td>
 +
<td>''Saccharomyces cerevisiae''</td>
 +
</tr>
  
Next, we use this sequence to find its most similar relative in YFO using BLAST.
+
<table>
  
  
&nbsp;
+
You can see that there is not a common mapping between the yeast lineage and the commonly recognized categories - not all ranks are represented. Nor is this consistent across species in the taxonomic database: some have subfamily ranks and some don't. And the tree is in no way normalized - some of the ranks have thousands of members, and for some, only a single extant member may be known, or it may be a rank that only relates to the fossil record. But the ranks do provide some guidance to evolutionary divergence. Say you want to choose four species across the tree of life for a study, you should choose one from each of the major '''domains''' of life: Eubacteria, Euryarchaeota, Crenarchaeota-Eocytes, and Eukaryotes. Or you want to study a gene that is specific to mammals. Then you could choose from the clades listed in the NCBI taxonomy database under [http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=40674&lvl=4 '''Mammalia'''] (a {{WP|Mammal_classification|'''class rank'''}}, and depending how many species you would want to include, use the
 +
subclass-, order-, or family rank (hover over the names to see their taxonomic rank.)
  
====BLAST search====
+
There will still be quite a bit of manual work involved and an exploration of different options on the Web may be useful. For our purposes here we can retrieve a good set of organisms from the [http://fungi.ensembl.org/info/website/species.html '''ensembl fungal genomes page'''] - maintained by the EBI's genome annotation group - that lists species grouped by taxonomic ''order''. All of these organisms are genome-sequenced, we can pick a set of representatives:
  
 +
# Capnodiales&nbsp;&nbsp;&nbsp;''Zymoseptoria tritici''
 +
# Erysiphales&nbsp;&nbsp;&nbsp;''Blumeria graminis''
 +
# Eurotiales&nbsp;&nbsp;&nbsp;''Aspergillus nidulans''
 +
# Glomerellales&nbsp;&nbsp;&nbsp;''Glomerella graminicola''
 +
# Hypocreales&nbsp;&nbsp;&nbsp;''Trichoderma reesei''
 +
# Magnaporthales&nbsp;&nbsp;&nbsp;''Magnaporthe oryzae''
 +
# Microbotryales&nbsp;&nbsp;&nbsp;''Microbotryum violaceum''
 +
# Pezizales&nbsp;&nbsp;&nbsp;''Tuber melanosporum''
 +
# Pleosporales&nbsp;&nbsp;&nbsp;''Phaeosphaeria nodorum''
 +
# Pucciniales&nbsp;&nbsp;&nbsp;''Puccinia graminis''
 +
# Saccharomycetales&nbsp;&nbsp;&nbsp;''Saccharomyces cerevisiae''
 +
# Schizosaccharomycetales&nbsp;&nbsp;&nbsp;''Schizosaccharomyces pombe''
 +
# Sclerotiniaceae&nbsp;&nbsp;&nbsp;''Sclerotinia sclerotiorum''
 +
# Sordariales&nbsp;&nbsp;&nbsp;''Neurospora crassa''
 +
# Tremellales&nbsp;&nbsp;&nbsp;''Cryptococcus neoformans''
 +
# Ustilaginales&nbsp;&nbsp;&nbsp;''Ustilago maydis''
  
{{task|1=
+
This set of organisms thus can be used to generate a PSI-BLAST search in a well-distributed set of species. Of course '''you must also include YFO''' (<small>if YFO is not in this list already</small>).
# Navigate to the [http://www.ncbi.nlm.nih.gov/blast '''BLAST''' entry page at the NCBI].
 
# Click on '''protein blast''' as the BLAST program to run.
 
# Paste the sequence of the yeast Mbp1 DNA-binding domain into the search field.
 
# Set the following parameters:
 
## As '''Database''' option choose '''Reference proteins (refseq_protein)'''
 
## As '''Organism''' enter the binomial name of YFO. Make sure you spell it right, the page will try to autocomplete your entry. Species level is detailed enough, you don't have to specify the strain (e.g. I would specify "''Ustilago maydis''" '''not''' "''Ustilago maydis'' 521").
 
# Then click on the '''BLAST''' button and wait for the result to appear. You will first see a graph of any conserved domains in your query sequence, this is not yet what you are waiting for...
 
# Patience.
 
# Patience. The database is large.
 
# Patience. Execution times vary greatly by time of day.
 
# The top "hit" on the results page is what you are looking for. Its alignment and alignment score are shown in the '''Alignments''' section a bit further down the page. Your hit should have on the order of more than 40% identities to the query and match at least 80 residues or so. <small>If your match seems less and worse than that, please eMail me to troubleshoot.</small>
 
# The first item for each hit is a link to its database entry, right next to the checkbox.  It says something like <code>ref&#124;NP_123456789</code> or <code>ref&#124;XP_123456789</code> ... follow that link.
 
# Note the RefSeq ID, and save the sequence in FASTA format into your '''R''' working directory, as you did for Mbp1 at the beginning of the assignment. Give this a filename of <code>mbp1-xxxxx.fa</code>, but replace <code>xxxxx</code> with its short species label for YFO. For simplicity I will refer to this sequence as "''YFO'' Mbp1" in the future.
 
}}
 
 
 
  
&nbsp;
+
To enter these 16 species as an Entrez restriction, they need to be formatted as below. (<small>One could also enter species one by one, by pressing the '''(+)''' button after the organism list</small>)
  
  
 +
<source lang="text">
 +
Aspergillus nidulans[orgn]
 +
OR Blumeria graminis[orgn]
 +
OR Cryptococcus neoformans[orgn]
 +
OR Glomerella graminicola[orgn]
 +
OR Magnaporthe oryzae[orgn]
 +
OR Microbotryum violaceum[orgn]
 +
OR Neurospora crassa[orgn]
 +
OR Phaeosphaeria nodorum[orgn]
 +
OR Puccinia graminis[orgn]
 +
OR Sclerotinia sclerotiorum[orgn]
 +
OR Trichoderma reesei[orgn]
 +
OR Tuber melanosporum[orgn]
 +
OR Saccharomyces cerevisiae[orgn]
 +
OR Schizosaccharomyces pombe[orgn]
 +
OR Ustilago maydis[orgn]
 +
OR Zymoseptoria tritici[orgn]
  
 +
</source>
  
  
  
 +
&nbsp;
  
==MSA (formerly A5)==
+
===Executing the PSI-BLAST search===
  
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
+
We have a list of species. Good. Next up: how do we '''use''' it.
  
 +
{{task|1=
  
&nbsp;<br>
 
  
;Take care of things, and they will take care of you.
+
# Navigate to the BLAST homepage.
:''Shunryu Suzuki''
+
# Select '''protein BLAST'''.
</div>
+
# Paste the APSES domain sequence into the search field.
 +
# Select '''refseq''' as the database.
 +
# Copy the organism restriction list from above '''and enter the correct name for YFO''' into the list if it is not there already. Obviously, you can't find sequences in YFO if YFO is not included in your search space. Paste the list into the '''Entrez Query''' field.
 +
# In the '''Algorithm''' section, select PSI-BLAST.
 +
#Click on '''BLAST'''.
 +
}}
  
  
Anyone can click buttons on a Web page, but to use the powerful sequence database search tools ''right'' often takes considerable more care, caution and consideration.
+
Evaluate the results carefully. Since we used default parameters, the threshold for inclusion was set at an '''E-value''' of 0.005 by default, and that may be a bit too lenient. If you look at the table of your hits&ndash; in the '''Sequences producing significant alignments...''' section&ndash; there may also be a few sequences that have a low query coverage of less than 80%. Let's exclude these from the profile initially: not to worry, if they are true positives, the will come back with improved E-values and greater coverage in subsequent iterations. But if they were false positives, their E-values will rise and they should drop out of the profile and not contaminate it.
  
Much of what we know about a protein's physiological function is based on the '''conservation''' of that function as the species evolves. We assess conservation by comparing sequences between related proteins. Conservation - or its opposite: ''variation'' - is a consequence of '''selection under constraints''': protein sequences change as a consequence of DNA mutations, this changes the protein's structure, this in turn changes functions and that has the multiple effects on a species' fitness function. Detrimental variants may be removed. Variation that is tolerated is largely neutral and therefore found only in positions that are neither structurally nor functionally critical. Conservation patterns can thus provide evidence for many different questions: structural conservation among proteins with similar 3D-structures, functional conservation among homologues with comparable roles, or amino acid propensities as predictors for protein engineering and design tasks.
 
  
Measuring conservation requires alignment. Therefore a carefully done multiple sequence alignment ('''MSA''') is a cornerstone for the annotation of the essential properties a gene or protein. MSAs are also useful to resolve ambiguities in the precise placement of indels and to ensure that columns in alignments actually contain amino acids that evolve in a similar context. MSAs serve as input for
+
{{task|1=
* functional annotation;
+
#In the header section, click on '''Formatting options''' and in the line "Format for..." set the '''with inclusion threshold''' to <code>0.001</code> (This means E-values can't be above 10<sup>-03</sup> for the sequence to be included.)
* protein homology modeling;
+
# Click on the '''Reformat''' button (top right).
* phylogenetic analyses, and
+
# In the table of sequence descriptions (not alignments!), click on the '''Query coverage''' to sort the table by coverage, not by score.
* sensitive homology searches in databases.
+
# Copy the rows with a coverage of less than 80% and paste them into some text editor so you can compare what happens with these sequences in the next iteration.
 
+
# '''Deselect''' the check mark next to these sequences in the right-hand column '''Select for PSI blast'''. (For me these are six sequences, but with YFO included that may be a bit different.)
In order to perform a multiple sequence alignment, we obviously need a set of homologous sequences. This is where the trouble begins. All interpretation of MSA results depends '''absolutely''' on how the input sequences were chosen. Should we include only orthologs, or paralogs as well? Should we include only species with fully sequenced genomes, or can we tolerate that some orthologous genes are possibly missing for a species? Should we include all sequences we can lay our hands on, or should we restrict the selection to a manageable number of ''representative'' sequences? All of these choices influence our interpretation:
+
# Then scroll to '''Run PSI-BLAST iteration 2 ...''' and click on '''<code>Go</code>'''.
*orthologs are expected to be functionally and structurally conserved;
+
}}
*paralogs may have divergent function but have similar structure;
 
*missing genes may make paralogs look like orthologs; and
 
*selection bias may weight our results toward sequences that are over-represented and do not provide a fair representation of evolutionary divergence.
 
  
  
In this assignment, we will set ourselves the task to use PSI-BLAST and '''find all orthologs and paralogs of the APSES domain containing transcription factors in YFO'''. We will use these sequences later for multiple alignments, calculation of conservation ''etc''. The methodical problem we will address is: how do we perform a sensitive PSI-BLAST search '''in one organism'''. There is an issue to consider:
+
This is now the "real" PSI-BLAST at work: it constructs a profile from all the full-length sequences and searches with the '''profile''', not with any individual sequence. Note that we are controlling what goes into the profile in two ways:
* If we restrict the PSI-BLAST search to YFO, PSI-BLAST has little chance of building a meaningful profile - the number of homologues that actually are '''in''' YFO is too small. Thus the search will not become very sensitive.
+
# we are explicitly removing sequences with poor coverage; and
* If we don't restrict our search, but search in all species, the number of hits may become too large. It becomes increasingly difficult to closely check all hits as to whether they have good coverage, and how will we evaluate the fringe cases of marginal E-value, where we need to decide whether to include a new sequence in the profile, or whether to hold off on it for one or two iterations, to see whether the E-value drops significantly. Profile corruption would make the search useless. This is maybe still manageable if we restrict our search to fungi, but imagine you are working with a bacterial protein, or a protein that is conserved across the entire tree of life: your search will find thousands of sequences. And by next year, thousands more will have been added.  
+
# we are requiring a more stringent minimum E-value for each sequence.
  
Therefore we have to find a middle ground: add enough species (sequences) to compile a sensitive profile, but not so many that we can no longer individually assess the sequences that contribute to the profile.
+
 
 +
{{task|1=
 +
#Again, study the table of hits. Sequences highlighted in yellow have met the search criteria in the second iteration. Note that the coverage of (some) of the previously excluded sequences is now above 80%.
 +
# Let's exclude partial matches one more time. Again, deselect all sequences with less than 80% coverage. Then run the third iteration.
 +
# Iterate the search in this way until no more "New" sequences are added to the profile. Then scan the list of excluded hits ... are there any from YFO that seem like they could potentially make the list? Marginal E-value perhaps, or reasonable E-value but less coverage? If that's the case, try returning the E-value threshold to the default 0.005 and see what happens...
 +
}}
  
  
Thus in practice, a sensitive PSI-BLAST search needs to address two issues before we begin:
+
Once no "new" sequences have been added, if we were to repeat the process again and again, we would always get the same result because the profile stays the same. We say that the search has '''converged'''. Good. Time to harvest.
# We need to define the sequence we are searching with; and
 
# We need to define the dataset we are searching in.
 
  
  
 +
{{task|1=
 +
# At the header, click on '''Taxonomy reports''' and find YFO in the '''Organism Report''' section. These are your APSES domain homologs. All of them. Actually, perhaps more than all: the report may also include sequences with E-values above the inclusion threshold.
 +
# From the report copy the sequence identifiers
 +
## from YFO,
 +
## with E-values above your defined threshold.
 +
}}
  
 +
For example, the list of ''Saccharomyces'' genes is the following:
  
 +
<code>
 +
<b>[http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=559292 Saccharomyces cerevisiae S288c]</b> [http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=4890 [ascomycetes]] taxid 559292<br \>
 +
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6320147&dopt=GenPept ref|NP_010227.1|] Mbp1p [Saccharomyces cerevisiae S288c]          [ 131]  1e-38<br \>
 +
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6320957&dopt=GenPept ref|NP_011036.1|] Swi4p [Saccharomyces cerevisiae S288c]          [ 123]  1e-35<br \>
 +
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6322808&dopt=GenPept ref|NP_012881.1|] Phd1p [Saccharomyces cerevisiae S288c]          [  91]  1e-25<br \>
 +
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6323658&dopt=GenPept ref|NP_013729.1|] Sok2p [Saccharomyces cerevisiae S288c]          [  93]  3e-25<br \>
 +
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6322090&dopt=GenPept ref|NP_012165.1|] Xbp1p [Saccharomyces cerevisiae S288c]          [  40]  5e-07<br \>
 +
</code>
  
==Defining the sequence to search with==
+
[[Saccharomyces cerevisiae Xbp1|Xbp1]] is a special case. It has only very low coverage, but that is because it has a long domain insertion and the N-terminal match often is not recognized by alignment because the gap scores for long indels are unrealistically large. For now, I keep that sequence with the others.
  
  
Consider again the task we set out from: '''find all orthologs and paralogs of the APSES domain containing transcription factors in YFO'''.
+
Next we need to retrieve the sequences. Tedious to retrieve them one by one, but we can get them all at the same time:
  
  
 
{{task|1=
 
{{task|1=
What query sequence should you use? Should you ...
 
  
 +
# Return to the BLAST results page and again open the '''Formatting options'''.
 +
# Find the '''Limit results''' section and enter YFO's name into the field. For example <code>Saccharomyces cerevisiae [ORGN]</code>
 +
# Click on '''Reformat'''
 +
# Scroll to the '''Descriptions''' section, check the box at the left-hand margin, next to each sequence you want to keep. Then click on '''Download &rarr; FASTA complete sequence &rarr; Continue'''.
  
# Search with the full-length Mbp1 sequence from ''Saccharomyces cerevisiae''?
 
# Search with the full-length Mbp1 homolog that you found in YFO?
 
# Search with the structurally defined ''S. cerevisiae'' APSES domain sequence?
 
# Search with the APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?
 
# Search with the KilA-N domain sequence?
 
  
  
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand" data-collapsetext="Collapse" style="border:#000000 solid 1px; padding: 10px; margin-left:25px; margin-right:25px;">Reflect on this (pretend this is a quiz question) and come up with a reasoned answer. Then click on "Expand" to read my opinion on this question.
+
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand" data-collapsetext="Collapse" style="border:#000000 solid 1px; padding: 10px; margin-left:25px; margin-right:25px;">There are actually several ways to download lists of sequences. Using the results page utility is only one. But if you know the GIs of the sequences you need, you can get them more directly by putting them into the URL...
 
<div  class="mw-collapsible-content">
 
<div  class="mw-collapsible-content">
;The full-length Mbp1 sequence from ''Saccharomyces cerevisiae''
 
:Since this sequence contains multiple domains (in particular the ubiquitous Ankyrin domains) it is not suitable for BLAST database searches. You must restrict your search to the domain of greatest interest for your question. That would be the APSES domain.
 
  
;The full-length Mbp1 homolog that you found in YFO
+
* http://www.ncbi.nlm.nih.gov/protein/6320147,6320957,6322808,6323658,6322090?report=docsum  - The default report
:What organism the search sequence comes from does not make a difference. Since you aim to find '''all''' homologs in YFO, it is not necessary to have your search sequence more or less similar to '''any particular''' homologs. In fact '''any''' APSES sequence should give you the same result, since they are '''all''' homologous. But the full-length sequence in YFO has the same problem as the ''Saccharomyces'' sequence.
+
* http://www.ncbi.nlm.nih.gov/protein/6320147,6320957,6322808,6323658,6322090?report=fasta - FASTA sequences with NCBI HTML markup
  
;The structurally defined ''S. cerevisiae'' APSES domain sequence?
+
Even more flexible is the [http://www.ncbi.nlm.nih.gov/books/NBK25500/#chapter1.Downloading_Full_Records '''eUtils'''] interface to the NCBI databases. For example you can download the dataset in text format by clicking below.
:That would be my first choice, just because it is structurally well defined as a complete domain, and the sequence is easy to obtain from the <code>1BM8</code> PDB entry. (<code>1MB1</code> would also work, but you would need to edit out the penta-Histidine tag at the C-terminus that was engineered into the sequence to help purify the recombinantly expressed protein.)
 
  
;The APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?
+
* http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=6320147,6320957,6322808,6323658,6322090&rettype=fasta&retmode=text
:As argued above: since they are all homologs, any of them should lead to the same set of results.
 
 
 
;The KilA-N domain sequence?
 
:This is a shorter sequence and a more distant homolog to the domain we are interested in. It would not be my first choice: the fact that it is more distantly related might make the search '''more sensitive'''. The fact that it is shorter might make the search '''less specific'''. The effect of this tradeoff would need to be compared and considered. By the way: the same holds for the even shorter subdomain 50-74 we discussed in the last assignment. However: one of the results of our analysis will be '''whether APSES domains in fungi all have the same length as the Mbp1 domain, or whether some are indeed much shorter, as suggested by the Pfam alignment.'''
 
 
 
 
 
So in my opinion, you should search with the yeast Mbp1 APSES domain, i.e. the sequence which you have previously studied in the crystal structure. Where is that? Well, you might have saved it in your journal, or you can get it again from the [http://www.pdb.org/pdb/explore/explore.do?structureId=1BM8 '''PDB'''] (i.e. [http://www.pdb.org/pdb/files/fasta.txt?structureIdList=1BM8 here], or from [[BIO_Assignment_Week_3#Search input|Assignment 3]].
 
  
 +
Note that this utility does not '''show''' anything, but downloads the (multi) fasta file to your default download directory.
 +
 
</div>
 
</div>
 
</div>
 
</div>
 
}}
 
}}
 +
<!--
  
&nbsp;
+
Add to this assignment:
 +
- study the BLAST output format, links, tools, scores ...
 +
- compare the improvement in E-values to standard BLAST
 +
- examine this in terms of sensitivity and specificity
  
==Selecting species for a PSI-BLAST search==
+
-->
  
  
As discussed in the introduction, in order to use our sequence set for studying structural and functional features and conservation patterns of our APSES domain proteins, we should start with a well selected dataset of APSES domain containing homologs in YFO. Since these may be quite divergent, we can't rely on '''BLAST''' to find all of them, we need to use the much more sensitive search of '''PSI-BLAST''' instead. But even though you are interested only in YFO's genes, it would be a mistake to restrict the PSI-BLAST search to YFO. PSI-BLAST becomes more sensitive if the profile represents more diverged homologs. Therefore we should always search with a broadly representative set of species, even if we are interested only in the results for one of the species. This is important. Please reflect on this for a bit and make sure you understand the rationale why we include sequences in the search that we are not actually interested in.
 
  
 +
==Multiple Sequence Alignment==
  
But you can also search with '''too many''' species: if the number of species is large and PSI-BLAST finds a large number of results:
 
# it becomes unwieldy to check the newly included sequences at each iteration, inclusion of false-positive hits may result, profile corruption and loss of specificity. The search will fail.
 
# since genomes from some parts of the Tree Of Life are over represented, the inclusion of all sequences leads to selection bias and loss of sensitivity.
 
  
 +
&nbsp;<br>
  
We should therefore try to find a subset of species
 
# that represent as large a '''range''' as possible on the evolutionary tree;
 
# that are as well '''distributed''' as possible on the tree; and
 
# whose '''genomes''' are fully sequenced.
 
  
These criteria are important. Again, reflect on them and understand their justification. Choosing your species well for a PSI-BLAST search can be crucial to obtain results that are robust and meaningful.
+
===Review of domain annotations===
  
How can we '''define''' a list of such species, and how can we '''use''' the list?
 
  
The definition is a rather typical bioinformatics task for integrating datasources: "retrieve a list of representative fungi with fully sequenced genomes". Unfortunately, to do this in a principled way requires tools that you can't (yet) program: we would need to use a list of genome sequenced fungi, estimate their evolutionary distance and select a well-distributed sample. Regrettably you can't combine such information easily with the resources available from the NCBI.
+
APSES domains are relatively easy to identify and annotate but we have had problems with the ankyrin domains in Mbp1 homologues. Both CDD as well as SMART have identified such domains, but while the domain model was based on the same Pfam profile for both, and both annotated approximately the same regions, the details of the alignments and the extent of the predicted region was different.
  
We will use an approach that is conceptually similar: selecting a set of species according to their shared taxonomic rank in the tree of life. {{WP|Biological classification|'''Biological classification'''}} provides a hierarchical system that describes evolutionary relatedness for all living entities. The levels of this hierarchy are so called {{WP|Taxonomic rank|'''taxonomic ranks'''}}. These ranks are defined in ''Codes of Nomenclature'' that are curated by the self-governed international associations of scientists working in the field. The number of ranks is not specified: there is a general consensus on seven principal ranks (see below, in bold) but many subcategories exist and may be newly introduced. It is desired&ndash;but not mandated&ndash;that ranks represent ''clades'' (a group of related species, or a "branch" of a phylogeny), and it is desired&ndash;but not madated&ndash;that the rank is sharply defined. The system is based on subjective dissimilarity. Needless to say that it is in flux.  
+
[http://www.yeastgenome.org/cgi-bin/locus.fpl?locus=mbp1 Mbp1] forms heterodimeric complexes with a homologue, [http://www.yeastgenome.org/cgi-bin/locus.fpl?locus=swi6 Swi6]. Swi6 does not have an APSES domain, thus it does not bind DNA. But it is similar to Mbp1 in the region spanning the ankyrin domains and in 1999 [http://www.ncbi.nlm.nih.gov/pubmed/10048928 Foord ''et al.''] published its crystal structure ([http://www.rcsb.org/pdb/cgi/explore.cgi?pdbId=1SW6 1SW6]). This structure is a good model for Ankyrin repeats in Mbp1. For details, please refer to the consolidated [[Reference annotation yeast Mbp1|Mbp1 annotation page]] I have prepared.
  
If we follow a link to an entry in the NCBI's Taxonomy database, eg. [http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=559292 ''Saccharomyces cerevisiae S228c''], the strain from which the original "yeast genome" was sequenced in the late 1990s, we see the following specification of its taxonomic lineage:
+
In what follows, we will use the program JALVIEW - a Java based multiple sequence alignment editor to load and align sequences and to consider structural similarity between yeast Mbp1 and its closest homologue in your organism.
  
 +
In this part of the assignment,
  
<source lang="text">
+
#You will load sequences that are most similar to Mbp1 into an MSA editor;
cellular organisms; Eukaryota; Opisthokonta; Fungi; Dikarya;  
+
#You will add sequences of ankyrin domain models;
Ascomycota; Saccharomyceta; Saccharomycotina; Saccharomycetes;  
+
#You will perform a multiple sequence alignment;
Saccharomycetales; Saccharomycetaceae; Saccharomyces; Saccharomyces cerevisiae
+
#You will try to improve the alignment manually;
</source>
+
<!-- Finally you will consider if the Mbp1 APSES domains could extend beyond the section of homology with Swi6 -->
  
  
These names can be mapped into taxonomic ranks ranks, since the suffixes of these names e.g. ''-mycotina'', ''-mycetaceae'' are specific to defined ranks. (NCBI does not provide this mapping, but {{WP|Taxonomic rank|Wikipedia}} is helpful here.)
+
===Jalview, loading sequences===
  
<table>
 
  
<tr class="sh">
+
Geoff Barton's lab in Dundee has developed an integrated MSA editor and sequence annotation workbench with a number of very useful functions. It is written in Java and should run on Mac, Linux and Windows platforms without modifications.
<td>Rank</td>
 
<td>Suffix</td>
 
<td>Example</td>
 
</tr>
 
  
<tr class="s1">
 
<td>Domain</td>
 
<td></td>
 
<td>Eukaryota (Eukarya)</td>
 
</tr>
 
  
<tr class="s2">
+
{{#pmid: 19151095}}
<td>&nbsp;&nbsp;Subdomain</td>
 
<td>&nbsp;</td>
 
<td>Opisthokonta</td>
 
</tr>
 
  
<tr class="s1">
 
<td>'''Kingdom'''</td>
 
<td>&nbsp;</td>
 
<td>Fungi</td>
 
</tr>
 
  
<tr class="s2">
+
We will use this tool for this assignment and explore its features as we go along.
<td>&nbsp;&nbsp;Subkingdom</td>
 
<td>&nbsp;</td>
 
<td>Dikarya</td>
 
</tr>
 
  
<tr class="s1">
+
{{task|1=
<td>'''Phylum'''</td>
+
#Navigate to the [http://www.jalview.org/ Jalview homepage] click on '''Download''', install Jalview on your computer and start it. A number of windows that showcase the program's abilities will load, you can close these.
<td>&nbsp;</td>
+
#Prepare homologous Mbp1 sequences for alignment:
<td>Ascomycota</td>
+
##Open the '''[[Reference Mbp1 orthologues (all fungi)]]''' page. (This is the list of Mbp1 orthologs I mentioned above.)
</tr>
+
##Copy the FASTA sequences of the reference proteins, paste them into a text file (TextEdit on the Mac, Notepad on Windows) and save the file; you could give it an extension of <code>.fa</code>&ndash;but you don't have to.
 +
##Check whether the sequence for YFO is included in the list. If it is, fine. If it is not, retrieve it from NCBI, paste it into the file and edit the header like the other sequences. If the wrong sequence from YFO is included, replace it and let me know.
 +
#Return to Jalview and select File &rarr; Input Alignment &rarr; from File and open your file. A window with sequences should appear.
 +
#Copy the sequences for ankyrin domain models (below), click on the Jalview window, select File &rarr; Add sequences &rarr; from Textbox and paste them into the Jalview textbox. Paste two separate copies of the CD00204 consensus sequence and one copy of 1SW6.
 +
##When all the sequences are present, click on '''Add'''.
  
<tr class="s2">
+
Jalview now displays all the sequences, but of course this is not yet an alignment.
<td>&nbsp;&nbsp;''rankless taxon''<ref>The -myceta are well supported groups above the Class rank. See {{WP|Leotiomyceta|''Leotiomyceta''}} for details and references.</ref></td>
 
<td>-myceta</td>
 
<td>Saccharomyceta</td>
 
</tr>
 
  
<tr class="s1">
+
}}
<td>&nbsp;&nbsp;Subphylum</td>
 
<td>-mycotina</td>
 
<td>Saccharomycotina</td>
 
</tr>
 
  
<tr class="s2">
+
;Ankyrin domain models
<td>'''Class'''</td>
+
>CD00204 ankyrin repeat consensus sequence from CDD
<td>-mycetes</td>
+
NARDEDGRTPLHLAASNGHLEVVKLLLENGADVNAKDNDGRTPLHLAAKNGHLEIVKLLL
<td>Saccharomycetes</td>
+
EKGADVNARDKDGNTPLHLAARNGNLDVVKLLLKHGADVNARDKDGRTPLHLAAKNGHL
</tr>
 
  
<tr class="s1">
+
>1SW6 from PDB - unstructured loops replaced with xxxx
<td>&nbsp;&nbsp;Subclass</td>
+
GPIITFTHDLTSDFLSSPLKIMKALPSPVVNDNEQKMKLEAFLQRLLFxxxxSFDSLLQE
<td>-mycetidae</td>
+
VNDAFPNTQLNLNIPVDEHGNTPLHWLTSIANLELVKHLVKHGSNRLYGDNMGESCLVKA
<td>&nbsp;</td>
+
VKSVNNYDSGTFEALLDYLYPCLILEDSMNRTILHHIIITSGMTGCSAAAKYYLDILMGW
</tr>
+
IVKKQNRPIQSGxxxxDSILENLDLKWIIANMLNAQDSNGDTCLNIAARLGNISIVDALL
 +
DYGADPFIANKSGLRPVDFGAG
  
<tr class="s2">
+
===Computing alignments===
<td>'''Order'''</td>
 
<td>-ales</td>
 
<td>Saccharomycetales</td>
 
</tr>
 
  
<tr class="s1">
+
The EBI has a very convenient [http://www.ebi.ac.uk/Tools/msa/ page to access a number of MSA algorithms]. This is especially convenient when you want to compare, e.g. T-Coffee and Muscle and MAFFT results to see which regions of your alignment are robust. You could use any of these tools, just paste your sequences into a Webform, download the results and load into Jalview. Easy.
<td>'''Family'''</td>
 
<td>-aceae</td>
 
<td>Saccharomycetaceae</td>
 
</tr>
 
  
<tr class="s2">
+
But even easier is to calculate the alignments directly from Jalview.  available. (Not today. <small>Bummer.</small>)
<td>&nbsp;&nbsp;Subfamily</td>
 
<td>-oideae</td>
 
<td>&nbsp;</td>
 
</tr>
 
  
<tr class="s1">
+
;Calculate a MAFFT alignment using the Jalview Web service option:
<td>&nbsp;&nbsp;Tribe</td>
 
<td>-eae</td>
 
<td>&nbsp;</td>
 
</tr>
 
  
<tr class="s2">
+
{{task|1=
<td>&nbsp;&nbsp;Subtribe</td>
+
#In Jalview, select '''Web Service &rarr; Alignment &rarr; MAFFT with defaults...'''. The alignment is calculated in a few minutes and displayed in a new window.
<td>-ineae</td>
+
}}
<td>&nbsp;</td>
 
</tr>
 
  
<tr class="s1">
+
;Calculate a MAFFT alignment when the Jalview Web service is NOT available:
<td>'''Genus'''</td>
+
 
<td>&nbsp;</td>
+
{{task|1=
<td>Saccharomyces</td>
+
#In Jalview, select '''File &rarr; Output to Textbox &rarr; FASTA'''
</tr>
+
#Copy the sequences.
 +
#Navigate to the [http://www.ebi.ac.uk/Tools/msa/mafft/ '''MAFFT Input form'''] at the EBI.
 +
#Paste your sequences into the form.
 +
#Click on '''Submit'''.
 +
#Close the Jalview sequence window and either save your MAFFT alignment to file and load in Jalview, or simply ''''File &rarr; Input Alignment &rarr; from Textbox''', paste and click '''New Window'''.
 +
}}
 +
 
 +
 
 +
In any case, you should now have an alignment.
 +
 
 +
{{task|1=
 +
#Choose '''Colour &rarr; Hydrophobicity''' and '''&rarr; by Conservation'''. Then adjust the slider left or right to see which columns are highly conserved. You will notice that the Swi6 sequence that was supposed to align only to the ankyrin domains was in fact aligned to other parts of the sequence as well. This is one part of the MSA that we will have to correct manually and a common problem when aligning sequences of different lengths.
 +
}}
  
<tr class="s2">
 
<td>'''Species'''</td>
 
<td>&nbsp;</td>
 
<td>''Saccharomyces cerevisiae''</td>
 
</tr>
 
 
<table>
 
  
  
You can see that there is not a common mapping between the yeast lineage and the commonly recognized categories - not all ranks are represented. Nor is this consistent across species in the taxonomic database: some have subfamily ranks and some don't. And the tree is in no way normalized - some of the ranks have thousands of members, and for some, only a single extant member may be known, or it may be a rank that only relates to the fossil record. But the ranks do provide some guidance to evolutionary divergence. Say you want to choose four species across the tree of life for a study, you should choose one from each of the major '''domains''' of life: Eubacteria, Euryarchaeota, Crenarchaeota-Eocytes, and Eukaryotes. Or you want to study a gene that is specific to mammals. Then you could choose from the clades listed in the NCBI taxonomy database under [http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=40674&lvl=4 '''Mammalia'''] (a {{WP|Mammal_classification|'''class rank'''}}, and depending how many species you would want to include, use the
+
&nbsp;
subclass-, order-, or family rank (hover over the names to see their taxonomic rank.)
 
  
There will still be quite a bit of manual work involved and an exploration of different options on the Web may be useful. For our purposes here we can retrieve a good set of organisms from the [http://fungi.ensembl.org/info/website/species.html '''ensembl fungal genomes page'''] - maintained by the EBI's genome annotation group - that lists species grouped by taxonomic ''order''. All of these organisms are genome-sequenced, we can pick a set of representatives:
+
===Editing ankyrin domain alignments===
  
# Capnodiales&nbsp;&nbsp;&nbsp;''Zymoseptoria tritici''
 
# Erysiphales&nbsp;&nbsp;&nbsp;''Blumeria graminis''
 
# Eurotiales&nbsp;&nbsp;&nbsp;''Aspergillus nidulans''
 
# Glomerellales&nbsp;&nbsp;&nbsp;''Glomerella graminicola''
 
# Hypocreales&nbsp;&nbsp;&nbsp;''Trichoderma reesei''
 
# Magnaporthales&nbsp;&nbsp;&nbsp;''Magnaporthe oryzae''
 
# Microbotryales&nbsp;&nbsp;&nbsp;''Microbotryum violaceum''
 
# Pezizales&nbsp;&nbsp;&nbsp;''Tuber melanosporum''
 
# Pleosporales&nbsp;&nbsp;&nbsp;''Phaeosphaeria nodorum''
 
# Pucciniales&nbsp;&nbsp;&nbsp;''Puccinia graminis''
 
# Saccharomycetales&nbsp;&nbsp;&nbsp;''Saccharomyces cerevisiae''
 
# Schizosaccharomycetales&nbsp;&nbsp;&nbsp;''Schizosaccharomyces pombe''
 
# Sclerotiniaceae&nbsp;&nbsp;&nbsp;''Sclerotinia sclerotiorum''
 
# Sordariales&nbsp;&nbsp;&nbsp;''Neurospora crassa''
 
# Tremellales&nbsp;&nbsp;&nbsp;''Cryptococcus neoformans''
 
# Ustilaginales&nbsp;&nbsp;&nbsp;''Ustilago maydis''
 
  
This set of organisms thus can be used to generate a PSI-BLAST search in a well-distributed set of species. Of course '''you must also include YFO''' (<small>if YFO is not in this list already</small>).
+
A '''good''' MSA comprises only columns of residues that play similar roles in the proteins' mechanism and/or that evolve in a comparable structural context. Since the alignment reflects the result of biological selection and conservation, it has relatively few indels and the indels it has are usually not placed into elements of secondary structure or into functional motifs. The contiguous features annotated for Mbp1 are expected to be left intact by a good alignment.
  
To enter these 16 species as an Entrez restriction, they need to be formatted as below. (<small>One could also enter species one by one, by pressing the '''(+)''' button after the organism list</small>)
+
A '''poor''' MSA has many errors in its columns; these contain residues that actually have different functions or structural roles, even though they may look similar according to a (pairwise!) scoring matrix. A poor MSA also may have introduced indels in biologically irrelevant positions, to maximize spurious sequence similarities. Some of the features annotated for Mbp1 will be disrupted in a poor alignment and residues that are conserved may be placed into different columns.
  
 +
Often errors or inconsistencies are easy to spot, and manually editing an MSA is not generally frowned upon, even though this is not a strictly objective procedure. The main goal of manual editing is to make an alignment biologically more plausible. Most comonly this means to mimize the number of rare evolutionary events that the alignment suggests and/or to emphasize conservation of known functional motifs. Here are some examples for what one might aim for in manually editing an alignment:
  
<source lang="text">
+
;Reduce number of indels
Aspergillus nidulans[orgn]
+
From a Probcons alignment:
OR Blumeria graminis[orgn]
+
0447_DEBHA    ILKTE-K<span style="color: rgb(255, 0, 0);">-</span>T<span style="color: rgb(255, 0, 0);">---</span>K--SVVK      ILKTE----KTK---SVVK
OR Cryptococcus neoformans[orgn]
+
9978_GIBZE    MLGLN<span style="color: rgb(255, 0, 0);">-</span>PGLKEIT--HSIT      MLGLNPGLKEIT---HSIT
OR Glomerella graminicola[orgn]
+
1513_CANAL    ILKTE-K<span style="color: rgb(255, 0, 0);">-</span>I<span style="color: rgb(255, 0, 0);">---</span>K--NVVK      ILKTE----KIK---NVVK
OR Magnaporthe oryzae[orgn]
+
6132_SCHPO    ELDDI-I<span style="color: rgb(255, 0, 0);">-</span>ESGDY--ENVD      ELDDI-IESGDY---ENVD
OR Microbotryum violaceum[orgn]
+
1244_ASPFU    ----N<span style="color: rgb(255, 0, 0);">-</span>PGLREIC--HSIT  -&gt;  ----NPGLREIC---HSIT
OR Neurospora crassa[orgn]
+
0925_USTMA    LVKTC<span style="color: rgb(255, 0, 0);">-</span>PALDPHI--TKLK      LVKTCPALDPHI---TKLK
OR Phaeosphaeria nodorum[orgn]
+
2599_ASPTE    VLDAN<span style="color: rgb(255, 0, 0);">-</span>PGLREIS--HSIT      VLDANPGLREIS---HSIT
OR Puccinia graminis[orgn]
+
9773_DEBHA    LLESTPKQYHQHI--KRIR      LLESTPKQYHQHI--KRIR
OR Sclerotinia sclerotiorum[orgn]
+
0918_CANAL    LLESTPKEYQQYI--KRIR      LLESTPKEYQQYI--KRIR
OR Trichoderma reesei[orgn]
 
OR Tuber melanosporum[orgn]
 
OR Saccharomyces cerevisiae[orgn]
 
OR Schizosaccharomyces pombe[orgn]
 
OR Ustilago maydis[orgn]
 
OR Zymoseptoria tritici[orgn]
 
  
</source>
+
<small>Gaps marked in red were moved. The sequence similarity in the alignment does not change considerably, however the total number of indels in this excerpt is reduced to 13 from the original 22</small>
  
  
 +
;Move indels to more plausible position
 +
From a CLUSTAL alignment:
 +
4966_CANGL    MKHEKVQ------GGYGRFQ---GTW      MKHEKV<span style="color: rgb(0, 170, 0);">Q</span>------GGYGRFQ---GTW
 +
1513_CANAL    KIKNVVK------VGSMNLK---GVW      KIKNVV<span style="color: rgb(0, 170, 0);">K</span>------VGSMNLK---GVW
 +
6132_SCHPO    VDSKHP<span style="color: rgb(255, 0, 0);">-</span>----------<span style="color: rgb(255, 0, 0);">Q</span>ID---GVW  -&gt;  VDSKHP<span style="color: rgb(0, 170, 0);">Q</span>-----------ID---GVW
 +
1244_ASPFU    EICHSIT------GGALAAQ---GYW      EICHSI<span style="color: rgb(0, 170, 0);">T</span>------GGALAAQ---GYW
  
&nbsp;
+
<small>The two characters marked in red were swapped. This does not change the number of indels but places the "Q" into a a column in which it is more highly conserved (green). Progressive alignments are especially prone to this type of error.</small>
  
==Executing the PSI-BLAST search==
+
;Conserve motifs
 +
From a CLUSTAL alignment:
 +
6166_SCHPO      --DKR<span style="color: rgb(255, 0, 0);">V</span>A---<span style="color: rgb(255, 0, 0);">G</span>LWVPP      --DKR<span style="color: rgb(0, 255, 0);">V</span>A--<span style="color: rgb(0, 255, 0);">G</span>-LWVPP
 +
XBP1_SACCE      GGYIK<span style="color: rgb(255, 0, 0);">I</span>Q---<span style="color: rgb(255, 0, 0);">G</span>TWLPM      GGYIK<span style="color: rgb(0, 255, 0);">I</span>Q--<span style="color: rgb(0, 255, 0);">G</span>-TWLPM
 +
6355_ASPTE      --DE<span style="color: rgb(255, 0, 0);">I</span>A<span style="color: rgb(255, 0, 0);">G</span>---NVWISP  -&gt;  ---DE<span style="color: rgb(0, 255, 0);">I</span>A--<span style="color: rgb(0, 255, 0);">G</span>NVWISP
 +
5262_KLULA      GGYIK<span style="color: rgb(255, 0, 0);">I</span>Q---<span style="color: rgb(255, 0, 0);">G</span>TWLPY      GGYIK<span style="color: rgb(0, 255, 0);">I</span>Q--<span style="color: rgb(0, 255, 0);">G</span>-TWLPY
  
We have a list of species. Good. Next up: how do we '''use''' it.
+
<small>The first of the two residues marked in red is a conserved, solvent exposed hydrophobic residue that may mediate domain interactions. The second residue is the conserved glycine in a beta turn that cannot be mutated without structural disruption. Changing the position of a gap and insertion in one sequence improves the conservation of both motifs.</small>
  
{{task|1=
 
  
 +
The Ankyrin domains are quite highly diverged, the boundaries not well defined and not even CDD, SMART and SAS agree on the precise annotations. We expect there to be alignment errors in this region. Nevertheless we would hope that a good alignment would recognize homology in that region and that ideally the required <i>indels</i> would be placed between the secondary structure elements, not in their middle. But judging from the sequence alignment alone, we cannot judge where the secondary structure elements ought to be. You should therefore add the following "sequence" to the alignment; it contains exactly as many characters as the Swi6 sequence above and annotates the secondary structure elements. I have derived it from the 1SW6 structure
  
# Navigate to the BLAST homepage.
+
>SecStruc 1SW6 E: strand  t: turn  H: helix  _: irregular
# Select '''protein BLAST'''.
+
_EEE__tt___ttt______EE_____t___HHHHHHHHHHHHHHHH_xxxx_HHHHHHH
# Paste the APSES domain sequence into the search field.
+
HHHH_t_____t_____t____HHHHHHH__tHHHHHHHHH____t___tt____HHHHH
# Select '''refseq''' as the database.
+
HH__HHHH___HHHHHHHHHHHHHEE_t____HHHHHHHHH__t__HHHHHHHHHHHHHH
# Copy the organism restriction list from above '''and enter the correct name for YFO''' into the list if it is not there already. Obviously, you can't find sequences in YFO if YFO is not included in your search space. Paste the list into the '''Entrez Query''' field.
+
HHHHHH__EEE_xxxx_HHHHHt_HHHHHHH______t____HHHHHHHH__HHHHHHHH
# In the '''Algorithm''' section, select PSI-BLAST.
+
H____t____t____HHHH___
#Click on '''BLAST'''.
+
 
}}
+
<div class="reference-box">[http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum/GetPage.pl?pdbcode=1sw6&template=protein.html&r=wiring&l=1&chain=A '''1SW6_A''' at the PDBSum database of structure annotations] You can compare the diagram there with this text string.</div>
  
  
Evaluate the results carefully. Since we used default parameters, the threshold for inclusion was set at an '''E-value''' of 0.005 by default, and that may be a bit too lenient. If you look at the table of your hits&ndash; in the '''Sequences producing significant alignments...''' section&ndash; there may also be a few sequences that have a low query coverage of less than 80%. Let's exclude these from the profile initially: not to worry, if they are true positives, the will come back with improved E-values and greater coverage in subsequent iterations. But if they were false positives, their E-values will rise and they should drop out of the profile and not contaminate it.
+
To proceed:
 +
#Manually align the Swi6 sequence with yeast Mbp1
 +
#Bring the Secondary structure annotation into its correct alignment with Swi6
 +
#Bring both CDD ankyrin profiles into the correct alignment with yeast Mbp1
  
 +
Proceed along the following steps:
  
 
{{task|1=
 
{{task|1=
#In the header section, click on '''Formatting options''' and in the line "Format for..." set the '''with inclusion threshold''' to <code>0.001</code> (This means E-values can't be above 10<sup>-03</sup> for the sequence to be included.)
+
#Add the secondary structure annotation to the sequence alignment in Jalview. Copy the annotation, select File &rarr; Add sequences &rarr; from Textbox and paste the sequence.
# Click on the '''Reformat''' button (top right).
+
#Select Help &rarr; Documentation and read about '''Editing Alignments''', '''Cursor Mode''' and '''Key strokes'''.
# In the table of sequence descriptions (not alignments!), click on the '''Query coverage''' to sort the table by coverage, not by score.  
+
#Click on the yeast Mbp1 sequence '''row''' to select the entire row. Then use the cursor key to move that sequence down, so it is directly above the 1SW6 sequence. Select the row of 1SW6 and use shift/mouse to move the sequence elements and edit the alignment to match yeast Mbp1. Refer to the alignment given in the [[Reference annotation yeast Mbp1|Mbp1 annotation page]] for the correct alignment.
# Copy the rows with a coverage of less than 80% and paste them into some text editor so you can compare what happens with these sequences in the next iteration.
+
#Align the secondary structure elements with the 1SW6 sequence: Every character of 1SW6 should be matched with either E, t, H, or _. The result should be similar to the [[Reference annotation yeast Mbp1|Mbp1 annotation page]]. If you need to insert gaps into all sequences in the alignment, simply drag your mouse over all row headers - movement of sequences is constrained to selected regions, the rest is locked into place to prevent inadvertent misalignments. Remember to save your project from time to time: '''File &rarr; save''' so you can reload a previous state if anything goes wrong and can't be fixed with '''Edit &rarr; Undo'''.
# '''Deselect''' the check mark next to these sequences in the right-hand column '''Select for PSI blast'''. (For me these are six sequences, but with YFO included that may be a bit different.)
+
#Finally align the two CD00204 consensus sequences to their correct positions (again, refer to the [[Reference annotation yeast Mbp1|Mbp1 annotation page]]).
# Then scroll to '''Run PSI-BLAST iteration 2 ...''' and click on '''<code>Go</code>'''.
+
#You can now consider the principles stated above and see if you can improve the alignment, for example by moving indels out of regions of secondary structure if that is possible without changing the character of the aligned columns significantly. Select blocks within which to work to leave the remaining alignment unchanged. So that this does not become tedious, you can restrict your editing to one Ankyrin repeat that is structurally defined in Swi6. You may want to open the 1SW6 structure in VMD to define the boundaries of one such repeat. You can copy and paste sections from Jalview into your assignment for documentation or export sections of the alignment to HTML (see the example below).  
 
}}
 
}}
  
 +
=== Editing ankyrin domain alignments - Sample===
 +
 +
This sample was created by
  
This is now the "real" PSI-BLAST at work: it constructs a profile from all the full-length sequences and searches with the '''profile''', not with any individual sequence. Note that we are controlling what goes into the profile in two ways:
+
# Editing the alignments as described above;
# we are explicitly removing sequences with poor coverage; and
+
# Copying a block of aligned sequence;
# we are requiring a more stringent minimum E-value for each sequence.
+
# Pasting it To New Alignment;
 +
# Colouring the residues by Hydrophobicity and setting the colour saturation according to Conservation;
 +
# Choosing File &rarr; Export Image &rarr; HTML and pasting the resulting HTML source into this Wikipage.  
  
  
{{task|1=
+
<table border="1"><tr><td>
#Again, study the table of hits. Sequences highlighted in yellow have met the search criteria in the second iteration. Note that the coverage of (some) of the previously excluded sequences is now above 80%.
+
<table border="0" cellpadding="0" cellspacing="0">
# Let's exclude partial matches one more time. Again, deselect all sequences with less than 80% coverage. Then run the third iteration.
 
# Iterate the search in this way until no more "New" sequences are added to the profile. Then scan the list of excluded hits ... are there any from YFO that seem like they could potentially make the list? Marginal E-value perhaps, or reasonable E-value but less coverage? If that's the case, try returning the E-value threshold to the default 0.005 and see what happens...
 
}}
 
  
 +
<tr><td colspan="6"></td>
 +
<td colspan="9">10<br>|</td><td></td>
 +
<td colspan="9">20<br>|</td><td></td>
 +
<td colspan="9">30<br>|</td><td></td>
 +
<td colspan="3"></td><td colspan="3">40<br>|</td>
  
Once no "new" sequences have been added, if we were to repeat the process again and again, we would always get the same result because the profile stays the same. We say that the search has '''converged'''. Good. Time to harvest.
+
</tr>
 +
<tr><td nowrap="nowrap">MBP1_USTMA/341-368&nbsp;&nbsp;</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#f3eef9">Y</td>
 +
<td bgcolor="#f4eef7">G</td>
 +
<td bgcolor="#eeeefe">D</td>
 +
<td bgcolor="#eeeefe">Q</td>
 +
<td bgcolor="#fdeeef">L</td>
  
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#f9eef3">A</td>
 +
<td bgcolor="#eeeefe">D</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
{{task|1=
+
<td>-</td>
# At the header, click on '''Taxonomy reports''' and find YFO in the '''Organism Report''' section. These are your APSES domain homologs. All of them. Actually, perhaps more than all: the report may also include sequences with E-values above the inclusion threshold.
+
<td>-</td>
# From the report copy the sequence identifiers
+
<td>-</td>
## from YFO,
+
<td>-</td>
## with E-values above your defined threshold.
+
<td>-</td>
}}
+
<td>-</td>
 +
<td bgcolor="#ffd8d8">I</td>
 +
<td bgcolor="#fbd8db">L</td>
 +
<td>-</td>
  
For example, the list of ''Saccharomyces'' genes is the following:
+
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#dad8fd">N</td>
 +
<td bgcolor="#fbeef1">F</td>
 +
<td bgcolor="#eeeefe">Q</td>
 +
<td bgcolor="#c5c2fb">D</td>
 +
<td bgcolor="#c5c2fb">D</td>
 +
<td bgcolor="#eeeefe">E</td>
  
<code>
+
<td bgcolor="#cfaddc">G</td>
<b>[http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=559292 Saccharomyces cerevisiae S288c]</b> [http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=4890 [ascomycetes]] taxid 559292<br \>
+
<td bgcolor="#dad8fd">E</td>
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6320147&dopt=GenPept ref|NP_010227.1|] Mbp1p [Saccharomyces cerevisiae S288c]          [ 131]  1e-38<br \>
+
<td bgcolor="#d9c2e7">T</td>
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6320957&dopt=GenPept ref|NP_011036.1|] Swi4p [Saccharomyces cerevisiae S288c]          [ 123]  1e-35<br \>
+
<td bgcolor="#d3c2ee">P</td>
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6322808&dopt=GenPept ref|NP_012881.1|] Phd1p [Saccharomyces cerevisiae S288c]          [  91]  1e-25<br \>
+
<td bgcolor="#f7adb3">L</td>
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6323658&dopt=GenPept ref|NP_013729.1|] Sok2p [Saccharomyces cerevisiae S288c]          [  93]  3e-25<br \>
+
<td bgcolor="#ccaddf">T</td>
[http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6322090&dopt=GenPept ref|NP_012165.1|] Xbp1p [Saccharomyces cerevisiae S288c]          [  40]  5e-07<br \>
+
<td bgcolor="#ecc2d5">M</td>
</code>
+
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#e4adc7">A</td>
  
[[Saccharomyces cerevisiae Xbp1|Xbp1]] is a special case. It has only very low coverage, but that is because it has a long domain insertion and the N-terminal match often is not recognized by alignment because the gap scores for long indels are unrealistically large. For now, I keep that sequence with the others.
+
<td bgcolor="#adadff">R</td>
 
+
<td bgcolor="#ebc2d5">A</td>
 
+
<td bgcolor="#eeeeff">R</td>
Next we need to retrieve the sequences. Tedious to retrieve them one by one, but we can get them all at the same time:
+
<td bgcolor="#f4eef8">S</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBP1B_SCHCO/470-498&nbsp;&nbsp;</td>
 +
<td>-</td>
 +
<td bgcolor="#eeeeff">R</td>
 +
<td bgcolor="#eeeefe">E</td>
  
 +
<td bgcolor="#eeeefe">D</td>
 +
<td bgcolor="#f4eef7">G</td>
 +
<td bgcolor="#eeeefe">D</td>
 +
<td bgcolor="#f3eef9">Y</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#eeeeff">K</td>
 +
<td bgcolor="#f4eef8">S</td>
  
{{task|1=
+
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
# Return to the BLAST results page and again open the '''Formatting options'''.
+
<td>-</td>
# Find the '''Limit results''' section and enter YFO's name into the field. For example <code>Saccharomyces cerevisiae [ORGN]</code>
+
<td bgcolor="#f7d8e0">F</td>
# Click on '''Reformat'''
+
<td bgcolor="#fbd8db">L</td>
# Scroll to the '''Descriptions''' section, check the box at the left-hand margin, next to each sequence you want to keep. Then click on '''Download &rarr; FASTA complete sequence &rarr; Continue'''.
+
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#dad8fd">D</td>
 +
<td bgcolor="#fdeeef">L</td>
  
 +
<td bgcolor="#eeeefe">Q</td>
 +
<td bgcolor="#c5c2fb">D</td>
 +
<td bgcolor="#c5c2fb">E</td>
 +
<td bgcolor="#efeefd">H</td>
 +
<td bgcolor="#cfaddc">G</td>
 +
<td bgcolor="#dad8fd">D</td>
 +
<td bgcolor="#d9c2e7">T</td>
 +
<td bgcolor="#ebc2d5">A</td>
 +
<td bgcolor="#f7adb3">L</td>
  
 +
<td bgcolor="#b0adfa">N</td>
 +
<td bgcolor="#ffc2c2">I</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#adadff">R</td>
 +
<td bgcolor="#fcc2c4">V</td>
 +
<td bgcolor="#f4eef7">G</td>
 +
<td bgcolor="#eeeefe">N</td>
 +
</tr>
  
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand" data-collapsetext="Collapse" style="border:#000000 solid 1px; padding: 10px; margin-left:25px; margin-right:25px;">There are actually several ways to download lists of sequences. Using the results page utility is only one. But if you know the GIs of the sequences you need, you can get them more directly by putting them into the URL...
+
<tr><td nowrap="nowrap">MBP1_ASHGO/465-494&nbsp;&nbsp;</td>
<div  class="mw-collapsible-content">
+
<td>F</td>
 
+
<td bgcolor="#f4eef8">S</td>
* http://www.ncbi.nlm.nih.gov/protein/6320147,6320957,6322808,6323658,6322090?report=docsum  - The default report
+
<td bgcolor="#f2eefa">P</td>
* http://www.ncbi.nlm.nih.gov/protein/6320147,6320957,6322808,6323658,6322090?report=fasta - FASTA sequences with NCBI HTML markup
+
<td bgcolor="#eeeefe">Q</td>
 +
<td bgcolor="#f3eef9">Y</td>
 +
<td bgcolor="#eeeeff">R</td>
 +
<td bgcolor="#ffeeee">I</td>
 +
<td>-</td>
  
Even more flexible is the [http://www.ncbi.nlm.nih.gov/books/NBK25500/#chapter1.Downloading_Full_Records '''eUtils'''] interface to the NCBI databases. For example you can download the dataset in text format by clicking below.
+
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#eeeefe">E</td>
 +
<td bgcolor="#f4eef8">T</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
* http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=6320147,6320957,6322808,6323658,6322090&rettype=fasta&retmode=text
+
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#fbd8db">L</td>
 +
<td bgcolor="#ffd8d8">I</td>
 +
<td>-</td>
 +
<td>-</td>
  
Note that this utility does not '''show''' anything, but downloads the (multi) fasta file to your default download directory.
+
<td>-</td>
+
<td>-</td>
</div>
+
<td bgcolor="#dad8fd">N</td>
</div>
+
<td bgcolor="#f9eef3">A</td>
}}
+
<td bgcolor="#eeeefe">Q</td>
<!--
+
<td bgcolor="#c5c2fb">D</td>
 +
<td bgcolor="#efc2d0">C</td>
 +
<td bgcolor="#eeeeff">K</td>
 +
<td bgcolor="#cfaddc">G</td>
  
Add to this assignment:
+
<td bgcolor="#e6d8f0">S</td>
- study the BLAST output format, links, tools, scores ...
+
<td bgcolor="#d9c2e7">T</td>
- compare the improvement in E-values to standard BLAST
+
<td bgcolor="#d3c2ee">P</td>
- examine this in terms of sensitivity and specificity
+
<td bgcolor="#f7adb3">L</td>
 +
<td bgcolor="#b3adf7">H</td>
 +
<td bgcolor="#ffc2c2">I</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#e5adc6">M</td>
  
-->
+
<td bgcolor="#c5c2fb">N</td>
 +
<td bgcolor="#eeeeff">R</td>
 +
<td bgcolor="#eeeefe">D</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBP1_CLALU/550-586&nbsp;&nbsp;</td>
 +
<td>G</td>
 +
<td bgcolor="#eeeefe">N</td>
 +
<td bgcolor="#eeeefe">Q</td>
 +
<td bgcolor="#eeeefe">N</td>
  
 +
<td bgcolor="#f4eef7">G</td>
 +
<td bgcolor="#eeeefe">N</td>
 +
<td bgcolor="#f4eef8">S</td>
 +
<td>N</td>
 +
<td>D</td>
 +
<td>K</td>
 +
<td bgcolor="#eeeeff">K</td>
 +
<td bgcolor="#eeeefe">E</td>
 +
<td>-</td>
  
==Introduction==
+
<td>-</td>
 
+
<td>-</td>
In the last assignment we discovered homologs to ''S. cerevisiae'' Mbp1 in YFO. Some of these will be orthologs to Mbp1, some will be paralogs. Some will have similar function, some will not. We discussed previously that genes that evolve under continuously similar evolutionary pressure should be most similar in sequence, and should have the most similar "function".
+
<td>-</td>
 
+
<td>-</td>
In this assignment we will define the YFO gene that is the most similar ortholog to ''S. cerevisiae'' Mbp1, and perform a multiple sequence alignment with it.
+
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
Let us briefly review the basic concepts.
+
<td bgcolor="#fbd8db">L</td>
 +
<td bgcolor="#ffd8d8">I</td>
 +
<td>S</td>
 +
<td>K</td>
 +
<td>F</td>
 +
<td>L</td>
 +
<td bgcolor="#dad8fd">N</td>
 +
<td bgcolor="#efeefd">H</td>
 +
<td bgcolor="#eeeefe">Q</td>
  
==Orthologs and Paralogs revisited==
+
<td bgcolor="#c5c2fb">D</td>
 +
<td bgcolor="#c5c2fb">N</td>
 +
<td bgcolor="#eeeefe">E</td>
 +
<td bgcolor="#cfaddc">G</td>
 +
<td bgcolor="#dad8fd">N</td>
 +
<td bgcolor="#d9c2e7">T</td>
 +
<td bgcolor="#ebc2d5">A</td>
 +
<td bgcolor="#edadbd">F</td>
 +
<td bgcolor="#b3adf7">H</td>
  
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
+
<td bgcolor="#ffc2c2">I</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#c6ade5">Y</td>
 +
<td bgcolor="#c5c2fb">N</td>
 +
<td bgcolor="#f9eef3">M</td>
 +
<td bgcolor="#f4eef8">S</td>
 +
</tr>
  
&nbsp;<br>
+
<tr><td nowrap="nowrap">MBPA_COPCI/514-542&nbsp;&nbsp;</td>
;All related genes are homologs.
 
</div>
 
  
 +
<td>-</td>
 +
<td bgcolor="#efeefd">H</td>
 +
<td bgcolor="#eeeefe">E</td>
 +
<td bgcolor="#f4eef7">G</td>
 +
<td bgcolor="#f4eef7">G</td>
 +
<td bgcolor="#eeeefe">D</td>
 +
<td bgcolor="#fbeef1">F</td>
 +
<td>-</td>
 +
<td>-</td>
  
Two central definitions about the mutual relationships between related genes go back to Walter Fitch who stated them in the 1970s:
+
<td>-</td>
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
+
<td bgcolor="#eeeeff">R</td>
 
+
<td bgcolor="#f4eef8">S</td>
&nbsp;<br>
+
<td>-</td>
;Orthologs have diverged after speciation.
+
<td>-</td>
 
+
<td>-</td>
;Paralogs have diverged after duplication.
+
<td>-</td>
</div>
+
<td>-</td>
 +
<td>-</td>
  
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#fbd8db">L</td>
 +
<td bgcolor="#fdd8da">V</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
&nbsp;
+
<td>-</td>
 +
<td bgcolor="#dad8fd">D</td>
 +
<td bgcolor="#fdeeef">L</td>
 +
<td bgcolor="#eeeefe">Q</td>
 +
<td bgcolor="#c5c2fb">D</td>
 +
<td bgcolor="#c5c2fb">E</td>
 +
<td bgcolor="#efeefd">H</td>
 +
<td bgcolor="#cfaddc">G</td>
 +
<td bgcolor="#dad8fd">D</td>
  
 +
<td bgcolor="#d9c2e7">T</td>
 +
<td bgcolor="#ebc2d5">A</td>
 +
<td bgcolor="#ffadad">I</td>
 +
<td bgcolor="#b0adfa">N</td>
 +
<td bgcolor="#ffc2c2">I</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#adadff">R</td>
 +
<td bgcolor="#fcc2c4">V</td>
  
[[Image:OrthologParalog.jpg|frame|none|'''Hypothetical evolutionary tree.''' A single gene evolves through two speciation events and one duplication event. A duplication occurs during the evolution from reptilian to synapsid. It is easy to see how this pair of genes (paralogs) in the ancestral synapsid gives rise to two pairs of genes in pig and elephant, respectively. All ''circle'' genes are mutually orthologs, they form a "cluster of orthologs". All genes within one species are mutual paralogs&ndash;they are so called ''in-paralogs''. The ''circle'' gene in pig and the ''triangle'' gene in the elephant are so-called ''out-paralogs''. Somewhat counterintuitively, the ''triangle'' gene in the pig and the ''circle'' gene in the raven are also orthologs - but this has to be, since the last common ancestor diverged by '''speciation'''.
+
<td bgcolor="#f4eef7">G</td>
 +
<td bgcolor="#eeeefe">N</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBP1_DEBHA/507-550&nbsp;&nbsp;</td>
 +
<td>I</td>
 +
<td bgcolor="#eeeeff">R</td>
 +
<td bgcolor="#eeeefe">D</td>
 +
<td bgcolor="#f4eef8">S</td>
 +
<td bgcolor="#eeeefe">Q</td>
  
The "phylogram" on the right symbolizes the amount of evolutionary change as proportional to height difference to the "root". It is easy to see how a bidirectional BLAST search will only find pairs of most similar orthologs. If applied to a group of species, bidirectional BLAST searches will find clusters of orthologs only (except if genes were lost, or there are  anomalies in the evolutionary rate.)]]
+
<td bgcolor="#eeeefe">E</td>
 +
<td bgcolor="#ffeeee">I</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#eeeefe">E</td>
 +
<td bgcolor="#eeeefe">N</td>
 +
<td>K</td>
 +
<td>K</td>
  
 +
<td>L</td>
 +
<td>S</td>
 +
<td>L</td>
 +
<td>S</td>
 +
<td>D</td>
 +
<td>K</td>
 +
<td>K</td>
 +
<td>E</td>
 +
<td bgcolor="#fbd8db">L</td>
  
==Defining orthologs==
+
<td bgcolor="#ffd8d8">I</td>
 +
<td>A</td>
 +
<td>K</td>
 +
<td>F</td>
 +
<td>I</td>
 +
<td bgcolor="#dad8fd">N</td>
 +
<td bgcolor="#efeefd">H</td>
 +
<td bgcolor="#eeeefe">Q</td>
 +
<td bgcolor="#c5c2fb">D</td>
  
To be reasonably certain about orthology relationships, we would need to construct and analyze detailed evolutionary trees. This is computationally expensive and the results are not always unambiguous either, as we will see in a later assignment. But a number of different strategies are available that use precomputed results to define orthologs. These are especially useful for large, cross genome surveys. They are less useful for detailed analysis of individual genes. Pay the sites a visit and try a search.
+
<td bgcolor="#ffc2c2">I</td>
 +
<td bgcolor="#eeeefe">D</td>
 +
<td bgcolor="#cfaddc">G</td>
 +
<td bgcolor="#dad8fd">N</td>
 +
<td bgcolor="#d9c2e7">T</td>
 +
<td bgcolor="#ebc2d5">A</td>
 +
<td bgcolor="#edadbd">F</td>
 +
<td bgcolor="#b3adf7">H</td>
 +
<td bgcolor="#ffc2c2">I</td>
  
 +
<td bgcolor="#fbadaf">V</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#c6ade5">Y</td>
 +
<td bgcolor="#c5c2fb">N</td>
 +
<td bgcolor="#fdeeef">L</td>
 +
<td bgcolor="#eeeefe">N</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBP1A_SCHCO/388-415&nbsp;&nbsp;</td>
 +
<td>-</td>
  
;Orthologs by eggNOG
+
<td>-</td>
:The [http://eggnog.embl.de/ '''eggNOG'''] (evolutionary genealogy of genes: Non-supervised Orthologous Groups) database contains orthologous groups of genes at the EMBL. It seems to be continuously updtaed, the search functionality is reasonable and the results for yeast Mbp1 show many genes from several fungi. Importantly, there is only one gene annotated for each species. Alignments and trees are also available, as are database downloads for algorithmic analysis.
+
<td bgcolor="#f3eef9">Y</td>
<div class="mw-collapsible mw-collapsed" data-expandtext="more..." data-collapsetext="less" style="width:800px">
+
<td bgcolor="#f2eefa">P</td>
&nbsp;
+
<td bgcolor="#eeeeff">K</td>
<div class="mw-collapsible-content">
+
<td bgcolor="#eeeefe">E</td>
+
<td bgcolor="#fdeeef">L</td>
{{#pmid: 24297252}}
+
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
</div>
+
<td bgcolor="#f9eef3">A</td>
</div>
+
<td bgcolor="#eeeefe">D</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#fdd8da">V</td>
 +
<td bgcolor="#fbd8db">L</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
;Orthologs at OrthoDB
+
<td bgcolor="#dad8fd">N</td>
:[http://www.orthodb.org/ '''OrthoDB'''] includes a large number of species, among them all of our protein-sequenced fungi. However the search function (by keyword) retrieves many paralogs together with the orthologs, for example, the yeast Soc2 and Phd1 proteins are found in the same orthologous group these two are clearly paralogs.
+
<td bgcolor="#fbeef1">F</td>
<div class="mw-collapsible mw-collapsed" data-expandtext="more..." data-collapsetext="less" style="width:800px">
+
<td bgcolor="#eeeefe">Q</td>
&nbsp;
+
<td bgcolor="#c5c2fb">D</td>
<div class="mw-collapsible-content">
+
<td bgcolor="#c5c2fb">E</td>
+
<td bgcolor="#eeeefe">D</td>
{{#pmid: 23180791}}
+
<td bgcolor="#cfaddc">G</td>
 +
<td bgcolor="#dad8fd">E</td>
 +
<td bgcolor="#d9c2e7">T</td>
  
</div>
+
<td bgcolor="#ebc2d5">A</td>
</div>
+
<td bgcolor="#f7adb3">L</td>
 +
<td bgcolor="#ccaddf">T</td>
 +
<td bgcolor="#ecc2d5">M</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#adadff">R</td>
 +
<td bgcolor="#efc2d0">C</td>
 +
<td bgcolor="#eeeeff">R</td>
  
 +
<td bgcolor="#f4eef8">S</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBP1_AJECA/374-403&nbsp;&nbsp;</td>
 +
<td>T</td>
 +
<td bgcolor="#fdeeef">L</td>
 +
<td bgcolor="#f2eefa">P</td>
 +
<td bgcolor="#f2eefa">P</td>
 +
<td bgcolor="#efeefd">H</td>
 +
<td bgcolor="#eeeefe">Q</td>
  
;Orthologs at OMA
+
<td bgcolor="#ffeeee">I</td>
[http://omabrowser.org/ '''OMA'''] (the Orthologous Matrix) maintained at the Swiss Federal Institute of Technology contains a large number of orthologs from sequenced genomes. Searching with <code>MBP1_YEAST</code> (this is the Swissprot ID) as a "Group" search finds the correct gene in EREGO, KLULA, CANGL and SACCE. But searching with the sequence of the ''Ustilago maydis'' ortholog does not find the yeast protein, but the orthologs in YARLI, SCHPO, LACCBI, CRYNE and USTMA. Apparently the orthologous group has been split into several subgroups across the fungi. However as a whole the database is carefully constructed and available for download and API access; a large and useful resource.
+
<td>-</td>
<div class="mw-collapsible mw-collapsed" data-expandtext="more..." data-collapsetext="less" style="width:800px">
+
<td>-</td>
&nbsp;
+
<td>-</td>
<div class="mw-collapsible-content">
+
<td bgcolor="#f4eef8">S</td>
+
<td bgcolor="#f9eef3">M</td>
{{#pmid: 21113020}}
+
<td>-</td>
 
+
<td>-</td>
... see also the related articles, much innovative and carefully done work on automated orthologue definition by the Dessimoz group.
+
<td>-</td>
</div>
 
</div>
 
  
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#fbd8db">L</td>
 +
<td bgcolor="#fbd8db">L</td>
  
;Orthologs by syntenic gene order conservation
+
<td>-</td>
:We will revisit this when we explore the UCSC genome browser.
+
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#e6d8f0">S</td>
 +
<td bgcolor="#f4eef8">S</td>
 +
<td bgcolor="#eeeefe">Q</td>
 +
<td bgcolor="#c5c2fb">D</td>
 +
<td bgcolor="#d8c2e8">S</td>
  
 +
<td bgcolor="#eeeefe">N</td>
 +
<td bgcolor="#cfaddc">G</td>
 +
<td bgcolor="#dad8fd">D</td>
 +
<td bgcolor="#d9c2e7">T</td>
 +
<td bgcolor="#ebc2d5">A</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#f7adb3">L</td>
 +
<td bgcolor="#ebc2d5">A</td>
 +
<td bgcolor="#e4adc7">A</td>
  
;Orthologs by RBM
+
<td bgcolor="#e4adc7">A</td>
:Defining it yourself. RBM (or: Reciprocal Best Match) is easy to compute and half of the work you have already done in [[BIO_Assignment_Week_3|Assignment 3]]. Get the ID for the gene which you have identified and annotated as the best BLAST match for Mbp1 in YFO and confirm that this gene has Mbp1 as the most significant hit in the yeast proteome. <small>The results are unambiguous, but there may be residual doubt whether these two best-matching sequences are actually the most similar orthologs.</small>  
+
<td bgcolor="#adadff">K</td>
 +
<td bgcolor="#c5c2fb">N</td>
 +
<td bgcolor="#f4eef7">G</td>
 +
<td bgcolor="#faeef2">C</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBP1_PARBR/380-409&nbsp;&nbsp;</td>
 +
<td>I</td>
 +
<td bgcolor="#fdeeef">L</td>
  
{{task|1=
+
<td bgcolor="#f2eefa">P</td>
# Navigate to the BLAST homepage.
+
<td bgcolor="#f2eefa">P</td>
# Paste the YFO RefSeq sequence identifier into the search field. (You don't have to search with sequences&ndash;you can search directly with an NCBI identifier '''IF''' you want to search with the full-length sequence.)
+
<td bgcolor="#efeefd">H</td>
# Set the database to refseq, and restrict the species to ''Saccharomyces cerevisiae''.
+
<td bgcolor="#eeeefe">Q</td>
# Run BLAST.
+
<td bgcolor="#ffeeee">I</td>
# Keep the window open for the next task.
+
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#f4eef8">S</td>
  
The top hit should be yeast Mbp1 (NP_010227). E mail me your sequence identifiers if it is not.
+
<td bgcolor="#fdeeef">L</td>
If it is, you have confirmed the '''RBM''' or '''BBM''' criterion (Reciprocal Best Match or Bidirectional Best Hit, respectively).
+
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
<small>Technically, this is not perfectly true since you have searched with the APSES domain in one direction, with the full-length sequence in the other. For this task I wanted you to try the ''search-with-accession-number''. Therefore the procedural laxness, I hope it is permissible. In fact, performing the reverse search with the YFO APSES domain should actually be more stringent, i.e. if you find the right gene with the longer sequence, you are even more likely to find the right gene with the shorter one.</small>  
+
<td>-</td>
}}
+
<td>-</td>
 +
<td bgcolor="#fbd8db">L</td>
 +
<td bgcolor="#fbd8db">L</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#e6d8f0">S</td>
  
 +
<td bgcolor="#f4eef8">S</td>
 +
<td bgcolor="#eeeefe">Q</td>
 +
<td bgcolor="#c5c2fb">D</td>
 +
<td bgcolor="#d8c2e8">S</td>
 +
<td bgcolor="#eeeefe">N</td>
 +
<td bgcolor="#cfaddc">G</td>
 +
<td bgcolor="#dad8fd">D</td>
 +
<td bgcolor="#d9c2e7">T</td>
 +
<td bgcolor="#ebc2d5">A</td>
  
;Orthology by annotation
+
<td bgcolor="#e4adc7">A</td>
:The NCBI precomputes BLAST results and makes them available at the RefSeq database entry for your protein.
+
<td bgcolor="#f7adb3">L</td>
 +
<td bgcolor="#ebc2d5">A</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#adadff">K</td>
 +
<td bgcolor="#c5c2fb">N</td>
 +
<td bgcolor="#f4eef7">G</td>
 +
<td bgcolor="#faeef2">C</td>
  
{{task|1=
+
</tr>
# In your BLAST result page, click on the RefSeq link for your query to navigate to the RefSeq database entry for your protein.
+
<tr><td nowrap="nowrap">MBP1_NEOFI/363-392&nbsp;&nbsp;</td>
# Follow the '''Blink''' link in the right-hand column under '''Related information'''.
+
<td>T</td>
# Restrict the view RefSeq under the "Display options" and to Fungi.
+
<td bgcolor="#faeef2">C</td>
 
+
<td bgcolor="#f4eef8">S</td>
You should see a number of genes with low E-values and high coverage in other fungi - however this search is problematic since the full length gene across the database finds mostly Ankyrin domains.
+
<td bgcolor="#eeeefe">Q</td>
}}
+
<td bgcolor="#eeeefe">D</td>
 +
<td bgcolor="#eeeefe">E</td>
 +
<td bgcolor="#ffeeee">I</td>
  
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#eeeefe">D</td>
 +
<td bgcolor="#fdeeef">L</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
You will find that '''all''' of these approaches yield '''some''' of the orthologs. But none finds them all. The take home message is: precomputed results are good for large-scale survey-type investigations, where you can't humanly process the information by hand. But for more detailed questions, careful manual searches are still indsipensable.
+
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#fbd8db">L</td>
 +
<td bgcolor="#fbd8db">L</td>
 +
<td>-</td>
  
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand for crowdsourcing" data-collapsetext="Collapse">
+
<td>-</td>
;Orthology by crowdsourcing
+
<td>-</td>
:Luckily a crowd of willing hands has prepared the necessary sequences for you: in the section below you will find a link to the annotated and verified Mbp1 orthologs from last year's course  :-)
+
<td>-</td>
 +
<td bgcolor="#e6d8f0">S</td>
 +
<td bgcolor="#faeef2">C</td>
 +
<td bgcolor="#eeeefe">Q</td>
 +
<td bgcolor="#c5c2fb">D</td>
 +
<td bgcolor="#d8c2e8">S</td>
 +
<td bgcolor="#eeeefe">N</td>
  
<div class="mw-collapsible-content">
+
<td bgcolor="#cfaddc">G</td>
We could call this annotation by many hands {{WP|Crowdsourcing|"crowdsourcing"}} - handing out small parcels of work to many workers, who would typically allocate only a small share of their time, but here the strength is in numbers and especially projects that organize via the Internet can tally up very impressive manpower, for free, or as {{WP|Microwork}}. These developments have some interest for bioinformatics: many of our more difficult tasks  can not be easily built into an algorithm, language related tasks such as text-mining, or pattern matching tasks come to mind. Allocating this to a large number of human contributors may be a viable alternative to computation. A marketplace where this kind of work is already a reality is {{WP|Amazon Mechanical Turk|Amazon's "Mechanical Turk" Marketplace}}: programmers&ndash;"requesters"&ndash; use an open interface to post tasks for payment, "providers" from all over the world can engage in these. Tasks may include matching of pictures, or evaluating the aesthetics of competing designs. A quirky example I came across recently was when information designer David McCandless had 200 "Mechanical Turks" draw a small picture of their soul for his collection.
+
<td bgcolor="#dad8fd">D</td>
 +
<td bgcolor="#d9c2e7">T</td>
 +
<td bgcolor="#ebc2d5">A</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#f7adb3">L</td>
 +
<td bgcolor="#fcc2c4">V</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#e4adc7">A</td>
  
The name {{WP|The Turk|"Mechanical Turk"}} by the way relates to a famous ruse, when a Hungarian inventor and adventurer toured the imperial courts of 18<sup>th</sup> century Europe with an automaton, dressed in turkish robes and turban, that played chess at the grandmaster level against opponents that included Napoleon Bonaparte and Benjamin Franklin. No small mechanical feat in any case, it was only in the 19<sup>th</sup> century that it was revealed that the computational power was actually provided by a concealed human.
+
<td bgcolor="#adadff">R</td>
 
+
<td bgcolor="#c5c2fb">N</td>
Are you up for some "Turking"? Before the next quiz, edit [http://biochemistry.utoronto.ca/steipe/abc/students/index.php/BCH441_2014_Assignment_7_RBM '''the Mbp1 RBM page on the Student Wiki] and include the RBM for Mbp1, for a 10% bonus on the next quiz.
+
<td bgcolor="#f4eef7">G</td>
 +
<td bgcolor="#f9eef3">A</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBP1_ASPNI/365-394&nbsp;&nbsp;</td>
 +
<td>T</td>
 +
<td bgcolor="#fbeef1">F</td>
 +
<td bgcolor="#f4eef8">S</td>
  
</div>
+
<td bgcolor="#f2eefa">P</td>
</div>
+
<td bgcolor="#eeeefe">E</td>
 +
<td bgcolor="#eeeefe">E</td>
 +
<td bgcolor="#fdeeee">V</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#eeeefe">D</td>
 +
<td bgcolor="#fdeeef">L</td>
  
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
&nbsp;
+
<td>-</td>
 +
<td bgcolor="#fbd8db">L</td>
 +
<td bgcolor="#fbd8db">L</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#e6d8f0">S</td>
 +
<td bgcolor="#faeef2">C</td>
  
==Align and Annotate==
+
<td bgcolor="#eeeefe">Q</td>
 
+
<td bgcolor="#c5c2fb">D</td>
 
+
<td bgcolor="#d8c2e8">S</td>
&nbsp;<br>
+
<td bgcolor="#fdeeee">V</td>
 +
<td bgcolor="#cfaddc">G</td>
 +
<td bgcolor="#dad8fd">D</td>
 +
<td bgcolor="#d9c2e7">T</td>
 +
<td bgcolor="#ebc2d5">A</td>
 +
<td bgcolor="#fbadaf">V</td>
  
 +
<td bgcolor="#f7adb3">L</td>
 +
<td bgcolor="#fcc2c4">V</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#adadff">R</td>
 +
<td bgcolor="#c5c2fb">N</td>
 +
<td bgcolor="#f4eef7">G</td>
 +
<td bgcolor="#fdeeee">V</td>
 +
</tr>
  
===Review of domain annotations===
+
<tr><td nowrap="nowrap">MBP1_UNCRE/377-406&nbsp;&nbsp;</td>
 +
<td>M</td>
 +
<td bgcolor="#f3eef9">Y</td>
 +
<td bgcolor="#f2eefa">P</td>
 +
<td bgcolor="#efeefd">H</td>
 +
<td bgcolor="#efeefd">H</td>
 +
<td bgcolor="#eeeefe">E</td>
 +
<td bgcolor="#fdeeee">V</td>
 +
<td>-</td>
  
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#f4eef7">G</td>
 +
<td bgcolor="#fdeeef">L</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
APSES domains are relatively easy to identify and annotate but we have had problems with the ankyrin domains in Mbp1 homologues. Both CDD as well as SMART have identified such domains, but while the domain model was based on the same Pfam profile for both, and both annotated approximately the same regions, the details of the alignments and the extent of the predicted region was different.
+
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#fbd8db">L</td>
 +
<td bgcolor="#fbd8db">L</td>
 +
<td>-</td>
 +
<td>-</td>
  
[http://www.yeastgenome.org/cgi-bin/locus.fpl?locus=mbp1 Mbp1] forms heterodimeric complexes with a homologue, [http://www.yeastgenome.org/cgi-bin/locus.fpl?locus=swi6 Swi6]. Swi6 does not have an APSES domain, thus it does not bind DNA. But it is similar to Mbp1 in the region spanning the ankyrin domains and in 1999 [http://www.ncbi.nlm.nih.gov/pubmed/10048928 Foord ''et al.''] published its crystal structure ([http://www.rcsb.org/pdb/cgi/explore.cgi?pdbId=1SW6 1SW6]). This structure is a good model for Ankyrin repeats in Mbp1. For details, please refer to the consolidated [[Reference annotation yeast Mbp1|Mbp1 annotation page]] I have prepared.
+
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#f2d8e5">A</td>
 +
<td bgcolor="#f4eef8">S</td>
 +
<td bgcolor="#eeeefe">Q</td>
 +
<td bgcolor="#c5c2fb">D</td>
 +
<td bgcolor="#d8c2e8">S</td>
 +
<td bgcolor="#eeeefe">N</td>
 +
<td bgcolor="#cfaddc">G</td>
  
In what follows, we will use the program JALVIEW - a Java based multiple sequence alignment editor to load and align sequences and to consider structural similarity between yeast Mbp1 and its closest homologue in your organism.
+
<td bgcolor="#dad8fd">D</td>
 +
<td bgcolor="#d9c2e7">T</td>
 +
<td bgcolor="#ebc2d5">A</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#f7adb3">L</td>
 +
<td bgcolor="#d9c2e7">T</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#adadff">K</td>
  
In this part of the assignment,
+
<td bgcolor="#c5c2fb">N</td>
 +
<td bgcolor="#f4eef7">G</td>
 +
<td bgcolor="#faeef2">C</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBP1_PENCH/439-468&nbsp;&nbsp;</td>
 +
<td>T</td>
 +
<td bgcolor="#faeef2">C</td>
 +
<td bgcolor="#f4eef8">S</td>
 +
<td bgcolor="#eeeefe">Q</td>
  
#You will load sequences that are most similar to Mbp1 into an MSA editor;
+
<td bgcolor="#eeeefe">D</td>
#You will add sequences of ankyrin domain models;
+
<td bgcolor="#eeeefe">E</td>
#You will perform a multiple sequence alignment;
+
<td bgcolor="#ffeeee">I</td>
#You will try to improve the alignment manually;
+
<td>-</td>
<!-- Finally you will consider if the Mbp1 APSES domains could extend beyond the section of homology with Swi6 -->
+
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#eeeefe">Q</td>
 +
<td bgcolor="#f9eef3">M</td>
 +
<td>-</td>
  
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
===Jalview, loading sequences===
+
<td bgcolor="#fbd8db">L</td>
 +
<td bgcolor="#fbd8db">L</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#e6d8f0">S</td>
 +
<td bgcolor="#faeef2">C</td>
 +
<td bgcolor="#eeeefe">Q</td>
  
 +
<td bgcolor="#c5c2fb">D</td>
 +
<td bgcolor="#c5c2fb">Q</td>
 +
<td bgcolor="#eeeefe">N</td>
 +
<td bgcolor="#cfaddc">G</td>
 +
<td bgcolor="#dad8fd">D</td>
 +
<td bgcolor="#d9c2e7">T</td>
 +
<td bgcolor="#ebc2d5">A</td>
 +
<td bgcolor="#fbadaf">V</td>
 +
<td bgcolor="#f7adb3">L</td>
  
Geoff Barton's lab in Dundee has developed an integrated MSA editor and sequence annotation workbench with a number of very useful functions. It is written in Java and should run on Mac, Linux and Windows platforms without modifications.
+
<td bgcolor="#fcc2c4">V</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#adadff">R</td>
 +
<td bgcolor="#c5c2fb">N</td>
 +
<td bgcolor="#f4eef7">G</td>
 +
<td bgcolor="#f9eef3">A</td>
 +
</tr>
  
 +
<tr><td nowrap="nowrap">MBPA_TRIVE/407-436&nbsp;&nbsp;</td>
  
{{#pmid: 19151095}}
+
<td>V</td>
 
+
<td bgcolor="#fbeef1">F</td>
 
+
<td bgcolor="#f2eefa">P</td>
We will use this tool for this assignment and explore its features as we go along.
+
<td bgcolor="#eeeeff">R</td>
 +
<td bgcolor="#efeefd">H</td>
 +
<td bgcolor="#eeeefe">E</td>
 +
<td bgcolor="#ffeeee">I</td>
 +
<td>-</td>
 +
<td>-</td>
  
{{task|1=
+
<td>-</td>
#Navigate to the [http://www.jalview.org/ Jalview homepage] click on '''Download''', install Jalview on your computer and start it. A number of windows that showcase the program's abilities will load, you can close these.
+
<td bgcolor="#f4eef8">S</td>
#Prepare homologous Mbp1 sequences for alignment:
+
<td bgcolor="#fdeeef">L</td>
##Open the '''[[Reference Mbp1 orthologues (all fungi)]]''' page. (This is the list of Mbp1 orthologs I mentioned above.)
+
<td>-</td>
##Copy the FASTA sequences of the reference proteins, paste them into a text file (TextEdit on the Mac, Notepad on Windows) and save the file; you could give it an extension of <code>.fa</code>&ndash;but you don't have to.
+
<td>-</td>
##Check whether the sequence for YFO is included in the list. If it is, fine. If it is not, retrieve it from NCBI, paste it into the file and edit the header like the other sequences. If the wrong sequence from YFO is included, replace it and let me know.
+
<td>-</td>
#Return to Jalview and select File &rarr; Input Alignment &rarr; from File and open your file. A window with sequences should appear.
+
<td>-</td>
#Copy the sequences for ankyrin domain models (below), click on the Jalview window, select File &rarr; Add sequences &rarr; from Textbox and paste them into the Jalview textbox. Paste two separate copies of the CD00204 consensus sequence and one copy of 1SW6.
+
<td>-</td>
##When all the sequences are present, click on '''Add'''.
+
<td>-</td>
  
Jalview now displays all the sequences, but of course this is not yet an alignment.
+
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#fbd8db">L</td>
 +
<td bgcolor="#fbd8db">L</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
}}
+
<td>-</td>
 +
<td bgcolor="#e6d8f0">S</td>
 +
<td bgcolor="#f4eef8">S</td>
 +
<td bgcolor="#eeeefe">Q</td>
 +
<td bgcolor="#c5c2fb">D</td>
 +
<td bgcolor="#ebc2d5">A</td>
 +
<td bgcolor="#eeeefe">N</td>
 +
<td bgcolor="#cfaddc">G</td>
 +
<td bgcolor="#dad8fd">D</td>
  
;Ankyrin domain models
+
<td bgcolor="#d9c2e7">T</td>
>CD00204 ankyrin repeat consensus sequence from CDD
+
<td bgcolor="#ebc2d5">A</td>
NARDEDGRTPLHLAASNGHLEVVKLLLENGADVNAKDNDGRTPLHLAAKNGHLEIVKLLL
+
<td bgcolor="#e4adc7">A</td>
EKGADVNARDKDGNTPLHLAARNGNLDVVKLLLKHGADVNARDKDGRTPLHLAAKNGHL
+
<td bgcolor="#f7adb3">L</td>
 +
<td bgcolor="#d9c2e7">T</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#adadff">K</td>
 +
<td bgcolor="#c5c2fb">N</td>
  
>1SW6 from PDB - unstructured loops replaced with xxxx
+
<td bgcolor="#f4eef7">G</td>
GPIITFTHDLTSDFLSSPLKIMKALPSPVVNDNEQKMKLEAFLQRLLFxxxxSFDSLLQE
+
<td bgcolor="#faeef2">C</td>
VNDAFPNTQLNLNIPVDEHGNTPLHWLTSIANLELVKHLVKHGSNRLYGDNMGESCLVKA
+
</tr>
VKSVNNYDSGTFEALLDYLYPCLILEDSMNRTILHHIIITSGMTGCSAAAKYYLDILMGW
+
<tr><td nowrap="nowrap">MBP1_PHANO/400-429&nbsp;&nbsp;</td>
IVKKQNRPIQSGxxxxDSILENLDLKWIIANMLNAQDSNGDTCLNIAARLGNISIVDALL
+
<td>T</td>
DYGADPFIANKSGLRPVDFGAG
+
<td bgcolor="#f4eef9">W</td>
 +
<td bgcolor="#ffeeee">I</td>
 +
<td bgcolor="#f2eefa">P</td>
 +
<td bgcolor="#eeeefe">E</td>
  
===Computing alignments===
+
<td bgcolor="#eeeefe">E</td>
 +
<td bgcolor="#fdeeee">V</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#f4eef8">T</td>
 +
<td bgcolor="#eeeeff">R</td>
 +
<td>-</td>
 +
<td>-</td>
  
The EBI has a very convenient [http://www.ebi.ac.uk/Tools/msa/ page to access a number of MSA algorithms]. This is especially convenient when you want to compare, e.g. T-Coffee and Muscle and MAFFT results to see which regions of your alignment are robust. You could use any of these tools, just paste your sequences into a Webform, download the results and load into Jalview. Easy.
+
<td>-</td>
 
+
<td>-</td>
But even easier is to calculate the alignments directly from Jalview.  available. (Not today. <small>Bummer.</small>)
+
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#fbd8db">L</td>
  
;Calculate a MAFFT alignment using the Jalview Web service option:
+
<td bgcolor="#fbd8db">L</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#dad8fd">N</td>
 +
<td bgcolor="#f9eef3">A</td>
 +
<td bgcolor="#eeeefe">Q</td>
 +
<td bgcolor="#c5c2fb">D</td>
  
{{task|1=
+
<td bgcolor="#c5c2fb">Q</td>
#In Jalview, select '''Web Service &rarr; Alignment &rarr; MAFFT with defaults...'''. The alignment is calculated in a few minutes and displayed in a new window.
+
<td bgcolor="#eeeefe">N</td>
}}
+
<td bgcolor="#cfaddc">G</td>
 +
<td bgcolor="#dad8fd">D</td>
 +
<td bgcolor="#d9c2e7">T</td>
 +
<td bgcolor="#ebc2d5">A</td>
 +
<td bgcolor="#ffadad">I</td>
 +
<td bgcolor="#e5adc6">M</td>
 +
<td bgcolor="#ffc2c2">I</td>
  
;Calculate a MAFFT alignment when the Jalview Web service is NOT available:
+
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#adadff">R</td>
 +
<td bgcolor="#c5c2fb">N</td>
 +
<td bgcolor="#f4eef7">G</td>
 +
<td bgcolor="#f9eef3">A</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBPA_SCLSC/294-313&nbsp;&nbsp;</td>
 +
<td>-</td>
  
{{task|1=
+
<td>-</td>
#In Jalview, select '''File &rarr; Output to Textbox &rarr; FASTA'''
+
<td>-</td>
#Copy the sequences.
+
<td>-</td>
#Navigate to the [http://www.ebi.ac.uk/Tools/msa/mafft/ '''MAFFT Input form'''] at the EBI.
+
<td>-</td>
#Paste your sequences into the form.
+
<td>-</td>
#Click on '''Submit'''.
+
<td>-</td>
#Close the Jalview sequence window and either save your MAFFT alignment to file and load in Jalview, or simply ''''File &rarr; Input Alignment &rarr; from Textbox''', paste and click '''New Window'''.
+
<td>-</td>
}}
+
<td>-</td>
 +
<td>-</td>
  
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
In any case, you should now have an alignment.
+
<td>-</td>
 
+
<td>-</td>
{{task|1=
+
<td>-</td>
#Choose '''Colour &rarr; Hydrophobicity''' and '''&rarr; by Conservation'''. Then adjust the slider left or right to see which columns are highly conserved. You will notice that the Swi6 sequence that was supposed to align only to the ankyrin domains was in fact aligned to other parts of the sequence as well. This is one part of the MSA that we will have to correct manually and a common problem when aligning sequences of different lengths.
+
<td>-</td>
}}
+
<td bgcolor="#fbd8db">L</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
 +
<td bgcolor="#dad8fd">D</td>
 +
<td bgcolor="#f9eef3">A</td>
 +
<td bgcolor="#eeeeff">R</td>
 +
<td bgcolor="#c5c2fb">D</td>
 +
<td bgcolor="#ffc2c2">I</td>
 +
<td bgcolor="#eeeefe">N</td>
 +
<td bgcolor="#cfaddc">G</td>
 +
<td bgcolor="#dad8fd">N</td>
 +
<td bgcolor="#d9c2e7">T</td>
  
 +
<td bgcolor="#ebc2d5">A</td>
 +
<td bgcolor="#ffadad">I</td>
 +
<td bgcolor="#b3adf7">H</td>
 +
<td bgcolor="#ffc2c2">I</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#adadff">K</td>
 +
<td bgcolor="#c5c2fb">N</td>
 +
<td bgcolor="#eeeeff">K</td>
  
&nbsp;
+
<td bgcolor="#f9eef3">A</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBPA_PYRIS/363-392&nbsp;&nbsp;</td>
 +
<td>T</td>
 +
<td bgcolor="#f4eef9">W</td>
 +
<td bgcolor="#ffeeee">I</td>
 +
<td bgcolor="#f2eefa">P</td>
 +
<td bgcolor="#eeeefe">E</td>
 +
<td bgcolor="#eeeefe">E</td>
  
===Editing ankyrin domain alignments===
+
<td bgcolor="#fdeeee">V</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#f4eef8">T</td>
 +
<td bgcolor="#eeeeff">R</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#fbd8db">L</td>
 +
<td bgcolor="#fbd8db">L</td>
  
A '''good''' MSA comprises only columns of residues that play similar roles in the proteins' mechanism and/or that evolve in a comparable structural context. Since the alignment reflects the result of biological selection and conservation, it has relatively few indels and the indels it has are usually not placed into elements of secondary structure or into functional motifs. The contiguous features annotated for Mbp1 are expected to be left intact by a good alignment.
+
<td>-</td>
 
+
<td>-</td>
A '''poor''' MSA has many errors in its columns; these contain residues that actually have different functions or structural roles, even though they may look similar according to a (pairwise!) scoring matrix. A poor MSA also may have introduced indels in biologically irrelevant positions, to maximize spurious sequence similarities. Some of the features annotated for Mbp1 will be disrupted in a poor alignment and residues that are conserved may be placed into different columns.
+
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#dad8fd">N</td>
 +
<td bgcolor="#f9eef3">A</td>
 +
<td bgcolor="#f9eef3">A</td>
 +
<td bgcolor="#c5c2fb">D</td>
 +
<td bgcolor="#c5c2fb">Q</td>
  
Often errors or inconsistencies are easy to spot, and manually editing an MSA is not generally frowned upon, even though this is not a strictly objective procedure. The main goal of manual editing is to make an alignment biologically more plausible. Most comonly this means to mimize the number of rare evolutionary events that the alignment suggests and/or to emphasize conservation of known functional motifs. Here are some examples for what one might aim for in manually editing an alignment:
+
<td bgcolor="#eeeefe">N</td>
 +
<td bgcolor="#cfaddc">G</td>
 +
<td bgcolor="#dad8fd">D</td>
 +
<td bgcolor="#d9c2e7">T</td>
 +
<td bgcolor="#ebc2d5">A</td>
 +
<td bgcolor="#ffadad">I</td>
 +
<td bgcolor="#e5adc6">M</td>
 +
<td bgcolor="#ffc2c2">I</td>
 +
<td bgcolor="#e4adc7">A</td>
  
;Reduce number of indels
+
<td bgcolor="#e4adc7">A</td>
From a Probcons alignment:
+
<td bgcolor="#adadff">R</td>
0447_DEBHA    ILKTE-K<span style="color: rgb(255, 0, 0);">-</span>T<span style="color: rgb(255, 0, 0);">---</span>K--SVVK      ILKTE----KTK---SVVK
+
<td bgcolor="#c5c2fb">N</td>
9978_GIBZE    MLGLN<span style="color: rgb(255, 0, 0);">-</span>PGLKEIT--HSIT      MLGLNPGLKEIT---HSIT
+
<td bgcolor="#f4eef7">G</td>
1513_CANAL    ILKTE-K<span style="color: rgb(255, 0, 0);">-</span>I<span style="color: rgb(255, 0, 0);">---</span>K--NVVK      ILKTE----KIK---NVVK
+
<td bgcolor="#f9eef3">A</td>
6132_SCHPO    ELDDI-I<span style="color: rgb(255, 0, 0);">-</span>ESGDY--ENVD      ELDDI-IESGDY---ENVD
+
</tr>
1244_ASPFU    ----N<span style="color: rgb(255, 0, 0);">-</span>PGLREIC--HSIT  -&gt;  ----NPGLREIC---HSIT
+
<tr><td nowrap="nowrap">MBP1_/361-390&nbsp;&nbsp;</td>
0925_USTMA    LVKTC<span style="color: rgb(255, 0, 0);">-</span>PALDPHI--TKLK      LVKTCPALDPHI---TKLK
+
<td>-</td>
2599_ASPTE    VLDAN<span style="color: rgb(255, 0, 0);">-</span>PGLREIS--HSIT      VLDANPGLREIS---HSIT
+
<td>-</td>
9773_DEBHA    LLESTPKQYHQHI--KRIR      LLESTPKQYHQHI--KRIR
 
0918_CANAL    LLESTPKEYQQYI--KRIR      LLESTPKEYQQYI--KRIR
 
  
<small>Gaps marked in red were moved. The sequence similarity in the alignment does not change considerably, however the total number of indels in this excerpt is reduced to 13 from the original 22</small>
+
<td>-</td>
 
+
<td bgcolor="#eeeefe">N</td>
 
+
<td bgcolor="#efeefd">H</td>
;Move indels to more plausible position
+
<td bgcolor="#f4eef8">S</td>
From a CLUSTAL alignment:
+
<td bgcolor="#fdeeef">L</td>
4966_CANGL    MKHEKVQ------GGYGRFQ---GTW      MKHEKV<span style="color: rgb(0, 170, 0);">Q</span>------GGYGRFQ---GTW
+
<td>G</td>
1513_CANAL    KIKNVVK------VGSMNLK---GVW      KIKNVV<span style="color: rgb(0, 170, 0);">K</span>------VGSMNLK---GVW
+
<td>V</td>
6132_SCHPO    VDSKHP<span style="color: rgb(255, 0, 0);">-</span>----------<span style="color: rgb(255, 0, 0);">Q</span>ID---GVW  -&gt;  VDSKHP<span style="color: rgb(0, 170, 0);">Q</span>-----------ID---GVW
+
<td>L</td>
1244_ASPFU    EICHSIT------GGALAAQ---GYW      EICHSI<span style="color: rgb(0, 170, 0);">T</span>------GGALAAQ---GYW
+
<td bgcolor="#f4eef8">S</td>
  
<small>The two characters marked in red were swapped. This does not change the number of indels but places the "Q" into a a column in which it is more highly conserved (green). Progressive alignments are especially prone to this type of error.</small>
+
<td bgcolor="#eeeefe">Q</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
;Conserve motifs
+
<td>-</td>
From a CLUSTAL alignment:
+
<td>-</td>
6166_SCHPO      --DKR<span style="color: rgb(255, 0, 0);">V</span>A---<span style="color: rgb(255, 0, 0);">G</span>LWVPP      --DKR<span style="color: rgb(0, 255, 0);">V</span>A--<span style="color: rgb(0, 255, 0);">G</span>-LWVPP
+
<td bgcolor="#f7d8e0">F</td>
XBP1_SACCE      GGYIK<span style="color: rgb(255, 0, 0);">I</span>Q---<span style="color: rgb(255, 0, 0);">G</span>TWLPM      GGYIK<span style="color: rgb(0, 255, 0);">I</span>Q--<span style="color: rgb(0, 255, 0);">G</span>-TWLPM
+
<td bgcolor="#f3d8e4">M</td>
6355_ASPTE      --DE<span style="color: rgb(255, 0, 0);">I</span>A<span style="color: rgb(255, 0, 0);">G</span>---NVWISP  -&gt;  ---DE<span style="color: rgb(0, 255, 0);">I</span>A--<span style="color: rgb(0, 255, 0);">G</span>NVWISP
+
<td>-</td>
5262_KLULA      GGYIK<span style="color: rgb(255, 0, 0);">I</span>Q---<span style="color: rgb(255, 0, 0);">G</span>TWLPY      GGYIK<span style="color: rgb(0, 255, 0);">I</span>Q--<span style="color: rgb(0, 255, 0);">G</span>-TWLPY
+
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#dad8fd">D</td>
  
<small>The first of the two residues marked in red is a conserved, solvent exposed hydrophobic residue that may mediate domain interactions. The second residue is the conserved glycine in a beta turn that cannot be mutated without structural disruption. Changing the position of a gap and insertion in one sequence improves the conservation of both motifs.</small>
+
<td bgcolor="#f4eef8">T</td>
 +
<td bgcolor="#eeeefe">Q</td>
 +
<td bgcolor="#c5c2fb">N</td>
 +
<td bgcolor="#c5c2fb">N</td>
 +
<td bgcolor="#eeeefe">E</td>
 +
<td bgcolor="#cfaddc">G</td>
 +
<td bgcolor="#dad8fd">D</td>
 +
<td bgcolor="#d9c2e7">T</td>
 +
<td bgcolor="#ebc2d5">A</td>
  
 +
<td bgcolor="#f7adb3">L</td>
 +
<td bgcolor="#b3adf7">H</td>
 +
<td bgcolor="#ffc2c2">I</td>
 +
<td bgcolor="#f7adb3">L</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#adadff">R</td>
 +
<td bgcolor="#d8c2e8">S</td>
 +
<td bgcolor="#f4eef7">G</td>
 +
<td bgcolor="#f9eef3">A</td>
  
The Ankyrin domains are quite highly diverged, the boundaries not well defined and not even CDD, SMART and SAS agree on the precise annotations. We expect there to be alignment errors in this region. Nevertheless we would hope that a good alignment would recognize homology in that region and that ideally the required <i>indels</i> would be placed between the secondary structure elements, not in their middle. But judging from the sequence alignment alone, we cannot judge where the secondary structure elements ought to be. You should therefore add the following "sequence" to the alignment; it contains exactly as many characters as the Swi6 sequence above and annotates the secondary structure elements. I have derived it from the 1SW6 structure
+
</tr>
 +
<tr><td nowrap="nowrap">MBP1_ASPFL/328-364&nbsp;&nbsp;</td>
 +
<td>T</td>
 +
<td bgcolor="#eeeefe">E</td>
 +
<td bgcolor="#eeeefe">Q</td>
 +
<td bgcolor="#f2eefa">P</td>
 +
<td bgcolor="#f4eef7">G</td>
 +
<td bgcolor="#eeeefe">E</td>
 +
<td bgcolor="#fdeeee">V</td>
  
>SecStruc 1SW6 E: strand  t: turn  H: helix  _: irregular
+
<td>I</td>
_EEE__tt___ttt______EE_____t___HHHHHHHHHHHHHHHH_xxxx_HHHHHHH
+
<td>T</td>
HHHH_t_____t_____t____HHHHHHH__tHHHHHHHHH____t___tt____HHHHH
+
<td>L</td>
HH__HHHH___HHHHHHHHHHHHHEE_t____HHHHHHHHH__t__HHHHHHHHHHHHHH
+
<td bgcolor="#f4eef7">G</td>
HHHHHH__EEE_xxxx_HHHHHt_HHHHHHH______t____HHHHHHHH__HHHHHHHH
+
<td bgcolor="#eeeeff">R</td>
H____t____t____HHHH___
+
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
<div class="reference-box">[http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum/GetPage.pl?pdbcode=1sw6&template=protein.html&r=wiring&l=1&chain=A '''1SW6_A''' at the PDBSum database of structure annotations] You can compare the diagram there with this text string.</div>
+
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#f7d8e0">F</td>
 +
<td bgcolor="#ffd8d8">I</td>
 +
<td>S</td>
  
 +
<td>E</td>
 +
<td>I</td>
 +
<td>V</td>
 +
<td bgcolor="#dad8fd">N</td>
 +
<td bgcolor="#fdeeef">L</td>
 +
<td bgcolor="#eeeeff">R</td>
 +
<td bgcolor="#c5c2fb">D</td>
 +
<td bgcolor="#c5c2fb">D</td>
 +
<td bgcolor="#eeeefe">Q</td>
  
To proceed:
+
<td bgcolor="#cfaddc">G</td>
#Manually align the Swi6 sequence with yeast Mbp1
+
<td bgcolor="#dad8fd">D</td>
#Bring the Secondary structure annotation into its correct alignment with Swi6
+
<td bgcolor="#d9c2e7">T</td>
#Bring both CDD ankyrin profiles into the correct alignment with yeast Mbp1
+
<td bgcolor="#ebc2d5">A</td>
 +
<td bgcolor="#f7adb3">L</td>
 +
<td bgcolor="#b0adfa">N</td>
 +
<td bgcolor="#f9c2c7">L</td>
 +
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#cfaddc">G</td>
  
Proceed along the following steps:
+
<td bgcolor="#adadff">R</td>
 +
<td bgcolor="#ebc2d5">A</td>
 +
<td bgcolor="#eeeeff">R</td>
 +
<td bgcolor="#f4eef8">S</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBPA_MAGOR/375-404&nbsp;&nbsp;</td>
 +
<td>Q</td>
 +
<td bgcolor="#efeefd">H</td>
 +
<td bgcolor="#eeeefe">D</td>
  
{{task|1=
+
<td bgcolor="#f2eefa">P</td>
#Add the secondary structure annotation to the sequence alignment in Jalview. Copy the annotation, select File &rarr; Add sequences &rarr; from Textbox and paste the sequence.
+
<td bgcolor="#eeeefe">N</td>
#Select Help &rarr; Documentation and read about '''Editing Alignments''', '''Cursor Mode''' and '''Key strokes'''.
+
<td bgcolor="#fbeef1">F</td>
#Click on the yeast Mbp1 sequence '''row''' to select the entire row. Then use the cursor key to move that sequence down, so it is directly above the 1SW6 sequence. Select the row of 1SW6 and use shift/mouse to move the sequence elements and edit the alignment to match yeast Mbp1. Refer to the alignment given in the [[Reference annotation yeast Mbp1|Mbp1 annotation page]] for the correct alignment.
+
<td bgcolor="#fdeeee">V</td>
#Align the secondary structure elements with the 1SW6 sequence: Every character of 1SW6 should be matched with either E, t, H, or _. The result should be similar to the [[Reference annotation yeast Mbp1|Mbp1 annotation page]]. If you need to insert gaps into all sequences in the alignment, simply drag your mouse over all row headers - movement of sequences is constrained to selected regions, the rest is locked into place to prevent inadvertent misalignments. Remember to save your project from time to time: '''File &rarr; save''' so you can reload a previous state if anything goes wrong and can't be fixed with '''Edit &rarr; Undo'''.
+
<td>-</td>
#Finally align the two CD00204 consensus sequences to their correct positions (again, refer to the [[Reference annotation yeast Mbp1|Mbp1 annotation page]]).
+
<td>-</td>
#You can now consider the principles stated above and see if you can improve the alignment, for example by moving indels out of regions of secondary structure if that is possible without changing the character of the aligned columns significantly. Select blocks within which to work to leave the remaining alignment unchanged. So that this does not become tedious, you can restrict your editing to one Ankyrin repeat that is structurally defined in Swi6. You may want to open the 1SW6 structure in VMD to define the boundaries of one such repeat. You can copy and paste sections from Jalview into your assignment for documentation or export sections of the alignment to HTML (see the example below).
 
}}
 
 
 
=== Editing ankyrin domain alignments - Sample===
 
 
 
This sample was created by
 
 
 
# Editing the alignments as described above;
 
# Copying a block of aligned sequence;
 
# Pasting it To New Alignment;
 
# Colouring the residues by Hydrophobicity and setting the colour saturation according to Conservation;
 
# Choosing File &rarr; Export Image &rarr; HTML and pasting the resulting HTML source into this Wikipage.
 
 
 
 
 
<table border="1"><tr><td>
 
<table border="0" cellpadding="0" cellspacing="0">
 
 
 
<tr><td colspan="6"></td>
 
<td colspan="9">10<br>|</td><td></td>
 
<td colspan="9">20<br>|</td><td></td>
 
<td colspan="9">30<br>|</td><td></td>
 
<td colspan="3"></td><td colspan="3">40<br>|</td>
 
 
 
</tr>
 
<tr><td nowrap="nowrap">MBP1_USTMA/341-368&nbsp;&nbsp;</td>
 
 
<td>-</td>
 
<td>-</td>
<td>-</td>
 
<td bgcolor="#f3eef9">Y</td>
 
<td bgcolor="#f4eef7">G</td>
 
<td bgcolor="#eeeefe">D</td>
 
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#eeeefe">Q</td>
<td bgcolor="#fdeeef">L</td>
+
<td bgcolor="#eeeefe">Q</td>
  
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td bgcolor="#f9eef3">A</td>
 
<td bgcolor="#eeeefe">D</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 1,334: Line 1,934:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#ffd8d8">I</td>
+
<td bgcolor="#fbd8db">L</td>
 
<td bgcolor="#fbd8db">L</td>
 
<td bgcolor="#fbd8db">L</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#dad8fd">N</td>
+
<td bgcolor="#dad8fd">D</td>
<td bgcolor="#fbeef1">F</td>
+
<td bgcolor="#f9eef3">A</td>
 +
 
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#c5c2fb">D</td>
 
<td bgcolor="#c5c2fb">D</td>
<td bgcolor="#c5c2fb">D</td>
+
<td bgcolor="#c5c2fb">N</td>
<td bgcolor="#eeeefe">E</td>
+
<td bgcolor="#eeeefe">D</td>
 
 
 
<td bgcolor="#cfaddc">G</td>
 
<td bgcolor="#cfaddc">G</td>
<td bgcolor="#dad8fd">E</td>
+
<td bgcolor="#dad8fd">N</td>
 
<td bgcolor="#d9c2e7">T</td>
 
<td bgcolor="#d9c2e7">T</td>
<td bgcolor="#d3c2ee">P</td>
+
<td bgcolor="#ebc2d5">A</td>
<td bgcolor="#f7adb3">L</td>
+
<td bgcolor="#fbadaf">V</td>
<td bgcolor="#ccaddf">T</td>
+
 
<td bgcolor="#ecc2d5">M</td>
+
<td bgcolor="#b3adf7">H</td>
 +
<td bgcolor="#f9c2c7">L</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#b0adfa">Q</td>
 +
<td bgcolor="#c2c2ff">R</td>
 +
<td bgcolor="#f4eef7">G</td>
 +
<td bgcolor="#f4eef8">S</td>
 +
</tr>
  
<td bgcolor="#adadff">R</td>
+
<tr><td nowrap="nowrap">MBP1_CHAGL/361-390&nbsp;&nbsp;</td>
<td bgcolor="#ebc2d5">A</td>
+
<td>S</td>
 
<td bgcolor="#eeeeff">R</td>
 
<td bgcolor="#eeeeff">R</td>
 
<td bgcolor="#f4eef8">S</td>
 
<td bgcolor="#f4eef8">S</td>
</tr>
+
<td bgcolor="#f9eef3">A</td>
<tr><td nowrap="nowrap">MBP1B_SCHCO/470-498&nbsp;&nbsp;</td>
+
<td bgcolor="#eeeefe">D</td>
 +
<td bgcolor="#eeeefe">E</td>
 +
<td bgcolor="#fdeeef">L</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#eeeeff">R</td>
 
<td bgcolor="#eeeefe">E</td>
 
  
<td bgcolor="#eeeefe">D</td>
 
<td bgcolor="#f4eef7">G</td>
 
<td bgcolor="#eeeefe">D</td>
 
<td bgcolor="#f3eef9">Y</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
<td bgcolor="#eeeefe">Q</td>
 +
<td bgcolor="#eeeefe">Q</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#eeeeff">K</td>
 
<td bgcolor="#f4eef8">S</td>
 
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 1,388: Line 1,990:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
+
<td bgcolor="#fbd8db">L</td>
<td>-</td>
 
<td bgcolor="#f7d8e0">F</td>
 
 
<td bgcolor="#fbd8db">L</td>
 
<td bgcolor="#fbd8db">L</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td bgcolor="#dad8fd">D</td>
 
<td bgcolor="#dad8fd">D</td>
<td bgcolor="#fdeeef">L</td>
+
<td bgcolor="#f4eef8">S</td>
 
 
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#c5c2fb">D</td>
 
<td bgcolor="#c5c2fb">D</td>
<td bgcolor="#c5c2fb">E</td>
+
<td bgcolor="#c5c2fb">N</td>
<td bgcolor="#efeefd">H</td>
+
<td bgcolor="#eeeefe">E</td>
 
<td bgcolor="#cfaddc">G</td>
 
<td bgcolor="#cfaddc">G</td>
<td bgcolor="#dad8fd">D</td>
+
 
 +
<td bgcolor="#dad8fd">N</td>
 
<td bgcolor="#d9c2e7">T</td>
 
<td bgcolor="#d9c2e7">T</td>
 
<td bgcolor="#ebc2d5">A</td>
 
<td bgcolor="#ebc2d5">A</td>
<td bgcolor="#f7adb3">L</td>
+
<td bgcolor="#fbadaf">V</td>
 
+
<td bgcolor="#b3adf7">H</td>
<td bgcolor="#b0adfa">N</td>
+
<td bgcolor="#f9c2c7">L</td>
<td bgcolor="#ffc2c2">I</td>
 
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
<td bgcolor="#adadff">R</td>
+
<td bgcolor="#e5adc6">M</td>
<td bgcolor="#fcc2c4">V</td>
+
 
<td bgcolor="#f4eef7">G</td>
+
<td bgcolor="#c2c2ff">R</td>
<td bgcolor="#eeeefe">N</td>
+
<td bgcolor="#eeeefe">D</td>
 +
<td bgcolor="#f9eef3">A</td>
 
</tr>
 
</tr>
 +
<tr><td nowrap="nowrap">MBP1_PODAN/372-401&nbsp;&nbsp;</td>
 +
<td>V</td>
 +
<td bgcolor="#eeeeff">R</td>
 +
<td bgcolor="#eeeefe">Q</td>
 +
<td bgcolor="#f2eefa">P</td>
  
<tr><td nowrap="nowrap">MBP1_ASHGO/465-494&nbsp;&nbsp;</td>
+
<td bgcolor="#eeeefe">E</td>
<td>F</td>
+
<td bgcolor="#eeeefe">E</td>
<td bgcolor="#f4eef8">S</td>
+
<td bgcolor="#fdeeee">V</td>
<td bgcolor="#f2eefa">P</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#f3eef9">Y</td>
 
<td bgcolor="#eeeeff">R</td>
 
<td bgcolor="#ffeeee">I</td>
 
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#eeeefe">E</td>
+
<td bgcolor="#eeeefe">Q</td>
<td bgcolor="#f4eef8">T</td>
+
<td bgcolor="#f9eef3">A</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 1,444: Line 2,044:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 +
<td bgcolor="#fbd8db">L</td>
 
<td bgcolor="#fbd8db">L</td>
 
<td bgcolor="#fbd8db">L</td>
<td bgcolor="#ffd8d8">I</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#dad8fd">N</td>
+
<td bgcolor="#dad8fd">D</td>
 
<td bgcolor="#f9eef3">A</td>
 
<td bgcolor="#f9eef3">A</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#eeeefe">Q</td>
 +
 
<td bgcolor="#c5c2fb">D</td>
 
<td bgcolor="#c5c2fb">D</td>
<td bgcolor="#efc2d0">C</td>
+
<td bgcolor="#c5c2fb">E</td>
<td bgcolor="#eeeeff">K</td>
+
<td bgcolor="#eeeefe">E</td>
 
<td bgcolor="#cfaddc">G</td>
 
<td bgcolor="#cfaddc">G</td>
 
+
<td bgcolor="#dad8fd">N</td>
<td bgcolor="#e6d8f0">S</td>
 
 
<td bgcolor="#d9c2e7">T</td>
 
<td bgcolor="#d9c2e7">T</td>
<td bgcolor="#d3c2ee">P</td>
+
<td bgcolor="#ebc2d5">A</td>
 
<td bgcolor="#f7adb3">L</td>
 
<td bgcolor="#f7adb3">L</td>
 
<td bgcolor="#b3adf7">H</td>
 
<td bgcolor="#b3adf7">H</td>
<td bgcolor="#ffc2c2">I</td>
+
 
 +
<td bgcolor="#f9c2c7">L</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
<td bgcolor="#e5adc6">M</td>
+
<td bgcolor="#adadff">R</td>
 +
<td bgcolor="#fcc2c4">V</td>
 +
<td bgcolor="#eeeefe">N</td>
 +
<td bgcolor="#f9eef3">A</td>
 +
</tr>
 +
 
 +
<tr><td nowrap="nowrap">MBP1_LACTH/458-487&nbsp;&nbsp;</td>
  
<td bgcolor="#c5c2fb">N</td>
+
<td>F</td>
 +
<td bgcolor="#f4eef8">S</td>
 +
<td bgcolor="#f2eefa">P</td>
 
<td bgcolor="#eeeeff">R</td>
 
<td bgcolor="#eeeeff">R</td>
<td bgcolor="#eeeefe">D</td>
+
<td bgcolor="#f3eef9">Y</td>
</tr>
+
<td bgcolor="#eeeeff">R</td>
<tr><td nowrap="nowrap">MBP1_CLALU/550-586&nbsp;&nbsp;</td>
+
<td bgcolor="#ffeeee">I</td>
<td>G</td>
+
<td>-</td>
<td bgcolor="#eeeefe">N</td>
+
<td>-</td>
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#eeeefe">N</td>
 
  
<td bgcolor="#f4eef7">G</td>
+
<td>-</td>
 +
<td bgcolor="#eeeefe">E</td>
 
<td bgcolor="#eeeefe">N</td>
 
<td bgcolor="#eeeefe">N</td>
<td bgcolor="#f4eef8">S</td>
 
<td>N</td>
 
<td>D</td>
 
<td>K</td>
 
<td bgcolor="#eeeeff">K</td>
 
<td bgcolor="#eeeefe">E</td>
 
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 1,494: Line 2,095:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td bgcolor="#fbd8db">L</td>
 
<td bgcolor="#fbd8db">L</td>
 
<td bgcolor="#ffd8d8">I</td>
 
<td bgcolor="#ffd8d8">I</td>
<td>S</td>
+
<td>-</td>
<td>K</td>
+
<td>-</td>
<td>F</td>
+
<td>-</td>
<td>L</td>
+
 
 +
<td>-</td>
 
<td bgcolor="#dad8fd">N</td>
 
<td bgcolor="#dad8fd">N</td>
<td bgcolor="#efeefd">H</td>
+
<td bgcolor="#f9eef3">A</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#eeeefe">Q</td>
 
 
<td bgcolor="#c5c2fb">D</td>
 
<td bgcolor="#c5c2fb">D</td>
<td bgcolor="#c5c2fb">N</td>
+
<td bgcolor="#c5c2fb">Q</td>
<td bgcolor="#eeeefe">E</td>
+
<td bgcolor="#eeeefe">N</td>
 
<td bgcolor="#cfaddc">G</td>
 
<td bgcolor="#cfaddc">G</td>
<td bgcolor="#dad8fd">N</td>
+
<td bgcolor="#dad8fd">D</td>
 +
 
 
<td bgcolor="#d9c2e7">T</td>
 
<td bgcolor="#d9c2e7">T</td>
 
<td bgcolor="#ebc2d5">A</td>
 
<td bgcolor="#ebc2d5">A</td>
<td bgcolor="#edadbd">F</td>
+
<td bgcolor="#fbadaf">V</td>
 
<td bgcolor="#b3adf7">H</td>
 
<td bgcolor="#b3adf7">H</td>
 
+
<td bgcolor="#f9c2c7">L</td>
<td bgcolor="#ffc2c2">I</td>
 
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
<td bgcolor="#c6ade5">Y</td>
+
<td bgcolor="#b0adfa">Q</td>
 
<td bgcolor="#c5c2fb">N</td>
 
<td bgcolor="#c5c2fb">N</td>
<td bgcolor="#f9eef3">M</td>
 
<td bgcolor="#f4eef8">S</td>
 
</tr>
 
  
<tr><td nowrap="nowrap">MBPA_COPCI/514-542&nbsp;&nbsp;</td>
 
 
<td>-</td>
 
<td bgcolor="#efeefd">H</td>
 
<td bgcolor="#eeeefe">E</td>
 
<td bgcolor="#f4eef7">G</td>
 
 
<td bgcolor="#f4eef7">G</td>
 
<td bgcolor="#f4eef7">G</td>
 
<td bgcolor="#eeeefe">D</td>
 
<td bgcolor="#eeeefe">D</td>
<td bgcolor="#fbeef1">F</td>
+
</tr>
 +
<tr><td nowrap="nowrap">MBP1_FILNE/433-460&nbsp;&nbsp;</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
<td bgcolor="#f3eef9">Y</td>
 +
<td bgcolor="#f2eefa">P</td>
 +
<td bgcolor="#eeeefe">Q</td>
  
 +
<td bgcolor="#eeeefe">E</td>
 +
<td bgcolor="#fdeeef">L</td>
 +
<td>-</td>
 +
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#eeeeff">R</td>
+
<td bgcolor="#f9eef3">A</td>
<td bgcolor="#f4eef8">S</td>
+
<td bgcolor="#eeeefe">D</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#fbd8db">L</td>
 
 
<td bgcolor="#fdd8da">V</td>
 
<td bgcolor="#fdd8da">V</td>
 +
 +
<td bgcolor="#ffd8d8">I</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#dad8fd">D</td>
+
<td bgcolor="#dad8fd">N</td>
<td bgcolor="#fdeeef">L</td>
+
<td bgcolor="#fbeef1">F</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#c5c2fb">D</td>
 
<td bgcolor="#c5c2fb">D</td>
 +
 
<td bgcolor="#c5c2fb">E</td>
 
<td bgcolor="#c5c2fb">E</td>
<td bgcolor="#efeefd">H</td>
+
<td bgcolor="#eeeefe">E</td>
 
<td bgcolor="#cfaddc">G</td>
 
<td bgcolor="#cfaddc">G</td>
<td bgcolor="#dad8fd">D</td>
+
<td bgcolor="#dad8fd">E</td>
 
 
 
<td bgcolor="#d9c2e7">T</td>
 
<td bgcolor="#d9c2e7">T</td>
 
<td bgcolor="#ebc2d5">A</td>
 
<td bgcolor="#ebc2d5">A</td>
<td bgcolor="#ffadad">I</td>
+
<td bgcolor="#f7adb3">L</td>
<td bgcolor="#b0adfa">N</td>
+
<td bgcolor="#ccaddf">T</td>
 
<td bgcolor="#ffc2c2">I</td>
 
<td bgcolor="#ffc2c2">I</td>
 +
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#adadff">R</td>
 
<td bgcolor="#adadff">R</td>
<td bgcolor="#fcc2c4">V</td>
+
<td bgcolor="#ebc2d5">A</td>
 
 
<td bgcolor="#f4eef7">G</td>
 
<td bgcolor="#eeeefe">N</td>
 
</tr>
 
<tr><td nowrap="nowrap">MBP1_DEBHA/507-550&nbsp;&nbsp;</td>
 
<td>I</td>
 
 
<td bgcolor="#eeeeff">R</td>
 
<td bgcolor="#eeeeff">R</td>
<td bgcolor="#eeeefe">D</td>
 
 
<td bgcolor="#f4eef8">S</td>
 
<td bgcolor="#f4eef8">S</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBP1_KLULA/477-506&nbsp;&nbsp;</td>
 +
<td>F</td>
 +
 +
<td bgcolor="#f4eef8">T</td>
 +
<td bgcolor="#f2eefa">P</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#eeeefe">Q</td>
 
+
<td bgcolor="#f3eef9">Y</td>
<td bgcolor="#eeeefe">E</td>
+
<td bgcolor="#eeeeff">R</td>
 
<td bgcolor="#ffeeee">I</td>
 
<td bgcolor="#ffeeee">I</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#eeeefe">E</td>
 
<td bgcolor="#eeeefe">N</td>
 
<td>K</td>
 
<td>K</td>
 
  
<td>L</td>
+
<td bgcolor="#eeeefe">D</td>
<td>S</td>
+
<td bgcolor="#fdeeee">V</td>
<td>L</td>
+
<td>-</td>
<td>S</td>
+
<td>-</td>
<td>D</td>
+
<td>-</td>
<td>K</td>
+
<td>-</td>
<td>K</td>
+
<td>-</td>
<td>E</td>
+
<td>-</td>
 +
<td>-</td>
 +
 
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 
<td bgcolor="#fbd8db">L</td>
 
<td bgcolor="#fbd8db">L</td>
 +
<td bgcolor="#ffd8d8">I</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
<td bgcolor="#ffd8d8">I</td>
 
<td>A</td>
 
<td>K</td>
 
<td>F</td>
 
<td>I</td>
 
 
<td bgcolor="#dad8fd">N</td>
 
<td bgcolor="#dad8fd">N</td>
<td bgcolor="#efeefd">H</td>
+
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#c5c2fb">D</td>
 
<td bgcolor="#c5c2fb">D</td>
 
+
<td bgcolor="#c5c2fb">N</td>
<td bgcolor="#ffc2c2">I</td>
 
 
<td bgcolor="#eeeefe">D</td>
 
<td bgcolor="#eeeefe">D</td>
 
<td bgcolor="#cfaddc">G</td>
 
<td bgcolor="#cfaddc">G</td>
 
<td bgcolor="#dad8fd">N</td>
 
<td bgcolor="#dad8fd">N</td>
<td bgcolor="#d9c2e7">T</td>
+
<td bgcolor="#d8c2e8">S</td>
<td bgcolor="#ebc2d5">A</td>
+
 
<td bgcolor="#edadbd">F</td>
+
<td bgcolor="#d3c2ee">P</td>
 +
<td bgcolor="#f7adb3">L</td>
 
<td bgcolor="#b3adf7">H</td>
 
<td bgcolor="#b3adf7">H</td>
<td bgcolor="#ffc2c2">I</td>
+
<td bgcolor="#d5c2ec">Y</td>
 
+
<td bgcolor="#e4adc7">A</td>
<td bgcolor="#fbadaf">V</td>
 
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
<td bgcolor="#c6ade5">Y</td>
+
<td bgcolor="#ccaddf">T</td>
 
<td bgcolor="#c5c2fb">N</td>
 
<td bgcolor="#c5c2fb">N</td>
<td bgcolor="#fdeeef">L</td>
+
<td bgcolor="#eeeeff">K</td>
<td bgcolor="#eeeefe">N</td>
+
 
 +
<td bgcolor="#eeeefe">D</td>
 
</tr>
 
</tr>
<tr><td nowrap="nowrap">MBP1A_SCHCO/388-415&nbsp;&nbsp;</td>
+
<tr><td nowrap="nowrap">MBP1_SCHST/468-501&nbsp;&nbsp;</td>
<td>-</td>
+
<td>A</td>
 
+
<td bgcolor="#eeeeff">K</td>
<td>-</td>
+
<td bgcolor="#eeeefe">D</td>
<td bgcolor="#f3eef9">Y</td>
 
 
<td bgcolor="#f2eefa">P</td>
 
<td bgcolor="#f2eefa">P</td>
 +
<td bgcolor="#eeeefe">D</td>
 +
<td bgcolor="#eeeefe">N</td>
 +
 
<td bgcolor="#eeeeff">K</td>
 
<td bgcolor="#eeeeff">K</td>
<td bgcolor="#eeeefe">E</td>
 
<td bgcolor="#fdeeef">L</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
+
<td bgcolor="#eeeeff">K</td>
<td bgcolor="#f9eef3">A</td>
 
 
<td bgcolor="#eeeefe">D</td>
 
<td bgcolor="#eeeefe">D</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#fdd8da">V</td>
 
 
<td bgcolor="#fbd8db">L</td>
 
<td bgcolor="#fbd8db">L</td>
<td>-</td>
+
<td bgcolor="#ffd8d8">I</td>
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
 +
<td>A</td>
 +
<td>K</td>
 +
<td>F</td>
 +
<td>I</td>
 
<td bgcolor="#dad8fd">N</td>
 
<td bgcolor="#dad8fd">N</td>
<td bgcolor="#fbeef1">F</td>
+
<td bgcolor="#efeefd">H</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#c5c2fb">D</td>
 
<td bgcolor="#c5c2fb">D</td>
<td bgcolor="#c5c2fb">E</td>
+
<td bgcolor="#d8c2e8">S</td>
 +
 
 
<td bgcolor="#eeeefe">D</td>
 
<td bgcolor="#eeeefe">D</td>
 
<td bgcolor="#cfaddc">G</td>
 
<td bgcolor="#cfaddc">G</td>
<td bgcolor="#dad8fd">E</td>
+
<td bgcolor="#dad8fd">N</td>
 
<td bgcolor="#d9c2e7">T</td>
 
<td bgcolor="#d9c2e7">T</td>
 
 
<td bgcolor="#ebc2d5">A</td>
 
<td bgcolor="#ebc2d5">A</td>
<td bgcolor="#f7adb3">L</td>
+
<td bgcolor="#edadbd">F</td>
<td bgcolor="#ccaddf">T</td>
+
<td bgcolor="#b3adf7">H</td>
<td bgcolor="#ecc2d5">M</td>
+
<td bgcolor="#ffc2c2">I</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#eaadc0">C</td>
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#adadff">R</td>
 
<td bgcolor="#efc2d0">C</td>
 
<td bgcolor="#eeeeff">R</td>
 
  
 +
<td bgcolor="#caade0">S</td>
 +
<td bgcolor="#b3adf7">H</td>
 +
<td bgcolor="#c5c2fb">N</td>
 +
<td bgcolor="#fdeeef">L</td>
 +
<td bgcolor="#eeeefe">N</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBP1_SACCE/496-525&nbsp;&nbsp;</td>
 +
<td>F</td>
 
<td bgcolor="#f4eef8">S</td>
 
<td bgcolor="#f4eef8">S</td>
</tr>
+
 
<tr><td nowrap="nowrap">MBP1_AJECA/374-403&nbsp;&nbsp;</td>
 
<td>T</td>
 
<td bgcolor="#fdeeef">L</td>
 
 
<td bgcolor="#f2eefa">P</td>
 
<td bgcolor="#f2eefa">P</td>
<td bgcolor="#f2eefa">P</td>
 
<td bgcolor="#efeefd">H</td>
 
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#eeeefe">Q</td>
 
+
<td bgcolor="#f3eef9">Y</td>
 +
<td bgcolor="#eeeeff">R</td>
 
<td bgcolor="#ffeeee">I</td>
 
<td bgcolor="#ffeeee">I</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#f4eef8">S</td>
+
<td bgcolor="#eeeefe">E</td>
<td bgcolor="#f9eef3">M</td>
+
 
 +
<td bgcolor="#fdeeef">L</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 1,715: Line 2,315:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td bgcolor="#fbd8db">L</td>
 
<td bgcolor="#fbd8db">L</td>
 
<td bgcolor="#fbd8db">L</td>
 
<td bgcolor="#fbd8db">L</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#e6d8f0">S</td>
+
<td bgcolor="#dad8fd">N</td>
<td bgcolor="#f4eef8">S</td>
+
 
 +
<td bgcolor="#f4eef8">T</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#c5c2fb">D</td>
 
<td bgcolor="#c5c2fb">D</td>
<td bgcolor="#d8c2e8">S</td>
+
<td bgcolor="#c2c2ff">K</td>
 
 
 
<td bgcolor="#eeeefe">N</td>
 
<td bgcolor="#eeeefe">N</td>
 
<td bgcolor="#cfaddc">G</td>
 
<td bgcolor="#cfaddc">G</td>
Line 1,735: Line 2,335:
 
<td bgcolor="#d9c2e7">T</td>
 
<td bgcolor="#d9c2e7">T</td>
 
<td bgcolor="#ebc2d5">A</td>
 
<td bgcolor="#ebc2d5">A</td>
<td bgcolor="#e4adc7">A</td>
+
 
 
<td bgcolor="#f7adb3">L</td>
 
<td bgcolor="#f7adb3">L</td>
<td bgcolor="#ebc2d5">A</td>
+
<td bgcolor="#b3adf7">H</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#ffc2c2">I</td>
 
 
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 +
<td bgcolor="#caade0">S</td>
 
<td bgcolor="#adadff">K</td>
 
<td bgcolor="#adadff">K</td>
 
<td bgcolor="#c5c2fb">N</td>
 
<td bgcolor="#c5c2fb">N</td>
 
<td bgcolor="#f4eef7">G</td>
 
<td bgcolor="#f4eef7">G</td>
<td bgcolor="#faeef2">C</td>
+
<td bgcolor="#eeeefe">D</td>
 +
 
 
</tr>
 
</tr>
<tr><td nowrap="nowrap">MBP1_PARBR/380-409&nbsp;&nbsp;</td>
+
<tr><td nowrap="nowrap">CD00204/1-19&nbsp;&nbsp;</td>
<td>I</td>
+
<td>-</td>
<td bgcolor="#fdeeef">L</td>
+
<td>-</td>
 
+
<td>-</td>
<td bgcolor="#f2eefa">P</td>
+
<td>-</td>
<td bgcolor="#f2eefa">P</td>
 
<td bgcolor="#efeefd">H</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#ffeeee">I</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#f4eef8">S</td>
 
  
<td bgcolor="#fdeeef">L</td>
+
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 1,772: Line 2,368:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#fbd8db">L</td>
 
<td bgcolor="#fbd8db">L</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#e6d8f0">S</td>
+
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
<td bgcolor="#f4eef8">S</td>
+
<td>-</td>
<td bgcolor="#eeeefe">Q</td>
+
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#dad8fd">N</td>
 +
<td bgcolor="#f9eef3">A</td>
 +
<td bgcolor="#eeeeff">R</td>
 
<td bgcolor="#c5c2fb">D</td>
 
<td bgcolor="#c5c2fb">D</td>
<td bgcolor="#d8c2e8">S</td>
+
<td bgcolor="#c5c2fb">E</td>
<td bgcolor="#eeeefe">N</td>
+
<td bgcolor="#eeeefe">D</td>
 +
 
 
<td bgcolor="#cfaddc">G</td>
 
<td bgcolor="#cfaddc">G</td>
<td bgcolor="#dad8fd">D</td>
+
<td bgcolor="#d8d8ff">R</td>
 
<td bgcolor="#d9c2e7">T</td>
 
<td bgcolor="#d9c2e7">T</td>
<td bgcolor="#ebc2d5">A</td>
+
<td bgcolor="#d3c2ee">P</td>
 
 
<td bgcolor="#e4adc7">A</td>
 
 
<td bgcolor="#f7adb3">L</td>
 
<td bgcolor="#f7adb3">L</td>
<td bgcolor="#ebc2d5">A</td>
+
<td bgcolor="#b3adf7">H</td>
 +
<td bgcolor="#f9c2c7">L</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
<td bgcolor="#adadff">K</td>
+
 
 +
<td bgcolor="#caade0">S</td>
 
<td bgcolor="#c5c2fb">N</td>
 
<td bgcolor="#c5c2fb">N</td>
 
<td bgcolor="#f4eef7">G</td>
 
<td bgcolor="#f4eef7">G</td>
<td bgcolor="#faeef2">C</td>
+
<td bgcolor="#efeefd">H</td>
 
 
 
</tr>
 
</tr>
<tr><td nowrap="nowrap">MBP1_NEOFI/363-392&nbsp;&nbsp;</td>
+
<tr><td nowrap="nowrap">CD00204/99-118&nbsp;&nbsp;</td>
<td>T</td>
+
<td>-</td>
<td bgcolor="#faeef2">C</td>
+
<td>-</td>
<td bgcolor="#f4eef8">S</td>
+
<td>-</td>
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#eeeefe">D</td>
 
<td bgcolor="#eeeefe">E</td>
 
<td bgcolor="#ffeeee">I</td>
 
  
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#eeeefe">D</td>
+
<td>-</td>
<td bgcolor="#fdeeef">L</td>
+
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 1,826: Line 2,422:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#fbd8db">L</td>
+
<td>-</td>
<td bgcolor="#fbd8db">L</td>
+
<td>-</td>
 
<td>-</td>
 
<td>-</td>
  
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
<td bgcolor="#fdd8da">V</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#e6d8f0">S</td>
+
<td>-</td>
<td bgcolor="#faeef2">C</td>
+
<td>-</td>
<td bgcolor="#eeeefe">Q</td>
+
<td>-</td>
 +
<td bgcolor="#dad8fd">N</td>
 +
<td bgcolor="#f9eef3">A</td>
 +
 
 +
<td bgcolor="#eeeeff">R</td>
 
<td bgcolor="#c5c2fb">D</td>
 
<td bgcolor="#c5c2fb">D</td>
<td bgcolor="#d8c2e8">S</td>
+
<td bgcolor="#c2c2ff">K</td>
<td bgcolor="#eeeefe">N</td>
+
<td bgcolor="#eeeefe">D</td>
 
 
 
<td bgcolor="#cfaddc">G</td>
 
<td bgcolor="#cfaddc">G</td>
<td bgcolor="#dad8fd">D</td>
+
<td bgcolor="#d8d8ff">R</td>
 
<td bgcolor="#d9c2e7">T</td>
 
<td bgcolor="#d9c2e7">T</td>
<td bgcolor="#ebc2d5">A</td>
+
<td bgcolor="#d3c2ee">P</td>
<td bgcolor="#e4adc7">A</td>
 
 
<td bgcolor="#f7adb3">L</td>
 
<td bgcolor="#f7adb3">L</td>
<td bgcolor="#fcc2c4">V</td>
+
 
 +
<td bgcolor="#b3adf7">H</td>
 +
<td bgcolor="#f9c2c7">L</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
+
<td bgcolor="#adadff">K</td>
<td bgcolor="#adadff">R</td>
 
 
<td bgcolor="#c5c2fb">N</td>
 
<td bgcolor="#c5c2fb">N</td>
 
<td bgcolor="#f4eef7">G</td>
 
<td bgcolor="#f4eef7">G</td>
<td bgcolor="#f9eef3">A</td>
+
<td bgcolor="#efeefd">H</td>
 
</tr>
 
</tr>
<tr><td nowrap="nowrap">MBP1_ASPNI/365-394&nbsp;&nbsp;</td>
 
<td>T</td>
 
<td bgcolor="#fbeef1">F</td>
 
<td bgcolor="#f4eef8">S</td>
 
  
<td bgcolor="#f2eefa">P</td>
+
<tr><td nowrap="nowrap">1SW6/203-232&nbsp;&nbsp;</td>
<td bgcolor="#eeeefe">E</td>
+
<td>L</td>
<td bgcolor="#eeeefe">E</td>
+
<td bgcolor="#eeeefe">D</td>
<td bgcolor="#fdeeee">V</td>
+
<td bgcolor="#fdeeef">L</td>
 +
<td bgcolor="#eeeeff">K</td>
 +
<td bgcolor="#f4eef9">W</td>
 +
<td bgcolor="#ffeeee">I</td>
 +
<td bgcolor="#ffeeee">I</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#eeeefe">D</td>
+
<td bgcolor="#f9eef3">A</td>
<td bgcolor="#fdeeef">L</td>
+
<td bgcolor="#eeeefe">N</td>
 
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 1,875: Line 2,475:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#fbd8db">L</td>
+
<td bgcolor="#f3d8e4">M</td>
 
<td bgcolor="#fbd8db">L</td>
 
<td bgcolor="#fbd8db">L</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#e6d8f0">S</td>
+
<td bgcolor="#dad8fd">N</td>
<td bgcolor="#faeef2">C</td>
+
<td bgcolor="#f9eef3">A</td>
 
 
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#c5c2fb">D</td>
 
<td bgcolor="#c5c2fb">D</td>
 
<td bgcolor="#d8c2e8">S</td>
 
<td bgcolor="#d8c2e8">S</td>
<td bgcolor="#fdeeee">V</td>
+
<td bgcolor="#eeeefe">N</td>
 
<td bgcolor="#cfaddc">G</td>
 
<td bgcolor="#cfaddc">G</td>
 +
 
<td bgcolor="#dad8fd">D</td>
 
<td bgcolor="#dad8fd">D</td>
 
<td bgcolor="#d9c2e7">T</td>
 
<td bgcolor="#d9c2e7">T</td>
<td bgcolor="#ebc2d5">A</td>
+
<td bgcolor="#efc2d0">C</td>
<td bgcolor="#fbadaf">V</td>
 
 
 
 
<td bgcolor="#f7adb3">L</td>
 
<td bgcolor="#f7adb3">L</td>
<td bgcolor="#fcc2c4">V</td>
+
<td bgcolor="#b0adfa">N</td>
 +
<td bgcolor="#ffc2c2">I</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#adadff">R</td>
 
<td bgcolor="#adadff">R</td>
<td bgcolor="#c5c2fb">N</td>
+
 
 +
<td bgcolor="#f9c2c7">L</td>
 
<td bgcolor="#f4eef7">G</td>
 
<td bgcolor="#f4eef7">G</td>
<td bgcolor="#fdeeee">V</td>
+
<td bgcolor="#eeeefe">N</td>
 
</tr>
 
</tr>
 +
<tr><td nowrap="nowrap">SecStruc/203-232&nbsp;&nbsp;</td>
 +
<td>t</td>
 +
<td bgcolor="#f5eef6">_</td>
 +
<td bgcolor="#efeefd">H</td>
 +
<td bgcolor="#efeefd">H</td>
  
<tr><td nowrap="nowrap">MBP1_UNCRE/377-406&nbsp;&nbsp;</td>
 
<td>M</td>
 
<td bgcolor="#f3eef9">Y</td>
 
<td bgcolor="#f2eefa">P</td>
 
 
<td bgcolor="#efeefd">H</td>
 
<td bgcolor="#efeefd">H</td>
 
<td bgcolor="#efeefd">H</td>
 
<td bgcolor="#efeefd">H</td>
<td bgcolor="#eeeefe">E</td>
+
<td bgcolor="#efeefd">H</td>
<td bgcolor="#fdeeee">V</td>
 
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#f4eef7">G</td>
+
<td bgcolor="#efeefd">H</td>
<td bgcolor="#fdeeef">L</td>
+
<td bgcolor="#efeefd">H</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 1,935: Line 2,535:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#fbd8db">L</td>
+
 
<td bgcolor="#fbd8db">L</td>
+
<td bgcolor="#ead8ed">_</td>
 +
<td bgcolor="#ead8ed">_</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#f2d8e5">A</td>
+
<td bgcolor="#ead8ed">_</td>
<td bgcolor="#f4eef8">S</td>
+
<td bgcolor="#f5eef6">_</td>
<td bgcolor="#eeeefe">Q</td>
+
<td bgcolor="#f5eef6">_</td>
<td bgcolor="#c5c2fb">D</td>
 
<td bgcolor="#d8c2e8">S</td>
 
<td bgcolor="#eeeefe">N</td>
 
<td bgcolor="#cfaddc">G</td>
 
  
<td bgcolor="#dad8fd">D</td>
+
<td bgcolor="#dec2e3">_</td>
<td bgcolor="#d9c2e7">T</td>
+
<td bgcolor="#d9c2e7">t</td>
<td bgcolor="#ebc2d5">A</td>
+
<td bgcolor="#f5eef6">_</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#d2add8">_</td>
<td bgcolor="#f7adb3">L</td>
+
<td bgcolor="#ead8ed">_</td>
<td bgcolor="#d9c2e7">T</td>
+
<td bgcolor="#dec2e3">_</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#c7c2f9">H</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#b3adf7">H</td>
<td bgcolor="#adadff">K</td>
+
<td bgcolor="#b3adf7">H</td>
  
<td bgcolor="#c5c2fb">N</td>
+
<td bgcolor="#c7c2f9">H</td>
<td bgcolor="#f4eef7">G</td>
+
<td bgcolor="#b3adf7">H</td>
<td bgcolor="#faeef2">C</td>
+
<td bgcolor="#b3adf7">H</td>
 +
<td bgcolor="#b3adf7">H</td>
 +
<td bgcolor="#c7c2f9">H</td>
 +
<td bgcolor="#f5eef6">_</td>
 +
<td bgcolor="#f5eef6">_</td>
 
</tr>
 
</tr>
<tr><td nowrap="nowrap">MBP1_PENCH/439-468&nbsp;&nbsp;</td>
+
</table>
<td>T</td>
+
</td></tr>
<td bgcolor="#faeef2">C</td>
 
<td bgcolor="#f4eef8">S</td>
 
<td bgcolor="#eeeefe">Q</td>
 
  
<td bgcolor="#eeeefe">D</td>
+
</table>
<td bgcolor="#eeeefe">E</td>
+
;Aligned sequences before editing. The algorithm has placed gaps into the Swi6 helix <code>LKWIIAN</code> and the four-residue gaps before the block of well aligned sequence on the right are poorly supported.
<td bgcolor="#ffeeee">I</td>
+
 
 +
 
 +
<table border="1"><tr><td>
 +
<table border="0" cellpadding="0" cellspacing="0">
 +
 
 +
<tr><td colspan="6"></td>
 +
<td colspan="9">10<br>|</td><td></td>
 +
<td colspan="9">20<br>|</td><td></td>
 +
 
 +
<td colspan="9">30<br>|</td><td></td>
 +
<td colspan="3"></td><td colspan="3">40<br>|</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBP1_USTMA/341-368&nbsp;&nbsp;</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td>-</td>
+
<td bgcolor="#dfd2f0">Y</td>
<td bgcolor="#eeeefe">Q</td>
+
<td bgcolor="#e4d2ec">G</td>
<td bgcolor="#f9eef3">M</td>
 
<td>-</td>
 
  
 +
<td bgcolor="#d4d2fc">D</td>
 +
<td bgcolor="#d4d2fc">Q</td>
 +
<td bgcolor="#fbd2d5">L</td>
 +
<td bgcolor="#f0d2e0">A</td>
 +
<td bgcolor="#d4d2fc">D</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 1,989: Line 2,602:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td bgcolor="#fbd8db">L</td>
 
<td bgcolor="#fbd8db">L</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#e6d8f0">S</td>
 
<td bgcolor="#faeef2">C</td>
 
<td bgcolor="#eeeefe">Q</td>
 
  
<td bgcolor="#c5c2fb">D</td>
+
<td>-</td>
<td bgcolor="#c5c2fb">Q</td>
+
<td bgcolor="#ffbfbf">I</td>
<td bgcolor="#eeeefe">N</td>
+
<td bgcolor="#f9bfc4">L</td>
<td bgcolor="#cfaddc">G</td>
+
<td bgcolor="#c2bffc">N</td>
<td bgcolor="#dad8fd">D</td>
+
<td bgcolor="#f5d2db">F</td>
<td bgcolor="#d9c2e7">T</td>
+
<td bgcolor="#d4d2fc">Q</td>
<td bgcolor="#ebc2d5">A</td>
+
<td bgcolor="#afabfa">D</td>
<td bgcolor="#fbadaf">V</td>
+
<td bgcolor="#afabfa">D</td>
<td bgcolor="#f7adb3">L</td>
+
<td bgcolor="#d4d2fc">E</td>
 +
 
 +
<td bgcolor="#c399d4">G</td>
 +
<td bgcolor="#c2bffc">E</td>
 +
<td bgcolor="#cbabdf">T</td>
 +
<td bgcolor="#c2abe8">P</td>
 +
<td bgcolor="#f699a1">L</td>
 +
<td bgcolor="#bf99d7">T</td>
 +
<td bgcolor="#e5abc5">M</td>
 +
<td bgcolor="#dd99b9">A</td>
 +
<td bgcolor="#dd99b9">A</td>
  
<td bgcolor="#fcc2c4">V</td>
+
<td bgcolor="#9999ff">R</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#e3abc6">A</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#d2d2ff">R</td>
<td bgcolor="#adadff">R</td>
+
<td bgcolor="#e2d2ee">S</td>
<td bgcolor="#c5c2fb">N</td>
 
<td bgcolor="#f4eef7">G</td>
 
<td bgcolor="#f9eef3">A</td>
 
 
</tr>
 
</tr>
 +
<tr><td nowrap="nowrap">MBP1B_SCHCO/470-498&nbsp;&nbsp;</td>
 +
<td>-</td>
 +
<td bgcolor="#d2d2ff">R</td>
 +
<td bgcolor="#d4d2fc">E</td>
  
<tr><td nowrap="nowrap">MBPA_TRIVE/407-436&nbsp;&nbsp;</td>
+
<td bgcolor="#d4d2fc">D</td>
 +
<td bgcolor="#e4d2ec">G</td>
 +
<td bgcolor="#d4d2fc">D</td>
 +
<td bgcolor="#dfd2f0">Y</td>
 +
<td bgcolor="#d2d2ff">K</td>
 +
<td bgcolor="#e2d2ee">S</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
<td>V</td>
 
<td bgcolor="#fbeef1">F</td>
 
<td bgcolor="#f2eefa">P</td>
 
<td bgcolor="#eeeeff">R</td>
 
<td bgcolor="#efeefd">H</td>
 
<td bgcolor="#eeeefe">E</td>
 
<td bgcolor="#ffeeee">I</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#f4eef8">S</td>
 
<td bgcolor="#fdeeef">L</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 2,043: Line 2,659:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td>-</td>
+
<td bgcolor="#f2bfcc">F</td>
<td>-</td>
+
<td bgcolor="#f9bfc4">L</td>
<td bgcolor="#fbd8db">L</td>
+
<td bgcolor="#c2bffc">D</td>
<td bgcolor="#fbd8db">L</td>
+
<td bgcolor="#fbd2d5">L</td>
<td>-</td>
+
<td bgcolor="#d4d2fc">Q</td>
<td>-</td>
+
<td bgcolor="#afabfa">D</td>
<td>-</td>
+
<td bgcolor="#afabfa">E</td>
  
<td>-</td>
+
<td bgcolor="#d5d2fb">H</td>
<td bgcolor="#e6d8f0">S</td>
+
<td bgcolor="#c399d4">G</td>
<td bgcolor="#f4eef8">S</td>
+
<td bgcolor="#c2bffc">D</td>
<td bgcolor="#eeeefe">Q</td>
+
<td bgcolor="#cbabdf">T</td>
<td bgcolor="#c5c2fb">D</td>
+
<td bgcolor="#e3abc6">A</td>
<td bgcolor="#ebc2d5">A</td>
+
<td bgcolor="#f699a1">L</td>
<td bgcolor="#eeeefe">N</td>
+
<td bgcolor="#9d99f9">N</td>
<td bgcolor="#cfaddc">G</td>
+
<td bgcolor="#ffabab">I</td>
<td bgcolor="#dad8fd">D</td>
+
<td bgcolor="#dd99b9">A</td>
  
<td bgcolor="#d9c2e7">T</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#ebc2d5">A</td>
+
<td bgcolor="#9999ff">R</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#fcabae">V</td>
<td bgcolor="#f7adb3">L</td>
+
<td bgcolor="#e4d2ec">G</td>
<td bgcolor="#d9c2e7">T</td>
+
<td bgcolor="#d4d2fc">N</td>
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#adadff">K</td>
 
<td bgcolor="#c5c2fb">N</td>
 
 
 
<td bgcolor="#f4eef7">G</td>
 
<td bgcolor="#faeef2">C</td>
 
 
</tr>
 
</tr>
<tr><td nowrap="nowrap">MBP1_PHANO/400-429&nbsp;&nbsp;</td>
+
<tr><td nowrap="nowrap">MBP1_ASHGO/465-494&nbsp;&nbsp;</td>
<td>T</td>
+
<td>F</td>
<td bgcolor="#f4eef9">W</td>
+
<td bgcolor="#e2d2ee">S</td>
<td bgcolor="#ffeeee">I</td>
 
<td bgcolor="#f2eefa">P</td>
 
<td bgcolor="#eeeefe">E</td>
 
  
<td bgcolor="#eeeefe">E</td>
+
<td bgcolor="#ded2f2">P</td>
<td bgcolor="#fdeeee">V</td>
+
<td bgcolor="#d4d2fc">Q</td>
 +
<td bgcolor="#dfd2f0">Y</td>
 +
<td bgcolor="#d2d2ff">R</td>
 +
<td bgcolor="#ffd2d2">I</td>
 +
<td bgcolor="#d4d2fc">E</td>
 +
<td bgcolor="#e2d2ed">T</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#f4eef8">T</td>
 
<td bgcolor="#eeeeff">R</td>
 
<td>-</td>
 
<td>-</td>
 
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 2,099: Line 2,706:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#fbd8db">L</td>
 
  
<td bgcolor="#fbd8db">L</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td>-</td>
+
<td bgcolor="#f9bfc4">L</td>
<td bgcolor="#dad8fd">N</td>
+
<td bgcolor="#ffbfbf">I</td>
<td bgcolor="#f9eef3">A</td>
+
<td bgcolor="#c2bffc">N</td>
<td bgcolor="#eeeefe">Q</td>
+
<td bgcolor="#f0d2e0">A</td>
<td bgcolor="#c5c2fb">D</td>
+
<td bgcolor="#d4d2fc">Q</td>
 +
<td bgcolor="#afabfa">D</td>
  
<td bgcolor="#c5c2fb">Q</td>
+
<td bgcolor="#eaabbf">C</td>
<td bgcolor="#eeeefe">N</td>
+
<td bgcolor="#d2d2ff">K</td>
<td bgcolor="#cfaddc">G</td>
+
<td bgcolor="#c399d4">G</td>
<td bgcolor="#dad8fd">D</td>
+
<td bgcolor="#d6bfe7">S</td>
<td bgcolor="#d9c2e7">T</td>
+
<td bgcolor="#cbabdf">T</td>
<td bgcolor="#ebc2d5">A</td>
+
<td bgcolor="#c2abe8">P</td>
<td bgcolor="#ffadad">I</td>
+
<td bgcolor="#f699a1">L</td>
<td bgcolor="#e5adc6">M</td>
+
<td bgcolor="#a199f6">H</td>
<td bgcolor="#ffc2c2">I</td>
+
<td bgcolor="#ffabab">I</td>
  
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#adadff">R</td>
+
<td bgcolor="#df99b8">M</td>
<td bgcolor="#c5c2fb">N</td>
+
<td bgcolor="#afabfa">N</td>
<td bgcolor="#f4eef7">G</td>
+
<td bgcolor="#d2d2ff">R</td>
<td bgcolor="#f9eef3">A</td>
+
<td bgcolor="#d4d2fc">D</td>
 
</tr>
 
</tr>
<tr><td nowrap="nowrap">MBPA_SCLSC/294-313&nbsp;&nbsp;</td>
+
<tr><td nowrap="nowrap">MBP1_CLALU/550-586&nbsp;&nbsp;</td>
<td>-</td>
+
<td>G</td>
  
<td>-</td>
+
<td bgcolor="#d4d2fc">N</td>
<td>-</td>
+
<td bgcolor="#d4d2fc">Q</td>
<td>-</td>
+
<td bgcolor="#d4d2fc">N</td>
<td>-</td>
+
<td bgcolor="#e4d2ec">G</td>
<td>-</td>
+
<td bgcolor="#d4d2fc">N</td>
<td>-</td>
+
<td bgcolor="#e2d2ee">S</td>
<td>-</td>
+
<td bgcolor="#d4d2fc">N</td>
<td>-</td>
+
<td bgcolor="#d4d2fc">D</td>
<td>-</td>
+
<td>K</td>
  
<td>-</td>
+
<td>K</td>
<td>-</td>
+
<td>E</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 2,151: Line 2,757:
 
<td>-</td>
 
<td>-</td>
  
<td>-</td>
+
<td>L</td>
<td>-</td>
+
<td>I</td>
<td>-</td>
+
<td>S</td>
<td>-</td>
+
<td>K</td>
<td bgcolor="#fbd8db">L</td>
+
<td bgcolor="#f2bfcc">F</td>
<td>-</td>
+
<td bgcolor="#f9bfc4">L</td>
<td>-</td>
+
<td bgcolor="#c2bffc">N</td>
<td>-</td>
+
<td bgcolor="#d5d2fb">H</td>
<td>-</td>
+
<td bgcolor="#d4d2fc">Q</td>
  
<td bgcolor="#dad8fd">D</td>
+
<td bgcolor="#afabfa">D</td>
<td bgcolor="#f9eef3">A</td>
+
<td bgcolor="#afabfa">N</td>
<td bgcolor="#eeeeff">R</td>
+
<td bgcolor="#d4d2fc">E</td>
<td bgcolor="#c5c2fb">D</td>
+
<td bgcolor="#c399d4">G</td>
<td bgcolor="#ffc2c2">I</td>
+
<td bgcolor="#c2bffc">N</td>
<td bgcolor="#eeeefe">N</td>
+
<td bgcolor="#cbabdf">T</td>
<td bgcolor="#cfaddc">G</td>
+
<td bgcolor="#e3abc6">A</td>
<td bgcolor="#dad8fd">N</td>
+
<td bgcolor="#e999ad">F</td>
<td bgcolor="#d9c2e7">T</td>
+
<td bgcolor="#a199f6">H</td>
  
<td bgcolor="#ebc2d5">A</td>
+
<td bgcolor="#ffabab">I</td>
<td bgcolor="#ffadad">I</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#b3adf7">H</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#ffc2c2">I</td>
+
<td bgcolor="#b899df">Y</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#afabfa">N</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#f0d2df">M</td>
<td bgcolor="#adadff">K</td>
+
<td bgcolor="#e2d2ee">S</td>
<td bgcolor="#c5c2fb">N</td>
 
<td bgcolor="#eeeeff">K</td>
 
 
 
<td bgcolor="#f9eef3">A</td>
 
 
</tr>
 
</tr>
<tr><td nowrap="nowrap">MBPA_PYRIS/363-392&nbsp;&nbsp;</td>
+
<tr><td nowrap="nowrap">MBPA_COPCI/514-542&nbsp;&nbsp;</td>
<td>T</td>
 
<td bgcolor="#f4eef9">W</td>
 
<td bgcolor="#ffeeee">I</td>
 
<td bgcolor="#f2eefa">P</td>
 
<td bgcolor="#eeeefe">E</td>
 
<td bgcolor="#eeeefe">E</td>
 
  
<td bgcolor="#fdeeee">V</td>
 
 
<td>-</td>
 
<td>-</td>
<td>-</td>
+
<td bgcolor="#d5d2fb">H</td>
<td>-</td>
+
<td bgcolor="#d4d2fc">E</td>
<td bgcolor="#f4eef8">T</td>
+
<td bgcolor="#e4d2ec">G</td>
<td bgcolor="#eeeeff">R</td>
+
<td bgcolor="#e4d2ec">G</td>
 +
<td bgcolor="#d4d2fc">D</td>
 +
<td bgcolor="#f5d2db">F</td>
 +
<td bgcolor="#d2d2ff">R</td>
 +
<td bgcolor="#e2d2ee">S</td>
 +
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 2,207: Line 2,806:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#fbd8db">L</td>
 
<td bgcolor="#fbd8db">L</td>
 
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#dad8fd">N</td>
+
<td bgcolor="#f9bfc4">L</td>
<td bgcolor="#f9eef3">A</td>
+
<td bgcolor="#fcbfc1">V</td>
<td bgcolor="#f9eef3">A</td>
+
<td bgcolor="#c2bffc">D</td>
<td bgcolor="#c5c2fb">D</td>
+
<td bgcolor="#fbd2d5">L</td>
<td bgcolor="#c5c2fb">Q</td>
 
  
<td bgcolor="#eeeefe">N</td>
+
<td bgcolor="#d4d2fc">Q</td>
<td bgcolor="#cfaddc">G</td>
+
<td bgcolor="#afabfa">D</td>
<td bgcolor="#dad8fd">D</td>
+
<td bgcolor="#afabfa">E</td>
<td bgcolor="#d9c2e7">T</td>
+
<td bgcolor="#d5d2fb">H</td>
<td bgcolor="#ebc2d5">A</td>
+
<td bgcolor="#c399d4">G</td>
<td bgcolor="#ffadad">I</td>
+
<td bgcolor="#c2bffc">D</td>
<td bgcolor="#e5adc6">M</td>
+
<td bgcolor="#cbabdf">T</td>
<td bgcolor="#ffc2c2">I</td>
+
<td bgcolor="#e3abc6">A</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#ff9999">I</td>
  
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#9d99f9">N</td>
<td bgcolor="#adadff">R</td>
+
<td bgcolor="#ffabab">I</td>
<td bgcolor="#c5c2fb">N</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#f4eef7">G</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#f9eef3">A</td>
+
<td bgcolor="#9999ff">R</td>
 +
<td bgcolor="#fcabae">V</td>
 +
<td bgcolor="#e4d2ec">G</td>
 +
<td bgcolor="#d4d2fc">N</td>
 
</tr>
 
</tr>
<tr><td nowrap="nowrap">MBP1_/361-390&nbsp;&nbsp;</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
<tr><td nowrap="nowrap">MBP1_DEBHA/507-550&nbsp;&nbsp;</td>
<td bgcolor="#eeeefe">N</td>
+
<td>I</td>
<td bgcolor="#efeefd">H</td>
+
<td bgcolor="#d2d2ff">R</td>
<td bgcolor="#f4eef8">S</td>
+
<td bgcolor="#d4d2fc">D</td>
<td bgcolor="#fdeeef">L</td>
+
<td bgcolor="#e2d2ee">S</td>
<td>G</td>
+
<td bgcolor="#d4d2fc">Q</td>
<td>V</td>
+
<td bgcolor="#d4d2fc">E</td>
 +
<td bgcolor="#ffd2d2">I</td>
 +
<td bgcolor="#d4d2fc">E</td>
 +
 
 +
<td bgcolor="#d4d2fc">N</td>
 +
<td>K</td>
 +
<td>K</td>
 +
<td>L</td>
 +
<td>S</td>
 
<td>L</td>
 
<td>L</td>
<td bgcolor="#f4eef8">S</td>
+
<td>S</td>
 +
<td>D</td>
 +
<td>K</td>
  
<td bgcolor="#eeeefe">Q</td>
+
<td>K</td>
<td>-</td>
+
<td>E</td>
<td>-</td>
+
<td>L</td>
 +
<td>I</td>
 +
<td>A</td>
 +
<td>K</td>
 +
<td bgcolor="#f2bfcc">F</td>
 +
<td bgcolor="#ffbfbf">I</td>
 +
<td bgcolor="#c2bffc">N</td>
 +
 
 +
<td bgcolor="#d5d2fb">H</td>
 +
<td bgcolor="#d4d2fc">Q</td>
 +
<td bgcolor="#afabfa">D</td>
 +
<td bgcolor="#ffabab">I</td>
 +
<td bgcolor="#d4d2fc">D</td>
 +
<td bgcolor="#c399d4">G</td>
 +
<td bgcolor="#c2bffc">N</td>
 +
<td bgcolor="#cbabdf">T</td>
 +
<td bgcolor="#e3abc6">A</td>
 +
 
 +
<td bgcolor="#e999ad">F</td>
 +
<td bgcolor="#a199f6">H</td>
 +
<td bgcolor="#ffabab">I</td>
 +
<td bgcolor="#fb999c">V</td>
 +
<td bgcolor="#dd99b9">A</td>
 +
<td bgcolor="#b899df">Y</td>
 +
<td bgcolor="#afabfa">N</td>
 +
<td bgcolor="#fbd2d5">L</td>
 +
<td bgcolor="#d4d2fc">N</td>
 +
 
 +
</tr>
 +
<tr><td nowrap="nowrap">MBP1A_SCHCO/388-415&nbsp;&nbsp;</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#dfd2f0">Y</td>
 +
<td bgcolor="#ded2f2">P</td>
 +
<td bgcolor="#d2d2ff">K</td>
 +
<td bgcolor="#d4d2fc">E</td>
 +
<td bgcolor="#fbd2d5">L</td>
 +
 
 +
<td bgcolor="#f0d2e0">A</td>
 +
<td bgcolor="#d4d2fc">D</td>
 +
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 2,263: Line 2,909:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#f7d8e0">F</td>
 
<td bgcolor="#f3d8e4">M</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#dad8fd">D</td>
+
<td>-</td>
 +
<td bgcolor="#fcbfc1">V</td>
 +
<td bgcolor="#f9bfc4">L</td>
  
<td bgcolor="#f4eef8">T</td>
+
<td bgcolor="#c2bffc">N</td>
<td bgcolor="#eeeefe">Q</td>
+
<td bgcolor="#f5d2db">F</td>
<td bgcolor="#c5c2fb">N</td>
+
<td bgcolor="#d4d2fc">Q</td>
<td bgcolor="#c5c2fb">N</td>
+
<td bgcolor="#afabfa">D</td>
<td bgcolor="#eeeefe">E</td>
+
<td bgcolor="#afabfa">E</td>
<td bgcolor="#cfaddc">G</td>
+
<td bgcolor="#d4d2fc">D</td>
<td bgcolor="#dad8fd">D</td>
+
<td bgcolor="#c399d4">G</td>
<td bgcolor="#d9c2e7">T</td>
+
<td bgcolor="#c2bffc">E</td>
<td bgcolor="#ebc2d5">A</td>
+
<td bgcolor="#cbabdf">T</td>
  
<td bgcolor="#f7adb3">L</td>
+
<td bgcolor="#e3abc6">A</td>
<td bgcolor="#b3adf7">H</td>
+
<td bgcolor="#f699a1">L</td>
<td bgcolor="#ffc2c2">I</td>
+
<td bgcolor="#bf99d7">T</td>
<td bgcolor="#f7adb3">L</td>
+
<td bgcolor="#e5abc5">M</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#adadff">R</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#d8c2e8">S</td>
+
<td bgcolor="#9999ff">R</td>
<td bgcolor="#f4eef7">G</td>
+
<td bgcolor="#eaabbf">C</td>
<td bgcolor="#f9eef3">A</td>
+
<td bgcolor="#d2d2ff">R</td>
  
 +
<td bgcolor="#e2d2ee">S</td>
 
</tr>
 
</tr>
<tr><td nowrap="nowrap">MBP1_ASPFL/328-364&nbsp;&nbsp;</td>
+
 
 +
<tr><td nowrap="nowrap">MBP1_AJECA/374-403&nbsp;&nbsp;</td>
 
<td>T</td>
 
<td>T</td>
<td bgcolor="#eeeefe">E</td>
+
<td bgcolor="#fbd2d5">L</td>
<td bgcolor="#eeeefe">Q</td>
+
<td bgcolor="#ded2f2">P</td>
<td bgcolor="#f2eefa">P</td>
+
<td bgcolor="#ded2f2">P</td>
<td bgcolor="#f4eef7">G</td>
+
<td bgcolor="#d5d2fb">H</td>
<td bgcolor="#eeeefe">E</td>
+
<td bgcolor="#d4d2fc">Q</td>
<td bgcolor="#fdeeee">V</td>
 
  
<td>I</td>
+
<td bgcolor="#ffd2d2">I</td>
<td>T</td>
+
<td bgcolor="#e2d2ee">S</td>
<td>L</td>
+
<td bgcolor="#f0d2df">M</td>
<td bgcolor="#f4eef7">G</td>
+
<td>-</td>
<td bgcolor="#eeeeff">R</td>
+
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 2,317: Line 2,964:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#f7d8e0">F</td>
+
<td>-</td>
<td bgcolor="#ffd8d8">I</td>
+
<td>-</td>
<td>S</td>
+
<td bgcolor="#f9bfc4">L</td>
  
<td>E</td>
+
<td bgcolor="#f9bfc4">L</td>
<td>I</td>
+
<td bgcolor="#d6bfe7">S</td>
<td>V</td>
+
<td bgcolor="#e2d2ee">S</td>
<td bgcolor="#dad8fd">N</td>
+
<td bgcolor="#d4d2fc">Q</td>
<td bgcolor="#fdeeef">L</td>
+
<td bgcolor="#afabfa">D</td>
<td bgcolor="#eeeeff">R</td>
+
<td bgcolor="#caabe0">S</td>
<td bgcolor="#c5c2fb">D</td>
+
<td bgcolor="#d4d2fc">N</td>
<td bgcolor="#c5c2fb">D</td>
+
<td bgcolor="#c399d4">G</td>
<td bgcolor="#eeeefe">Q</td>
+
<td bgcolor="#c2bffc">D</td>
  
<td bgcolor="#cfaddc">G</td>
+
<td bgcolor="#cbabdf">T</td>
<td bgcolor="#dad8fd">D</td>
+
<td bgcolor="#e3abc6">A</td>
<td bgcolor="#d9c2e7">T</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#ebc2d5">A</td>
+
<td bgcolor="#f699a1">L</td>
<td bgcolor="#f7adb3">L</td>
+
<td bgcolor="#e3abc6">A</td>
<td bgcolor="#b0adfa">N</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#f9c2c7">L</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#9999ff">K</td>
<td bgcolor="#cfaddc">G</td>
+
<td bgcolor="#afabfa">N</td>
  
<td bgcolor="#adadff">R</td>
+
<td bgcolor="#e4d2ec">G</td>
<td bgcolor="#ebc2d5">A</td>
+
<td bgcolor="#f4d2dc">C</td>
<td bgcolor="#eeeeff">R</td>
 
<td bgcolor="#f4eef8">S</td>
 
 
</tr>
 
</tr>
<tr><td nowrap="nowrap">MBPA_MAGOR/375-404&nbsp;&nbsp;</td>
+
<tr><td nowrap="nowrap">MBP1_PARBR/380-409&nbsp;&nbsp;</td>
<td>Q</td>
+
<td>I</td>
<td bgcolor="#efeefd">H</td>
+
<td bgcolor="#fbd2d5">L</td>
<td bgcolor="#eeeefe">D</td>
+
<td bgcolor="#ded2f2">P</td>
 +
<td bgcolor="#ded2f2">P</td>
 +
<td bgcolor="#d5d2fb">H</td>
  
<td bgcolor="#f2eefa">P</td>
+
<td bgcolor="#d4d2fc">Q</td>
<td bgcolor="#eeeefe">N</td>
+
<td bgcolor="#ffd2d2">I</td>
<td bgcolor="#fbeef1">F</td>
+
<td bgcolor="#e2d2ee">S</td>
<td bgcolor="#fdeeee">V</td>
+
<td bgcolor="#fbd2d5">L</td>
 +
<td>-</td>
 +
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#eeeefe">Q</td>
 
  
 
<td>-</td>
 
<td>-</td>
Line 2,371: Line 3,018:
 
<td>-</td>
 
<td>-</td>
  
 +
<td bgcolor="#f9bfc4">L</td>
 +
<td bgcolor="#f9bfc4">L</td>
 +
<td bgcolor="#d6bfe7">S</td>
 +
<td bgcolor="#e2d2ee">S</td>
 +
<td bgcolor="#d4d2fc">Q</td>
 +
<td bgcolor="#afabfa">D</td>
 +
<td bgcolor="#caabe0">S</td>
 +
<td bgcolor="#d4d2fc">N</td>
 +
<td bgcolor="#c399d4">G</td>
 +
 +
<td bgcolor="#c2bffc">D</td>
 +
<td bgcolor="#cbabdf">T</td>
 +
<td bgcolor="#e3abc6">A</td>
 +
<td bgcolor="#dd99b9">A</td>
 +
<td bgcolor="#f699a1">L</td>
 +
<td bgcolor="#e3abc6">A</td>
 +
<td bgcolor="#dd99b9">A</td>
 +
<td bgcolor="#dd99b9">A</td>
 +
<td bgcolor="#9999ff">K</td>
 +
 +
<td bgcolor="#afabfa">N</td>
 +
<td bgcolor="#e4d2ec">G</td>
 +
<td bgcolor="#f4d2dc">C</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBP1_NEOFI/363-392&nbsp;&nbsp;</td>
 +
<td>T</td>
 +
<td bgcolor="#f4d2dc">C</td>
 +
<td bgcolor="#e2d2ee">S</td>
 +
<td bgcolor="#d4d2fc">Q</td>
 +
 +
<td bgcolor="#d4d2fc">D</td>
 +
<td bgcolor="#d4d2fc">E</td>
 +
<td bgcolor="#ffd2d2">I</td>
 +
<td bgcolor="#d4d2fc">D</td>
 +
<td bgcolor="#fbd2d5">L</td>
 +
<td>-</td>
 +
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#fbd8db">L</td>
+
<td>-</td>
<td bgcolor="#fbd8db">L</td>
+
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#dad8fd">D</td>
 
<td bgcolor="#f9eef3">A</td>
 
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#c5c2fb">D</td>
 
<td bgcolor="#c5c2fb">N</td>
 
<td bgcolor="#eeeefe">D</td>
 
<td bgcolor="#cfaddc">G</td>
 
<td bgcolor="#dad8fd">N</td>
 
<td bgcolor="#d9c2e7">T</td>
 
<td bgcolor="#ebc2d5">A</td>
 
<td bgcolor="#fbadaf">V</td>
 
 
<td bgcolor="#b3adf7">H</td>
 
<td bgcolor="#f9c2c7">L</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#b0adfa">Q</td>
 
<td bgcolor="#c2c2ff">R</td>
 
<td bgcolor="#f4eef7">G</td>
 
<td bgcolor="#f4eef8">S</td>
 
</tr>
 
 
<tr><td nowrap="nowrap">MBP1_CHAGL/361-390&nbsp;&nbsp;</td>
 
<td>S</td>
 
<td bgcolor="#eeeeff">R</td>
 
<td bgcolor="#f4eef8">S</td>
 
<td bgcolor="#f9eef3">A</td>
 
<td bgcolor="#eeeefe">D</td>
 
<td bgcolor="#eeeefe">E</td>
 
<td bgcolor="#fdeeef">L</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#eeeefe">Q</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 2,422: Line 3,069:
  
 
<td>-</td>
 
<td>-</td>
<td>-</td>
+
<td bgcolor="#f9bfc4">L</td>
<td>-</td>
+
<td bgcolor="#f9bfc4">L</td>
<td>-</td>
+
<td bgcolor="#d6bfe7">S</td>
<td>-</td>
+
<td bgcolor="#f4d2dc">C</td>
<td bgcolor="#fbd8db">L</td>
+
<td bgcolor="#d4d2fc">Q</td>
<td bgcolor="#fbd8db">L</td>
+
<td bgcolor="#afabfa">D</td>
 +
<td bgcolor="#caabe0">S</td>
 +
<td bgcolor="#d4d2fc">N</td>
 +
 
 +
<td bgcolor="#c399d4">G</td>
 +
<td bgcolor="#c2bffc">D</td>
 +
<td bgcolor="#cbabdf">T</td>
 +
<td bgcolor="#e3abc6">A</td>
 +
<td bgcolor="#dd99b9">A</td>
 +
<td bgcolor="#f699a1">L</td>
 +
<td bgcolor="#fcabae">V</td>
 +
<td bgcolor="#dd99b9">A</td>
 +
<td bgcolor="#dd99b9">A</td>
 +
 
 +
<td bgcolor="#9999ff">R</td>
 +
<td bgcolor="#afabfa">N</td>
 +
<td bgcolor="#e4d2ec">G</td>
 +
<td bgcolor="#f0d2e0">A</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBP1_ASPNI/365-394&nbsp;&nbsp;</td>
 +
<td>T</td>
 +
<td bgcolor="#f5d2db">F</td>
 +
<td bgcolor="#e2d2ee">S</td>
 +
 
 +
<td bgcolor="#ded2f2">P</td>
 +
<td bgcolor="#d4d2fc">E</td>
 +
<td bgcolor="#d4d2fc">E</td>
 +
<td bgcolor="#fcd2d3">V</td>
 +
<td bgcolor="#d4d2fc">D</td>
 +
<td bgcolor="#fbd2d5">L</td>
 +
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 2,433: Line 3,110:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#dad8fd">D</td>
+
<td>-</td>
<td bgcolor="#f4eef8">S</td>
+
<td>-</td>
<td bgcolor="#eeeefe">Q</td>
+
<td>-</td>
<td bgcolor="#c5c2fb">D</td>
+
<td>-</td>
<td bgcolor="#c5c2fb">N</td>
+
<td>-</td>
<td bgcolor="#eeeefe">E</td>
+
<td>-</td>
<td bgcolor="#cfaddc">G</td>
 
 
 
<td bgcolor="#dad8fd">N</td>
 
<td bgcolor="#d9c2e7">T</td>
 
<td bgcolor="#ebc2d5">A</td>
 
<td bgcolor="#fbadaf">V</td>
 
<td bgcolor="#b3adf7">H</td>
 
<td bgcolor="#f9c2c7">L</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e5adc6">M</td>
 
 
 
<td bgcolor="#c2c2ff">R</td>
 
<td bgcolor="#eeeefe">D</td>
 
<td bgcolor="#f9eef3">A</td>
 
</tr>
 
<tr><td nowrap="nowrap">MBP1_PODAN/372-401&nbsp;&nbsp;</td>
 
<td>V</td>
 
<td bgcolor="#eeeeff">R</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#f2eefa">P</td>
 
 
 
<td bgcolor="#eeeefe">E</td>
 
<td bgcolor="#eeeefe">E</td>
 
<td bgcolor="#fdeeee">V</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#f9eef3">A</td>
 
 
<td>-</td>
 
<td>-</td>
  
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td>-</td>
+
<td bgcolor="#f9bfc4">L</td>
<td>-</td>
+
<td bgcolor="#f9bfc4">L</td>
<td>-</td>
+
<td bgcolor="#d6bfe7">S</td>
<td>-</td>
+
<td bgcolor="#f4d2dc">C</td>
<td>-</td>
+
<td bgcolor="#d4d2fc">Q</td>
<td>-</td>
+
<td bgcolor="#afabfa">D</td>
<td>-</td>
+
<td bgcolor="#caabe0">S</td>
  
<td bgcolor="#fbd8db">L</td>
+
<td bgcolor="#fcd2d3">V</td>
<td bgcolor="#fbd8db">L</td>
+
<td bgcolor="#c399d4">G</td>
 +
<td bgcolor="#c2bffc">D</td>
 +
<td bgcolor="#cbabdf">T</td>
 +
<td bgcolor="#e3abc6">A</td>
 +
<td bgcolor="#fb999c">V</td>
 +
<td bgcolor="#f699a1">L</td>
 +
<td bgcolor="#fcabae">V</td>
 +
<td bgcolor="#dd99b9">A</td>
 +
 
 +
<td bgcolor="#dd99b9">A</td>
 +
<td bgcolor="#9999ff">R</td>
 +
<td bgcolor="#afabfa">N</td>
 +
<td bgcolor="#e4d2ec">G</td>
 +
<td bgcolor="#fcd2d3">V</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBP1_UNCRE/377-406&nbsp;&nbsp;</td>
 +
<td>M</td>
 +
<td bgcolor="#dfd2f0">Y</td>
 +
 
 +
<td bgcolor="#ded2f2">P</td>
 +
<td bgcolor="#d5d2fb">H</td>
 +
<td bgcolor="#d5d2fb">H</td>
 +
<td bgcolor="#d4d2fc">E</td>
 +
<td bgcolor="#fcd2d3">V</td>
 +
<td bgcolor="#e4d2ec">G</td>
 +
<td bgcolor="#fbd2d5">L</td>
 +
<td>-</td>
 +
<td>-</td>
 +
 
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#dad8fd">D</td>
 
<td bgcolor="#f9eef3">A</td>
 
<td bgcolor="#eeeefe">Q</td>
 
  
<td bgcolor="#c5c2fb">D</td>
+
<td>-</td>
<td bgcolor="#c5c2fb">E</td>
+
<td>-</td>
<td bgcolor="#eeeefe">E</td>
+
<td>-</td>
<td bgcolor="#cfaddc">G</td>
+
<td bgcolor="#f9bfc4">L</td>
<td bgcolor="#dad8fd">N</td>
+
<td bgcolor="#f9bfc4">L</td>
<td bgcolor="#d9c2e7">T</td>
+
<td bgcolor="#eabfd3">A</td>
<td bgcolor="#ebc2d5">A</td>
+
<td bgcolor="#e2d2ee">S</td>
<td bgcolor="#f7adb3">L</td>
+
<td bgcolor="#d4d2fc">Q</td>
<td bgcolor="#b3adf7">H</td>
+
<td bgcolor="#afabfa">D</td>
 +
 
 +
<td bgcolor="#caabe0">S</td>
 +
<td bgcolor="#d4d2fc">N</td>
 +
<td bgcolor="#c399d4">G</td>
 +
<td bgcolor="#c2bffc">D</td>
 +
<td bgcolor="#cbabdf">T</td>
 +
<td bgcolor="#e3abc6">A</td>
 +
<td bgcolor="#dd99b9">A</td>
 +
<td bgcolor="#f699a1">L</td>
 +
<td bgcolor="#cbabdf">T</td>
  
<td bgcolor="#f9c2c7">L</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#9999ff">K</td>
<td bgcolor="#adadff">R</td>
+
<td bgcolor="#afabfa">N</td>
<td bgcolor="#fcc2c4">V</td>
+
<td bgcolor="#e4d2ec">G</td>
<td bgcolor="#eeeefe">N</td>
+
<td bgcolor="#f4d2dc">C</td>
<td bgcolor="#f9eef3">A</td>
 
 
</tr>
 
</tr>
 +
<tr><td nowrap="nowrap">MBP1_PENCH/439-468&nbsp;&nbsp;</td>
 +
<td>T</td>
  
<tr><td nowrap="nowrap">MBP1_LACTH/458-487&nbsp;&nbsp;</td>
+
<td bgcolor="#f4d2dc">C</td>
 
+
<td bgcolor="#e2d2ee">S</td>
<td>F</td>
+
<td bgcolor="#d4d2fc">Q</td>
<td bgcolor="#f4eef8">S</td>
+
<td bgcolor="#d4d2fc">D</td>
<td bgcolor="#f2eefa">P</td>
+
<td bgcolor="#d4d2fc">E</td>
<td bgcolor="#eeeeff">R</td>
+
<td bgcolor="#ffd2d2">I</td>
<td bgcolor="#f3eef9">Y</td>
+
<td bgcolor="#d4d2fc">Q</td>
<td bgcolor="#eeeeff">R</td>
+
<td bgcolor="#f0d2df">M</td>
<td bgcolor="#ffeeee">I</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
  
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#eeeefe">E</td>
 
<td bgcolor="#eeeefe">N</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 2,531: Line 3,215:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
<td>-</td>
 
<td bgcolor="#fbd8db">L</td>
 
<td bgcolor="#ffd8d8">I</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
<td bgcolor="#f9bfc4">L</td>
 +
<td bgcolor="#f9bfc4">L</td>
 +
<td bgcolor="#d6bfe7">S</td>
 +
<td bgcolor="#f4d2dc">C</td>
 +
<td bgcolor="#d4d2fc">Q</td>
  
<td>-</td>
+
<td bgcolor="#afabfa">D</td>
<td bgcolor="#dad8fd">N</td>
+
<td bgcolor="#afabfa">Q</td>
<td bgcolor="#f9eef3">A</td>
+
<td bgcolor="#d4d2fc">N</td>
<td bgcolor="#eeeefe">Q</td>
+
<td bgcolor="#c399d4">G</td>
<td bgcolor="#c5c2fb">D</td>
+
<td bgcolor="#c2bffc">D</td>
<td bgcolor="#c5c2fb">Q</td>
+
<td bgcolor="#cbabdf">T</td>
<td bgcolor="#eeeefe">N</td>
+
<td bgcolor="#e3abc6">A</td>
<td bgcolor="#cfaddc">G</td>
+
<td bgcolor="#fb999c">V</td>
<td bgcolor="#dad8fd">D</td>
+
<td bgcolor="#f699a1">L</td>
  
<td bgcolor="#d9c2e7">T</td>
+
<td bgcolor="#fcabae">V</td>
<td bgcolor="#ebc2d5">A</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#fbadaf">V</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#b3adf7">H</td>
+
<td bgcolor="#9999ff">R</td>
<td bgcolor="#f9c2c7">L</td>
+
<td bgcolor="#afabfa">N</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#e4d2ec">G</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#f0d2e0">A</td>
<td bgcolor="#b0adfa">Q</td>
+
</tr>
<td bgcolor="#c5c2fb">N</td>
+
<tr><td nowrap="nowrap">MBPA_TRIVE/407-436&nbsp;&nbsp;</td>
  
<td bgcolor="#f4eef7">G</td>
+
<td>V</td>
<td bgcolor="#eeeefe">D</td>
+
<td bgcolor="#f5d2db">F</td>
</tr>
+
<td bgcolor="#ded2f2">P</td>
<tr><td nowrap="nowrap">MBP1_FILNE/433-460&nbsp;&nbsp;</td>
+
<td bgcolor="#d2d2ff">R</td>
<td>-</td>
+
<td bgcolor="#d5d2fb">H</td>
<td>-</td>
+
<td bgcolor="#d4d2fc">E</td>
<td bgcolor="#f3eef9">Y</td>
+
<td bgcolor="#ffd2d2">I</td>
<td bgcolor="#f2eefa">P</td>
+
<td bgcolor="#e2d2ee">S</td>
<td bgcolor="#eeeefe">Q</td>
+
<td bgcolor="#fbd2d5">L</td>
  
<td bgcolor="#eeeefe">E</td>
 
<td bgcolor="#fdeeef">L</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#f9eef3">A</td>
 
<td bgcolor="#eeeefe">D</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#fdd8da">V</td>
+
<td>-</td>
 
+
<td bgcolor="#f9bfc4">L</td>
<td bgcolor="#ffd8d8">I</td>
+
<td bgcolor="#f9bfc4">L</td>
 +
<td bgcolor="#d6bfe7">S</td>
 +
<td bgcolor="#e2d2ee">S</td>
 +
 
 +
<td bgcolor="#d4d2fc">Q</td>
 +
<td bgcolor="#afabfa">D</td>
 +
<td bgcolor="#e3abc6">A</td>
 +
<td bgcolor="#d4d2fc">N</td>
 +
<td bgcolor="#c399d4">G</td>
 +
<td bgcolor="#c2bffc">D</td>
 +
<td bgcolor="#cbabdf">T</td>
 +
<td bgcolor="#e3abc6">A</td>
 +
<td bgcolor="#dd99b9">A</td>
 +
 
 +
<td bgcolor="#f699a1">L</td>
 +
<td bgcolor="#cbabdf">T</td>
 +
<td bgcolor="#dd99b9">A</td>
 +
<td bgcolor="#dd99b9">A</td>
 +
<td bgcolor="#9999ff">K</td>
 +
<td bgcolor="#afabfa">N</td>
 +
<td bgcolor="#e4d2ec">G</td>
 +
<td bgcolor="#f4d2dc">C</td>
 +
</tr>
 +
 
 +
<tr><td nowrap="nowrap">MBP1_PHANO/400-429&nbsp;&nbsp;</td>
 +
<td>T</td>
 +
<td bgcolor="#e2d2ef">W</td>
 +
<td bgcolor="#ffd2d2">I</td>
 +
<td bgcolor="#ded2f2">P</td>
 +
<td bgcolor="#d4d2fc">E</td>
 +
<td bgcolor="#d4d2fc">E</td>
 +
<td bgcolor="#fcd2d3">V</td>
 +
<td bgcolor="#e2d2ed">T</td>
 +
 
 +
<td bgcolor="#d2d2ff">R</td>
 +
<td>-</td>
 +
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#dad8fd">N</td>
 
<td bgcolor="#fbeef1">F</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#c5c2fb">D</td>
 
 
<td bgcolor="#c5c2fb">E</td>
 
<td bgcolor="#eeeefe">E</td>
 
<td bgcolor="#cfaddc">G</td>
 
<td bgcolor="#dad8fd">E</td>
 
<td bgcolor="#d9c2e7">T</td>
 
<td bgcolor="#ebc2d5">A</td>
 
<td bgcolor="#f7adb3">L</td>
 
<td bgcolor="#ccaddf">T</td>
 
<td bgcolor="#ffc2c2">I</td>
 
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#adadff">R</td>
 
<td bgcolor="#ebc2d5">A</td>
 
<td bgcolor="#eeeeff">R</td>
 
<td bgcolor="#f4eef8">S</td>
 
</tr>
 
<tr><td nowrap="nowrap">MBP1_KLULA/477-506&nbsp;&nbsp;</td>
 
<td>F</td>
 
 
<td bgcolor="#f4eef8">T</td>
 
<td bgcolor="#f2eefa">P</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#f3eef9">Y</td>
 
<td bgcolor="#eeeeff">R</td>
 
<td bgcolor="#ffeeee">I</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
  
<td bgcolor="#eeeefe">D</td>
 
<td bgcolor="#fdeeee">V</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 2,641: Line 3,324:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
<td bgcolor="#f9bfc4">L</td>
 +
<td bgcolor="#f9bfc4">L</td>
 +
<td bgcolor="#c2bffc">N</td>
  
<td>-</td>
+
<td bgcolor="#f0d2e0">A</td>
<td>-</td>
+
<td bgcolor="#d4d2fc">Q</td>
<td>-</td>
+
<td bgcolor="#afabfa">D</td>
<td bgcolor="#fbd8db">L</td>
+
<td bgcolor="#afabfa">Q</td>
<td bgcolor="#ffd8d8">I</td>
+
<td bgcolor="#d4d2fc">N</td>
 +
<td bgcolor="#c399d4">G</td>
 +
<td bgcolor="#c2bffc">D</td>
 +
<td bgcolor="#cbabdf">T</td>
 +
<td bgcolor="#e3abc6">A</td>
 +
 
 +
<td bgcolor="#ff9999">I</td>
 +
<td bgcolor="#df99b8">M</td>
 +
<td bgcolor="#ffabab">I</td>
 +
<td bgcolor="#dd99b9">A</td>
 +
<td bgcolor="#dd99b9">A</td>
 +
<td bgcolor="#9999ff">R</td>
 +
<td bgcolor="#afabfa">N</td>
 +
<td bgcolor="#e4d2ec">G</td>
 +
<td bgcolor="#f0d2e0">A</td>
 +
 
 +
</tr>
 +
<tr><td nowrap="nowrap">MBPA_SCLSC/294-313&nbsp;&nbsp;</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 2,652: Line 3,358:
 
<td>-</td>
 
<td>-</td>
  
<td bgcolor="#dad8fd">N</td>
+
<td>-</td>
<td bgcolor="#eeeefe">Q</td>
+
<td>-</td>
<td bgcolor="#eeeefe">Q</td>
+
<td>-</td>
<td bgcolor="#c5c2fb">D</td>
+
<td>-</td>
<td bgcolor="#c5c2fb">N</td>
+
<td>-</td>
<td bgcolor="#eeeefe">D</td>
+
<td>-</td>
<td bgcolor="#cfaddc">G</td>
+
<td>-</td>
<td bgcolor="#dad8fd">N</td>
+
<td>-</td>
<td bgcolor="#d8c2e8">S</td>
+
<td>-</td>
  
<td bgcolor="#d3c2ee">P</td>
 
<td bgcolor="#f7adb3">L</td>
 
<td bgcolor="#b3adf7">H</td>
 
<td bgcolor="#d5c2ec">Y</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#ccaddf">T</td>
 
<td bgcolor="#c5c2fb">N</td>
 
<td bgcolor="#eeeeff">K</td>
 
 
<td bgcolor="#eeeefe">D</td>
 
</tr>
 
<tr><td nowrap="nowrap">MBP1_SCHST/468-501&nbsp;&nbsp;</td>
 
<td>A</td>
 
<td bgcolor="#eeeeff">K</td>
 
<td bgcolor="#eeeefe">D</td>
 
<td bgcolor="#f2eefa">P</td>
 
<td bgcolor="#eeeefe">D</td>
 
<td bgcolor="#eeeefe">N</td>
 
 
<td bgcolor="#eeeeff">K</td>
 
 
<td>-</td>
 
<td>-</td>
<td>-</td>
 
<td>-</td>
 
<td bgcolor="#eeeeff">K</td>
 
<td bgcolor="#eeeefe">D</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#f9bfc4">L</td>
  
<td>-</td>
+
<td bgcolor="#c2bffc">D</td>
<td>-</td>
+
<td bgcolor="#f0d2e0">A</td>
<td>-</td>
+
<td bgcolor="#d2d2ff">R</td>
<td>-</td>
+
<td bgcolor="#afabfa">D</td>
<td>-</td>
+
<td bgcolor="#ffabab">I</td>
<td>-</td>
+
<td bgcolor="#d4d2fc">N</td>
<td>-</td>
+
<td bgcolor="#c399d4">G</td>
<td bgcolor="#fbd8db">L</td>
+
<td bgcolor="#c2bffc">N</td>
<td bgcolor="#ffd8d8">I</td>
+
<td bgcolor="#cbabdf">T</td>
  
<td>A</td>
+
<td bgcolor="#e3abc6">A</td>
<td>K</td>
+
<td bgcolor="#ff9999">I</td>
<td>F</td>
+
<td bgcolor="#a199f6">H</td>
<td>I</td>
+
<td bgcolor="#ffabab">I</td>
<td bgcolor="#dad8fd">N</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#efeefd">H</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#eeeefe">Q</td>
+
<td bgcolor="#9999ff">K</td>
<td bgcolor="#c5c2fb">D</td>
+
<td bgcolor="#afabfa">N</td>
<td bgcolor="#d8c2e8">S</td>
+
<td bgcolor="#d2d2ff">K</td>
  
<td bgcolor="#eeeefe">D</td>
+
<td bgcolor="#f0d2e0">A</td>
<td bgcolor="#cfaddc">G</td>
 
<td bgcolor="#dad8fd">N</td>
 
<td bgcolor="#d9c2e7">T</td>
 
<td bgcolor="#ebc2d5">A</td>
 
<td bgcolor="#edadbd">F</td>
 
<td bgcolor="#b3adf7">H</td>
 
<td bgcolor="#ffc2c2">I</td>
 
<td bgcolor="#eaadc0">C</td>
 
 
 
<td bgcolor="#caade0">S</td>
 
<td bgcolor="#b3adf7">H</td>
 
<td bgcolor="#c5c2fb">N</td>
 
<td bgcolor="#fdeeef">L</td>
 
<td bgcolor="#eeeefe">N</td>
 
 
</tr>
 
</tr>
<tr><td nowrap="nowrap">MBP1_SACCE/496-525&nbsp;&nbsp;</td>
 
<td>F</td>
 
<td bgcolor="#f4eef8">S</td>
 
  
<td bgcolor="#f2eefa">P</td>
+
<tr><td nowrap="nowrap">MBPA_PYRIS/363-392&nbsp;&nbsp;</td>
<td bgcolor="#eeeefe">Q</td>
+
<td>T</td>
<td bgcolor="#f3eef9">Y</td>
+
<td bgcolor="#e2d2ef">W</td>
<td bgcolor="#eeeeff">R</td>
+
<td bgcolor="#ffd2d2">I</td>
<td bgcolor="#ffeeee">I</td>
+
<td bgcolor="#ded2f2">P</td>
<td>-</td>
+
<td bgcolor="#d4d2fc">E</td>
<td>-</td>
+
<td bgcolor="#d4d2fc">E</td>
<td>-</td>
 
<td bgcolor="#eeeefe">E</td>
 
  
<td bgcolor="#fdeeef">L</td>
+
<td bgcolor="#fcd2d3">V</td>
 +
<td bgcolor="#e2d2ed">T</td>
 +
<td bgcolor="#d2d2ff">R</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 2,749: Line 3,418:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#fbd8db">L</td>
 
<td bgcolor="#fbd8db">L</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#dad8fd">N</td>
+
<td bgcolor="#f9bfc4">L</td>
  
<td bgcolor="#f4eef8">T</td>
+
<td bgcolor="#f9bfc4">L</td>
<td bgcolor="#eeeefe">Q</td>
+
<td bgcolor="#c2bffc">N</td>
<td bgcolor="#c5c2fb">D</td>
+
<td bgcolor="#f0d2e0">A</td>
<td bgcolor="#c2c2ff">K</td>
+
<td bgcolor="#f0d2e0">A</td>
<td bgcolor="#eeeefe">N</td>
+
<td bgcolor="#afabfa">D</td>
<td bgcolor="#cfaddc">G</td>
+
<td bgcolor="#afabfa">Q</td>
<td bgcolor="#dad8fd">D</td>
+
<td bgcolor="#d4d2fc">N</td>
<td bgcolor="#d9c2e7">T</td>
+
<td bgcolor="#c399d4">G</td>
<td bgcolor="#ebc2d5">A</td>
+
<td bgcolor="#c2bffc">D</td>
  
<td bgcolor="#f7adb3">L</td>
+
<td bgcolor="#cbabdf">T</td>
<td bgcolor="#b3adf7">H</td>
+
<td bgcolor="#e3abc6">A</td>
<td bgcolor="#ffc2c2">I</td>
+
<td bgcolor="#ff9999">I</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#df99b8">M</td>
<td bgcolor="#caade0">S</td>
+
<td bgcolor="#ffabab">I</td>
<td bgcolor="#adadff">K</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#c5c2fb">N</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#f4eef7">G</td>
+
<td bgcolor="#9999ff">R</td>
<td bgcolor="#eeeefe">D</td>
+
<td bgcolor="#afabfa">N</td>
  
 +
<td bgcolor="#e4d2ec">G</td>
 +
<td bgcolor="#f0d2e0">A</td>
 
</tr>
 
</tr>
<tr><td nowrap="nowrap">CD00204/1-19&nbsp;&nbsp;</td>
+
<tr><td nowrap="nowrap">MBP1_/361-390&nbsp;&nbsp;</td>
<td>-</td>
+
<td>N</td>
<td>-</td>
+
<td bgcolor="#d5d2fb">H</td>
<td>-</td>
+
<td bgcolor="#e2d2ee">S</td>
<td>-</td>
+
<td bgcolor="#fbd2d5">L</td>
<td>-</td>
+
<td bgcolor="#e4d2ec">G</td>
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
<td bgcolor="#fcd2d3">V</td>
<td>-</td>
+
<td bgcolor="#fbd2d5">L</td>
<td>-</td>
+
<td bgcolor="#e2d2ee">S</td>
<td>-</td>
+
<td bgcolor="#d4d2fc">Q</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 2,812: Line 3,479:
 
<td>-</td>
 
<td>-</td>
  
<td>-</td>
+
<td bgcolor="#f2bfcc">F</td>
<td>-</td>
+
<td bgcolor="#ebbfd3">M</td>
<td>-</td>
+
<td bgcolor="#c2bffc">D</td>
<td bgcolor="#dad8fd">N</td>
+
<td bgcolor="#e2d2ed">T</td>
<td bgcolor="#f9eef3">A</td>
+
<td bgcolor="#d4d2fc">Q</td>
<td bgcolor="#eeeeff">R</td>
+
<td bgcolor="#afabfa">N</td>
<td bgcolor="#c5c2fb">D</td>
+
<td bgcolor="#afabfa">N</td>
<td bgcolor="#c5c2fb">E</td>
+
<td bgcolor="#d4d2fc">E</td>
<td bgcolor="#eeeefe">D</td>
+
<td bgcolor="#c399d4">G</td>
  
<td bgcolor="#cfaddc">G</td>
+
<td bgcolor="#c2bffc">D</td>
<td bgcolor="#d8d8ff">R</td>
+
<td bgcolor="#cbabdf">T</td>
<td bgcolor="#d9c2e7">T</td>
+
<td bgcolor="#e3abc6">A</td>
<td bgcolor="#d3c2ee">P</td>
+
<td bgcolor="#f699a1">L</td>
<td bgcolor="#f7adb3">L</td>
+
<td bgcolor="#a199f6">H</td>
<td bgcolor="#b3adf7">H</td>
+
<td bgcolor="#ffabab">I</td>
<td bgcolor="#f9c2c7">L</td>
+
<td bgcolor="#f699a1">L</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#9999ff">R</td>
  
<td bgcolor="#caade0">S</td>
+
<td bgcolor="#caabe0">S</td>
<td bgcolor="#c5c2fb">N</td>
+
<td bgcolor="#e4d2ec">G</td>
<td bgcolor="#f4eef7">G</td>
+
<td bgcolor="#f0d2e0">A</td>
<td bgcolor="#efeefd">H</td>
 
 
</tr>
 
</tr>
<tr><td nowrap="nowrap">CD00204/99-118&nbsp;&nbsp;</td>
+
<tr><td nowrap="nowrap">MBP1_ASPFL/328-364&nbsp;&nbsp;</td>
<td>-</td>
+
<td>T</td>
<td>-</td>
+
<td bgcolor="#d4d2fc">E</td>
<td>-</td>
+
<td bgcolor="#d4d2fc">Q</td>
 +
<td bgcolor="#ded2f2">P</td>
  
<td>-</td>
+
<td bgcolor="#e4d2ec">G</td>
<td>-</td>
+
<td bgcolor="#d4d2fc">E</td>
<td>-</td>
+
<td bgcolor="#fcd2d3">V</td>
<td>-</td>
+
<td bgcolor="#ffd2d2">I</td>
<td>-</td>
+
<td bgcolor="#e2d2ed">T</td>
<td>-</td>
+
<td>L</td>
<td>-</td>
+
<td>G</td>
<td>-</td>
+
<td>R</td>
<td>-</td>
+
<td>F</td>
  
<td>-</td>
+
<td>I</td>
<td>-</td>
+
<td>S</td>
<td>-</td>
+
<td>E</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 2,863: Line 3,530:
  
 
<td>-</td>
 
<td>-</td>
<td>-</td>
+
<td bgcolor="#ffbfbf">I</td>
<td bgcolor="#fdd8da">V</td>
+
<td bgcolor="#fcbfc1">V</td>
<td>-</td>
+
<td bgcolor="#c2bffc">N</td>
<td>-</td>
+
<td bgcolor="#fbd2d5">L</td>
<td>-</td>
+
<td bgcolor="#d2d2ff">R</td>
<td>-</td>
+
<td bgcolor="#afabfa">D</td>
<td bgcolor="#dad8fd">N</td>
+
<td bgcolor="#afabfa">D</td>
<td bgcolor="#f9eef3">A</td>
+
<td bgcolor="#d4d2fc">Q</td>
  
<td bgcolor="#eeeeff">R</td>
+
<td bgcolor="#c399d4">G</td>
<td bgcolor="#c5c2fb">D</td>
+
<td bgcolor="#c2bffc">D</td>
<td bgcolor="#c2c2ff">K</td>
+
<td bgcolor="#cbabdf">T</td>
<td bgcolor="#eeeefe">D</td>
+
<td bgcolor="#e3abc6">A</td>
<td bgcolor="#cfaddc">G</td>
+
<td bgcolor="#f699a1">L</td>
<td bgcolor="#d8d8ff">R</td>
+
<td bgcolor="#9d99f9">N</td>
<td bgcolor="#d9c2e7">T</td>
+
<td bgcolor="#f7abb2">L</td>
<td bgcolor="#d3c2ee">P</td>
+
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#f7adb3">L</td>
+
<td bgcolor="#c399d4">G</td>
  
<td bgcolor="#b3adf7">H</td>
+
<td bgcolor="#9999ff">R</td>
<td bgcolor="#f9c2c7">L</td>
+
<td bgcolor="#e3abc6">A</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#d2d2ff">R</td>
<td bgcolor="#e4adc7">A</td>
+
<td bgcolor="#e2d2ee">S</td>
<td bgcolor="#adadff">K</td>
 
<td bgcolor="#c5c2fb">N</td>
 
<td bgcolor="#f4eef7">G</td>
 
<td bgcolor="#efeefd">H</td>
 
 
</tr>
 
</tr>
 +
<tr><td nowrap="nowrap">MBPA_MAGOR/375-404&nbsp;&nbsp;</td>
 +
<td>Q</td>
 +
<td bgcolor="#d5d2fb">H</td>
 +
<td bgcolor="#d4d2fc">D</td>
  
<tr><td nowrap="nowrap">1SW6/203-232&nbsp;&nbsp;</td>
+
<td bgcolor="#ded2f2">P</td>
<td>L</td>
+
<td bgcolor="#d4d2fc">N</td>
<td bgcolor="#eeeefe">D</td>
+
<td bgcolor="#f5d2db">F</td>
<td bgcolor="#fdeeef">L</td>
+
<td bgcolor="#fcd2d3">V</td>
<td bgcolor="#eeeeff">K</td>
+
<td bgcolor="#d4d2fc">Q</td>
<td bgcolor="#f4eef9">W</td>
+
<td bgcolor="#d4d2fc">Q</td>
<td bgcolor="#ffeeee">I</td>
+
<td>-</td>
<td bgcolor="#ffeeee">I</td>
+
<td>-</td>
 
<td>-</td>
 
<td>-</td>
  
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#f9eef3">A</td>
 
<td bgcolor="#eeeefe">N</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 2,911: Line 3,576:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
 +
<td>-</td>
 +
<td>-</td>
 +
<td bgcolor="#f9bfc4">L</td>
 +
<td bgcolor="#f9bfc4">L</td>
 +
<td bgcolor="#c2bffc">D</td>
 +
<td bgcolor="#f0d2e0">A</td>
 +
<td bgcolor="#d4d2fc">Q</td>
 +
<td bgcolor="#afabfa">D</td>
 +
<td bgcolor="#afabfa">N</td>
 +
 +
<td bgcolor="#d4d2fc">D</td>
 +
<td bgcolor="#c399d4">G</td>
 +
<td bgcolor="#c2bffc">N</td>
 +
<td bgcolor="#cbabdf">T</td>
 +
<td bgcolor="#e3abc6">A</td>
 +
<td bgcolor="#fb999c">V</td>
 +
<td bgcolor="#a199f6">H</td>
 +
<td bgcolor="#f7abb2">L</td>
 +
<td bgcolor="#dd99b9">A</td>
  
 +
<td bgcolor="#dd99b9">A</td>
 +
<td bgcolor="#9d99f9">Q</td>
 +
<td bgcolor="#ababff">R</td>
 +
<td bgcolor="#e4d2ec">G</td>
 +
<td bgcolor="#e2d2ee">S</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBP1_CHAGL/361-390&nbsp;&nbsp;</td>
 +
<td>S</td>
 +
<td bgcolor="#d2d2ff">R</td>
 +
 +
<td bgcolor="#e2d2ee">S</td>
 +
<td bgcolor="#f0d2e0">A</td>
 +
<td bgcolor="#d4d2fc">D</td>
 +
<td bgcolor="#d4d2fc">E</td>
 +
<td bgcolor="#fbd2d5">L</td>
 +
<td bgcolor="#d4d2fc">Q</td>
 +
<td bgcolor="#d4d2fc">Q</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#f3d8e4">M</td>
 
<td bgcolor="#fbd8db">L</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#dad8fd">N</td>
 
<td bgcolor="#f9eef3">A</td>
 
<td bgcolor="#eeeefe">Q</td>
 
<td bgcolor="#c5c2fb">D</td>
 
<td bgcolor="#d8c2e8">S</td>
 
<td bgcolor="#eeeefe">N</td>
 
<td bgcolor="#cfaddc">G</td>
 
 
<td bgcolor="#dad8fd">D</td>
 
<td bgcolor="#d9c2e7">T</td>
 
<td bgcolor="#efc2d0">C</td>
 
<td bgcolor="#f7adb3">L</td>
 
<td bgcolor="#b0adfa">N</td>
 
<td bgcolor="#ffc2c2">I</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#e4adc7">A</td>
 
<td bgcolor="#adadff">R</td>
 
 
<td bgcolor="#f9c2c7">L</td>
 
<td bgcolor="#f4eef7">G</td>
 
<td bgcolor="#eeeefe">N</td>
 
</tr>
 
<tr><td nowrap="nowrap">SecStruc/203-232&nbsp;&nbsp;</td>
 
<td>t</td>
 
<td bgcolor="#f5eef6">_</td>
 
<td bgcolor="#efeefd">H</td>
 
<td bgcolor="#efeefd">H</td>
 
 
<td bgcolor="#efeefd">H</td>
 
<td bgcolor="#efeefd">H</td>
 
<td bgcolor="#efeefd">H</td>
 
 
<td>-</td>
 
<td>-</td>
<td>-</td>
 
<td>-</td>
 
<td bgcolor="#efeefd">H</td>
 
<td bgcolor="#efeefd">H</td>
 
 
<td>-</td>
 
<td>-</td>
  
Line 2,965: Line 3,632:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td>-</td>
+
<td bgcolor="#f9bfc4">L</td>
<td>-</td>
+
<td bgcolor="#f9bfc4">L</td>
<td>-</td>
+
<td bgcolor="#c2bffc">D</td>
<td>-</td>
+
<td bgcolor="#e2d2ee">S</td>
<td>-</td>
+
<td bgcolor="#d4d2fc">Q</td>
 +
<td bgcolor="#afabfa">D</td>
 +
 
 +
<td bgcolor="#afabfa">N</td>
 +
<td bgcolor="#d4d2fc">E</td>
 +
<td bgcolor="#c399d4">G</td>
 +
<td bgcolor="#c2bffc">N</td>
 +
<td bgcolor="#cbabdf">T</td>
 +
<td bgcolor="#e3abc6">A</td>
 +
<td bgcolor="#fb999c">V</td>
 +
<td bgcolor="#a199f6">H</td>
 +
<td bgcolor="#f7abb2">L</td>
 +
 
 +
<td bgcolor="#dd99b9">A</td>
 +
<td bgcolor="#dd99b9">A</td>
 +
<td bgcolor="#df99b8">M</td>
 +
<td bgcolor="#ababff">R</td>
 +
<td bgcolor="#d4d2fc">D</td>
 +
<td bgcolor="#f0d2e0">A</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBP1_PODAN/372-401&nbsp;&nbsp;</td>
 +
<td>V</td>
 +
 
 +
<td bgcolor="#d2d2ff">R</td>
 +
<td bgcolor="#d4d2fc">Q</td>
 +
<td bgcolor="#ded2f2">P</td>
 +
<td bgcolor="#d4d2fc">E</td>
 +
<td bgcolor="#d4d2fc">E</td>
 +
<td bgcolor="#fcd2d3">V</td>
 +
<td bgcolor="#d4d2fc">Q</td>
 +
<td bgcolor="#f0d2e0">A</td>
 
<td>-</td>
 
<td>-</td>
  
<td bgcolor="#ead8ed">_</td>
 
<td bgcolor="#ead8ed">_</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#ead8ed">_</td>
+
<td>-</td>
<td bgcolor="#f5eef6">_</td>
+
<td>-</td>
<td bgcolor="#f5eef6">_</td>
+
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
<td bgcolor="#dec2e3">_</td>
 
<td bgcolor="#d9c2e7">t</td>
 
<td bgcolor="#f5eef6">_</td>
 
<td bgcolor="#d2add8">_</td>
 
<td bgcolor="#ead8ed">_</td>
 
<td bgcolor="#dec2e3">_</td>
 
<td bgcolor="#c7c2f9">H</td>
 
<td bgcolor="#b3adf7">H</td>
 
<td bgcolor="#b3adf7">H</td>
 
 
<td bgcolor="#c7c2f9">H</td>
 
<td bgcolor="#b3adf7">H</td>
 
<td bgcolor="#b3adf7">H</td>
 
<td bgcolor="#b3adf7">H</td>
 
<td bgcolor="#c7c2f9">H</td>
 
<td bgcolor="#f5eef6">_</td>
 
<td bgcolor="#f5eef6">_</td>
 
</tr>
 
</table>
 
</td></tr>
 
 
</table>
 
;Aligned sequences before editing. The algorithm has placed gaps into the Swi6 helix <code>LKWIIAN</code> and the four-residue gaps before the block of well aligned sequence on the right are poorly supported.
 
 
 
<table border="1"><tr><td>
 
<table border="0" cellpadding="0" cellspacing="0">
 
 
<tr><td colspan="6"></td>
 
<td colspan="9">10<br>|</td><td></td>
 
<td colspan="9">20<br>|</td><td></td>
 
 
<td colspan="9">30<br>|</td><td></td>
 
<td colspan="3"></td><td colspan="3">40<br>|</td>
 
</tr>
 
<tr><td nowrap="nowrap">MBP1_USTMA/341-368&nbsp;&nbsp;</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#dfd2f0">Y</td>
+
<td>-</td>
<td bgcolor="#e4d2ec">G</td>
+
<td>-</td>
 +
<td bgcolor="#f9bfc4">L</td>
 +
<td bgcolor="#f9bfc4">L</td>
 +
<td bgcolor="#c2bffc">D</td>
 +
<td bgcolor="#f0d2e0">A</td>
 +
<td bgcolor="#d4d2fc">Q</td>
  
<td bgcolor="#d4d2fc">D</td>
+
<td bgcolor="#afabfa">D</td>
<td bgcolor="#d4d2fc">Q</td>
+
<td bgcolor="#afabfa">E</td>
<td bgcolor="#fbd2d5">L</td>
+
<td bgcolor="#d4d2fc">E</td>
 +
<td bgcolor="#c399d4">G</td>
 +
<td bgcolor="#c2bffc">N</td>
 +
<td bgcolor="#cbabdf">T</td>
 +
<td bgcolor="#e3abc6">A</td>
 +
<td bgcolor="#f699a1">L</td>
 +
<td bgcolor="#a199f6">H</td>
 +
 
 +
<td bgcolor="#f7abb2">L</td>
 +
<td bgcolor="#dd99b9">A</td>
 +
<td bgcolor="#dd99b9">A</td>
 +
<td bgcolor="#9999ff">R</td>
 +
<td bgcolor="#fcabae">V</td>
 +
<td bgcolor="#d4d2fc">N</td>
 
<td bgcolor="#f0d2e0">A</td>
 
<td bgcolor="#f0d2e0">A</td>
<td bgcolor="#d4d2fc">D</td>
+
</tr>
 +
<tr><td nowrap="nowrap">MBP1_LACTH/458-487&nbsp;&nbsp;</td>
 +
 
 +
<td>F</td>
 +
<td bgcolor="#e2d2ee">S</td>
 +
<td bgcolor="#ded2f2">P</td>
 +
<td bgcolor="#d2d2ff">R</td>
 +
<td bgcolor="#dfd2f0">Y</td>
 +
<td bgcolor="#d2d2ff">R</td>
 +
<td bgcolor="#ffd2d2">I</td>
 +
<td bgcolor="#d4d2fc">E</td>
 +
<td bgcolor="#d4d2fc">N</td>
 +
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 3,038: Line 3,728:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 +
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#ffbfbf">I</td>
 
<td bgcolor="#ffbfbf">I</td>
<td bgcolor="#f9bfc4">L</td>
 
 
<td bgcolor="#c2bffc">N</td>
 
<td bgcolor="#c2bffc">N</td>
<td bgcolor="#f5d2db">F</td>
+
<td bgcolor="#f0d2e0">A</td>
 +
 
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#afabfa">D</td>
<td bgcolor="#afabfa">D</td>
+
<td bgcolor="#afabfa">Q</td>
<td bgcolor="#d4d2fc">E</td>
+
<td bgcolor="#d4d2fc">N</td>
 
 
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c399d4">G</td>
<td bgcolor="#c2bffc">E</td>
+
<td bgcolor="#c2bffc">D</td>
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#cbabdf">T</td>
<td bgcolor="#c2abe8">P</td>
+
<td bgcolor="#e3abc6">A</td>
<td bgcolor="#f699a1">L</td>
+
<td bgcolor="#fb999c">V</td>
<td bgcolor="#bf99d7">T</td>
+
 
<td bgcolor="#e5abc5">M</td>
+
<td bgcolor="#a199f6">H</td>
 +
<td bgcolor="#f7abb2">L</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 +
<td bgcolor="#9d99f9">Q</td>
 +
<td bgcolor="#afabfa">N</td>
 +
<td bgcolor="#e4d2ec">G</td>
 +
<td bgcolor="#d4d2fc">D</td>
 +
</tr>
  
<td bgcolor="#9999ff">R</td>
+
<tr><td nowrap="nowrap">MBP1_FILNE/433-460&nbsp;&nbsp;</td>
<td bgcolor="#e3abc6">A</td>
+
<td>-</td>
<td bgcolor="#d2d2ff">R</td>
 
<td bgcolor="#e2d2ee">S</td>
 
</tr>
 
<tr><td nowrap="nowrap">MBP1B_SCHCO/470-498&nbsp;&nbsp;</td>
 
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#d2d2ff">R</td>
+
<td bgcolor="#dfd2f0">Y</td>
 +
<td bgcolor="#ded2f2">P</td>
 +
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#d4d2fc">E</td>
 
<td bgcolor="#d4d2fc">E</td>
 +
<td bgcolor="#fbd2d5">L</td>
 +
<td bgcolor="#f0d2e0">A</td>
  
 
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#d4d2fc">D</td>
<td bgcolor="#e4d2ec">G</td>
 
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#dfd2f0">Y</td>
 
<td bgcolor="#d2d2ff">K</td>
 
<td bgcolor="#e2d2ee">S</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 3,088: Line 3,778:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#f2bfcc">F</td>
+
<td bgcolor="#fcbfc1">V</td>
<td bgcolor="#f9bfc4">L</td>
+
<td bgcolor="#ffbfbf">I</td>
<td bgcolor="#c2bffc">D</td>
+
<td bgcolor="#c2bffc">N</td>
<td bgcolor="#fbd2d5">L</td>
+
 
 +
<td bgcolor="#f5d2db">F</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#afabfa">E</td>
 
<td bgcolor="#afabfa">E</td>
 
+
<td bgcolor="#d4d2fc">E</td>
<td bgcolor="#d5d2fb">H</td>
 
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c399d4">G</td>
<td bgcolor="#c2bffc">D</td>
+
<td bgcolor="#c2bffc">E</td>
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#e3abc6">A</td>
 
<td bgcolor="#e3abc6">A</td>
 +
 
<td bgcolor="#f699a1">L</td>
 
<td bgcolor="#f699a1">L</td>
<td bgcolor="#9d99f9">N</td>
+
<td bgcolor="#bf99d7">T</td>
 
<td bgcolor="#ffabab">I</td>
 
<td bgcolor="#ffabab">I</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#9999ff">R</td>
 
<td bgcolor="#9999ff">R</td>
<td bgcolor="#fcabae">V</td>
+
<td bgcolor="#e3abc6">A</td>
<td bgcolor="#e4d2ec">G</td>
+
<td bgcolor="#d2d2ff">R</td>
<td bgcolor="#d4d2fc">N</td>
+
<td bgcolor="#e2d2ee">S</td>
 +
 
 
</tr>
 
</tr>
<tr><td nowrap="nowrap">MBP1_ASHGO/465-494&nbsp;&nbsp;</td>
+
<tr><td nowrap="nowrap">MBP1_KLULA/477-506&nbsp;&nbsp;</td>
 
<td>F</td>
 
<td>F</td>
<td bgcolor="#e2d2ee">S</td>
+
<td bgcolor="#e2d2ed">T</td>
 
 
 
<td bgcolor="#ded2f2">P</td>
 
<td bgcolor="#ded2f2">P</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#d4d2fc">Q</td>
Line 3,128: Line 3,818:
 
<td bgcolor="#d2d2ff">R</td>
 
<td bgcolor="#d2d2ff">R</td>
 
<td bgcolor="#ffd2d2">I</td>
 
<td bgcolor="#ffd2d2">I</td>
<td bgcolor="#d4d2fc">E</td>
+
 
<td bgcolor="#e2d2ed">T</td>
+
<td bgcolor="#d4d2fc">D</td>
 +
<td bgcolor="#fcd2d3">V</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 3,138: Line 3,828:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 3,148: Line 3,838:
 
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#ffbfbf">I</td>
 
<td bgcolor="#ffbfbf">I</td>
 +
 
<td bgcolor="#c2bffc">N</td>
 
<td bgcolor="#c2bffc">N</td>
<td bgcolor="#f0d2e0">A</td>
+
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#afabfa">D</td>
 +
<td bgcolor="#afabfa">N</td>
 +
<td bgcolor="#d4d2fc">D</td>
 +
<td bgcolor="#c399d4">G</td>
 +
<td bgcolor="#c2bffc">N</td>
 +
<td bgcolor="#caabe0">S</td>
  
<td bgcolor="#eaabbf">C</td>
 
<td bgcolor="#d2d2ff">K</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#d6bfe7">S</td>
 
<td bgcolor="#cbabdf">T</td>
 
 
<td bgcolor="#c2abe8">P</td>
 
<td bgcolor="#c2abe8">P</td>
 
<td bgcolor="#f699a1">L</td>
 
<td bgcolor="#f699a1">L</td>
 
<td bgcolor="#a199f6">H</td>
 
<td bgcolor="#a199f6">H</td>
<td bgcolor="#ffabab">I</td>
+
<td bgcolor="#c5abe5">Y</td>
 
 
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#df99b8">M</td>
+
<td bgcolor="#bf99d7">T</td>
 
<td bgcolor="#afabfa">N</td>
 
<td bgcolor="#afabfa">N</td>
<td bgcolor="#d2d2ff">R</td>
+
<td bgcolor="#d2d2ff">K</td>
 +
 
 
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#d4d2fc">D</td>
 
</tr>
 
</tr>
<tr><td nowrap="nowrap">MBP1_CLALU/550-586&nbsp;&nbsp;</td>
 
<td>G</td>
 
  
 +
<tr><td nowrap="nowrap">MBP1_SCHST/468-501&nbsp;&nbsp;</td>
 +
<td>A</td>
 +
<td bgcolor="#d2d2ff">K</td>
 +
<td bgcolor="#d4d2fc">D</td>
 +
<td bgcolor="#ded2f2">P</td>
 +
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#d4d2fc">N</td>
 
<td bgcolor="#d4d2fc">N</td>
<td bgcolor="#d4d2fc">Q</td>
+
 
<td bgcolor="#d4d2fc">N</td>
+
<td bgcolor="#d2d2ff">K</td>
<td bgcolor="#e4d2ec">G</td>
+
<td bgcolor="#d2d2ff">K</td>
<td bgcolor="#d4d2fc">N</td>
 
<td bgcolor="#e2d2ee">S</td>
 
<td bgcolor="#d4d2fc">N</td>
 
 
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#d4d2fc">D</td>
<td>K</td>
 
 
<td>K</td>
 
<td>E</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>L</td>
 
<td>L</td>
 
<td>I</td>
 
<td>I</td>
<td>S</td>
+
<td>A</td>
 
<td>K</td>
 
<td>K</td>
 
<td bgcolor="#f2bfcc">F</td>
 
<td bgcolor="#f2bfcc">F</td>
<td bgcolor="#f9bfc4">L</td>
+
 
 +
<td bgcolor="#ffbfbf">I</td>
 
<td bgcolor="#c2bffc">N</td>
 
<td bgcolor="#c2bffc">N</td>
 
<td bgcolor="#d5d2fb">H</td>
 
<td bgcolor="#d5d2fb">H</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#afabfa">D</td>
<td bgcolor="#afabfa">N</td>
+
<td bgcolor="#caabe0">S</td>
<td bgcolor="#d4d2fc">E</td>
+
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c2bffc">N</td>
 
<td bgcolor="#c2bffc">N</td>
 +
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#e3abc6">A</td>
 
<td bgcolor="#e3abc6">A</td>
 
<td bgcolor="#e999ad">F</td>
 
<td bgcolor="#e999ad">F</td>
 
<td bgcolor="#a199f6">H</td>
 
<td bgcolor="#a199f6">H</td>
 
 
<td bgcolor="#ffabab">I</td>
 
<td bgcolor="#ffabab">I</td>
<td bgcolor="#dd99b9">A</td>
+
<td bgcolor="#e699b1">C</td>
<td bgcolor="#dd99b9">A</td>
+
<td bgcolor="#be99d9">S</td>
<td bgcolor="#b899df">Y</td>
+
<td bgcolor="#a199f6">H</td>
 
<td bgcolor="#afabfa">N</td>
 
<td bgcolor="#afabfa">N</td>
<td bgcolor="#f0d2df">M</td>
+
 
 +
<td bgcolor="#fbd2d5">L</td>
 +
<td bgcolor="#d4d2fc">N</td>
 +
</tr>
 +
<tr><td nowrap="nowrap">MBP1_SACCE/496-525&nbsp;&nbsp;</td>
 +
<td>F</td>
 
<td bgcolor="#e2d2ee">S</td>
 
<td bgcolor="#e2d2ee">S</td>
</tr>
+
<td bgcolor="#ded2f2">P</td>
<tr><td nowrap="nowrap">MBPA_COPCI/514-542&nbsp;&nbsp;</td>
+
<td bgcolor="#d4d2fc">Q</td>
 +
<td bgcolor="#dfd2f0">Y</td>
  
<td>-</td>
+
<td bgcolor="#d2d2ff">R</td>
<td bgcolor="#d5d2fb">H</td>
+
<td bgcolor="#ffd2d2">I</td>
 
<td bgcolor="#d4d2fc">E</td>
 
<td bgcolor="#d4d2fc">E</td>
<td bgcolor="#e4d2ec">G</td>
+
<td bgcolor="#fbd2d5">L</td>
<td bgcolor="#e4d2ec">G</td>
 
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#f5d2db">F</td>
 
<td bgcolor="#d2d2ff">R</td>
 
<td bgcolor="#e2d2ee">S</td>
 
 
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 3,238: Line 3,929:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 3,248: Line 3,939:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 +
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#f9bfc4">L</td>
<td bgcolor="#fcbfc1">V</td>
+
<td bgcolor="#c2bffc">N</td>
<td bgcolor="#c2bffc">D</td>
+
<td bgcolor="#e2d2ed">T</td>
<td bgcolor="#fbd2d5">L</td>
 
 
 
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#afabfa">D</td>
<td bgcolor="#afabfa">E</td>
+
<td bgcolor="#ababff">K</td>
<td bgcolor="#d5d2fb">H</td>
+
<td bgcolor="#d4d2fc">N</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c399d4">G</td>
 +
 
<td bgcolor="#c2bffc">D</td>
 
<td bgcolor="#c2bffc">D</td>
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#e3abc6">A</td>
 
<td bgcolor="#e3abc6">A</td>
<td bgcolor="#ff9999">I</td>
+
<td bgcolor="#f699a1">L</td>
 
+
<td bgcolor="#a199f6">H</td>
<td bgcolor="#9d99f9">N</td>
 
 
<td bgcolor="#ffabab">I</td>
 
<td bgcolor="#ffabab">I</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#dd99b9">A</td>
+
<td bgcolor="#be99d9">S</td>
<td bgcolor="#9999ff">R</td>
+
<td bgcolor="#9999ff">K</td>
<td bgcolor="#fcabae">V</td>
+
 
 +
<td bgcolor="#afabfa">N</td>
 
<td bgcolor="#e4d2ec">G</td>
 
<td bgcolor="#e4d2ec">G</td>
<td bgcolor="#d4d2fc">N</td>
+
<td bgcolor="#d4d2fc">D</td>
 
</tr>
 
</tr>
 +
<tr><td nowrap="nowrap">CD00204/1-19&nbsp;&nbsp;</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
  
<tr><td nowrap="nowrap">MBP1_DEBHA/507-550&nbsp;&nbsp;</td>
+
<td>-</td>
<td>I</td>
+
<td>-</td>
<td bgcolor="#d2d2ff">R</td>
+
<td>-</td>
<td bgcolor="#d4d2fc">D</td>
+
<td>-</td>
<td bgcolor="#e2d2ee">S</td>
+
<td>-</td>
<td bgcolor="#d4d2fc">Q</td>
+
<td>-</td>
<td bgcolor="#d4d2fc">E</td>
+
<td>-</td>
<td bgcolor="#ffd2d2">I</td>
+
<td>-</td>
<td bgcolor="#d4d2fc">E</td>
+
<td>-</td>
  
<td bgcolor="#d4d2fc">N</td>
+
<td>-</td>
<td>K</td>
+
<td>-</td>
<td>K</td>
+
<td>-</td>
<td>L</td>
+
<td>-</td>
<td>S</td>
+
<td>-</td>
<td>L</td>
+
<td>-</td>
<td>S</td>
+
<td>-</td>
<td>D</td>
+
<td>-</td>
<td>K</td>
+
<td>-</td>
  
<td>K</td>
+
<td>-</td>
<td>E</td>
+
<td>-</td>
<td>L</td>
+
<td>-</td>
<td>I</td>
 
<td>A</td>
 
<td>K</td>
 
<td bgcolor="#f2bfcc">F</td>
 
<td bgcolor="#ffbfbf">I</td>
 
 
<td bgcolor="#c2bffc">N</td>
 
<td bgcolor="#c2bffc">N</td>
 
+
<td bgcolor="#f0d2e0">A</td>
<td bgcolor="#d5d2fb">H</td>
+
<td bgcolor="#d2d2ff">R</td>
<td bgcolor="#d4d2fc">Q</td>
 
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#afabfa">D</td>
<td bgcolor="#ffabab">I</td>
+
<td bgcolor="#afabfa">E</td>
 
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#d4d2fc">D</td>
 +
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c399d4">G</td>
<td bgcolor="#c2bffc">N</td>
+
<td bgcolor="#bfbfff">R</td>
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#cbabdf">T</td>
<td bgcolor="#e3abc6">A</td>
+
<td bgcolor="#c2abe8">P</td>
 
+
<td bgcolor="#f699a1">L</td>
<td bgcolor="#e999ad">F</td>
 
 
<td bgcolor="#a199f6">H</td>
 
<td bgcolor="#a199f6">H</td>
<td bgcolor="#ffabab">I</td>
+
<td bgcolor="#f7abb2">L</td>
<td bgcolor="#fb999c">V</td>
+
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#b899df">Y</td>
+
 
 +
<td bgcolor="#be99d9">S</td>
 
<td bgcolor="#afabfa">N</td>
 
<td bgcolor="#afabfa">N</td>
<td bgcolor="#fbd2d5">L</td>
+
<td bgcolor="#e4d2ec">G</td>
<td bgcolor="#d4d2fc">N</td>
+
<td bgcolor="#d5d2fb">H</td>
 
 
 
</tr>
 
</tr>
<tr><td nowrap="nowrap">MBP1A_SCHCO/388-415&nbsp;&nbsp;</td>
+
<tr><td nowrap="nowrap">CD00204/99-118&nbsp;&nbsp;</td>
 +
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td bgcolor="#dfd2f0">Y</td>
 
<td bgcolor="#ded2f2">P</td>
 
<td bgcolor="#d2d2ff">K</td>
 
<td bgcolor="#d4d2fc">E</td>
 
<td bgcolor="#fbd2d5">L</td>
 
  
<td bgcolor="#f0d2e0">A</td>
+
<td>-</td>
<td bgcolor="#d4d2fc">D</td>
+
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 3,347: Line 4,034:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td bgcolor="#fcbfc1">V</td>
 
<td bgcolor="#fcbfc1">V</td>
<td bgcolor="#f9bfc4">L</td>
 
 
 
<td bgcolor="#c2bffc">N</td>
 
<td bgcolor="#c2bffc">N</td>
<td bgcolor="#f5d2db">F</td>
+
<td bgcolor="#f0d2e0">A</td>
<td bgcolor="#d4d2fc">Q</td>
+
<td bgcolor="#d2d2ff">R</td>
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#afabfa">D</td>
<td bgcolor="#afabfa">E</td>
+
<td bgcolor="#ababff">K</td>
 +
 
 
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c399d4">G</td>
<td bgcolor="#c2bffc">E</td>
+
<td bgcolor="#bfbfff">R</td>
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#cbabdf">T</td>
 
+
<td bgcolor="#c2abe8">P</td>
<td bgcolor="#e3abc6">A</td>
 
 
<td bgcolor="#f699a1">L</td>
 
<td bgcolor="#f699a1">L</td>
<td bgcolor="#bf99d7">T</td>
+
<td bgcolor="#a199f6">H</td>
<td bgcolor="#e5abc5">M</td>
+
<td bgcolor="#f7abb2">L</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 +
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#9999ff">R</td>
+
<td bgcolor="#9999ff">K</td>
<td bgcolor="#eaabbf">C</td>
+
<td bgcolor="#afabfa">N</td>
<td bgcolor="#d2d2ff">R</td>
+
<td bgcolor="#e4d2ec">G</td>
 
+
<td bgcolor="#d5d2fb">H</td>
<td bgcolor="#e2d2ee">S</td>
 
 
</tr>
 
</tr>
 +
<tr><td nowrap="nowrap">1SW6/203-232&nbsp;&nbsp;</td>
 +
<td>L</td>
 +
<td bgcolor="#d4d2fc">D</td>
  
<tr><td nowrap="nowrap">MBP1_AJECA/374-403&nbsp;&nbsp;</td>
 
<td>T</td>
 
 
<td bgcolor="#fbd2d5">L</td>
 
<td bgcolor="#fbd2d5">L</td>
<td bgcolor="#ded2f2">P</td>
+
<td bgcolor="#d2d2ff">K</td>
<td bgcolor="#ded2f2">P</td>
+
<td bgcolor="#e2d2ef">W</td>
<td bgcolor="#d5d2fb">H</td>
+
<td bgcolor="#ffd2d2">I</td>
<td bgcolor="#d4d2fc">Q</td>
 
 
 
 
<td bgcolor="#ffd2d2">I</td>
 
<td bgcolor="#ffd2d2">I</td>
<td bgcolor="#e2d2ee">S</td>
+
<td bgcolor="#f0d2e0">A</td>
<td bgcolor="#f0d2df">M</td>
+
<td bgcolor="#d4d2fc">N</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 3,399: Line 4,089:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
<td bgcolor="#ebbfd3">M</td>
 
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#f9bfc4">L</td>
 
+
<td bgcolor="#c2bffc">N</td>
<td bgcolor="#f9bfc4">L</td>
+
<td bgcolor="#f0d2e0">A</td>
<td bgcolor="#d6bfe7">S</td>
 
<td bgcolor="#e2d2ee">S</td>
 
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#afabfa">D</td>
 +
 
<td bgcolor="#caabe0">S</td>
 
<td bgcolor="#caabe0">S</td>
 
<td bgcolor="#d4d2fc">N</td>
 
<td bgcolor="#d4d2fc">N</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c2bffc">D</td>
 
<td bgcolor="#c2bffc">D</td>
 
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#cbabdf">T</td>
<td bgcolor="#e3abc6">A</td>
+
<td bgcolor="#eaabbf">C</td>
<td bgcolor="#dd99b9">A</td>
 
 
<td bgcolor="#f699a1">L</td>
 
<td bgcolor="#f699a1">L</td>
<td bgcolor="#e3abc6">A</td>
+
<td bgcolor="#9d99f9">N</td>
 +
<td bgcolor="#ffabab">I</td>
 +
 
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
<td bgcolor="#9999ff">K</td>
+
<td bgcolor="#9999ff">R</td>
<td bgcolor="#afabfa">N</td>
+
<td bgcolor="#f7abb2">L</td>
 
 
 
<td bgcolor="#e4d2ec">G</td>
 
<td bgcolor="#e4d2ec">G</td>
<td bgcolor="#f4d2dc">C</td>
+
<td bgcolor="#d4d2fc">N</td>
 
</tr>
 
</tr>
<tr><td nowrap="nowrap">MBP1_PARBR/380-409&nbsp;&nbsp;</td>
+
<tr><td nowrap="nowrap">SecStruc/203-232&nbsp;&nbsp;</td>
<td>I</td>
+
<td>t</td>
<td bgcolor="#fbd2d5">L</td>
+
 
<td bgcolor="#ded2f2">P</td>
+
<td bgcolor="#e6d2e9">_</td>
<td bgcolor="#ded2f2">P</td>
+
<td bgcolor="#d5d2fb">H</td>
 +
<td bgcolor="#d5d2fb">H</td>
 +
<td bgcolor="#d5d2fb">H</td>
 +
<td bgcolor="#d5d2fb">H</td>
 +
<td bgcolor="#d5d2fb">H</td>
 +
<td bgcolor="#d5d2fb">H</td>
 
<td bgcolor="#d5d2fb">H</td>
 
<td bgcolor="#d5d2fb">H</td>
 +
<td>-</td>
  
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#ffd2d2">I</td>
 
<td bgcolor="#e2d2ee">S</td>
 
<td bgcolor="#fbd2d5">L</td>
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
<td>-</td>
 
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
Line 3,449: Line 4,139:
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 +
<td bgcolor="#dcbfe1">_</td>
 +
<td bgcolor="#dcbfe1">_</td>
 +
<td bgcolor="#dcbfe1">_</td>
 +
<td bgcolor="#e6d2e9">_</td>
 +
<td bgcolor="#e6d2e9">_</td>
  
<td bgcolor="#f9bfc4">L</td>
+
<td bgcolor="#d2abd8">_</td>
<td bgcolor="#f9bfc4">L</td>
+
<td bgcolor="#cbabdf">t</td>
<td bgcolor="#d6bfe7">S</td>
+
<td bgcolor="#e6d2e9">_</td>
<td bgcolor="#e2d2ee">S</td>
+
<td bgcolor="#c799cf">_</td>
<td bgcolor="#d4d2fc">Q</td>
+
<td bgcolor="#dcbfe1">_</td>
<td bgcolor="#afabfa">D</td>
+
<td bgcolor="#d2abd8">_</td>
<td bgcolor="#caabe0">S</td>
+
<td bgcolor="#b2abf7">H</td>
<td bgcolor="#d4d2fc">N</td>
+
<td bgcolor="#a199f6">H</td>
<td bgcolor="#c399d4">G</td>
+
<td bgcolor="#a199f6">H</td>
  
<td bgcolor="#c2bffc">D</td>
+
<td bgcolor="#b2abf7">H</td>
<td bgcolor="#cbabdf">T</td>
+
<td bgcolor="#a199f6">H</td>
<td bgcolor="#e3abc6">A</td>
+
<td bgcolor="#a199f6">H</td>
<td bgcolor="#dd99b9">A</td>
+
<td bgcolor="#a199f6">H</td>
<td bgcolor="#f699a1">L</td>
+
<td bgcolor="#b2abf7">H</td>
<td bgcolor="#e3abc6">A</td>
+
<td bgcolor="#e6d2e9">_</td>
<td bgcolor="#dd99b9">A</td>
+
<td bgcolor="#e6d2e9">_</td>
<td bgcolor="#dd99b9">A</td>
+
</tr>
<td bgcolor="#9999ff">K</td>
+
</table>
 +
</td></tr>
  
<td bgcolor="#afabfa">N</td>
+
</table>
<td bgcolor="#e4d2ec">G</td>
+
;Aligned sequence after editing. A significant cleanup of the frayed region is possible. Now there is only one insertion event, and it is placed into the loop that connects two helices of the 1SW6 structure.
<td bgcolor="#f4d2dc">C</td>
 
</tr>
 
<tr><td nowrap="nowrap">MBP1_NEOFI/363-392&nbsp;&nbsp;</td>
 
<td>T</td>
 
<td bgcolor="#f4d2dc">C</td>
 
<td bgcolor="#e2d2ee">S</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
  
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#d4d2fc">E</td>
 
<td bgcolor="#ffd2d2">I</td>
 
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#fbd2d5">L</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
===Final analysis===
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
 
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#d6bfe7">S</td>
 
<td bgcolor="#f4d2dc">C</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#caabe0">S</td>
 
<td bgcolor="#d4d2fc">N</td>
 
  
<td bgcolor="#c399d4">G</td>
+
{{task|1=
<td bgcolor="#c2bffc">D</td>
+
* Compare the distribution of indels in the ankyrin repeat regions of your alignments.
<td bgcolor="#cbabdf">T</td>
+
**'''Review''' whether the indels in this region are concentrated in segments that connect the helices, or if they are more or less evenly distributed along the entire region of similarity.
<td bgcolor="#e3abc6">A</td>
+
**Think about whether the assertion that ''indels should not be placed in elements of secondary structure'' has merit in your alignment.
<td bgcolor="#dd99b9">A</td>
+
**Recognize that an indel in an element of secondary structure could be interpreted in a number of different ways:
<td bgcolor="#f699a1">L</td>
+
*** The alignment is correct, the annotation is correct too: the indel is tolerated in that particular case, for example by extending the length of an &alpha;-helix or &beta;-strand;
<td bgcolor="#fcabae">V</td>
+
*** The alignment algorithm has made an error, the structural annotation is correct: the indel should be moved a few residues;
<td bgcolor="#dd99b9">A</td>
+
*** The alignment is correct, the structural annotation is wrong, this is not a secondary structure element after all;
<td bgcolor="#dd99b9">A</td>
+
*** Both the algorithm and the annotation are probably wrong, but we have no data to improve the situation.
  
<td bgcolor="#9999ff">R</td>
+
(<small>NB: remember that the structural annotations have been made for the yeast protein and might have turned out differently for the other proteins...</small>)
<td bgcolor="#afabfa">N</td>
+
 
<td bgcolor="#e4d2ec">G</td>
+
You should be able to analyse discrepancies between annotation and expectation in a structured and systematic way. In particular if you notice indels that have been placed into an '''annotated''' region of secondary structure, you should be able to comment on whether the location of the indel has strong support from aligned sequence motifs, or whether the indel could possibly be moved into a different location without much loss in alignment quality.
<td bgcolor="#f0d2e0">A</td>
 
</tr>
 
<tr><td nowrap="nowrap">MBP1_ASPNI/365-394&nbsp;&nbsp;</td>
 
<td>T</td>
 
<td bgcolor="#f5d2db">F</td>
 
<td bgcolor="#e2d2ee">S</td>
 
  
<td bgcolor="#ded2f2">P</td>
+
*Considering the whole alignment and your experience with editing, you should be able to state whether the position of indels relative to structural features of the ankyrin domains in your organism's Mbp1 protein is reliable. That would be the result of this task, in which you combine multiple sequence and structural information.
<td bgcolor="#d4d2fc">E</td>
 
<td bgcolor="#d4d2fc">E</td>
 
<td bgcolor="#fcd2d3">V</td>
 
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#fbd2d5">L</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
*You can also critically evaluate database information that you have encountered:
<td>-</td>
+
# Navigate to the [http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi?INPUT_TYPE=precalc&SEQUENCE=6320147 '''CDD annotation'''] for yeast Mbp1.
<td>-</td>
+
# You can check the precise alignment boundaries of the ankyrin domains by clicking on the (+) icon to the left of the matching domain definition.
<td>-</td>
+
# Confirm that CDD extends the ankyrin domain annotation beyond the 1SW6 domain boundaries. Given your assessment of conservation in the region beyond the structural annotation:  do you think that extending the annotation is reasonable also in YFO's protein? Is there evidence for this in the alignment of the CD00204 consensus with well aligned blocks of sequence beyond the positions that match Swi6?
<td>-</td>
+
}}
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
 
<td>-</td>
 
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#d6bfe7">S</td>
 
<td bgcolor="#f4d2dc">C</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#caabe0">S</td>
 
  
<td bgcolor="#fcd2d3">V</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c2bffc">D</td>
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#e3abc6">A</td>
 
<td bgcolor="#fb999c">V</td>
 
<td bgcolor="#f699a1">L</td>
 
<td bgcolor="#fcabae">V</td>
 
<td bgcolor="#dd99b9">A</td>
 
  
<td bgcolor="#dd99b9">A</td>
+
==R code: load alignment and compute information scores==
<td bgcolor="#9999ff">R</td>
+
<!-- Add sequence weighting and sampling bias correction ? -->
<td bgcolor="#afabfa">N</td>
+
 
<td bgcolor="#e4d2ec">G</td>
+
As discussed in the lecture, Shannon information is calculated as the difference between expected and observed entropy, where entropy is the negative sum over probabilities times the log of those probabilities:
<td bgcolor="#fcd2d3">V</td>
+
 
</tr>
+
 
<tr><td nowrap="nowrap">MBP1_UNCRE/377-406&nbsp;&nbsp;</td>
 
<td>M</td>
 
<td bgcolor="#dfd2f0">Y</td>
 
  
<td bgcolor="#ded2f2">P</td>
 
<td bgcolor="#d5d2fb">H</td>
 
<td bgcolor="#d5d2fb">H</td>
 
<td bgcolor="#d4d2fc">E</td>
 
<td bgcolor="#fcd2d3">V</td>
 
<td bgcolor="#e4d2ec">G</td>
 
<td bgcolor="#fbd2d5">L</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
Here we compute Shannon information scores for aligned positions of the APSES domain, and plot the values in '''R'''. You can try this with any part of your alignment, but I have used only the aligned residues for the APSES domain for my example. This is a good choice for a first try, since there are (almost) no gaps.
<td>-</td>
 
<td>-</td>
 
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#eabfd3">A</td>
 
<td bgcolor="#e2d2ee">S</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#afabfa">D</td>
 
  
<td bgcolor="#caabe0">S</td>
+
{{task|1=
<td bgcolor="#d4d2fc">N</td>
+
# Export only the sequences of the aligned APSES domains to a file on your computer, in FASTA format as explained below. You could call this: <code>Mbp1_All_APSES.fa</code>.
<td bgcolor="#c399d4">G</td>
+
##Use your mouse and clik and drag to ''select'' the aligned APSES domains in the alignment window.
<td bgcolor="#c2bffc">D</td>
+
##Copy your selection to the clipboard.
<td bgcolor="#cbabdf">T</td>
+
##Use the main menu (not the menu of your alignment window) and select '''File &rarr; Input alignment &rarr; from Textbox'''; paste the selection into the textbox and click '''New Window'''.
<td bgcolor="#e3abc6">A</td>
+
##Use '''File &rarr; save as''' to save the aligned siequences in multi-FASTA format under the filename you want in your '''R''' project directory.
<td bgcolor="#dd99b9">A</td>
+
 
<td bgcolor="#f699a1">L</td>
+
# Explore the R-code below. Be sure that you understand it correctly. Note that this code does not implement any sampling bias correction, so positions with large numbers of gaps will receive artificially high scores (the alignment looks like the gap charecter were a conserved character).
<td bgcolor="#cbabdf">T</td>
 
  
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#9999ff">K</td>
 
<td bgcolor="#afabfa">N</td>
 
<td bgcolor="#e4d2ec">G</td>
 
<td bgcolor="#f4d2dc">C</td>
 
</tr>
 
<tr><td nowrap="nowrap">MBP1_PENCH/439-468&nbsp;&nbsp;</td>
 
<td>T</td>
 
  
<td bgcolor="#f4d2dc">C</td>
+
<source lang="rsplus">
<td bgcolor="#e2d2ee">S</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#d4d2fc">E</td>
 
<td bgcolor="#ffd2d2">I</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#f0d2df">M</td>
 
<td>-</td>
 
  
<td>-</td>
+
# CalculateInformation.R
<td>-</td>
+
# Calculate Shannon information for positions in a multiple sequence alignment.
<td>-</td>
+
# Requires: an MSA in multi FASTA format
<td>-</td>
+
<td>-</td>
+
# It is good practice to set variables you might want to change
<td>-</td>
+
# in a header block so you don't need to hunt all over the code
<td>-</td>
+
# for strings you need to update.
<td>-</td>
+
#
<td>-</td>
+
setwd("/your/R/working/directory")
 +
mfa      <- "MBP1_All_APSES.fa"
 +
 +
# ================================================
 +
#    Read sequence alignment fasta file
 +
# ================================================
 +
 +
# read MFA datafile using seqinr function read.fasta()
 +
library(seqinr)
 +
tmp  <- read.alignment(mfa, format="fasta")
 +
MSA  <- as.matrix(tmp)  # convert the list into a characterwise matrix
 +
                        # with appropriate row and column names using
 +
                        # the seqinr function as.matrix.alignment()
 +
                        # You could have a look under the hood of this
 +
                        # function to understand beter how to convert a
 +
                        # list into something else ... simply type
 +
                        # "as.matrix.alignment" - without the parentheses
 +
                        # to retrieve the function source code (as for any
 +
                        # function btw).
  
<td>-</td>
+
### Explore contents of and access to the matrix of sequences
<td>-</td>
+
MSA
<td>-</td>
+
MSA[1,]
<td>-</td>
+
MSA[,1]
<td bgcolor="#f9bfc4">L</td>
+
length(MSA[,1])
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#d6bfe7">S</td>
 
<td bgcolor="#f4d2dc">C</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
  
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#afabfa">Q</td>
 
<td bgcolor="#d4d2fc">N</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c2bffc">D</td>
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#e3abc6">A</td>
 
<td bgcolor="#fb999c">V</td>
 
<td bgcolor="#f699a1">L</td>
 
  
<td bgcolor="#fcabae">V</td>
+
# ================================================
<td bgcolor="#dd99b9">A</td>
+
#   define function to calculate entropy
<td bgcolor="#dd99b9">A</td>
+
# ================================================
<td bgcolor="#9999ff">R</td>
 
<td bgcolor="#afabfa">N</td>
 
<td bgcolor="#e4d2ec">G</td>
 
<td bgcolor="#f0d2e0">A</td>
 
</tr>
 
<tr><td nowrap="nowrap">MBPA_TRIVE/407-436&nbsp;&nbsp;</td>
 
  
<td>V</td>
+
entropy <- function(v) { # calculate shannon entropy for the aa vector v
<td bgcolor="#f5d2db">F</td>
+
                    # Note: we are not correcting for small sample sizes
<td bgcolor="#ded2f2">P</td>
+
                    # here. Thus if there are a large number of gaps in
<td bgcolor="#d2d2ff">R</td>
+
                    # the alignment, this will look like small entropy
<td bgcolor="#d5d2fb">H</td>
+
                    # since only a few amino acids are present. In the
<td bgcolor="#d4d2fc">E</td>
+
                    # extreme case: if a position is only present in
<td bgcolor="#ffd2d2">I</td>
+
                    # one sequence, that one amino acid will be treated
<td bgcolor="#e2d2ee">S</td>
+
                    # as 100% conserved - zero entropy. Sampling error
<td bgcolor="#fbd2d5">L</td>
+
                    # corrections are discussed eg. in Schneider et al.
 +
                    # (1986) JMB 188:414
 +
l <- length(v)
 +
a <- rep(0, 21)      # initialize a vector with 21 elements (20 aa plus gap)
 +
                    # the set the name of each row to the one letter
 +
                    # code. Through this, we can access a row by its
 +
                    # one letter code.
 +
names(a)  <- unlist(strsplit("acdefghiklmnpqrstvwy-", ""))
 +
 +
for (i in 1:l) {      # for the whole vector of amino acids
 +
c <- v[i]          # retrieve the character
 +
a[c] <- a[c] + 1  # increment its count by one
 +
} # note: we could also have used the table() function for this
 +
 +
tot <- sum(a) - a["-"] # calculate number of observed amino acids
 +
                      # i.e. subtract gaps
 +
a <- a/tot            # frequency is observations of one amino acid
 +
                      # divided by all observations. We assume that
 +
                      # frequency equals probability.
 +
a["-"] <- 0                             
 +
for (i in 1:length(a)) {
 +
if (a[i] != 0) { # if a[i] is not zero, otherwise leave as is.
 +
            # By definition, 0*log(0) = 0  but R calculates
 +
            # this in parts and returns NaN for log(0).
 +
a[i] <- a[i] * (log(a[i])/log(2)) # replace a[i] with
 +
                                  # p(i) log_2(p(i))
 +
}
 +
}
 +
return(-sum(a)) # return Shannon entropy
 +
}
  
<td>-</td>
+
# ================================================
<td>-</td>
+
#    calculate entropy for reference distribution
<td>-</td>
+
#    (from UniProt, c.f. Assignment 2)
<td>-</td>
+
# ================================================
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
refData <- c(
<td>-</td>
+
    "A"=8.26,
<td>-</td>
+
    "Q"=3.93,
<td>-</td>
+
    "L"=9.66,
<td>-</td>
+
    "S"=6.56,
<td bgcolor="#f9bfc4">L</td>
+
    "R"=5.53,
<td bgcolor="#f9bfc4">L</td>
+
    "E"=6.75,
<td bgcolor="#d6bfe7">S</td>
+
    "K"=5.84,
<td bgcolor="#e2d2ee">S</td>
+
    "T"=5.34,
 +
    "N"=4.06,
 +
    "G"=7.08,
 +
    "M"=2.42,
 +
    "W"=1.08,
 +
    "D"=5.45,
 +
    "H"=2.27,
 +
    "F"=3.86,
 +
    "Y"=2.92,
 +
    "C"=1.37,
 +
    "I"=5.96,
 +
    "P"=4.70,
 +
    "V"=6.87
 +
    )
  
<td bgcolor="#d4d2fc">Q</td>
+
### Calculate the entropy of this distribution
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#e3abc6">A</td>
 
<td bgcolor="#d4d2fc">N</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c2bffc">D</td>
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#e3abc6">A</td>
 
<td bgcolor="#dd99b9">A</td>
 
  
<td bgcolor="#f699a1">L</td>
+
H.ref <- 0
<td bgcolor="#cbabdf">T</td>
+
for (i in 1:length(refData)) {
<td bgcolor="#dd99b9">A</td>
+
p <- refData[i]/sum(refData) # convert % to probabilities
<td bgcolor="#dd99b9">A</td>
+
    H.ref <- H.ref - (p * (log(p)/log(2)))
<td bgcolor="#9999ff">K</td>
+
}
<td bgcolor="#afabfa">N</td>
 
<td bgcolor="#e4d2ec">G</td>
 
<td bgcolor="#f4d2dc">C</td>
 
</tr>
 
  
<tr><td nowrap="nowrap">MBP1_PHANO/400-429&nbsp;&nbsp;</td>
+
# ================================================
<td>T</td>
+
#    calculate information for each position of
<td bgcolor="#e2d2ef">W</td>
+
#   multiple sequence alignment
<td bgcolor="#ffd2d2">I</td>
+
# ================================================
<td bgcolor="#ded2f2">P</td>
 
<td bgcolor="#d4d2fc">E</td>
 
<td bgcolor="#d4d2fc">E</td>
 
<td bgcolor="#fcd2d3">V</td>
 
<td bgcolor="#e2d2ed">T</td>
 
  
<td bgcolor="#d2d2ff">R</td>
+
lAli <- dim(MSA)[2] # length of row in matrix is second element of dim(<matrix>).
<td>-</td>
+
I <- rep(0, lAli)  # initialize result vector
<td>-</td>
+
for (i in 1:lAli) {
<td>-</td>
+
I[i] = H.ref - entropy(MSA[,i])  # I = H_ref - H_obs
<td>-</td>
+
}
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
### evaluate I
<td>-</td>
+
I
<td>-</td>
+
quantile(I)
<td>-</td>
+
hist(I)
<td>-</td>
+
plot(I)
<td>-</td>
+
 
<td bgcolor="#f9bfc4">L</td>
+
# you can see that we have quite a large number of columns with the same,
<td bgcolor="#f9bfc4">L</td>
+
# high value ... what are these?
<td bgcolor="#c2bffc">N</td>
 
  
<td bgcolor="#f0d2e0">A</td>
+
which(I > 4)
<td bgcolor="#d4d2fc">Q</td>
+
MSA[,which(I > 4)]
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#afabfa">Q</td>
 
<td bgcolor="#d4d2fc">N</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c2bffc">D</td>
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#e3abc6">A</td>
 
  
<td bgcolor="#ff9999">I</td>
+
# And what is in the columns with low values?
<td bgcolor="#df99b8">M</td>
+
MSA[,which(I < 1.5)]
<td bgcolor="#ffabab">I</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#9999ff">R</td>
 
<td bgcolor="#afabfa">N</td>
 
<td bgcolor="#e4d2ec">G</td>
 
<td bgcolor="#f0d2e0">A</td>
 
  
</tr>
 
<tr><td nowrap="nowrap">MBPA_SCLSC/294-313&nbsp;&nbsp;</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
# ===================================================
<td>-</td>
+
#    plot the information
<td>-</td>
+
#    (c.f. Assignment 5, see there for explanations)
<td>-</td>
+
# ===================================================
<td>-</td>
+
 
<td>-</td>
+
IP <- (I-min(I))/(max(I) - min(I) + 0.0001)
<td>-</td>
+
nCol <- 15
<td>-</td>
+
IP <- floor(IP * nCol) + 1
<td>-</td>
+
spect <- colorRampPalette(c("#DD0033", "#00BB66", "#3300DD"), bias=0.6)(nCol)
 +
# lets set the information scores from single informations to grey. We 
 +
# change the highest level of the spectrum to grey.
 +
#spect[nCol] <- "#CCCCCC"
 +
Icol <- vector()
 +
for (i in 1:length(I)) {
 +
Icol[i] <- spect[ IP[i] ]
 +
}
 +
 +
plot(1,1, xlim=c(0, lAli), ylim=c(-0.5, 5) ,
 +
    type="n", bty="n", xlab="position in alignment", ylab="Information (bits)")
  
<td>-</td>
+
# plot as rectangles: height is information and color is coded to information
<td>-</td>
+
for (i in 1:lAli) {
<td>-</td>
+
  rect(i, 0, i+1, I[i], border=NA, col=Icol[i])
<td>-</td>
+
}
<td>-</td>
+
 
<td>-</td>
+
# As you can see, some of the columns reach very high values, but they are not
<td>-</td>
+
# contiguous in sequence. Are they contiguous in structure? We will find out in
<td>-</td>
+
# a later assignment, when we map computed values to structure.
<td bgcolor="#f9bfc4">L</td>
+
 
 +
</source>
 +
}}
 +
 
 +
 
 +
[[Image:InformationPlot.jpg|frame|none|Plot of information vs. sequence position produced by the '''R''' script above, for an alignment of Mbp1 ortholog APSES domains.]]
  
<td bgcolor="#c2bffc">D</td>
 
<td bgcolor="#f0d2e0">A</td>
 
<td bgcolor="#d2d2ff">R</td>
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#ffabab">I</td>
 
<td bgcolor="#d4d2fc">N</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c2bffc">N</td>
 
<td bgcolor="#cbabdf">T</td>
 
  
<td bgcolor="#e3abc6">A</td>
 
<td bgcolor="#ff9999">I</td>
 
<td bgcolor="#a199f6">H</td>
 
<td bgcolor="#ffabab">I</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#9999ff">K</td>
 
<td bgcolor="#afabfa">N</td>
 
<td bgcolor="#d2d2ff">K</td>
 
  
<td bgcolor="#f0d2e0">A</td>
 
</tr>
 
  
<tr><td nowrap="nowrap">MBPA_PYRIS/363-392&nbsp;&nbsp;</td>
 
<td>T</td>
 
<td bgcolor="#e2d2ef">W</td>
 
<td bgcolor="#ffd2d2">I</td>
 
<td bgcolor="#ded2f2">P</td>
 
<td bgcolor="#d4d2fc">E</td>
 
<td bgcolor="#d4d2fc">E</td>
 
  
<td bgcolor="#fcd2d3">V</td>
+
== Calculating conservation scores ==
<td bgcolor="#e2d2ed">T</td>
 
<td bgcolor="#d2d2ff">R</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td bgcolor="#f9bfc4">L</td>
 
  
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#c2bffc">N</td>
 
<td bgcolor="#f0d2e0">A</td>
 
<td bgcolor="#f0d2e0">A</td>
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#afabfa">Q</td>
 
<td bgcolor="#d4d2fc">N</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c2bffc">D</td>
 
  
<td bgcolor="#cbabdf">T</td>
+
{{task|1=
<td bgcolor="#e3abc6">A</td>
 
<td bgcolor="#ff9999">I</td>
 
<td bgcolor="#df99b8">M</td>
 
<td bgcolor="#ffabab">I</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#9999ff">R</td>
 
<td bgcolor="#afabfa">N</td>
 
  
<td bgcolor="#e4d2ec">G</td>
+
* Study this code carefully, execute it, section by section and make sure you understand all of it. Ask on the list if anything is not clear.
<td bgcolor="#f0d2e0">A</td>
 
</tr>
 
<tr><td nowrap="nowrap">MBP1_/361-390&nbsp;&nbsp;</td>
 
<td>N</td>
 
<td bgcolor="#d5d2fb">H</td>
 
<td bgcolor="#e2d2ee">S</td>
 
<td bgcolor="#fbd2d5">L</td>
 
<td bgcolor="#e4d2ec">G</td>
 
  
<td bgcolor="#fcd2d3">V</td>
+
<source lang="R">
<td bgcolor="#fbd2d5">L</td>
+
# BiostringsExample.R
<td bgcolor="#e2d2ee">S</td>
+
# Short tutorial on sequence alignment with the Biostrings package.
<td bgcolor="#d4d2fc">Q</td>
+
# Boris Steipe for BCH441, 2013 - 2014
<td>-</td>
+
#
<td>-</td>
+
setwd("~/path/to/your/R_files/")
<td>-</td>
+
setwd("~/Documents/07.TEACHING/37-BCH441 Bioinformatics 2014/05-Materials/Assignment_5 data")
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
# Biostrings is a package within the bioconductor project.
<td>-</td>
+
# bioconducter packages have their own installation system,
<td>-</td>
+
# they are normally not installed via CRAN.
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td bgcolor="#f2bfcc">F</td>
+
# First, you load the BioConductor installer...
<td bgcolor="#ebbfd3">M</td>
+
source("http://bioconductor.org/biocLite.R")
<td bgcolor="#c2bffc">D</td>
 
<td bgcolor="#e2d2ed">T</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#afabfa">N</td>
 
<td bgcolor="#afabfa">N</td>
 
<td bgcolor="#d4d2fc">E</td>
 
<td bgcolor="#c399d4">G</td>
 
  
<td bgcolor="#c2bffc">D</td>
+
# Then you can install the Biostrings package and all of its dependencies.
<td bgcolor="#cbabdf">T</td>
+
biocLite("Biostrings")
<td bgcolor="#e3abc6">A</td>
 
<td bgcolor="#f699a1">L</td>
 
<td bgcolor="#a199f6">H</td>
 
<td bgcolor="#ffabab">I</td>
 
<td bgcolor="#f699a1">L</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#9999ff">R</td>
 
  
<td bgcolor="#caabe0">S</td>
+
# ... and load the library.
<td bgcolor="#e4d2ec">G</td>
+
library(Biostrings)
<td bgcolor="#f0d2e0">A</td>
 
</tr>
 
<tr><td nowrap="nowrap">MBP1_ASPFL/328-364&nbsp;&nbsp;</td>
 
<td>T</td>
 
<td bgcolor="#d4d2fc">E</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#ded2f2">P</td>
 
  
<td bgcolor="#e4d2ec">G</td>
+
# Some basic (technical) information is available ...
<td bgcolor="#d4d2fc">E</td>
+
library(help=Biostrings)
<td bgcolor="#fcd2d3">V</td>
 
<td bgcolor="#ffd2d2">I</td>
 
<td bgcolor="#e2d2ed">T</td>
 
<td>L</td>
 
<td>G</td>
 
<td>R</td>
 
<td>F</td>
 
  
<td>I</td>
+
# ... but for more in depth documentation, use the
<td>S</td>
+
# so called "vignettes" that are provided with every R package.
<td>E</td>
+
browseVignettes("Biostrings")
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
# In this code, we mostly use functions that are discussed in the
<td bgcolor="#ffbfbf">I</td>
+
# pairwise alignement vignette.
<td bgcolor="#fcbfc1">V</td>
+
# Read in two fasta files - you will need to edit this for YFO
<td bgcolor="#c2bffc">N</td>
+
sacce <- readAAStringSet("mbp1-sacce.fa", format="fasta")
<td bgcolor="#fbd2d5">L</td>
 
<td bgcolor="#d2d2ff">R</td>
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
  
<td bgcolor="#c399d4">G</td>
+
# "USTMA" is used only as an example here - modify for YFO  :-)
<td bgcolor="#c2bffc">D</td>
+
ustma <- readAAStringSet("mbp1-ustma.fa", format="fasta")
<td bgcolor="#cbabdf">T</td>
+
 
<td bgcolor="#e3abc6">A</td>
+
sacce
<td bgcolor="#f699a1">L</td>
+
names(sacce)
<td bgcolor="#9d99f9">N</td>
+
names(sacce) <- "Mbp1 SACCE"
<td bgcolor="#f7abb2">L</td>
+
names(ustma) <- "Mbp1 USTMA" # Example only ... modify for YFO
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#c399d4">G</td>
 
  
<td bgcolor="#9999ff">R</td>
+
width(sacce)
<td bgcolor="#e3abc6">A</td>
+
as.character(sacce)
<td bgcolor="#d2d2ff">R</td>
 
<td bgcolor="#e2d2ee">S</td>
 
</tr>
 
<tr><td nowrap="nowrap">MBPA_MAGOR/375-404&nbsp;&nbsp;</td>
 
<td>Q</td>
 
<td bgcolor="#d5d2fb">H</td>
 
<td bgcolor="#d4d2fc">D</td>
 
  
<td bgcolor="#ded2f2">P</td>
+
# Biostrings takes a sophisticated approach to sequence alignment ...
<td bgcolor="#d4d2fc">N</td>
+
?pairwiseAlignment
<td bgcolor="#f5d2db">F</td>
 
<td bgcolor="#fcd2d3">V</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
# ... but the use in practice is quite simple:
<td>-</td>
+
ali <- pairwiseAlignment(sacce, ustma, substitutionMatrix = "BLOSUM50")
<td>-</td>
+
ali
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
pattern(ali)
<td>-</td>
+
subject(ali)
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#c2bffc">D</td>
 
<td bgcolor="#f0d2e0">A</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#afabfa">N</td>
 
  
<td bgcolor="#d4d2fc">D</td>
+
writePairwiseAlignments(ali)
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c2bffc">N</td>
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#e3abc6">A</td>
 
<td bgcolor="#fb999c">V</td>
 
<td bgcolor="#a199f6">H</td>
 
<td bgcolor="#f7abb2">L</td>
 
<td bgcolor="#dd99b9">A</td>
 
  
<td bgcolor="#dd99b9">A</td>
+
p <- aligned(pattern(ali))
<td bgcolor="#9d99f9">Q</td>
+
names(p) <- "Mbp1 SACCE aligned"
<td bgcolor="#ababff">R</td>
+
s <- aligned(subject(ali))
<td bgcolor="#e4d2ec">G</td>
+
names(s) <- "Mbp1 USTMA aligned"
<td bgcolor="#e2d2ee">S</td>
 
</tr>
 
<tr><td nowrap="nowrap">MBP1_CHAGL/361-390&nbsp;&nbsp;</td>
 
<td>S</td>
 
<td bgcolor="#d2d2ff">R</td>
 
  
<td bgcolor="#e2d2ee">S</td>
+
# don't overwrite your EMBOSS .fal files
<td bgcolor="#f0d2e0">A</td>
+
writeXStringSet(p, "mbp1-sacce.R.fal", append=FALSE, format="fasta")
<td bgcolor="#d4d2fc">D</td>
+
writeXStringSet(s, "mbp1-ustma.R.fal", append=FALSE, format="fasta")
<td bgcolor="#d4d2fc">E</td>
+
 
<td bgcolor="#fbd2d5">L</td>
+
# Done.
<td bgcolor="#d4d2fc">Q</td>
+
 
<td bgcolor="#d4d2fc">Q</td>
+
</source>
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
* Compare the alignments you received from the EMBOSS server, and that you computed using '''R'''. Are they approximately the same? Exactly? You did use different matrices and gap parameters, so minor differences are to be expected. But by and large you should get the same alignments.
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
}}
<td>-</td>
 
<td>-</td>
 
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#c2bffc">D</td>
 
<td bgcolor="#e2d2ee">S</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#afabfa">D</td>
 
  
<td bgcolor="#afabfa">N</td>
+
We will now use the aligned sequences to compute a graphical display of alignment quality.
<td bgcolor="#d4d2fc">E</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c2bffc">N</td>
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#e3abc6">A</td>
 
<td bgcolor="#fb999c">V</td>
 
<td bgcolor="#a199f6">H</td>
 
<td bgcolor="#f7abb2">L</td>
 
  
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#df99b8">M</td>
 
<td bgcolor="#ababff">R</td>
 
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#f0d2e0">A</td>
 
</tr>
 
<tr><td nowrap="nowrap">MBP1_PODAN/372-401&nbsp;&nbsp;</td>
 
<td>V</td>
 
  
<td bgcolor="#d2d2ff">R</td>
+
{{task|1=
<td bgcolor="#d4d2fc">Q</td>
+
 
<td bgcolor="#ded2f2">P</td>
+
* Study this code carefully, execute it, section by section and make sure you understand all of it. Ask on the list if anything is not clear.
<td bgcolor="#d4d2fc">E</td>
 
<td bgcolor="#d4d2fc">E</td>
 
<td bgcolor="#fcd2d3">V</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#f0d2e0">A</td>
 
<td>-</td>
 
  
<td>-</td>
+
<source lang="R">
<td>-</td>
+
# aliScore.R
<td>-</td>
+
# Evaluating an alignment with a sliding window score
<td>-</td>
+
# Boris Steipe, October 2012. Update October 2013
<td>-</td>
+
setwd("~/path/to/your/R_files/")
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
# Scoring matrices can be found at the NCBI.
<td>-</td>
+
# ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62
<td>-</td>
 
<td>-</td>
 
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#c2bffc">D</td>
 
<td bgcolor="#f0d2e0">A</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
  
<td bgcolor="#afabfa">D</td>
+
# It is good practice to set variables you might want to change
<td bgcolor="#afabfa">E</td>
+
# in a header block so you don't need to hunt all over the code
<td bgcolor="#d4d2fc">E</td>
+
# for strings you need to update.
<td bgcolor="#c399d4">G</td>
+
#
<td bgcolor="#c2bffc">N</td>
+
fa1      <- "mbp1-sacce.R.fal"
<td bgcolor="#cbabdf">T</td>
+
fa2      <- "mbp1-ustma.R.fal"
<td bgcolor="#e3abc6">A</td>
+
code1    <- "SACCE"
<td bgcolor="#f699a1">L</td>
+
code2    <- "USTMA"
<td bgcolor="#a199f6">H</td>
+
mdmFile  <- "BLOSUM62.mdm"
 +
window  <- 9  # window-size (should be an odd integer)
  
<td bgcolor="#f7abb2">L</td>
+
# ================================================
<td bgcolor="#dd99b9">A</td>
+
#   Read data files
<td bgcolor="#dd99b9">A</td>
+
# ================================================
<td bgcolor="#9999ff">R</td>
 
<td bgcolor="#fcabae">V</td>
 
<td bgcolor="#d4d2fc">N</td>
 
<td bgcolor="#f0d2e0">A</td>
 
</tr>
 
<tr><td nowrap="nowrap">MBP1_LACTH/458-487&nbsp;&nbsp;</td>
 
  
<td>F</td>
+
# read fasta datafiles using seqinr function read.fasta()
<td bgcolor="#e2d2ee">S</td>
+
install.packages("seqinr")
<td bgcolor="#ded2f2">P</td>
+
library(seqinr)
<td bgcolor="#d2d2ff">R</td>
+
tmp  <- unlist(read.fasta(fa1, seqtype="AA", as.string=FALSE, seqonly=TRUE))
<td bgcolor="#dfd2f0">Y</td>
+
seq1 <- unlist(strsplit(as.character(tmp), split=""))
<td bgcolor="#d2d2ff">R</td>
 
<td bgcolor="#ffd2d2">I</td>
 
<td bgcolor="#d4d2fc">E</td>
 
<td bgcolor="#d4d2fc">N</td>
 
  
<td>-</td>
+
tmp  <- unlist(read.fasta(fa2, seqtype="AA", as.string=FALSE, seqonly=TRUE))
<td>-</td>
+
seq2 <- unlist(strsplit(as.character(tmp), split=""))
<td>-</td>
+
 
<td>-</td>
+
if (length(seq1) != length(seq2)) {
<td>-</td>
+
print("Error: Sequences have unequal length!")
<td>-</td>
+
}
<td>-</td>
+
<td>-</td>
+
lSeq <- length(seq1)
<td>-</td>
+
 
 +
# ================================================
 +
#    Read scoring matrix
 +
# ================================================
  
<td>-</td>
+
MDM <- read.table(mdmFile, skip=6)
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#ffbfbf">I</td>
 
<td bgcolor="#c2bffc">N</td>
 
<td bgcolor="#f0d2e0">A</td>
 
  
<td bgcolor="#d4d2fc">Q</td>
+
# This is a dataframe. Study how it can be accessed:
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#afabfa">Q</td>
 
<td bgcolor="#d4d2fc">N</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c2bffc">D</td>
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#e3abc6">A</td>
 
<td bgcolor="#fb999c">V</td>
 
  
<td bgcolor="#a199f6">H</td>
+
MDM
<td bgcolor="#f7abb2">L</td>
+
MDM[1,]
<td bgcolor="#dd99b9">A</td>
+
MDM[,1]
<td bgcolor="#dd99b9">A</td>
+
MDM[5,5]  # Cys-Cys
<td bgcolor="#9d99f9">Q</td>
+
MDM[20,20] # Val-Val
<td bgcolor="#afabfa">N</td>
+
MDM[,"W"# the tryptophan column
<td bgcolor="#e4d2ec">G</td>
+
MDM["R","W"# Arg-Trp pairscore
<td bgcolor="#d4d2fc">D</td>
+
MDM["W","R"# Trp-Arg pairscore: pairscores are symmetric
</tr>
 
  
<tr><td nowrap="nowrap">MBP1_FILNE/433-460&nbsp;&nbsp;</td>
+
colnames(MDM)  # names of columns
<td>-</td>
+
rownames(MDM)  # names of rows
<td>-</td>
+
colnames(MDM)[3]  # third column
<td bgcolor="#dfd2f0">Y</td>
+
rownames(MDM)[12]  # twelfth row
<td bgcolor="#ded2f2">P</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#d4d2fc">E</td>
 
<td bgcolor="#fbd2d5">L</td>
 
<td bgcolor="#f0d2e0">A</td>
 
  
<td bgcolor="#d4d2fc">D</td>
+
# change the two "*" names to "-" so we can use them to score
<td>-</td>
+
# indels of the alignment. This is a bit of a hack, since this
<td>-</td>
+
# does not reflect the actual indel penalties (which is, as you)
<td>-</td>
+
# remember from your lectures, calculated as a gap opening
<td>-</td>
+
# + gap extension penalty; it can't be calculated in a pairwise
<td>-</td>
+
# manner) EMBOSS defaults for BLODSUM62 are opening -10 and
<td>-</td>
+
# extension -0.5 i.e. a gap of size 3 (-11.5) has approximately
<td>-</td>
+
# the same penalty as a 3-character score of "-" matches (-12)
<td>-</td>
+
# so a pairscore of -4 is not entirely unreasonable.
  
<td>-</td>
+
colnames(MDM)[24]
<td>-</td>
+
rownames(MDM)[24]
<td>-</td>
+
colnames(MDM)[24] <- "-"
<td>-</td>
+
rownames(MDM)[24] <- "-"
<td>-</td>
+
colnames(MDM)[24]
<td>-</td>
+
rownames(MDM)[24]
<td bgcolor="#fcbfc1">V</td>
+
MDM["Q", "-"]
<td bgcolor="#ffbfbf">I</td>
+
MDM["-", "D"]
<td bgcolor="#c2bffc">N</td>
+
# so far so good.
 +
 
 +
# ================================================
 +
#   Tabulate pairscores for alignment
 +
# ================================================
  
<td bgcolor="#f5d2db">F</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#afabfa">E</td>
 
<td bgcolor="#d4d2fc">E</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c2bffc">E</td>
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#e3abc6">A</td>
 
  
<td bgcolor="#f699a1">L</td>
+
# It is trivial to create a pairscore vector along the
<td bgcolor="#bf99d7">T</td>
+
# length of the aligned sequences.
<td bgcolor="#ffabab">I</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#9999ff">R</td>
 
<td bgcolor="#e3abc6">A</td>
 
<td bgcolor="#d2d2ff">R</td>
 
<td bgcolor="#e2d2ee">S</td>
 
  
</tr>
+
PS <- vector()
<tr><td nowrap="nowrap">MBP1_KLULA/477-506&nbsp;&nbsp;</td>
+
for (i in 1:lSeq) {
<td>F</td>
+
  aa1 <- seq1[i]
<td bgcolor="#e2d2ed">T</td>
+
  aa2 <- seq2[i]
<td bgcolor="#ded2f2">P</td>
+
  PS[i] = MDM[aa1, aa2]
<td bgcolor="#d4d2fc">Q</td>
+
}
<td bgcolor="#dfd2f0">Y</td>
 
<td bgcolor="#d2d2ff">R</td>
 
<td bgcolor="#ffd2d2">I</td>
 
  
<td bgcolor="#d4d2fc">D</td>
+
PS
<td bgcolor="#fcd2d3">V</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#ffbfbf">I</td>
 
  
<td bgcolor="#c2bffc">N</td>
+
# The same vector could be created - albeit perhaps not so
<td bgcolor="#d4d2fc">Q</td>
+
# easy to understand - with the expression ...
<td bgcolor="#d4d2fc">Q</td>
+
MDM[cbind(seq1,seq2)]
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#afabfa">N</td>
 
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c2bffc">N</td>
 
<td bgcolor="#caabe0">S</td>
 
  
<td bgcolor="#c2abe8">P</td>
 
<td bgcolor="#f699a1">L</td>
 
<td bgcolor="#a199f6">H</td>
 
<td bgcolor="#c5abe5">Y</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#bf99d7">T</td>
 
<td bgcolor="#afabfa">N</td>
 
<td bgcolor="#d2d2ff">K</td>
 
  
<td bgcolor="#d4d2fc">D</td>
 
</tr>
 
  
<tr><td nowrap="nowrap">MBP1_SCHST/468-501&nbsp;&nbsp;</td>
+
# ================================================
<td>A</td>
+
#    Calculate moving averages
<td bgcolor="#d2d2ff">K</td>
+
# ================================================
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#ded2f2">P</td>
 
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#d4d2fc">N</td>
 
  
<td bgcolor="#d2d2ff">K</td>
+
# In order to evaluate the alignment, we will calculate a
<td bgcolor="#d2d2ff">K</td>
+
# sliding window average over the pairscores. Somewhat surprisingly
<td bgcolor="#d4d2fc">D</td>
+
# R doesn't (yet) have a native function for moving averages: options
<td>-</td>
+
# that are quoted are:
<td>-</td>
+
#  - rollmean() in the "zoo" package http://rss.acs.unt.edu/Rdoc/library/zoo/html/rollmean.html
<td>-</td>
+
#  - MovingAverages() in "TTR" http://rss.acs.unt.edu/Rdoc/library/TTR/html/MovingAverages.html
<td>-</td>
+
- ma() in "forecast"  http://robjhyndman.com/software/forecast/
<td>-</td>
+
# But since this is easy to code, we shall implement it ourselves.
<td>-</td>
 
  
<td>-</td>
+
PSma <- vector()          # will hold the averages
<td>-</td>
+
winS <- floor(window/2)    # span of elements above/below the centre
<td>-</td>
+
winC <- winS+1            # centre of the window
<td>-</td>
 
<td>L</td>
 
<td>I</td>
 
<td>A</td>
 
<td>K</td>
 
<td bgcolor="#f2bfcc">F</td>
 
  
<td bgcolor="#ffbfbf">I</td>
+
# extend the vector PS with zeros (virtual observations) above and below
<td bgcolor="#c2bffc">N</td>
+
PS <- c(rep(0, winS), PS , rep(0, winS))
<td bgcolor="#d5d2fb">H</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#caabe0">S</td>
 
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c2bffc">N</td>
 
  
<td bgcolor="#cbabdf">T</td>
+
# initialize the window score for the first position
<td bgcolor="#e3abc6">A</td>
+
winScore <- sum(PS[1:window])
<td bgcolor="#e999ad">F</td>
 
<td bgcolor="#a199f6">H</td>
 
<td bgcolor="#ffabab">I</td>
 
<td bgcolor="#e699b1">C</td>
 
<td bgcolor="#be99d9">S</td>
 
<td bgcolor="#a199f6">H</td>
 
<td bgcolor="#afabfa">N</td>
 
  
<td bgcolor="#fbd2d5">L</td>
+
# write the first score to PSma
<td bgcolor="#d4d2fc">N</td>
+
PSma[1] <- winScore
</tr>
 
<tr><td nowrap="nowrap">MBP1_SACCE/496-525&nbsp;&nbsp;</td>
 
<td>F</td>
 
<td bgcolor="#e2d2ee">S</td>
 
<td bgcolor="#ded2f2">P</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#dfd2f0">Y</td>
 
  
<td bgcolor="#d2d2ff">R</td>
+
# Slide the window along the sequence, and recalculate sum()
<td bgcolor="#ffd2d2">I</td>
+
# Loop from the next position, to the last position that does not exceed the vector...
<td bgcolor="#d4d2fc">E</td>
+
for (i in (winC + 1):(lSeq + winS)) {
<td bgcolor="#fbd2d5">L</td>
+
  # subtract the value that has just dropped out of the window
<td>-</td>
+
  winScore <- winScore - PS[(i-winS-1)]
<td>-</td>
+
  # add the value that has just entered the window
<td>-</td>
+
  winScore <- winScore + PS[(i+winS)] 
<td>-</td>
+
  # put score into PSma
<td>-</td>
+
  PSma[i-winS] <- winScore
 +
}
 +
 
 +
# convert the sums to averages
 +
PSma <- PSma / window
  
<td>-</td>
+
# have a quick look at the score distributions
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td bgcolor="#f9bfc4">L</td>
+
boxplot(PSma)
<td bgcolor="#f9bfc4">L</td>
+
hist(PSma)
<td bgcolor="#c2bffc">N</td>
 
<td bgcolor="#e2d2ed">T</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#ababff">K</td>
 
<td bgcolor="#d4d2fc">N</td>
 
<td bgcolor="#c399d4">G</td>
 
  
<td bgcolor="#c2bffc">D</td>
+
# ================================================
<td bgcolor="#cbabdf">T</td>
+
#   Plot the alignment scores
<td bgcolor="#e3abc6">A</td>
+
# ================================================
<td bgcolor="#f699a1">L</td>
 
<td bgcolor="#a199f6">H</td>
 
<td bgcolor="#ffabab">I</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#be99d9">S</td>
 
<td bgcolor="#9999ff">K</td>
 
  
<td bgcolor="#afabfa">N</td>
+
# normalize the scores
<td bgcolor="#e4d2ec">G</td>
+
PSma <- (PSma-min(PSma))/(max(PSma) - min(PSma) + 0.0001)
<td bgcolor="#d4d2fc">D</td>
+
# spread the normalized values to a desired range, n
</tr>
+
nCol <- 10
<tr><td nowrap="nowrap">CD00204/1-19&nbsp;&nbsp;</td>
+
PSma <- floor(PSma * nCol) + 1
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
# Assign a colorspectrum to a vector (with a bit of colormagic,
<td>-</td>
+
# don't worry about that for now). Dark colors are poor scores,
<td>-</td>
+
# "hot" colors are high scores
<td>-</td>
+
spect <- colorRampPalette(c("black", "red", "yellow", "white"), bias=0.4)(nCol)
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
# Color is an often abused aspect of plotting. One can use color to label
<td>-</td>
+
# *quantities* or *qualities*. For the most part, our pairscores measure amino
<td>-</td>
+
# acid similarity. That is a quantity and with the spectrum that we just defined
<td>-</td>
+
# we associte the measured quantities with the color of a glowing piece
<td>-</td>
+
# of metal: we start with black #000000, then first we ramp up the red
<td>-</td>
+
# (i.e. low-energy) part of the visible spectrum to red #FF0000, then we
<td>-</td>
+
# add and ramp up the green spectrum giving us yellow #FFFF00 and finally we
<td>-</td>
+
# add blue, giving us white #FFFFFF. Let's have a look at the spectrum:
<td>-</td>
 
  
<td>-</td>
+
s <- rep(1, nCol)
<td>-</td>
+
barplot(s, col=spect, axes=F, main="Color spectrum")
<td>-</td>
 
<td bgcolor="#c2bffc">N</td>
 
<td bgcolor="#f0d2e0">A</td>
 
<td bgcolor="#d2d2ff">R</td>
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#afabfa">E</td>
 
<td bgcolor="#d4d2fc">D</td>
 
  
<td bgcolor="#c399d4">G</td>
+
# But one aspect of our data is not quantitatively different: indels.
<td bgcolor="#bfbfff">R</td>
+
# We valued indels with pairscores of -4. But indels are not simply poor alignment,
<td bgcolor="#cbabdf">T</td>
+
# rather they are non-alignment. This means stretches of -4 values are really
<td bgcolor="#c2abe8">P</td>
+
# *qualitatively* different. Let's color them differently by changing the lowest
<td bgcolor="#f699a1">L</td>
+
# level of the spectrum to grey.
<td bgcolor="#a199f6">H</td>
 
<td bgcolor="#f7abb2">L</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 
  
<td bgcolor="#be99d9">S</td>
+
spect[1] <- "#CCCCCC"
<td bgcolor="#afabfa">N</td>
+
barplot(s, col=spect, axes=F, main="Color spectrum")
<td bgcolor="#e4d2ec">G</td>
 
<td bgcolor="#d5d2fb">H</td>
 
</tr>
 
<tr><td nowrap="nowrap">CD00204/99-118&nbsp;&nbsp;</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
# Now we can display our alignment score vector with colored rectangles.
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
# Convert the integers in PSma to color values from spect
<td>-</td>
+
PScol <- vector()
<td>-</td>
+
for (i in 1:length(PSma)) {
<td>-</td>
+
PScol[i] <- spect[ PSma[i] ]  # this is how a value from PSma is used as an index of spect
<td>-</td>
+
}
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
  
<td>-</td>
+
# Plot the scores. The code is similar to the last assignment.
<td>-</td>
+
# Create an empty plot window of appropriate size
<td>-</td>
+
plot(1,1, xlim=c(-100, lSeq), ylim=c(0, 2) , type="n", yaxt="n", bty="n", xlab="position in alignment", ylab="")
<td bgcolor="#fcbfc1">V</td>
 
<td bgcolor="#c2bffc">N</td>
 
<td bgcolor="#f0d2e0">A</td>
 
<td bgcolor="#d2d2ff">R</td>
 
<td bgcolor="#afabfa">D</td>
 
<td bgcolor="#ababff">K</td>
 
 
 
<td bgcolor="#d4d2fc">D</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#bfbfff">R</td>
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#c2abe8">P</td>
 
<td bgcolor="#f699a1">L</td>
 
<td bgcolor="#a199f6">H</td>
 
<td bgcolor="#f7abb2">L</td>
 
<td bgcolor="#dd99b9">A</td>
 
 
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#9999ff">K</td>
 
<td bgcolor="#afabfa">N</td>
 
<td bgcolor="#e4d2ec">G</td>
 
<td bgcolor="#d5d2fb">H</td>
 
</tr>
 
<tr><td nowrap="nowrap">1SW6/203-232&nbsp;&nbsp;</td>
 
<td>L</td>
 
<td bgcolor="#d4d2fc">D</td>
 
 
 
<td bgcolor="#fbd2d5">L</td>
 
<td bgcolor="#d2d2ff">K</td>
 
<td bgcolor="#e2d2ef">W</td>
 
<td bgcolor="#ffd2d2">I</td>
 
<td bgcolor="#ffd2d2">I</td>
 
<td bgcolor="#f0d2e0">A</td>
 
<td bgcolor="#d4d2fc">N</td>
 
<td>-</td>
 
<td>-</td>
 
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td bgcolor="#ebbfd3">M</td>
 
<td bgcolor="#f9bfc4">L</td>
 
<td bgcolor="#c2bffc">N</td>
 
<td bgcolor="#f0d2e0">A</td>
 
<td bgcolor="#d4d2fc">Q</td>
 
<td bgcolor="#afabfa">D</td>
 
 
 
<td bgcolor="#caabe0">S</td>
 
<td bgcolor="#d4d2fc">N</td>
 
<td bgcolor="#c399d4">G</td>
 
<td bgcolor="#c2bffc">D</td>
 
<td bgcolor="#cbabdf">T</td>
 
<td bgcolor="#eaabbf">C</td>
 
<td bgcolor="#f699a1">L</td>
 
<td bgcolor="#9d99f9">N</td>
 
<td bgcolor="#ffabab">I</td>
 
 
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#dd99b9">A</td>
 
<td bgcolor="#9999ff">R</td>
 
<td bgcolor="#f7abb2">L</td>
 
<td bgcolor="#e4d2ec">G</td>
 
<td bgcolor="#d4d2fc">N</td>
 
</tr>
 
<tr><td nowrap="nowrap">SecStruc/203-232&nbsp;&nbsp;</td>
 
<td>t</td>
 
 
 
<td bgcolor="#e6d2e9">_</td>
 
<td bgcolor="#d5d2fb">H</td>
 
<td bgcolor="#d5d2fb">H</td>
 
<td bgcolor="#d5d2fb">H</td>
 
<td bgcolor="#d5d2fb">H</td>
 
<td bgcolor="#d5d2fb">H</td>
 
<td bgcolor="#d5d2fb">H</td>
 
<td bgcolor="#d5d2fb">H</td>
 
<td>-</td>
 
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
 
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td>-</td>
 
<td bgcolor="#dcbfe1">_</td>
 
<td bgcolor="#dcbfe1">_</td>
 
<td bgcolor="#dcbfe1">_</td>
 
<td bgcolor="#e6d2e9">_</td>
 
<td bgcolor="#e6d2e9">_</td>
 
 
 
<td bgcolor="#d2abd8">_</td>
 
<td bgcolor="#cbabdf">t</td>
 
<td bgcolor="#e6d2e9">_</td>
 
<td bgcolor="#c799cf">_</td>
 
<td bgcolor="#dcbfe1">_</td>
 
<td bgcolor="#d2abd8">_</td>
 
<td bgcolor="#b2abf7">H</td>
 
<td bgcolor="#a199f6">H</td>
 
<td bgcolor="#a199f6">H</td>
 
 
 
<td bgcolor="#b2abf7">H</td>
 
<td bgcolor="#a199f6">H</td>
 
<td bgcolor="#a199f6">H</td>
 
<td bgcolor="#a199f6">H</td>
 
<td bgcolor="#b2abf7">H</td>
 
<td bgcolor="#e6d2e9">_</td>
 
<td bgcolor="#e6d2e9">_</td>
 
</tr>
 
</table>
 
</td></tr>
 
 
 
</table>
 
;Aligned sequence after editing. A significant cleanup of the frayed region is possible. Now there is only one insertion event, and it is placed into the loop that connects two helices of the 1SW6 structure.
 
 
 
 
 
===Final analysis===
 
 
 
 
 
{{task|1=
 
* Compare the distribution of indels in the ankyrin repeat regions of your alignments.
 
**'''Review''' whether the indels in this region are concentrated in segments that connect the helices, or if they are more or less evenly distributed along the entire region of similarity.
 
**Think about whether the assertion that ''indels should not be placed in elements of secondary structure'' has merit in your alignment.
 
**Recognize that an indel in an element of secondary structure could be interpreted in a number of different ways:
 
*** The alignment is correct, the annotation is correct too: the indel is tolerated in that particular case, for example by extending the length of an &alpha;-helix or &beta;-strand;
 
*** The alignment algorithm has made an error, the structural annotation is correct: the indel should be moved a few residues;
 
*** The alignment is correct, the structural annotation is wrong, this is not a secondary structure element after all;
 
*** Both the algorithm and the annotation are probably wrong, but we have no data to improve the situation.  
 
 
 
(<small>NB: remember that the structural annotations have been made for the yeast protein and might have turned out differently for the other proteins...</small>)
 
 
 
You should be able to analyse discrepancies between annotation and expectation in a structured and systematic way. In particular if you notice indels that have been placed into an '''annotated''' region of secondary structure, you should be able to comment on whether the location of the indel has strong support from aligned sequence motifs, or whether the indel could possibly be moved into a different location without much loss in alignment quality.
 
 
 
*Considering the whole alignment and your experience with editing, you should be able to state whether the position of indels relative to structural features of the ankyrin domains in your organism's Mbp1 protein is reliable. That would be the result of this task, in which you combine multiple sequence and structural information.
 
 
 
*You can also critically evaluate database information that you have encountered:
 
# Navigate to the [http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi?INPUT_TYPE=precalc&SEQUENCE=6320147 '''CDD annotation'''] for yeast Mbp1.
 
# You can check the precise alignment boundaries of the ankyrin domains by clicking on the (+) icon to the left of the matching domain definition.
 
# Confirm that CDD extends the ankyrin domain annotation beyond the 1SW6 domain boundaries. Given your assessment of conservation in the region beyond the structural annotation:  do you think that extending the annotation is reasonable also in YFO's protein? Is there evidence for this in the alignment of the CD00204 consensus with well aligned blocks of sequence beyond the positions that match Swi6?
 
}}
 
 
 
==R code: load alignment and compute information scores==
 
<!-- Add sequence weighting and sampling bias correction ? -->
 
 
 
As discussed in the lecture, Shannon information is calculated as the difference between expected and observed entropy, where entropy is the negative sum over probabilities times the log of those probabilities:
 
 
 
 
 
 
 
 
 
 
 
Here we compute Shannon information scores for aligned positions of the APSES domain, and plot the values in '''R'''. You can try this with any part of your alignment, but I have used only the aligned residues for the APSES domain for my example. This is a good choice for a first try, since there are (almost) no gaps.
 
 
 
{{task|1=
 
# Export only the sequences of the aligned APSES domains to a file on your computer, in FASTA format as explained below. You could call this: <code>Mbp1_All_APSES.fa</code>.
 
##Use your mouse and clik and drag to ''select'' the aligned APSES domains in the alignment window.
 
##Copy your selection to the clipboard.
 
##Use the main menu (not the menu of your alignment window) and select '''File &rarr; Input alignment &rarr; from Textbox'''; paste the selection into the textbox and click '''New Window'''.
 
##Use '''File &rarr; save as''' to save the aligned siequences in multi-FASTA format under the filename you want in your '''R''' project directory.
 
 
 
# Explore the R-code below. Be sure that you understand it correctly. Note that this code does not implement any sampling bias correction, so positions with large numbers of gaps will receive artificially high scores (the alignment looks like the gap charecter were a conserved character).
 
 
 
 
 
<source lang="rsplus">
 
 
 
# CalculateInformation.R
 
# Calculate Shannon information for positions in a multiple sequence alignment.
 
# Requires: an MSA in multi FASTA format
 
 
# It is good practice to set variables you might want to change
 
# in a header block so you don't need to hunt all over the code
 
# for strings you need to update.
 
#
 
setwd("/your/R/working/directory")
 
mfa      <- "MBP1_All_APSES.fa"
 
 
# ================================================
 
#    Read sequence alignment fasta file
 
# ================================================
 
 
# read MFA datafile using seqinr function read.fasta()
 
library(seqinr)
 
tmp  <- read.alignment(mfa, format="fasta")
 
MSA  <- as.matrix(tmp)  # convert the list into a characterwise matrix
 
                        # with appropriate row and column names using
 
                        # the seqinr function as.matrix.alignment()
 
                        # You could have a look under the hood of this
 
                        # function to understand beter how to convert a
 
                        # list into something else ... simply type
 
                        # "as.matrix.alignment" - without the parentheses
 
                        # to retrieve the function source code (as for any
 
                        # function btw).
 
 
 
### Explore contents of and access to the matrix of sequences
 
MSA
 
MSA[1,]
 
MSA[,1]
 
length(MSA[,1])
 
 
 
 
 
# ================================================
 
#    define function to calculate entropy
 
# ================================================
 
 
 
entropy <- function(v) { # calculate shannon entropy for the aa vector v
 
                    # Note: we are not correcting for small sample sizes
 
                    # here. Thus if there are a large number of gaps in
 
                    # the alignment, this will look like small entropy
 
                    # since only a few amino acids are present. In the
 
                    # extreme case: if a position is only present in
 
                    # one sequence, that one amino acid will be treated
 
                    # as 100% conserved - zero entropy. Sampling error
 
                    # corrections are discussed eg. in Schneider et al.
 
                    # (1986) JMB 188:414
 
l <- length(v)
 
a <- rep(0, 21)      # initialize a vector with 21 elements (20 aa plus gap)
 
                    # the set the name of each row to the one letter
 
                    # code. Through this, we can access a row by its
 
                    # one letter code.
 
names(a)  <- unlist(strsplit("acdefghiklmnpqrstvwy-", ""))
 
 
for (i in 1:l) {      # for the whole vector of amino acids
 
c <- v[i]          # retrieve the character
 
a[c] <- a[c] + 1  # increment its count by one
 
} # note: we could also have used the table() function for this
 
 
tot <- sum(a) - a["-"] # calculate number of observed amino acids
 
                      # i.e. subtract gaps
 
a <- a/tot            # frequency is observations of one amino acid
 
                      # divided by all observations. We assume that
 
                      # frequency equals probability.
 
a["-"] <- 0                             
 
for (i in 1:length(a)) {
 
if (a[i] != 0) { # if a[i] is not zero, otherwise leave as is.
 
            # By definition, 0*log(0) = 0  but R calculates
 
            # this in parts and returns NaN for log(0).
 
a[i] <- a[i] * (log(a[i])/log(2)) # replace a[i] with
 
                                  # p(i) log_2(p(i))
 
}
 
}
 
return(-sum(a)) # return Shannon entropy
 
}
 
 
 
# ================================================
 
#    calculate entropy for reference distribution
 
#    (from UniProt, c.f. Assignment 2)
 
# ================================================
 
 
 
refData <- c(
 
    "A"=8.26,
 
    "Q"=3.93,
 
    "L"=9.66,
 
    "S"=6.56,
 
    "R"=5.53,
 
    "E"=6.75,
 
    "K"=5.84,
 
    "T"=5.34,
 
    "N"=4.06,
 
    "G"=7.08,
 
    "M"=2.42,
 
    "W"=1.08,
 
    "D"=5.45,
 
    "H"=2.27,
 
    "F"=3.86,
 
    "Y"=2.92,
 
    "C"=1.37,
 
    "I"=5.96,
 
    "P"=4.70,
 
    "V"=6.87
 
    )
 
  
### Calculate the entropy of this distribution
+
# Add a label to the left
 +
text (-30, 1, adj=1, labels=c(paste("Mbp1:\n", code1, "\nvs.\n", code2)), cex=0.9 )
  
H.ref <- 0
+
# Loop over the vector and draw boxes  without border, filled with color.
for (i in 1:length(refData)) {
+
for (i in 1:lSeq) {
p <- refData[i]/sum(refData) # convert % to probabilities
+
  rect(i, 0.9, i+1, 1.1, border=NA, col=PScol[i])
    H.ref <- H.ref - (p * (log(p)/log(2)))
 
 
}
 
}
  
# ================================================
+
# Note that the numbers along the X-axis are not sequence numbers, but numbers
#    calculate information for each position of
+
# of the alignment, i.e. sequence number + indel length. That is important to
#    multiple sequence alignment
+
# realize: if you would like to add the annotations from the last assignment
# ================================================
+
# which I will leave as an exercise, you need to map your sequence numbering
 
+
# into alignment numbering. Let me know in case you try that but need some help.
lAli <- dim(MSA)[2] # length of row in matrix is second element of dim(<matrix>).
 
I <- rep(0, lAli)  # initialize result vector
 
for (i in 1:lAli) {
 
I[i] = H.ref - entropy(MSA[,i])  # I = H_ref - H_obs
 
}
 
 
 
### evaluate I
 
I
 
quantile(I)
 
hist(I)
 
plot(I)
 
 
 
# you can see that we have quite a large number of columns with the same,
 
# high value ... what are these?
 
 
 
which(I > 4)
 
MSA[,which(I > 4)]
 
 
 
# And what is in the columns with low values?
 
MSA[,which(I < 1.5)]
 
 
 
 
 
# ===================================================
 
#    plot the information
 
#    (c.f. Assignment 5, see there for explanations)
 
# ===================================================
 
 
 
IP <- (I-min(I))/(max(I) - min(I) + 0.0001)
 
nCol <- 15
 
IP <- floor(IP * nCol) + 1
 
spect <- colorRampPalette(c("#DD0033", "#00BB66", "#3300DD"), bias=0.6)(nCol)
 
# lets set the information scores from single informations to grey. We 
 
# change the highest level of the spectrum to grey.
 
#spect[nCol] <- "#CCCCCC"
 
Icol <- vector()
 
for (i in 1:length(I)) {
 
Icol[i] <- spect[ IP[i] ]
 
}
 
 
plot(1,1, xlim=c(0, lAli), ylim=c(-0.5, 5) ,
 
    type="n", bty="n", xlab="position in alignment", ylab="Information (bits)")
 
 
 
# plot as rectangles: height is information and color is coded to information
 
for (i in 1:lAli) {
 
  rect(i, 0, i+1, I[i], border=NA, col=Icol[i])
 
}
 
 
 
# As you can see, some of the columns reach very high values, but they are not
 
# contiguous in sequence. Are they contiguous in structure? We will find out in
 
# a later assignment, when we map computed values to structure.
 
  
 
</source>
 
</source>
 
}}
 
}}
 
 
[[Image:InformationPlot.jpg|frame|none|Plot of information vs. sequence position produced by the '''R''' script above, for an alignment of Mbp1 ortholog APSES domains.]]
 
 
  
  

Revision as of 17:46, 4 October 2015

Assignment for Week 4
Sequence alignment

< Assignment 3 Assignment 5 >

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.


 

Introduction

In this assignment we will perform an optimal global and local sequence alignment, and use R to plot the alignment quality as a colored bar-graph.


Optimal sequence alignments

Online programs for optimal sequence alignment are part of the EMBOSS tools. The programs take FASTA files as input.

Local optimal SEQUENCE alignment "water"

Task:

  1. Retrieve the FASTA file for the YFO Mbp1 protein and for Saccharomyces cerevisiae.
  2. Save the files as text files to your computer, (if you haven't done so already). You could give them an extension of .fa.
  3. Access the EMBOSS Explorer site (if you haven't done so yet, you might want to bookmark it.)
  4. Look for ALIGNMENT LOCAL, click on water, paste your FASTA sequences and run the program with default parameters.
  5. Study the results. You will probably find that the alignment extends over most of the protein, but does not include the termini.
  6. Considering the sequence identy cutoff we discussed in class (25% over the length of a domain), do you believe that the APSES domains are homologous?
  7. Change the Gap opening and Gap extension parameters to high values (e.g. 30 and 5). Then run the alignment again.
  8. Note what is different.
  9. You could try getting only an alignment for the ankyrin domains that you have found in the last assignment, by deleting the approximate region of the APSES domains from your input.


Global optimal SEQUENCE alignment "needle"

Task:

  1. Look for ALIGNMENT GLOBAL, click on needle, paste your FASTA sequences and run the program with default parameters.
  2. Study the results. You will find that the alignment extends over the entire protein, likely with long indels at the termini.
  3. Change the Output alignment format to FASTA pairwise simple, to retrieve the aligned FASTA files with indels.
  4. Copy the aligned sequences (with indels) and save them to your computer. You could give them an extension of .fal to remind you that they are aligned FASTA sequences.


 

The Mutation Data Matrix

The NCBI makes its alignment matrices available by ftp. They are located at ftp://ftp.ncbi.nih.gov/blast/matrices - for example here is a link to the BLOSUM62 matrix[1]. Access that site and download the BLOSUM62 matrix to your computer. You could give it a filename of BLOSUM62.mdm.

It should look like this.

#  Matrix made by matblas from blosum62.iij
#  * column uses minimum score
#  BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
#  Blocks Database = /data/blocks_5.0/blocks.dat
#  Cluster Percentage: >= 62
#  Entropy =   0.6979, Expected =  -0.5209
   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z  X  *
A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1  0 -4 
R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1  0 -1 -4 
N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  3  0 -1 -4 
D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4  1 -1 -4 
C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 
Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0  3 -1 -4 
E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4 
G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -2 -1 -4 
H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0  0 -1 -4 
I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3 -3 -3 -1 -4 
L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1 -4 -3 -1 -4 
K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2  0  1 -1 -4 
M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1 -3 -1 -1 -4 
F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1 -3 -3 -1 -4 
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2 -2 -1 -2 -4 
S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2  0  0  0 -4 
T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0 -1 -1  0 -4 
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3 -4 -3 -2 -4 
Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1 -3 -2 -1 -4 
V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4 -3 -2 -1 -4 
B -2 -1  3  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4  1 -1 -4 
Z -1  0  0  1 -3  3  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -3 -2 -2  1  4 -1 -4 
X  0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2  0  0 -2 -1 -1 -1 -1 -1 -4 
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4  1


Task:

  • Study this and make sure you understand what this table is, how it can be used, and what a reasonable range of values for identities and pairscores for non-identical, similar and dissimilar residues is. Ask on the mailing list in case you have questions.


 

The DNA binding site

Now, that you know how YFO Mbp1 aligns with yeast Mbp1, you can evaluate functional conservation in these homologous proteins. You probably already downloaded the two Biochemistry papers by Taylor et al. (2000) and by Deleeuw et al. (2008) that we encountered in Assignment 2. These discuss the residues involved in DNA binding[2]. In particular the residues between 50-74 have been proposed to comprise the DNA recognition domain.

Task:

  1. Using the APSES domain alignment you have just constructed, find the YFO Mbp1 residues that correspond to the range 50-74 in yeast.
  2. Note whether the sequences are especially highly conserved in this region.
  3. Using Chimera, look at the region. Use the sequence window to make sure that the sequence numbering between the paper and the PDB file are the same (they are often not identical!). Then select the residues - the proposed recognition domain - and color them differently for emphasis. Study this in stereo to get a sense of the spatial relationships. Check where the conserved residues are.
  4. A good representation is stick - but other representations that include sidechains will also serve well.
  5. Calculate a solvent accessible surface of the protein in a separate representation and make it transparent.
  6. You could combine three representations: (1) the backbone (in ribbon view), (2) the sidechains of residues that presumably contact DNA, distinctly colored, and (3) a transparent surface of the entire protein. This image should show whether residues annotated as DNA binding form a contiguous binding interface.


DNA binding interfaces are expected to comprise a number of positively charged amino acids, that might form salt-bridges with the phosphate backbone.


Task:

  • Study and consider whether this is the case here and which residues might be included.


 

 

BLAST

One of the foundations of bioinformatics is the empirical observation that related sequences conserve structure, and often function. This is the basis on which we can make inferences from well-studied model organisms in species that have not been studied as deeply. The model case for our assignments is to take annotations from baker's yeast, Saccharomyces cerevisiae and apply them to YFO.

Therefore, in this assignment we will

  • use the sequence search program BLAST to retrieve a sequence similar to yeast Mbp1 in YFO;
  • use a number of tools to annotate the sequence.

Keeping with our theme of sequence analysis, we will

  • explore EMBOSS tools;
  • compute and plot relative amino acid frequencies in R;
  • and (optionally) use Chimera to explore H-bond patterns in the Mbp1 APSES domain structure.

 

Retrieve

In Assignment 2 you looked at sequences in YFO that are related to yeast Mbp1, by following a link from the RefSeq record. I mentioned that there are more principled ways to find related proteins: that principle is to search for similar sequences. Exactly how this works will be the subject of later lectures, but the tool that is most commonly used for this task is called BLAST (Basic Local Alignment And Search Tool). The task of this assignment is to perform a number of sequence annotations to the sequence from YFO that is most similar to Mbp1, or, more precisely, that contains an APSES domain that is most similar[3].

 

Search input

First, we need to define the sequence we will search with, as the search input.


Defining the sequence to search with

I have highlighted the extent of the APSES domain sequence in the previous assignment, but when you explored the corresponding structure in Chimera, you saw that the structured protein domain is larger and the additional secondary structure elements are in fact well integrated into the overall domain. This is not surprising: canonical domain definitions are compiled from many species and examples, and they generally comprise only the common core. Looking up the source of the domain annotations for Mbp1 is very easy:


Task:

  1. Access the RefSeq record for yeast Mbp1.
  2. While you are here, download a FASTA formatted version of the sequence to your R working directory and give it a filename of mbp1-sacce.fa. We will need it later. It should be straightforward from the NCBI page how to achieve that. As a hint, you need to use the Send to... link to actually download the file.
  3. On the RefSeq page, look for the link Related InformationCDD Search Results and follow it.


This is a domain annotation: CDD is the NCBI's Conserved Domain Database and the annotation was done by a tool that scanned the sequence of Mbp1 for segments that are similar to any of the domain definitions stored in the CDD. We will return to CDD in the next assignment.

  1. Click on the blue box labeled Kila-N in the graph to access the CDD entry for this domain.
  2. Read the abstract. You should understand the relationship between Kila-N and APSES domains. One is a subfamily of the other.
  3. Confirm that the domain definition – as applied to the Mbp1 sequence (which is labeled as "query") – corresponds to the region we highlighted in the last assignment.


What precisely constitutes an APSES domain however is a matter of definition, as you can explore in the following (optional) task.


Optional: Load the structure in Chimera, like you did in the last assignment and switch on stereo viewing ... (more)
  1. Display the protein in ribbon style, e.g. with the Interactive 1 preset.
  2. Access the Interpro information page for Mbp1 at the EBI: http://www.ebi.ac.uk/interpro/protein/P39678
  3. In the section Domains and repeats, mouse over the red annotations and note down the residue numbers for the annotated domains. Also follow the links to the respective Interpro domain definition pages.

At this point we have definitions for the following regions on the Mbp1 protein ...

  • The KilA-N (pfam 04383) domain definition as applied to the Mbp1 protein sequence by CDD;
  • The InterPro KilA, N-terminal/APSES-type HTH, DNA-binding (IPR018004) definition annotated on the Mbp1 sequence;
  • The InterPro Transcription regulator HTH, APSES-type DNA-binding domain (IPR003163) definition annotated on the Mbp1 sequence;
  • (... in addition – without following the source here – the UniProt record for Mbp1 annotates a "HTH APSES-type" domain from residues 5-111)

... each with its distinct and partially overlapping sequence range. Back to Chimera:


  1. In the sequence window, select the sequence corresponding to the Interpro KilA-N annotation and colour this fragment red. Remember that you can get the sequence numbers of a residue in the sequence window when you hover the pointer over it - but do confirm that the sequence numbering that Chimera displays matches the numbering of the Interpro domain definition.
  2. Then select the residue range(s) by which the CDD KilA-N definition is larger, and colour that fragment orange.
  3. Then select the residue range(s) by which the InterPro APSES domain definition is larger, and colour that fragment yellow.
  4. If the structure contains residues outside these ranges, colour these white.
  5. Study this in a side-by-side stereo view and get a sense for how the extra sequence beyond the Kil-A N domain(s) is part of the structure, and how the integrity of the folded structure would be affected if these fragments were missing.
  6. Display Hydrogen bonds, to get a sense of interactions between residues from the differently colored parts. First show the protein as a stick model, with sticks that are thicker than the default to give a better sense of sidechain packing:
    (i) SelectSelect all
    (ii) ActionsRibbonhide
    (iii) SelectStructureprotein
    (iv) ActionsAtoms/Bondsshow
    (v) ActionsAtoms/Bondsstick
    (vi) click on the looking glass icon at the bottom right of the graphics window to bring up the inspector window and choose Inspect ... Bond. Change the radius to 0.4.
  7. Then calculate and display the hydrogen bonds:
    (vii) ToolsSurface/Binding AnalysisFindHbond
    (viii) Set the Line width to 3.0, leave all other parameters with their default values an click Apply
    Clear the selection.
    Study this view, especially regarding side chain H-bonds. Are there many? Do side chains interact more with other sidechains, or with the backbone?
  8. Let's now simplify the scene a bit and focus on backbone/backbone H-bonds:
    (ix) SelectStructureBackbonefull
    (x) ActionsAtoms/Bondsshow only

    Clear the selection.
    In this way you can appreciate how H-bonds build secondary structure - α-helices and β-sheets - and how these interact with each other ... in part across the KilA N boundary.
  9. Save the resulting image as a jpeg no larger than 600px across and upload it to your Lab notebook on the Wiki.
  10. When you are done, congratulate yourself on having earned a bonus of 10% on the next quiz.


There is a rather important lesson in this: domain definitions may be fluid, and their boundaries may be computationally derived from sequence comparisons across many families, and do not necessarily correspond to individual structures. Make sure you understand this well.


Given this, it seems appropriate to search the sequence database with the sequence of an Mbp1 structure–this being a structured, stable, subdomain of the whole that presumably contains the protein's most unique and specific function. Let us retrieve this sequence. All PDB structures have their sequences stored in the NCBI protein database. They can be accessed simply via the PDB-ID, which serves as an identifier both for the NCBI and the PDB databases. However there is a small catch (isn't there always?). PDB files can contain more than one protein, e.g. if the crystal structure contains a complex[4]. Each of the individual proteins gets a so-called chain ID–a one letter identifier– to identify them uniquely. To find their unique sequence in the database, you need to know the PDB ID as well as the chain ID. If the file contains only a single protein (as in our case), the chain ID is always A[5]. make sure you understand the concept of protein chains, and chain IDs.


Task:

  1. Back at the RefSeq record for yeast Mbp1, enter the PDB-ID, an underscore, and the chain ID for one of the crystal structures into the search field. You can use 1MB1_A or 1BM8_A, but don't use 1L3G: this NMR structure includes a large stretch of unstructured residues.
  2. Click on Display settings and choose FASTA (text). You should get something like:
    >gi|157830387|pdb|1BM8|A Chain A, Dna-Binding Domain Of Mbp1
    QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKY
    QGTWVPLNIAKQLAEKFSVYDQLKPLFDF
  3. Save this sequence in your notebook, in case we need it later.


Next, we use this sequence to find its most similar relative in YFO using BLAST.


 

BLAST search

Task:

  1. Navigate to the BLAST entry page at the NCBI.
  2. Click on protein blast as the BLAST program to run.
  3. Paste the sequence of the yeast Mbp1 DNA-binding domain into the search field.
  4. Set the following parameters:
    1. As Database option choose Reference proteins (refseq_protein)
    2. As Organism enter the binomial name of YFO. Make sure you spell it right, the page will try to autocomplete your entry. Species level is detailed enough, you don't have to specify the strain (e.g. I would specify "Ustilago maydis" not "Ustilago maydis 521").
  5. Then click on the BLAST button and wait for the result to appear. You will first see a graph of any conserved domains in your query sequence, this is not yet what you are waiting for...
  6. Patience.
  7. Patience. The database is large.
  8. Patience. Execution times vary greatly by time of day.
  9. The top "hit" on the results page is what you are looking for. Its alignment and alignment score are shown in the Alignments section a bit further down the page. Your hit should have on the order of more than 40% identities to the query and match at least 80 residues or so. If your match seems less and worse than that, please eMail me to troubleshoot.
  10. The first item for each hit is a link to its database entry, right next to the checkbox. It says something like ref|NP_123456789 or ref|XP_123456789 ... follow that link.
  11. Note the RefSeq ID, and save the sequence in FASTA format into your R working directory, as you did for Mbp1 at the beginning of the assignment. Give this a filename of mbp1-xxxxx.fa, but replace xxxxx with its short species label for YFO. For simplicity I will refer to this sequence as "YFO Mbp1" in the future.


 




PSI BLAST


 

Take care of things, and they will take care of you.
Shunryu Suzuki


Anyone can click buttons on a Web page, but to use the powerful sequence database search tools right often takes considerable more care, caution and consideration.

Much of what we know about a protein's physiological function is based on the conservation of that function as the species evolves. We assess conservation by comparing sequences between related proteins. Conservation - or its opposite: variation - is a consequence of selection under constraints: protein sequences change as a consequence of DNA mutations, this changes the protein's structure, this in turn changes functions and that has the multiple effects on a species' fitness function. Detrimental variants may be removed. Variation that is tolerated is largely neutral and therefore found only in positions that are neither structurally nor functionally critical. Conservation patterns can thus provide evidence for many different questions: structural conservation among proteins with similar 3D-structures, functional conservation among homologues with comparable roles, or amino acid propensities as predictors for protein engineering and design tasks.

Measuring conservation requires alignment. Therefore a carefully done multiple sequence alignment (MSA) is a cornerstone for the annotation of the essential properties a gene or protein. MSAs are also useful to resolve ambiguities in the precise placement of indels and to ensure that columns in alignments actually contain amino acids that evolve in a similar context. MSAs serve as input for

  • functional annotation;
  • protein homology modeling;
  • phylogenetic analyses, and
  • sensitive homology searches in databases.

In order to perform a multiple sequence alignment, we obviously need a set of homologous sequences. This is where the trouble begins. All interpretation of MSA results depends absolutely on how the input sequences were chosen. Should we include only orthologs, or paralogs as well? Should we include only species with fully sequenced genomes, or can we tolerate that some orthologous genes are possibly missing for a species? Should we include all sequences we can lay our hands on, or should we restrict the selection to a manageable number of representative sequences? All of these choices influence our interpretation:

  • orthologs are expected to be functionally and structurally conserved;
  • paralogs may have divergent function but have similar structure;
  • missing genes may make paralogs look like orthologs; and
  • selection bias may weight our results toward sequences that are over-represented and do not provide a fair representation of evolutionary divergence.


In this assignment, we will set ourselves the task to use PSI-BLAST and find all orthologs and paralogs of the APSES domain containing transcription factors in YFO. We will use these sequences later for multiple alignments, calculation of conservation etc. The methodical problem we will address is: how do we perform a sensitive PSI-BLAST search in one organism. There is an issue to consider:

  • If we restrict the PSI-BLAST search to YFO, PSI-BLAST has little chance of building a meaningful profile - the number of homologues that actually are in YFO is too small. Thus the search will not become very sensitive.
  • If we don't restrict our search, but search in all species, the number of hits may become too large. It becomes increasingly difficult to closely check all hits as to whether they have good coverage, and how will we evaluate the fringe cases of marginal E-value, where we need to decide whether to include a new sequence in the profile, or whether to hold off on it for one or two iterations, to see whether the E-value drops significantly. Profile corruption would make the search useless. This is maybe still manageable if we restrict our search to fungi, but imagine you are working with a bacterial protein, or a protein that is conserved across the entire tree of life: your search will find thousands of sequences. And by next year, thousands more will have been added.

Therefore we have to find a middle ground: add enough species (sequences) to compile a sensitive profile, but not so many that we can no longer individually assess the sequences that contribute to the profile.


Thus in practice, a sensitive PSI-BLAST search needs to address two issues before we begin:

  1. We need to define the sequence we are searching with; and
  2. We need to define the dataset we are searching in.



Defining the sequence to search with

Consider again the task we set out from: find all orthologs and paralogs of the APSES domain containing transcription factors in YFO.


Task:
What query sequence should you use? Should you ...


  1. Search with the full-length Mbp1 sequence from Saccharomyces cerevisiae?
  2. Search with the full-length Mbp1 homolog that you found in YFO?
  3. Search with the structurally defined S. cerevisiae APSES domain sequence?
  4. Search with the APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?
  5. Search with the KilA-N domain sequence?


Reflect on this (pretend this is a quiz question) and come up with a reasoned answer. Then click on "Expand" to read my opinion on this question.
The full-length Mbp1 sequence from Saccharomyces cerevisiae
Since this sequence contains multiple domains (in particular the ubiquitous Ankyrin domains) it is not suitable for BLAST database searches. You must restrict your search to the domain of greatest interest for your question. That would be the APSES domain.
The full-length Mbp1 homolog that you found in YFO
What organism the search sequence comes from does not make a difference. Since you aim to find all homologs in YFO, it is not necessary to have your search sequence more or less similar to any particular homologs. In fact any APSES sequence should give you the same result, since they are all homologous. But the full-length sequence in YFO has the same problem as the Saccharomyces sequence.
The structurally defined S. cerevisiae APSES domain sequence?
That would be my first choice, just because it is structurally well defined as a complete domain, and the sequence is easy to obtain from the 1BM8 PDB entry. (1MB1 would also work, but you would need to edit out the penta-Histidine tag at the C-terminus that was engineered into the sequence to help purify the recombinantly expressed protein.)
The APSES domain sequence from the YFO homolog, that you have defined by sequence alignment with the yeast protein?
As argued above: since they are all homologs, any of them should lead to the same set of results.
The KilA-N domain sequence?
This is a shorter sequence and a more distant homolog to the domain we are interested in. It would not be my first choice: the fact that it is more distantly related might make the search more sensitive. The fact that it is shorter might make the search less specific. The effect of this tradeoff would need to be compared and considered. By the way: the same holds for the even shorter subdomain 50-74 we discussed in the last assignment. However: one of the results of our analysis will be whether APSES domains in fungi all have the same length as the Mbp1 domain, or whether some are indeed much shorter, as suggested by the Pfam alignment.


So in my opinion, you should search with the yeast Mbp1 APSES domain, i.e. the sequence which you have previously studied in the crystal structure. Where is that? Well, you might have saved it in your journal, or you can get it again from the PDB (i.e. here, or from Assignment 3.

 

Selecting species for a PSI-BLAST search

As discussed in the introduction, in order to use our sequence set for studying structural and functional features and conservation patterns of our APSES domain proteins, we should start with a well selected dataset of APSES domain containing homologs in YFO. Since these may be quite divergent, we can't rely on BLAST to find all of them, we need to use the much more sensitive search of PSI-BLAST instead. But even though you are interested only in YFO's genes, it would be a mistake to restrict the PSI-BLAST search to YFO. PSI-BLAST becomes more sensitive if the profile represents more diverged homologs. Therefore we should always search with a broadly representative set of species, even if we are interested only in the results for one of the species. This is important. Please reflect on this for a bit and make sure you understand the rationale why we include sequences in the search that we are not actually interested in.


But you can also search with too many species: if the number of species is large and PSI-BLAST finds a large number of results:

  1. it becomes unwieldy to check the newly included sequences at each iteration, inclusion of false-positive hits may result, profile corruption and loss of specificity. The search will fail.
  2. since genomes from some parts of the Tree Of Life are over represented, the inclusion of all sequences leads to selection bias and loss of sensitivity.


We should therefore try to find a subset of species

  1. that represent as large a range as possible on the evolutionary tree;
  2. that are as well distributed as possible on the tree; and
  3. whose genomes are fully sequenced.

These criteria are important. Again, reflect on them and understand their justification. Choosing your species well for a PSI-BLAST search can be crucial to obtain results that are robust and meaningful.

How can we define a list of such species, and how can we use the list?

The definition is a rather typical bioinformatics task for integrating datasources: "retrieve a list of representative fungi with fully sequenced genomes". Unfortunately, to do this in a principled way requires tools that you can't (yet) program: we would need to use a list of genome sequenced fungi, estimate their evolutionary distance and select a well-distributed sample. Regrettably you can't combine such information easily with the resources available from the NCBI.

We will use an approach that is conceptually similar: selecting a set of species according to their shared taxonomic rank in the tree of life. Biological classification provides a hierarchical system that describes evolutionary relatedness for all living entities. The levels of this hierarchy are so called taxonomic ranks. These ranks are defined in Codes of Nomenclature that are curated by the self-governed international associations of scientists working in the field. The number of ranks is not specified: there is a general consensus on seven principal ranks (see below, in bold) but many subcategories exist and may be newly introduced. It is desired–but not mandated–that ranks represent clades (a group of related species, or a "branch" of a phylogeny), and it is desired–but not madated–that the rank is sharply defined. The system is based on subjective dissimilarity. Needless to say that it is in flux.

If we follow a link to an entry in the NCBI's Taxonomy database, eg. Saccharomyces cerevisiae S228c, the strain from which the original "yeast genome" was sequenced in the late 1990s, we see the following specification of its taxonomic lineage:


cellular organisms; Eukaryota; Opisthokonta; Fungi; Dikarya; 
Ascomycota; Saccharomyceta; Saccharomycotina; Saccharomycetes; 
Saccharomycetales; Saccharomycetaceae; Saccharomyces; Saccharomyces cerevisiae


These names can be mapped into taxonomic ranks ranks, since the suffixes of these names e.g. -mycotina, -mycetaceae are specific to defined ranks. (NCBI does not provide this mapping, but Wikipedia is helpful here.)

Rank Suffix Example
Domain Eukaryota (Eukarya)
  Subdomain   Opisthokonta
Kingdom   Fungi
  Subkingdom   Dikarya
Phylum   Ascomycota
  rankless taxon[6] -myceta Saccharomyceta
  Subphylum -mycotina Saccharomycotina
Class -mycetes Saccharomycetes
  Subclass -mycetidae  
Order -ales Saccharomycetales
Family -aceae Saccharomycetaceae
  Subfamily -oideae  
  Tribe -eae  
  Subtribe -ineae  
Genus   Saccharomyces
Species   Saccharomyces cerevisiae
You can see that there is not a common mapping between the yeast lineage and the commonly recognized categories - not all ranks are represented. Nor is this consistent across species in the taxonomic database: some have subfamily ranks and some don't. And the tree is in no way normalized - some of the ranks have thousands of members, and for some, only a single extant member may be known, or it may be a rank that only relates to the fossil record. But the ranks do provide some guidance to evolutionary divergence. Say you want to choose four species across the tree of life for a study, you should choose one from each of the major domains of life: Eubacteria, Euryarchaeota, Crenarchaeota-Eocytes, and Eukaryotes. Or you want to study a gene that is specific to mammals. Then you could choose from the clades listed in the NCBI taxonomy database under Mammalia (a class rank, and depending how many species you would want to include, use the subclass-, order-, or family rank (hover over the names to see their taxonomic rank.) There will still be quite a bit of manual work involved and an exploration of different options on the Web may be useful. For our purposes here we can retrieve a good set of organisms from the ensembl fungal genomes page - maintained by the EBI's genome annotation group - that lists species grouped by taxonomic order. All of these organisms are genome-sequenced, we can pick a set of representatives:
  1. Capnodiales   Zymoseptoria tritici
  2. Erysiphales   Blumeria graminis
  3. Eurotiales   Aspergillus nidulans
  4. Glomerellales   Glomerella graminicola
  5. Hypocreales   Trichoderma reesei
  6. Magnaporthales   Magnaporthe oryzae
  7. Microbotryales   Microbotryum violaceum
  8. Pezizales   Tuber melanosporum
  9. Pleosporales   Phaeosphaeria nodorum
  10. Pucciniales   Puccinia graminis
  11. Saccharomycetales   Saccharomyces cerevisiae
  12. Schizosaccharomycetales   Schizosaccharomyces pombe
  13. Sclerotiniaceae   Sclerotinia sclerotiorum
  14. Sordariales   Neurospora crassa
  15. Tremellales   Cryptococcus neoformans
  16. Ustilaginales   Ustilago maydis
This set of organisms thus can be used to generate a PSI-BLAST search in a well-distributed set of species. Of course you must also include YFO (if YFO is not in this list already). To enter these 16 species as an Entrez restriction, they need to be formatted as below. (One could also enter species one by one, by pressing the (+) button after the organism list)
Aspergillus nidulans[orgn]
OR Blumeria graminis[orgn]
OR Cryptococcus neoformans[orgn]
OR Glomerella graminicola[orgn]
OR Magnaporthe oryzae[orgn]
OR Microbotryum violaceum[orgn] 
OR Neurospora crassa[orgn]
OR Phaeosphaeria nodorum[orgn]
OR Puccinia graminis[orgn]
OR Sclerotinia sclerotiorum[orgn]
OR Trichoderma reesei[orgn]
OR Tuber melanosporum[orgn]
OR Saccharomyces cerevisiae[orgn]
OR Schizosaccharomyces pombe[orgn]
OR Ustilago maydis[orgn]
OR Zymoseptoria tritici[orgn]


 

Executing the PSI-BLAST search

We have a list of species. Good. Next up: how do we use it.

Task:

  1. Navigate to the BLAST homepage.
  2. Select protein BLAST.
  3. Paste the APSES domain sequence into the search field.
  4. Select refseq as the database.
  5. Copy the organism restriction list from above and enter the correct name for YFO into the list if it is not there already. Obviously, you can't find sequences in YFO if YFO is not included in your search space. Paste the list into the Entrez Query field.
  6. In the Algorithm section, select PSI-BLAST.
  7. Click on BLAST.


Evaluate the results carefully. Since we used default parameters, the threshold for inclusion was set at an E-value of 0.005 by default, and that may be a bit too lenient. If you look at the table of your hits– in the Sequences producing significant alignments... section– there may also be a few sequences that have a low query coverage of less than 80%. Let's exclude these from the profile initially: not to worry, if they are true positives, the will come back with improved E-values and greater coverage in subsequent iterations. But if they were false positives, their E-values will rise and they should drop out of the profile and not contaminate it.


Task:

  1. In the header section, click on Formatting options and in the line "Format for..." set the with inclusion threshold to 0.001 (This means E-values can't be above 10-03 for the sequence to be included.)
  2. Click on the Reformat button (top right).
  3. In the table of sequence descriptions (not alignments!), click on the Query coverage to sort the table by coverage, not by score.
  4. Copy the rows with a coverage of less than 80% and paste them into some text editor so you can compare what happens with these sequences in the next iteration.
  5. Deselect the check mark next to these sequences in the right-hand column Select for PSI blast. (For me these are six sequences, but with YFO included that may be a bit different.)
  6. Then scroll to Run PSI-BLAST iteration 2 ... and click on Go.


This is now the "real" PSI-BLAST at work: it constructs a profile from all the full-length sequences and searches with the profile, not with any individual sequence. Note that we are controlling what goes into the profile in two ways:

  1. we are explicitly removing sequences with poor coverage; and
  2. we are requiring a more stringent minimum E-value for each sequence.


Task:

  1. Again, study the table of hits. Sequences highlighted in yellow have met the search criteria in the second iteration. Note that the coverage of (some) of the previously excluded sequences is now above 80%.
  2. Let's exclude partial matches one more time. Again, deselect all sequences with less than 80% coverage. Then run the third iteration.
  3. Iterate the search in this way until no more "New" sequences are added to the profile. Then scan the list of excluded hits ... are there any from YFO that seem like they could potentially make the list? Marginal E-value perhaps, or reasonable E-value but less coverage? If that's the case, try returning the E-value threshold to the default 0.005 and see what happens...


Once no "new" sequences have been added, if we were to repeat the process again and again, we would always get the same result because the profile stays the same. We say that the search has converged. Good. Time to harvest.


Task:

  1. At the header, click on Taxonomy reports and find YFO in the Organism Report section. These are your APSES domain homologs. All of them. Actually, perhaps more than all: the report may also include sequences with E-values above the inclusion threshold.
  2. From the report copy the sequence identifiers
    1. from YFO,
    2. with E-values above your defined threshold.

For example, the list of Saccharomyces genes is the following:

Saccharomyces cerevisiae S288c [ascomycetes] taxid 559292
ref|NP_010227.1| Mbp1p [Saccharomyces cerevisiae S288c] [ 131] 1e-38
ref|NP_011036.1| Swi4p [Saccharomyces cerevisiae S288c] [ 123] 1e-35
ref|NP_012881.1| Phd1p [Saccharomyces cerevisiae S288c] [ 91] 1e-25
ref|NP_013729.1| Sok2p [Saccharomyces cerevisiae S288c] [ 93] 3e-25
ref|NP_012165.1| Xbp1p [Saccharomyces cerevisiae S288c] [ 40] 5e-07

Xbp1 is a special case. It has only very low coverage, but that is because it has a long domain insertion and the N-terminal match often is not recognized by alignment because the gap scores for long indels are unrealistically large. For now, I keep that sequence with the others.


Next we need to retrieve the sequences. Tedious to retrieve them one by one, but we can get them all at the same time:


Task:

  1. Return to the BLAST results page and again open the Formatting options.
  2. Find the Limit results section and enter YFO's name into the field. For example Saccharomyces cerevisiae [ORGN]
  3. Click on Reformat
  4. Scroll to the Descriptions section, check the box at the left-hand margin, next to each sequence you want to keep. Then click on Download → FASTA complete sequence → Continue.


There are actually several ways to download lists of sequences. Using the results page utility is only one. But if you know the GIs of the sequences you need, you can get them more directly by putting them into the URL...

Even more flexible is the eUtils interface to the NCBI databases. For example you can download the dataset in text format by clicking below.

Note that this utility does not show anything, but downloads the (multi) fasta file to your default download directory.


Multiple Sequence Alignment

 


Review of domain annotations

APSES domains are relatively easy to identify and annotate but we have had problems with the ankyrin domains in Mbp1 homologues. Both CDD as well as SMART have identified such domains, but while the domain model was based on the same Pfam profile for both, and both annotated approximately the same regions, the details of the alignments and the extent of the predicted region was different.

Mbp1 forms heterodimeric complexes with a homologue, Swi6. Swi6 does not have an APSES domain, thus it does not bind DNA. But it is similar to Mbp1 in the region spanning the ankyrin domains and in 1999 Foord et al. published its crystal structure (1SW6). This structure is a good model for Ankyrin repeats in Mbp1. For details, please refer to the consolidated Mbp1 annotation page I have prepared.

In what follows, we will use the program JALVIEW - a Java based multiple sequence alignment editor to load and align sequences and to consider structural similarity between yeast Mbp1 and its closest homologue in your organism.

In this part of the assignment,

  1. You will load sequences that are most similar to Mbp1 into an MSA editor;
  2. You will add sequences of ankyrin domain models;
  3. You will perform a multiple sequence alignment;
  4. You will try to improve the alignment manually;


Jalview, loading sequences

Geoff Barton's lab in Dundee has developed an integrated MSA editor and sequence annotation workbench with a number of very useful functions. It is written in Java and should run on Mac, Linux and Windows platforms without modifications.


Waterhouse et al. (2009) Jalview Version 2--a multiple sequence alignment editor and analysis workbench. Bioinformatics 25:1189-91. (pmid: 19151095)

PubMed ] [ DOI ] UNLABELLED: Jalview Version 2 is a system for interactive WYSIWYG editing, analysis and annotation of multiple sequence alignments. Core features include keyboard and mouse-based editing, multiple views and alignment overviews, and linked structure display with Jmol. Jalview 2 is available in two forms: a lightweight Java applet for use in web applications, and a powerful desktop application that employs web services for sequence alignment, secondary structure prediction and the retrieval of alignments, sequences, annotation and structures from public databases and any DAS 1.53 compliant sequence or annotation server. AVAILABILITY: The Jalview 2 Desktop application and JalviewLite applet are made freely available under the GPL, and can be downloaded from www.jalview.org.


We will use this tool for this assignment and explore its features as we go along.

Task:

  1. Navigate to the Jalview homepage click on Download, install Jalview on your computer and start it. A number of windows that showcase the program's abilities will load, you can close these.
  2. Prepare homologous Mbp1 sequences for alignment:
    1. Open the Reference Mbp1 orthologues (all fungi) page. (This is the list of Mbp1 orthologs I mentioned above.)
    2. Copy the FASTA sequences of the reference proteins, paste them into a text file (TextEdit on the Mac, Notepad on Windows) and save the file; you could give it an extension of .fa–but you don't have to.
    3. Check whether the sequence for YFO is included in the list. If it is, fine. If it is not, retrieve it from NCBI, paste it into the file and edit the header like the other sequences. If the wrong sequence from YFO is included, replace it and let me know.
  3. Return to Jalview and select File → Input Alignment → from File and open your file. A window with sequences should appear.
  4. Copy the sequences for ankyrin domain models (below), click on the Jalview window, select File → Add sequences → from Textbox and paste them into the Jalview textbox. Paste two separate copies of the CD00204 consensus sequence and one copy of 1SW6.
    1. When all the sequences are present, click on Add.

Jalview now displays all the sequences, but of course this is not yet an alignment.

Ankyrin domain models
>CD00204 ankyrin repeat consensus sequence from CDD
NARDEDGRTPLHLAASNGHLEVVKLLLENGADVNAKDNDGRTPLHLAAKNGHLEIVKLLL
EKGADVNARDKDGNTPLHLAARNGNLDVVKLLLKHGADVNARDKDGRTPLHLAAKNGHL
>1SW6 from PDB - unstructured loops replaced with xxxx
GPIITFTHDLTSDFLSSPLKIMKALPSPVVNDNEQKMKLEAFLQRLLFxxxxSFDSLLQE
VNDAFPNTQLNLNIPVDEHGNTPLHWLTSIANLELVKHLVKHGSNRLYGDNMGESCLVKA
VKSVNNYDSGTFEALLDYLYPCLILEDSMNRTILHHIIITSGMTGCSAAAKYYLDILMGW
IVKKQNRPIQSGxxxxDSILENLDLKWIIANMLNAQDSNGDTCLNIAARLGNISIVDALL
DYGADPFIANKSGLRPVDFGAG

Computing alignments

The EBI has a very convenient page to access a number of MSA algorithms. This is especially convenient when you want to compare, e.g. T-Coffee and Muscle and MAFFT results to see which regions of your alignment are robust. You could use any of these tools, just paste your sequences into a Webform, download the results and load into Jalview. Easy.

But even easier is to calculate the alignments directly from Jalview. available. (Not today. Bummer.)

Calculate a MAFFT alignment using the Jalview Web service option

Task:

  1. In Jalview, select Web Service → Alignment → MAFFT with defaults.... The alignment is calculated in a few minutes and displayed in a new window.
Calculate a MAFFT alignment when the Jalview Web service is NOT available

Task:

  1. In Jalview, select File → Output to Textbox → FASTA
  2. Copy the sequences.
  3. Navigate to the MAFFT Input form at the EBI.
  4. Paste your sequences into the form.
  5. Click on Submit.
  6. Close the Jalview sequence window and either save your MAFFT alignment to file and load in Jalview, or simply 'File → Input Alignment → from Textbox, paste and click New Window.


In any case, you should now have an alignment.

Task:

  1. Choose Colour → Hydrophobicity and → by Conservation. Then adjust the slider left or right to see which columns are highly conserved. You will notice that the Swi6 sequence that was supposed to align only to the ankyrin domains was in fact aligned to other parts of the sequence as well. This is one part of the MSA that we will have to correct manually and a common problem when aligning sequences of different lengths.


 

Editing ankyrin domain alignments

A good MSA comprises only columns of residues that play similar roles in the proteins' mechanism and/or that evolve in a comparable structural context. Since the alignment reflects the result of biological selection and conservation, it has relatively few indels and the indels it has are usually not placed into elements of secondary structure or into functional motifs. The contiguous features annotated for Mbp1 are expected to be left intact by a good alignment.

A poor MSA has many errors in its columns; these contain residues that actually have different functions or structural roles, even though they may look similar according to a (pairwise!) scoring matrix. A poor MSA also may have introduced indels in biologically irrelevant positions, to maximize spurious sequence similarities. Some of the features annotated for Mbp1 will be disrupted in a poor alignment and residues that are conserved may be placed into different columns.

Often errors or inconsistencies are easy to spot, and manually editing an MSA is not generally frowned upon, even though this is not a strictly objective procedure. The main goal of manual editing is to make an alignment biologically more plausible. Most comonly this means to mimize the number of rare evolutionary events that the alignment suggests and/or to emphasize conservation of known functional motifs. Here are some examples for what one might aim for in manually editing an alignment:

Reduce number of indels
From a Probcons alignment:
0447_DEBHA    ILKTE-K-T---K--SVVK      ILKTE----KTK---SVVK
9978_GIBZE    MLGLN-PGLKEIT--HSIT      MLGLNPGLKEIT---HSIT
1513_CANAL    ILKTE-K-I---K--NVVK      ILKTE----KIK---NVVK
6132_SCHPO    ELDDI-I-ESGDY--ENVD      ELDDI-IESGDY---ENVD
1244_ASPFU    ----N-PGLREIC--HSIT  ->  ----NPGLREIC---HSIT
0925_USTMA    LVKTC-PALDPHI--TKLK      LVKTCPALDPHI---TKLK
2599_ASPTE    VLDAN-PGLREIS--HSIT      VLDANPGLREIS---HSIT
9773_DEBHA    LLESTPKQYHQHI--KRIR      LLESTPKQYHQHI--KRIR
0918_CANAL    LLESTPKEYQQYI--KRIR      LLESTPKEYQQYI--KRIR

Gaps marked in red were moved. The sequence similarity in the alignment does not change considerably, however the total number of indels in this excerpt is reduced to 13 from the original 22


Move indels to more plausible position
From a CLUSTAL alignment:
4966_CANGL     MKHEKVQ------GGYGRFQ---GTW      MKHEKVQ------GGYGRFQ---GTW
1513_CANAL     KIKNVVK------VGSMNLK---GVW      KIKNVVK------VGSMNLK---GVW
6132_SCHPO     VDSKHP-----------QID---GVW  ->  VDSKHPQ-----------ID---GVW
1244_ASPFU     EICHSIT------GGALAAQ---GYW      EICHSIT------GGALAAQ---GYW

The two characters marked in red were swapped. This does not change the number of indels but places the "Q" into a a column in which it is more highly conserved (green). Progressive alignments are especially prone to this type of error.

Conserve motifs
From a CLUSTAL alignment:
6166_SCHPO      --DKRVA---GLWVPP      --DKRVA--G-LWVPP
XBP1_SACCE      GGYIKIQ---GTWLPM      GGYIKIQ--G-TWLPM
6355_ASPTE      --DEIAG---NVWISP  ->  ---DEIA--GNVWISP
5262_KLULA      GGYIKIQ---GTWLPY      GGYIKIQ--G-TWLPY

The first of the two residues marked in red is a conserved, solvent exposed hydrophobic residue that may mediate domain interactions. The second residue is the conserved glycine in a beta turn that cannot be mutated without structural disruption. Changing the position of a gap and insertion in one sequence improves the conservation of both motifs.


The Ankyrin domains are quite highly diverged, the boundaries not well defined and not even CDD, SMART and SAS agree on the precise annotations. We expect there to be alignment errors in this region. Nevertheless we would hope that a good alignment would recognize homology in that region and that ideally the required indels would be placed between the secondary structure elements, not in their middle. But judging from the sequence alignment alone, we cannot judge where the secondary structure elements ought to be. You should therefore add the following "sequence" to the alignment; it contains exactly as many characters as the Swi6 sequence above and annotates the secondary structure elements. I have derived it from the 1SW6 structure

>SecStruc 1SW6 E: strand   t: turn   H: helix   _: irregular
_EEE__tt___ttt______EE_____t___HHHHHHHHHHHHHHHH_xxxx_HHHHHHH
HHHH_t_____t_____t____HHHHHHH__tHHHHHHHHH____t___tt____HHHHH
HH__HHHH___HHHHHHHHHHHHHEE_t____HHHHHHHHH__t__HHHHHHHHHHHHHH
HHHHHH__EEE_xxxx_HHHHHt_HHHHHHH______t____HHHHHHHH__HHHHHHHH
H____t____t____HHHH___
1SW6_A at the PDBSum database of structure annotations You can compare the diagram there with this text string.


To proceed:

  1. Manually align the Swi6 sequence with yeast Mbp1
  2. Bring the Secondary structure annotation into its correct alignment with Swi6
  3. Bring both CDD ankyrin profiles into the correct alignment with yeast Mbp1

Proceed along the following steps:

Task:

  1. Add the secondary structure annotation to the sequence alignment in Jalview. Copy the annotation, select File → Add sequences → from Textbox and paste the sequence.
  2. Select Help → Documentation and read about Editing Alignments, Cursor Mode and Key strokes.
  3. Click on the yeast Mbp1 sequence row to select the entire row. Then use the cursor key to move that sequence down, so it is directly above the 1SW6 sequence. Select the row of 1SW6 and use shift/mouse to move the sequence elements and edit the alignment to match yeast Mbp1. Refer to the alignment given in the Mbp1 annotation page for the correct alignment.
  4. Align the secondary structure elements with the 1SW6 sequence: Every character of 1SW6 should be matched with either E, t, H, or _. The result should be similar to the Mbp1 annotation page. If you need to insert gaps into all sequences in the alignment, simply drag your mouse over all row headers - movement of sequences is constrained to selected regions, the rest is locked into place to prevent inadvertent misalignments. Remember to save your project from time to time: File → save so you can reload a previous state if anything goes wrong and can't be fixed with Edit → Undo.
  5. Finally align the two CD00204 consensus sequences to their correct positions (again, refer to the Mbp1 annotation page).
  6. You can now consider the principles stated above and see if you can improve the alignment, for example by moving indels out of regions of secondary structure if that is possible without changing the character of the aligned columns significantly. Select blocks within which to work to leave the remaining alignment unchanged. So that this does not become tedious, you can restrict your editing to one Ankyrin repeat that is structurally defined in Swi6. You may want to open the 1SW6 structure in VMD to define the boundaries of one such repeat. You can copy and paste sections from Jalview into your assignment for documentation or export sections of the alignment to HTML (see the example below).

Editing ankyrin domain alignments - Sample

This sample was created by

  1. Editing the alignments as described above;
  2. Copying a block of aligned sequence;
  3. Pasting it To New Alignment;
  4. Colouring the residues by Hydrophobicity and setting the colour saturation according to Conservation;
  5. Choosing File → Export Image → HTML and pasting the resulting HTML source into this Wikipage.


10
|
20
|
30
|
40
|
MBP1_USTMA/341-368   - - Y G D Q L - - - A D - - - - - - - - - - I L - - - - N F Q D D E G E T P L T M A A R A R S
MBP1B_SCHCO/470-498   - R E D G D Y - - - K S - - - - - - - - - - F L - - - - D L Q D E H G D T A L N I A A R V G N
MBP1_ASHGO/465-494   F S P Q Y R I - - - E T - - - - - - - - - - L I - - - - N A Q D C K G S T P L H I A A M N R D
MBP1_CLALU/550-586   G N Q N G N S N D K K E - - - - - - - - - - L I S K F L N H Q D N E G N T A F H I A A Y N M S
MBPA_COPCI/514-542   - H E G G D F - - - R S - - - - - - - - - - L V - - - - D L Q D E H G D T A I N I A A R V G N
MBP1_DEBHA/507-550   I R D S Q E I - - - E N K K L S L S D K K E L I A K F I N H Q D I D G N T A F H I V A Y N L N
MBP1A_SCHCO/388-415   - - Y P K E L - - - A D - - - - - - - - - - V L - - - - N F Q D E D G E T A L T M A A R C R S
MBP1_AJECA/374-403   T L P P H Q I - - - S M - - - - - - - - - - L L - - - - S S Q D S N G D T A A L A A A K N G C
MBP1_PARBR/380-409   I L P P H Q I - - - S L - - - - - - - - - - L L - - - - S S Q D S N G D T A A L A A A K N G C
MBP1_NEOFI/363-392   T C S Q D E I - - - D L - - - - - - - - - - L L - - - - S C Q D S N G D T A A L V A A R N G A
MBP1_ASPNI/365-394   T F S P E E V - - - D L - - - - - - - - - - L L - - - - S C Q D S V G D T A V L V A A R N G V
MBP1_UNCRE/377-406   M Y P H H E V - - - G L - - - - - - - - - - L L - - - - A S Q D S N G D T A A L T A A K N G C
MBP1_PENCH/439-468   T C S Q D E I - - - Q M - - - - - - - - - - L L - - - - S C Q D Q N G D T A V L V A A R N G A
MBPA_TRIVE/407-436   V F P R H E I - - - S L - - - - - - - - - - L L - - - - S S Q D A N G D T A A L T A A K N G C
MBP1_PHANO/400-429   T W I P E E V - - - T R - - - - - - - - - - L L - - - - N A Q D Q N G D T A I M I A A R N G A
MBPA_SCLSC/294-313   - - - - - - - - - - - - - - - - - - - - - - - L - - - - D A R D I N G N T A I H I A A K N K A
MBPA_PYRIS/363-392   T W I P E E V - - - T R - - - - - - - - - - L L - - - - N A A D Q N G D T A I M I A A R N G A
MBP1_/361-390   - - - N H S L G V L S Q - - - - - - - - - - F M - - - - D T Q N N E G D T A L H I L A R S G A
MBP1_ASPFL/328-364   T E Q P G E V I T L G R - - - - - - - - - - F I S E I V N L R D D Q G D T A L N L A G R A R S
MBPA_MAGOR/375-404   Q H D P N F V - - - Q Q - - - - - - - - - - L L - - - - D A Q D N D G N T A V H L A A Q R G S
MBP1_CHAGL/361-390   S R S A D E L - - - Q Q - - - - - - - - - - L L - - - - D S Q D N E G N T A V H L A A M R D A
MBP1_PODAN/372-401   V R Q P E E V - - - Q A - - - - - - - - - - L L - - - - D A Q D E E G N T A L H L A A R V N A
MBP1_LACTH/458-487   F S P R Y R I - - - E N - - - - - - - - - - L I - - - - N A Q D Q N G D T A V H L A A Q N G D
MBP1_FILNE/433-460   - - Y P Q E L - - - A D - - - - - - - - - - V I - - - - N F Q D E E G E T A L T I A A R A R S
MBP1_KLULA/477-506   F T P Q Y R I - - - D V - - - - - - - - - - L I - - - - N Q Q D N D G N S P L H Y A A T N K D
MBP1_SCHST/468-501   A K D P D N K - - - K D - - - - - - - - - - L I A K F I N H Q D S D G N T A F H I C S H N L N
MBP1_SACCE/496-525   F S P Q Y R I - - - E L - - - - - - - - - - L L - - - - N T Q D K N G D T A L H I A S K N G D
CD00204/1-19   - - - - - - - - - - - - - - - - - - - - - - - - - - - - N A R D E D G R T P L H L A A S N G H
CD00204/99-118   - - - - - - - - - - - - - - - - - - - - - - - V - - - - N A R D K D G R T P L H L A A K N G H
1SW6/203-232   L D L K W I I - - - A N - - - - - - - - - - M L - - - - N A Q D S N G D T C L N I A A R L G N
SecStruc/203-232   t _ H H H H H - - - H H - - - - - - - - - - _ _ - - - - _ _ _ _ t _ _ _ _ H H H H H H H H _ _
Aligned sequences before editing. The algorithm has placed gaps into the Swi6 helix LKWIIAN and the four-residue gaps before the block of well aligned sequence on the right are poorly supported.


10
|
20
|
30
|
40
|
MBP1_USTMA/341-368   - - Y G D Q L A D - - - - - - - - - - - - - - I L N F Q D D E G E T P L T M A A R A R S
MBP1B_SCHCO/470-498   - R E D G D Y K S - - - - - - - - - - - - - - F L D L Q D E H G D T A L N I A A R V G N
MBP1_ASHGO/465-494   F S P Q Y R I E T - - - - - - - - - - - - - - L I N A Q D C K G S T P L H I A A M N R D
MBP1_CLALU/550-586   G N Q N G N S N D K K E - - - - - - - L I S K F L N H Q D N E G N T A F H I A A Y N M S
MBPA_COPCI/514-542   - H E G G D F R S - - - - - - - - - - - - - - L V D L Q D E H G D T A I N I A A R V G N
MBP1_DEBHA/507-550   I R D S Q E I E N K K L S L S D K K E L I A K F I N H Q D I D G N T A F H I V A Y N L N
MBP1A_SCHCO/388-415   - - Y P K E L A D - - - - - - - - - - - - - - V L N F Q D E D G E T A L T M A A R C R S
MBP1_AJECA/374-403   T L P P H Q I S M - - - - - - - - - - - - - - L L S S Q D S N G D T A A L A A A K N G C
MBP1_PARBR/380-409   I L P P H Q I S L - - - - - - - - - - - - - - L L S S Q D S N G D T A A L A A A K N G C
MBP1_NEOFI/363-392   T C S Q D E I D L - - - - - - - - - - - - - - L L S C Q D S N G D T A A L V A A R N G A
MBP1_ASPNI/365-394   T F S P E E V D L - - - - - - - - - - - - - - L L S C Q D S V G D T A V L V A A R N G V
MBP1_UNCRE/377-406   M Y P H H E V G L - - - - - - - - - - - - - - L L A S Q D S N G D T A A L T A A K N G C
MBP1_PENCH/439-468   T C S Q D E I Q M - - - - - - - - - - - - - - L L S C Q D Q N G D T A V L V A A R N G A
MBPA_TRIVE/407-436   V F P R H E I S L - - - - - - - - - - - - - - L L S S Q D A N G D T A A L T A A K N G C
MBP1_PHANO/400-429   T W I P E E V T R - - - - - - - - - - - - - - L L N A Q D Q N G D T A I M I A A R N G A
MBPA_SCLSC/294-313   - - - - - - - - - - - - - - - - - - - - - - - - L D A R D I N G N T A I H I A A K N K A
MBPA_PYRIS/363-392   T W I P E E V T R - - - - - - - - - - - - - - L L N A A D Q N G D T A I M I A A R N G A
MBP1_/361-390   N H S L G V L S Q - - - - - - - - - - - - - - F M D T Q N N E G D T A L H I L A R S G A
MBP1_ASPFL/328-364   T E Q P G E V I T L G R F I S E - - - - - - - I V N L R D D Q G D T A L N L A G R A R S
MBPA_MAGOR/375-404   Q H D P N F V Q Q - - - - - - - - - - - - - - L L D A Q D N D G N T A V H L A A Q R G S
MBP1_CHAGL/361-390   S R S A D E L Q Q - - - - - - - - - - - - - - L L D S Q D N E G N T A V H L A A M R D A
MBP1_PODAN/372-401   V R Q P E E V Q A - - - - - - - - - - - - - - L L D A Q D E E G N T A L H L A A R V N A
MBP1_LACTH/458-487   F S P R Y R I E N - - - - - - - - - - - - - - L I N A Q D Q N G D T A V H L A A Q N G D
MBP1_FILNE/433-460   - - Y P Q E L A D - - - - - - - - - - - - - - V I N F Q D E E G E T A L T I A A R A R S
MBP1_KLULA/477-506   F T P Q Y R I D V - - - - - - - - - - - - - - L I N Q Q D N D G N S P L H Y A A T N K D
MBP1_SCHST/468-501   A K D P D N K K D - - - - - - - - - - L I A K F I N H Q D S D G N T A F H I C S H N L N
MBP1_SACCE/496-525   F S P Q Y R I E L - - - - - - - - - - - - - - L L N T Q D K N G D T A L H I A S K N G D
CD00204/1-19   - - - - - - - - - - - - - - - - - - - - - - - - - N A R D E D G R T P L H L A A S N G H
CD00204/99-118   - - - - - - - - - - - - - - - - - - - - - - - - V N A R D K D G R T P L H L A A K N G H
1SW6/203-232   L D L K W I I A N - - - - - - - - - - - - - - M L N A Q D S N G D T C L N I A A R L G N
SecStruc/203-232   t _ H H H H H H H - - - - - - - - - - - - - - _ _ _ _ _ _ t _ _ _ _ H H H H H H H H _ _
Aligned sequence after editing. A significant cleanup of the frayed region is possible. Now there is only one insertion event, and it is placed into the loop that connects two helices of the 1SW6 structure.


Final analysis

Task:

  • Compare the distribution of indels in the ankyrin repeat regions of your alignments.
    • Review whether the indels in this region are concentrated in segments that connect the helices, or if they are more or less evenly distributed along the entire region of similarity.
    • Think about whether the assertion that indels should not be placed in elements of secondary structure has merit in your alignment.
    • Recognize that an indel in an element of secondary structure could be interpreted in a number of different ways:
      • The alignment is correct, the annotation is correct too: the indel is tolerated in that particular case, for example by extending the length of an α-helix or β-strand;
      • The alignment algorithm has made an error, the structural annotation is correct: the indel should be moved a few residues;
      • The alignment is correct, the structural annotation is wrong, this is not a secondary structure element after all;
      • Both the algorithm and the annotation are probably wrong, but we have no data to improve the situation.

(NB: remember that the structural annotations have been made for the yeast protein and might have turned out differently for the other proteins...)

You should be able to analyse discrepancies between annotation and expectation in a structured and systematic way. In particular if you notice indels that have been placed into an annotated region of secondary structure, you should be able to comment on whether the location of the indel has strong support from aligned sequence motifs, or whether the indel could possibly be moved into a different location without much loss in alignment quality.

  • Considering the whole alignment and your experience with editing, you should be able to state whether the position of indels relative to structural features of the ankyrin domains in your organism's Mbp1 protein is reliable. That would be the result of this task, in which you combine multiple sequence and structural information.
  • You can also critically evaluate database information that you have encountered:
  1. Navigate to the CDD annotation for yeast Mbp1.
  2. You can check the precise alignment boundaries of the ankyrin domains by clicking on the (+) icon to the left of the matching domain definition.
  3. Confirm that CDD extends the ankyrin domain annotation beyond the 1SW6 domain boundaries. Given your assessment of conservation in the region beyond the structural annotation: do you think that extending the annotation is reasonable also in YFO's protein? Is there evidence for this in the alignment of the CD00204 consensus with well aligned blocks of sequence beyond the positions that match Swi6?


R code: load alignment and compute information scores

As discussed in the lecture, Shannon information is calculated as the difference between expected and observed entropy, where entropy is the negative sum over probabilities times the log of those probabilities:



Here we compute Shannon information scores for aligned positions of the APSES domain, and plot the values in R. You can try this with any part of your alignment, but I have used only the aligned residues for the APSES domain for my example. This is a good choice for a first try, since there are (almost) no gaps.

Task:

  1. Export only the sequences of the aligned APSES domains to a file on your computer, in FASTA format as explained below. You could call this: Mbp1_All_APSES.fa.
    1. Use your mouse and clik and drag to select the aligned APSES domains in the alignment window.
    2. Copy your selection to the clipboard.
    3. Use the main menu (not the menu of your alignment window) and select File → Input alignment → from Textbox; paste the selection into the textbox and click New Window.
    4. Use File → save as to save the aligned siequences in multi-FASTA format under the filename you want in your R project directory.
  1. Explore the R-code below. Be sure that you understand it correctly. Note that this code does not implement any sampling bias correction, so positions with large numbers of gaps will receive artificially high scores (the alignment looks like the gap charecter were a conserved character).


# CalculateInformation.R
# Calculate Shannon information for positions in a multiple sequence alignment.
# Requires: an MSA in multi FASTA format
 
# It is good practice to set variables you might want to change
# in a header block so you don't need to hunt all over the code
# for strings you need to update.
#
setwd("/your/R/working/directory")
mfa      <- "MBP1_All_APSES.fa"
 
# ================================================
#    Read sequence alignment fasta file
# ================================================
 
# read MFA datafile using seqinr function read.fasta()
library(seqinr)
tmp  <- read.alignment(mfa, format="fasta")
MSA  <- as.matrix(tmp)  # convert the list into a characterwise matrix
                        # with appropriate row and column names using
                        # the seqinr function as.matrix.alignment()
                        # You could have a look under the hood of this
                        # function to understand beter how to convert a
                        # list into something else ... simply type
                        # "as.matrix.alignment" - without the parentheses
                        # to retrieve the function source code (as for any
                        # function btw).

### Explore contents of and access to the matrix of sequences
MSA
MSA[1,]
MSA[,1]
length(MSA[,1])


# ================================================
#    define function to calculate entropy
# ================================================

entropy <- function(v) { # calculate shannon entropy for the aa vector v
	                     # Note: we are not correcting for small sample sizes
	                     # here. Thus if there are a large number of gaps in
	                     # the alignment, this will look like small entropy
	                     # since only a few amino acids are present. In the 
	                     # extreme case: if a position is only present in 
	                     # one sequence, that one amino acid will be treated
	                     # as 100% conserved - zero entropy. Sampling error
	                     # corrections are discussed eg. in Schneider et al.
	                     # (1986) JMB 188:414
	l <- length(v)
	a <- rep(0, 21)      # initialize a vector with 21 elements (20 aa plus gap)
	                     # the set the name of each row to the one letter
	                     # code. Through this, we can access a row by its
	                     # one letter code.
	names(a)  <- unlist(strsplit("acdefghiklmnpqrstvwy-", ""))
	
	for (i in 1:l) {       # for the whole vector of amino acids
		c <- v[i]          # retrieve the character
		a[c] <- a[c] + 1   # increment its count by one
	} # note: we could also have used the table() function for this
	
	tot <- sum(a) - a["-"] # calculate number of observed amino acids
	                       # i.e. subtract gaps
	a <- a/tot             # frequency is observations of one amino acid
	                       # divided by all observations. We assume that
	                       # frequency equals probability.
	a["-"] <- 0       	                        
	for (i in 1:length(a)) {
		if (a[i] != 0) { # if a[i] is not zero, otherwise leave as is.
			             # By definition, 0*log(0) = 0  but R calculates
			             # this in parts and returns NaN for log(0).
			a[i] <- a[i] * (log(a[i])/log(2)) # replace a[i] with
			                                  # p(i) log_2(p(i))
		}
	}
	return(-sum(a)) # return Shannon entropy
}

# ================================================
#    calculate entropy for reference distribution
#    (from UniProt, c.f. Assignment 2)
# ================================================

refData <- c(
    "A"=8.26,
    "Q"=3.93,
    "L"=9.66,
    "S"=6.56,
    "R"=5.53,
    "E"=6.75,
    "K"=5.84,
    "T"=5.34,
    "N"=4.06,
    "G"=7.08,
    "M"=2.42,
    "W"=1.08,
    "D"=5.45,
    "H"=2.27,
    "F"=3.86,
    "Y"=2.92,
    "C"=1.37,
    "I"=5.96,
    "P"=4.70,
    "V"=6.87
    )

### Calculate the entropy of this distribution

H.ref <- 0
for (i in 1:length(refData)) {
	p <- refData[i]/sum(refData) # convert % to probabilities
    H.ref <- H.ref - (p * (log(p)/log(2)))
}

# ================================================
#    calculate information for each position of 
#    multiple sequence alignment
# ================================================

lAli <- dim(MSA)[2] # length of row in matrix is second element of dim(<matrix>).
I <- rep(0, lAli)   # initialize result vector
for (i in 1:lAli) { 
	I[i] = H.ref - entropy(MSA[,i])  # I = H_ref - H_obs
}

### evaluate I
I
quantile(I)
hist(I)
plot(I)

# you can see that we have quite a large number of columns with the same,
# high value ... what are these?

which(I > 4)
MSA[,which(I > 4)]

# And what is in the columns with low values?
MSA[,which(I < 1.5)]


# ===================================================
#    plot the information
#    (c.f. Assignment 5, see there for explanations)
# ===================================================

IP <- (I-min(I))/(max(I) - min(I) + 0.0001)
nCol <- 15
IP <- floor(IP * nCol) + 1
spect <- colorRampPalette(c("#DD0033", "#00BB66", "#3300DD"), bias=0.6)(nCol)
# lets set the information scores from single informations to grey. We   
# change the highest level of the spectrum to grey.
#spect[nCol] <- "#CCCCCC"
Icol <- vector()
for (i in 1:length(I)) {
	Icol[i] <- spect[ IP[i] ] 
}
 
plot(1,1, xlim=c(0, lAli), ylim=c(-0.5, 5) ,
     type="n", bty="n", xlab="position in alignment", ylab="Information (bits)")

# plot as rectangles: height is information and color is coded to information
for (i in 1:lAli) {
   rect(i, 0, i+1, I[i], border=NA, col=Icol[i])
}

# As you can see, some of the columns reach very high values, but they are not
# contiguous in sequence. Are they contiguous in structure? We will find out in
# a later assignment, when we map computed values to structure.


Plot of information vs. sequence position produced by the R script above, for an alignment of Mbp1 ortholog APSES domains.



Calculating conservation scores

Task:

  • Study this code carefully, execute it, section by section and make sure you understand all of it. Ask on the list if anything is not clear.
# BiostringsExample.R
# Short tutorial on sequence alignment with the Biostrings package.
# Boris Steipe for BCH441, 2013 - 2014
#
setwd("~/path/to/your/R_files/")
setwd("~/Documents/07.TEACHING/37-BCH441 Bioinformatics 2014/05-Materials/Assignment_5 data")

# Biostrings is a package within the bioconductor project.
# bioconducter packages have their own installation system,
# they are normally not installed via CRAN.

# First, you load the BioConductor installer...
source("http://bioconductor.org/biocLite.R")

# Then you can install the Biostrings package and all of its dependencies.
biocLite("Biostrings")

# ... and load the library.
library(Biostrings)

# Some basic (technical) information is available ...
library(help=Biostrings)

# ... but for more in depth documentation, use the
# so called "vignettes" that are provided with every R package.
browseVignettes("Biostrings")

# In this code, we mostly use functions that are discussed in the 
# pairwise alignement vignette.

# Read in two fasta files - you will need to edit this for YFO
sacce <- readAAStringSet("mbp1-sacce.fa", format="fasta")

# "USTMA" is used only as an example here - modify for YFO  :-)
ustma <- readAAStringSet("mbp1-ustma.fa", format="fasta")

sacce
names(sacce) 
names(sacce) <- "Mbp1 SACCE"
names(ustma) <- "Mbp1 USTMA" # Example only ... modify for YFO

width(sacce)
as.character(sacce)

# Biostrings takes a sophisticated approach to sequence alignment ...
?pairwiseAlignment

# ... but the use in practice is quite simple:
ali <- pairwiseAlignment(sacce, ustma, substitutionMatrix = "BLOSUM50")
ali

pattern(ali)
subject(ali)

writePairwiseAlignments(ali)

p <- aligned(pattern(ali))
names(p) <- "Mbp1 SACCE aligned"
s <- aligned(subject(ali))
names(s) <- "Mbp1 USTMA aligned"

# don't overwrite your EMBOSS .fal files
writeXStringSet(p, "mbp1-sacce.R.fal", append=FALSE, format="fasta")
writeXStringSet(s, "mbp1-ustma.R.fal", append=FALSE, format="fasta")

# Done.
  • Compare the alignments you received from the EMBOSS server, and that you computed using R. Are they approximately the same? Exactly? You did use different matrices and gap parameters, so minor differences are to be expected. But by and large you should get the same alignments.

We will now use the aligned sequences to compute a graphical display of alignment quality.


Task:

  • Study this code carefully, execute it, section by section and make sure you understand all of it. Ask on the list if anything is not clear.
# aliScore.R
# Evaluating an alignment with a sliding window score
# Boris Steipe, October 2012. Update October 2013
setwd("~/path/to/your/R_files/")

# Scoring matrices can be found at the NCBI. 
# ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62

# It is good practice to set variables you might want to change
# in a header block so you don't need to hunt all over the code
# for strings you need to update.
#
fa1      <- "mbp1-sacce.R.fal"
fa2      <- "mbp1-ustma.R.fal"
code1    <- "SACCE"
code2    <- "USTMA"
mdmFile  <- "BLOSUM62.mdm"
window   <- 9   # window-size (should be an odd integer)

# ================================================
#    Read data files
# ================================================

# read fasta datafiles using seqinr function read.fasta()
install.packages("seqinr")
library(seqinr)
tmp  <- unlist(read.fasta(fa1, seqtype="AA", as.string=FALSE, seqonly=TRUE))
seq1 <- unlist(strsplit(as.character(tmp), split=""))

tmp  <- unlist(read.fasta(fa2, seqtype="AA", as.string=FALSE, seqonly=TRUE))
seq2 <- unlist(strsplit(as.character(tmp), split=""))

if (length(seq1) != length(seq2)) {
	print("Error: Sequences have unequal length!")
	}
	
lSeq <- length(seq1)

# ================================================
#    Read scoring matrix
# ================================================

MDM <- read.table(mdmFile, skip=6)

# This is a dataframe. Study how it can be accessed:

MDM
MDM[1,]
MDM[,1]
MDM[5,5]   # Cys-Cys
MDM[20,20] # Val-Val
MDM[,"W"]  # the tryptophan column
MDM["R","W"]  # Arg-Trp pairscore
MDM["W","R"]  # Trp-Arg pairscore: pairscores are symmetric

colnames(MDM)  # names of columns
rownames(MDM)  # names of rows
colnames(MDM)[3]   # third column
rownames(MDM)[12]  # twelfth row

# change the two "*" names to "-" so we can use them to score
# indels of the alignment. This is a bit of a hack, since this
# does not reflect the actual indel penalties (which is, as you)
# remember from your lectures, calculated as a gap opening
# + gap extension penalty; it can't be calculated in a pairwise
# manner) EMBOSS defaults for BLODSUM62 are opening -10 and
# extension -0.5 i.e. a gap of size 3 (-11.5) has approximately
# the same penalty as a 3-character score of "-" matches (-12)
# so a pairscore of -4 is not entirely unreasonable.

colnames(MDM)[24] 
rownames(MDM)[24]
colnames(MDM)[24] <- "-"
rownames(MDM)[24] <- "-"
colnames(MDM)[24] 
rownames(MDM)[24]
MDM["Q", "-"]
MDM["-", "D"]
# so far so good.

# ================================================
#    Tabulate pairscores for alignment
# ================================================


# It is trivial to create a pairscore vector along the
# length of the aligned sequences.

PS <- vector()
for (i in 1:lSeq) {
   aa1 <- seq1[i] 
   aa2 <- seq2[i] 
   PS[i] = MDM[aa1, aa2]
}

PS


# The same vector could be created - albeit perhaps not so
# easy to understand - with the expression ...
MDM[cbind(seq1,seq2)]



# ================================================
#    Calculate moving averages
# ================================================

# In order to evaluate the alignment, we will calculate a 
# sliding window average over the pairscores. Somewhat surprisingly
# R doesn't (yet) have a native function for moving averages: options
# that are quoted are:
#   - rollmean() in the "zoo" package http://rss.acs.unt.edu/Rdoc/library/zoo/html/rollmean.html
#   - MovingAverages() in "TTR"  http://rss.acs.unt.edu/Rdoc/library/TTR/html/MovingAverages.html
#   - ma() in "forecast"  http://robjhyndman.com/software/forecast/
# But since this is easy to code, we shall implement it ourselves.

PSma <- vector()           # will hold the averages
winS <- floor(window/2)    # span of elements above/below the centre
winC <- winS+1             # centre of the window

# extend the vector PS with zeros (virtual observations) above and below
PS <- c(rep(0, winS), PS , rep(0, winS))

# initialize the window score for the first position
winScore <- sum(PS[1:window])

# write the first score to PSma
PSma[1] <- winScore

# Slide the window along the sequence, and recalculate sum()
# Loop from the next position, to the last position that does not exceed the vector...
for (i in (winC + 1):(lSeq + winS)) { 
   # subtract the value that has just dropped out of the window
   winScore <- winScore - PS[(i-winS-1)] 
   # add the value that has just entered the window
   winScore <- winScore + PS[(i+winS)]  
   # put score into PSma
   PSma[i-winS] <- winScore
}

# convert the sums to averages
PSma <- PSma / window

# have a quick look at the score distributions

boxplot(PSma)
hist(PSma)

# ================================================
#    Plot the alignment scores
# ================================================

# normalize the scores 
PSma <- (PSma-min(PSma))/(max(PSma) - min(PSma) + 0.0001)
# spread the normalized values to a desired range, n
nCol <- 10
PSma <- floor(PSma * nCol) + 1

# Assign a colorspectrum to a vector (with a bit of colormagic,
# don't worry about that for now). Dark colors are poor scores,
# "hot" colors are high scores
spect <- colorRampPalette(c("black", "red", "yellow", "white"), bias=0.4)(nCol)

# Color is an often abused aspect of plotting. One can use color to label
# *quantities* or *qualities*. For the most part, our pairscores measure amino
# acid similarity. That is a quantity and with the spectrum that we just defined
# we associte the measured quantities with the color of a glowing piece
# of metal: we start with black #000000, then first we ramp up the red
# (i.e. low-energy) part of the visible spectrum to red #FF0000, then we
# add and ramp up the green spectrum giving us yellow #FFFF00 and finally we 
# add blue, giving us white #FFFFFF. Let's have a look at the spectrum:

s <- rep(1, nCol)
barplot(s, col=spect, axes=F, main="Color spectrum")

# But one aspect of our data is not quantitatively different: indels.
# We valued indels with pairscores of -4. But indels are not simply poor alignment, 
# rather they are non-alignment. This means stretches of -4 values are really 
# *qualitatively* different. Let's color them differently by changing the lowest 
# level of the spectrum to grey.

spect[1] <- "#CCCCCC"
barplot(s, col=spect, axes=F, main="Color spectrum")

# Now we can display our alignment score vector with colored rectangles.

# Convert the integers in PSma to color values from spect
PScol <- vector()
for (i in 1:length(PSma)) {
	PScol[i] <- spect[ PSma[i] ]  # this is how a value from PSma is used as an index of spect
}

# Plot the scores. The code is similar to the last assignment.
# Create an empty plot window of appropriate size
plot(1,1, xlim=c(-100, lSeq), ylim=c(0, 2) , type="n", yaxt="n", bty="n", xlab="position in alignment", ylab="")

# Add a label to the left
text (-30, 1, adj=1, labels=c(paste("Mbp1:\n", code1, "\nvs.\n", code2)), cex=0.9 )

# Loop over the vector and draw boxes  without border, filled with color.
for (i in 1:lSeq) {
   rect(i, 0.9, i+1, 1.1, border=NA, col=PScol[i])
}

# Note that the numbers along the X-axis are not sequence numbers, but numbers
# of the alignment, i.e. sequence number + indel length. That is important to
# realize: if you would like to add the annotations from the last assignment 
# which I will leave as an exercise, you need to map your sequence numbering
# into alignment numbering. Let me know in case you try that but need some help.


 

That is all.


 

Links and resources

 


Footnotes and references

  1. The directory also contains sourcecode to generte the PAM matrices. This may be of interest for you if you ever want to produce scoring matrices from your own datasets.
  2. (Taylor et al. (2000) Biochemistry 39: 3943-3954 and Deleeuw et al. (2008) Biochemistry. 47:6378-6385)
  3. As you will see later on in the assignment, Mbp1-related proteins contain "Ankyrin" domains, a very widely distributed protein-protein interaction motif that may give rise to false-positive similarities for full-length sequence searches. Therefore, we search only with the DNA binding domain sequence, since this is the functionality that best characterizes the "function" of the protein we are interested in.
  4. Think of the ribosome or DNA-polymerase as extreme examples.
  5. Otherwise, you need to study the PDB Web page for the structure, or the text in the PDB file itself, to identify which part of the complex is labeled with which chain ID. For example, immunoglobulin structures some time label the light- and heavy chain fragments as "L" and "H", and sometimes as "A" and "B"–there are no fixed rules. You can also load the structure in VMD, color "by chain" and use the mouse to click on residues in each chain to identify it.
  6. The -myceta are well supported groups above the Class rank. See Leotiomyceta for details and references.


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.



< Assignment 3 Assignment 5 >