Difference between revisions of "BIO Assignment 3 2011"

From "A B C"
Jump to navigation Jump to search
Line 124: Line 124:
 
*Based on your results, comment briefly on whether your organism appears to have an orthologue of the entire Mbp1 gene and/or only of the Mbp1 APSES domain.
 
*Based on your results, comment briefly on whether your organism appears to have an orthologue of the entire Mbp1 gene and/or only of the Mbp1 APSES domain.
  
 +
</div>
  
  
</div>
+
Now compare the (empirical, local) BLAST alignment with a Needleman-Wunsch (optimal, global) sequence alignment. Use the correct algorithm from the set of [http://www.google.ca/search?hl=en&q=emboss+gui EMBOSS tools].
 
 
Now compare the (empirical, local) BLAST alignment with a Needleman-Wunsch (optimal, global) sequence alignment. Use the correct algortihm from the set of [http://www.google.ca/search?hl=en&q=emboss+gui EMBOSS tools] instead.
 
  
 
<div style="padding: 5px; background: #EEEEEE;">
 
<div style="padding: 5px; background: #EEEEEE;">
 
*Retrieve the full-length sequence of the orthologue to yeast Mbp1 in your organism, and use an online tool to generate an optimal global alignment between this and ''S. cerevisiae'' Mbp1. <small>You have to figure out where to find a Web service that does such alignments, what the name of the algorithm is that you should use and how to define reasonable parameters for the alignment.</small>
 
*Retrieve the full-length sequence of the orthologue to yeast Mbp1 in your organism, and use an online tool to generate an optimal global alignment between this and ''S. cerevisiae'' Mbp1. <small>You have to figure out where to find a Web service that does such alignments, what the name of the algorithm is that you should use and how to define reasonable parameters for the alignment.</small>
*Review if and how the alignments are different, or whether the two alignment algorithms have given essentially the same results.
+
*'''Review''' if and how the alignments are different, or whether the two alignment algorithms have given essentially the same results.
 +
 
 +
'''Note''': When I instruct you to '''review...''', I do not require you to include your conclusions in the submitted assignment. However I expect you to be familiar with the analysis and to be able to answer questions on the process and the conclusions.
 +
 
 
</div>
 
</div>
  
Line 353: Line 355:
 
This process was not entirely straightforward in all cases and such variations are quite normal for a database query. You need to be familiar with exceptions such as the ones described below and know how to deal with them.
 
This process was not entirely straightforward in all cases and such variations are quite normal for a database query. You need to be familiar with exceptions such as the ones described below and know how to deal with them.
  
Note: for ''Aspergillus fumigatus'' and ''Aspergillus nidulans'', the top BLAST hit is not the best match. The reason is that the best matching protein has a deletion just C-terminal to the APSES domain. This causes BLAST to split the HSP into two parts,and even though the APSES domain alone has a higher % identity, its E-value turns out to be lower because it is a shorter sequence. Global alignment of each sequence with yeast Mbp1, as well as alignment of only the APSES domains were consistent in showing that for both ''Aspergillus'' species the second highest BLAST score is indeed the most similar protein. The take-home message is that the '''comparison of BLAST scores can be misleading if we apply them to sequences of different length'''. For the record: ''Aspergillus fumigatus'' highest BLAST score is with XP_748947, second highest BLAST score is with XP_754232; the latter has higher global identity (25.7% vs. 22.6%) and higher identity in the APSES domain (55% vs. 45%). ''Aspergillus nidulans'' highest BLAST score is with XP_664319, second highest BLAST score is with XP_660758; the latter has higher global identity (26.7% vs. 22.8%) and higher identity in the APSES domain (59.5% vs. 50.6%). Interestingly, the ''Aspergillus terreus'' orthologue has the same deletion, but it provided the highest BLAST score to begin with.
+
'''Note''': for ''Aspergillus fumigatus'' and ''Aspergillus nidulans'', the top BLAST hit is not the best match. The reason is that the best matching protein has a deletion just C-terminal to the APSES domain. This causes BLAST to split the HSP into two parts,and even though the APSES domain alone has a higher % identity, its E-value turns out to be lower because it is a shorter sequence. Global alignment of each sequence with yeast Mbp1, as well as alignment of only the APSES domains were consistent in showing that for both ''Aspergillus'' species the second highest BLAST score is indeed the most similar protein. The take-home message is that the '''comparison of BLAST scores can be misleading if we apply them to sequences of different length'''. For the record: ''Aspergillus fumigatus'' highest BLAST score is with XP_748947, second highest BLAST score is with XP_754232; the latter has higher global identity (25.7% vs. 22.6%) and higher identity in the APSES domain (55% vs. 45%). ''Aspergillus nidulans'' highest BLAST score is with XP_664319, second highest BLAST score is with XP_660758; the latter has higher global identity (26.7% vs. 22.8%) and higher identity in the APSES domain (59.5% vs. 50.6%). Interestingly, the ''Aspergillus terreus'' orthologue has the same deletion, but it provided the highest BLAST score to begin with.
 
 
Note: ''Coprinopsis cinerea'' accession numbers are not yet in UniProt.
 
 
 
Note: For ''Giberella zeae'' and ''Magnaporthe grisea'', the protein BLAST search had to go through the entire '''nr''' database, by entering an organism restriction,  since genomic BLAST was not enabled.
 
  
Note: For ''Giberella zeae'' XP_384396  no UniProt ID was returned as cross-reference. EBI-BLAST retrieved  FG04220 which is largely identical, except for short stretches that are absent in GenPept: apparently UniProt has a different gene-model for this protein.
+
'''Note''': For ''Giberella zeae'' XP_384396  no UniProt ID was returned as cross-reference. EBI-BLAST retrieved  FG04220 which is largely identical, except for short stretches that are absent in GenPept: apparently UniProt has a different gene-model for this protein.
  
Note: The ''Neurospora crassa'' protein EAA33731 has no direct cross-reference in UniProt. The closest match is Q7SBG9 which is largely identical, except for short stretches that are absent in GenPept: apparently UniProt has a different gene-model for this protein.
+
'''Note''': The ''Magnaporthe grisea'' protein ABA02072 has greater local C-terminal similarity to the yeast protein Swi6 than to Mbp1, whereas the N-terminal APSES domain is most similar to yeast Mbp1. However a '''global''' Needleman-Wunsch alignment (BLOSUM 30, gaps: 8.0/1.0) shows greater '''overall''' similarity to yeast Mbp1 than to Swi6. Accordingly I consider this an orthologue to Mbp1 even though its database annotation calls ABA02072  the ''M. grisea'' Swi6 homologue.
  
Note: The ''Magnaporthe grisea'' protein ABA02072 has greater local C-terminal similarity to the yeast protein Swi6 than to Mbp1, whereas the N-terminal APSES domain is most similar to yeast Mbp1. However a '''global''' Needleman-Wunsch alignment (BLOSUM 30, gaps: 8.0/1.0) shows greater '''overall''' similarity to yeast Mbp1 than to Swi6. Accordingly I consider this an orthologue to Mbp1 even though its database annotation calls ABA02072  the ''M. grisea'' Swi6 homologue.
+
'''Note''': For ''Pichia stipitis'', BLAST finds two very similar sequences in GenPept as candidate Mbp1 orthologues; the RefSeq sequence XP_001386821.1 is translated according to the standard code, the entry EAZ62798.2 is translated according to the alternative nuclear code [http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c#SG12 '''12''']. The question had to be considered which translation appears to be correct. This required looking at the conservation of the residues in question in the BLAST alignment; better conservation indeed supports the alternative code translation.
  
Note: For ''Pichia stipitis'', BLAST finds two very similar sequences in GenPept as candidate Mbp1 orthologues; the RefSeq sequence XP_001386821.1 is translated according to the standard code, the entry EAZ62798.2 is translated according to the alternative nuclear code [http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c#SG12 '''12''']. The question had to be considered which translation appears to be correct. This required looking at the conservation of the residues in question in the BLAST alignment; better conservation indeed supports the alternative code translation.
+
'''Note''': The ''Ustilago maydis'' protein XP_762343 (the protein with the sytematic name UM06196) is only the second-best hit in the original BLAST list as performed on the genomic BLAST page for the organism, however local optimal alignment (EMBOSS water) shows a much higher percentage of identity to yeast Mbp1 in the APSES domain than the top BLAST hit (XP_761485, systematic name UM05338) and global alignment  (after trimming the N- and C- terminal extensions, respectively) also shows a slightly higher degree of similarity for the latter. Accordingly, XP_762343 is considered the Mbp1 orthologue, even though it is the second highest hit according to BLAST. The situation is similar as with the ''Aspergillus'' species, one protein was reported as a single HSP and one protein was broken into two HSPs. This emphasizes the fact that optimal sequence alignments are not entirely equivalent to BLAST alignments. Further, performing the same search against the "'''nr'''" database and applying an '''Organism''' filter for ''Ustilago maydis''  resulted in '''both''' proteins being split and the correct orthologue having the highest BLAST score in the list. This emphasizes the fact that searches in organism databases are not entirely equivalent to searches in the global database, even if the results are filtered.
 
 
Note: The ''Ustilago maydis'' protein EAK87100 (XP_762343, the protein with the sytematic name UM06196) is only the second-best hit in the original BLAST list as performed on the genomic BLAST page for the organism, however local optimal alignment (EMBOSS water) shows a much higher percentage of identity to yeast Mbp1 in the APSES domain than the top BLAST hit EAK86587 (XP_761485, systematic name UM05338) and global alignment  (after trimming the N- and C- terminal extensions, respectively) also shows a slightly higher degree of similarity for EAK87100 than EAK86587. Accordingly, EAK87100 is considered the Mbp1 orthologue, even though it is the second highest hit according to BLAST. The situation is similar as with the ''Aspergillus'' species, one protein was reported as a single HSP and one protein was broken into two HSPs. This emphasizes the fact that optimal sequence alignments are not entirely equivalent to BLAST alignments. Further, performing the same search against the "'''nr'''" database and applying an '''Organism''' filter for ''Ustilago maydis''  resulted in '''both''' proteins being split and the correct orthologue having the highest BLAST score in the list. This emphasizes the fact that searches in organism databases are not entirely equivalent to searches in the global database, even if the results are filtered.
 
  
 
</small>
 
</small>
  
 
&nbsp;<br>
 
&nbsp;<br>
Our second task is to obtain all FASTA sequences based on a list of identifiers and to save them in a format in which we can use them as input for other programs or services. This is easy: we simply paste all GI numbers as a comma separated list into the Entrez search form and select Display FASTA, send to Text on the results page, then save the contents as a Text file.
 
 
&nbsp;<br>
 
&nbsp;<br>
  
 +
To obtain all FASTA sequences based on a list of identifiers and to save them in a format in which we can use them as input for other programs or services is easy. We can simply paste '''all GI numbers as a comma separated list''' into the Entrez search form and on the results page, select '''Display FASTA''' and '''send to Text'''; then save the contents as a text file. This is a multi-Fasta file, suitable for input into MSA programs.
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*We have applied the "reciprocal best match criterion" to assert that these sequences are '''orthologues to yeast Mbp1''' and this is how orthologues are commonly defined computationally. Briefly explain why this criterium will distinguish between orthologues and paralogues (when no genes have been lost). Consider at least the following three cases (''i'') a gene duplication has occurred before a speciation event, (''ii'') a gene duplication in the query organism has occurred after a speciation event. (''iii'') a gene duplication in the target organism has occurred after a speciation event. Use sketches to illustrate the cases. (1 mark)
 
 
*<!-- Clarify: no need to submit anything here -->Review the resulting multi-FASTA file for the  [[All_Mbp1_proteins|'''Mbp1 proteins (linked here)''']] and make sure you understand the procedure that led to it. Depending on your personal learning style you may either carefully review the described procedure, reproduce key steps of the procedure, reproduce the entire procedure paying special attention to the problem cases discussed in the notes, or develop your own procedure. Whatever you do, you must be confident in the end that you could have produced the same input file.<br>
 
  
 +
&nbsp;<br>
 +
<div style="padding: 5px; background: #EEEEEE;">
 +
'''Review''' the resulting multi-FASTA file for the  [[All_Mbp1_proteins|'''Mbp1 proteins (linked here)''']] and make sure you understand the procedure that led to it. Depending on your personal learning style you may either carefully review the described procedure, reproduce key steps of the procedure, reproduce the entire procedure paying special attention to the problem cases discussed in the notes, or develop your own procedure. Whatever you do, you must be confident in the end that you could have produced the same input file. (You do not need to submit documentation for this part of the assignment, but you do need to understand the process.)<br>
 
</div>
 
</div>
&nbsp;<br>
 
Mbp1 orthologues are not the only proteins that contain APSES domains. In order to find all the rest, a PSI-BLAST search was performed using the yeast Mbp1 APSES domain as query. From the list of hits, the APSES domains were extracted and summarized in a file.
 
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*Review the resulting file for the  [[All_APSES_domains|'''APSES domains (linked here)''']] and make sure you understand the procedure that was used in its construction, as above.
 
</div>
 
 
&nbsp;<br>
 
&nbsp;<br>
 +
As you have seen from the results of your BLAST searches, Mbp1 orthologues are not the only proteins that contain APSES domains. In order to find all the rest, a PSI-BLAST search was performed using the yeast Mbp1 APSES domain as query. From the list of hits, the APSES domains were extracted and summarized in a file.
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+
<div style="padding: 5px; background: #EEEEEE;">
 
+
*'''Review''' the resulting file for the [[All_APSES_domains|'''APSES domains (linked here)''']] and make sure you understand the procedure that was used in its construction, as above.
===(1.2) Orthologues (1 mark)===
 
 
</div>
 
</div>
&nbsp;<br>
 
 
For '''one''' of the the APSES domains from your assigned organism, determine whether it is orthologous to a yeast APSES domain, according to the reciprocal-best-match criterion:
 
 
# Choose at random one sequence from the list of [[All_APSES_domains|'''APSES domains''']] from your organism (but not one from an Mbp1 orthologue) and copy it's [[All_APSES_domains|sequence]] into the input window of a genomic [http://www.ncbi.nlm.nih.gov/blast/ BLAST] search against ''saccharomyces cerevisiae'' proteins.
 
# Run the search and determine the gene name of the best hit. (This is the best match.)
 
# The BLAST-retrieved sequence may be truncated on the results page and not cover the entire APSES domain: find the sequence of your best match in yeast in the [[All_APSES_domains| sequence file]]. (Since the file contains all yeast APSES domains, your best match should be in this file, labeled with <code>????_SACCE</code> - except if the match is to Xbp1, which matches only to a part of the canonical APSES domain.)
 
# Copy the APSES domain sequence sequence from the Wiki page and perform the same kind of BLAST search with this yeast sequence, against the proteins in your organism's genome. (This finds the reciprocal match.) In case that you have found Xbp1 as the best match, use only the matching segment from your BLAST report for the search.
 
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
* Document the process and report briefly what you have found on the forward and on the reverse search. Does the gene you have chosen have an APSES domain that fulfils the ''reciprocal best match'' criterion for orthology with a yeast gene? (1 mark)
 
</div>
 
 
&nbsp;<br>
 
&nbsp;<br>
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
  
==(2) Align==
+
==(2) Align and annotate==
 
</div>
 
</div>
 +
 
&nbsp;
 
&nbsp;
 
&nbsp;
 
&nbsp;
  
Actually performing multiple sequence alignements used to involve downloading and installing software on your own computer. While most tools were available on the Web in principle, many groups have restricted the total number of sequences or the total number of characters to be aligned. The EBI however offers three of the most commonly used tools with few limitations and it was possible to run MSAs for all Mbp1 orthologues jointly.
 
&nbsp;
 
&nbsp;
 
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
===(2.1) Aligning the Mbp1 orthologues (1 mark)===
+
===(2.1) Review of domain annotations===
 
</div>
 
</div>
&nbsp;<br>
 
 
I used the following three servers:
 
 
* [http://www.ebi.ac.uk/clustalw/ '''CLUSTAL-W''']  is a progressive alignment program, it is the most popular, most widely referenced MSA algorithm, it is reasonably fast and easy to use. But alignment errors that are made early in the process can't get corrected and thus CLUSTAL is prone to misalign sets of sequences that have poor (<30% ID) local similarity. It is no longer considered state-of-the-art for carefully done alignments.
 
* [http://www.ebi.ac.uk/muscle/ '''MUSCLE'''] essentially starts out from a CLUSTAL like alignment as a draft, then identifies similar groups of sequences from which it calculates profiles, it then re-aligns the group to the profile. This procedure is iterated.
 
* [http://www.ebi.ac.uk/t-coffee/ '''T-Coffee'''] is one of my favourites - the tradeoffs appear to be especially well balanced. It too starts from a set of pairwise global alignments, like CLUSTAL, then additionally calculates sets of best local alignments. Global and local alignments are then combined to a similarity matrix and based on this matrix a guide-tree is constructed. This determines the order of steps in which sequences are added to the multiple alignment. A nice feature of T-Coffee is color coded output that allows you to quickly judge the local reliability of the alignment.
 
 
We shall perform multiple sequence alignments for all 18 Mbp1 orthologues and compare the results. Since the results will all look the same for the same input file, I have simply prepared them. Of course you are welcome to do run an alignment on your own for your own learning experience, but it is not required. The first alignment was run with CLUSTAL.
 
 
[[Image:A03_01.jpg|frame|none|Assignment 3, Figure 01<br>
 
The guide tree computed by CLUSTAL-W. The algorithm uses this tree to determine the best order for its progressive alignment for the 18 Mbp1 orthologue sequences. This tree is based on a matrix of pairwise distances.]]
 
 
Subseqently, sequence alignments were performed with T-Coffee and MUSCLE. However, the input files were re-ordered to correspond to the order of the CLUSTAL output, and the option to order the alignments according to the ''input sequences'' was chosen on the form. This makes it much easier to compare alignments, since all MSAs are displayed in the same relative order.
 
  
 
+
Let us first review some of the features of the yeast Mbp1 protein that we have defined in the second assignment (and some structural features I have compiled from various sources). Below is the yeast Mbp1 sequence with a number of annotations, compiled according to the following procedure.  
The result files are linked here:
 
 
 
* [[All_Mbp1_CLUSTAL|Mbp1 proteins '''CLUSTAL''' aligned]]
 
* [[All_Mbp1_MUSCLE|Mbp1 proteins '''MUSCLE''' aligned]]
 
* [[All_Mbp1_T-COFFEE|Mbp1 proteins '''T-Coffee''' aligned (text version)]] and [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/resources/T-coffee_scores.html (coloured according to scores)]
 
 
 
Globally speaking, the alignments are quite similar. Let's first look at the common themes, before we discuss details of the results. The  [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/resources/T-coffee_scores.html (score-colored T-COFFEE alignment)] is well suited to look at general relationships between the sequences, since outliers can be easily identified.  For example, if one of the sequences would have a low-scoring domain that aligns poorly to the others of the group, it may be possible that that domain has been acquired in a separate evolutionary event and is not homologous to all others. We would notice an isolated stretch of poorly alignable sequence, i.e. it should be a segment coloured with a low score in a set of otherwise high-scoring segments. Also a gene may have acquired significant lengths of N- or C-terminal extensions which may not be homologous (unless they are the result of an internal duplication).
 
 
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*Review the  [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/resources/T-coffee_scores.html (score-colored T-COFFEE alignment)]. Based on this alignment, how do you feel about our initial assertion that these 18 proteins should be considered orthologous<!-- over their entire length-->? (Answer briefly, but with reference to specific evidence in the alignment. Note that this question does not ask about the general level of conservation, but about whether significant segments (of about the lenght of a domain) do not appear related/alignable at all in regions where the rest of the group are reasonably well conserved.) (1 mark)
 
</div>
 
 
 
&nbsp;
 
&nbsp;
 
 
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
 
 
==(3) Mbp1 orthologues: analysis of full length MSAs==
 
</div>
 
&nbsp;
 
&nbsp;
 
 
 
What do we mean by a ''good'' versus a ''poor'' multiple sequence alignment?
 
 
 
Let us first consider some of the features of the yeast Mbp1 protein that we have defined in the second assignment (and some structural features I have compiled from various sources). Below is the yeast Mbp1 sequence with a number of annotations, compiled according to the following procedure.  
 
  
 
# Performed [http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi '''CDD'''] search with yeast Mbp1 protein sequence. This retrieves alignments of Mbp1 with the APSES and the ANKYRIN domains. These are profile based alignments and thus they are more reliable than pairwise alignments.
 
# Performed [http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi '''CDD'''] search with yeast Mbp1 protein sequence. This retrieves alignments of Mbp1 with the APSES and the ANKYRIN domains. These are profile based alignments and thus they are more reliable than pairwise alignments.
Line 558: Line 496:
 
   
 
   
 
   
 
   
A '''good''' MSA comprises only columns of residues that play similar roles in the proteins' mechanism and/or that evolve in a comparable structural context. Since it is a result of biological selection and conservation, it has relatively few indels and the indels it has are usually not placed into elements of secondary structure or into functional motifs.  
+
A '''good''' MSA comprises only columns of residues that play similar roles in the proteins' mechanism and/or that evolve in a comparable structural context. Since it is a result of biological selection and conservation, it has relatively few indels and the indels it has are usually not placed into elements of secondary structure or into functional motifs. The contiguous features annotated for Mbp1 are left intact.
 +
 
 +
A '''poor''' MSA has many errors in its columns, they contain residues that actuallly have diffferent functions or structural roles, even though they may look similar to a scoring matrix. It also may have introduced indels in biologically irrelevant positions, to maximize spurious sequence similarities. Some of the features annotated for Mbp1 will be disrupted.
 +
 
 +
&nbsp;
 +
 
 +
&nbsp;
 +
 
 +
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #EEEEEE;">
 +
*Produce a similar set of annotations for your Mbp1 orthologue protein.
 +
</div>
 +
 
 +
 
 +
&nbsp;
 +
 
 +
&nbsp;
 +
 
 +
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 +
===(2.2) Computing alignments===
 +
</div>
 +
 
 +
&nbsp;
 +
 
 +
Multiple sequence alignments are compute-intensive and this used to require downloading and installing software on your own computer. While most tools were available on the Web in principle, many groups have restricted the total number of sequences or the total number of characters to be aligned. The EBI however offers three of the most commonly used tools with few limitations and it was possible to run MSAs for all Mbp1 orthologues jointly.
 +
 +
The following three servers were used for the alignments:
 +
 
 +
* [http://www.ebi.ac.uk/clustalw/ '''CLUSTAL-W''']  is a progressive alignment program, it is the most popular, most widely referenced MSA algorithm, it is reasonably fast and easy to use. But alignment errors that are made early in the process can't get corrected and thus CLUSTAL is prone to misalign sets of sequences that have poor (<30% ID) local similarity. It is no longer considered state-of-the-art for carefully done alignments.
 +
* [http://www.ebi.ac.uk/muscle/ '''MUSCLE'''] essentially starts out from a CLUSTAL like alignment as a draft, then identifies similar groups of sequences from which it calculates profiles, it then re-aligns the group to the profile. This procedure is iterated.
 +
* [http://www.ebi.ac.uk/t-coffee/ '''T-Coffee'''] is one of my favourites - the tradeoffs appear to be especially well balanced. It too starts from a set of pairwise global alignments, like CLUSTAL, then additionally calculates sets of best local alignments. Global and local alignments are then combined to a similarity matrix and based on this matrix a guide-tree is constructed. This determines the order of steps in which sequences are added to the multiple alignment. A nice feature of T-Coffee is color coded output that allows you to quickly judge the local reliability of the alignment.
 +
 
 +
Multiple sequence alignments were performed for all 18 Mbp1 orthologues to compare the results. Since the results would all look the same for the same input file, I have simply posted them here. Of course you are welcome to do run an alignment on your own for your own learning experience, or to find an alternative program.  
  
A '''poor''' MSA has many errors in its columns, they contain residues that actuallly have diffferent functions or structural roles, even though they may look similar to a scoring matrix. It also may have introduced indels in biologically irrelevant positions, to maximize spurious sequence similarities.  
+
The first alignment was run with CLUSTAL.
 +
 
 +
[[Image:A03_01.jpg|frame|none|Assignment 3, Figure 01<br>
 +
The guide tree computed by CLUSTAL-W. The algorithm uses this tree to determine the best order for its progressive alignment for the 18 Mbp1 orthologue sequences. This tree is based on a matrix of pairwise distances.]]
 +
 
 +
Subseqently, sequence alignments were performed with T-Coffee and MUSCLE. For these two, the input files were re-ordered to correspond to the order of the CLUSTAL output, and the option to order the alignments according to the ''input sequences'' was chosen on the form. This makes it much easier to compare alignments, since all MSAs are displayed in the same relative order.
 +
 
 +
Finally I have merged the domain annotations for the yeast Mbp1 protein into the output files.
 +
 
 +
The result files are linked here:
 +
 
 +
* [[All_Mbp1_CLUSTAL_annotated|Mbp1 proteins '''CLUSTAL''' aligned]]
 +
* [[All_Mbp1_MUSCLE_annotated|Mbp1 proteins '''MUSCLE''' aligned]]
 +
* [[All_Mbp1_T-COFFEE_annotated|Mbp1 proteins '''T-Coffee''' aligned (text version)]] or <small>[http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/resources/T-coffee_scores.html Mbp1 proteins '''T-Coffee aligned'''] (coloured according to scores)</small>
 +
 
 +
Globally speaking, the alignments are quite similar. Let's first look at the common themes, before we discuss details of the results. The  [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/resources/T-coffee_scores.html (score-colored T-COFFEE alignment)] is well suited to look at general relationships between the sequences, since outliers can be easily identified.  For example, if one of the sequences would have a low-scoring domain that aligns poorly to the others of the group, it may be possible that that domain has been acquired in a separate evolutionary event and is not homologous to all others. We would notice an isolated stretch of poorly alignable sequence, i.e. it should be a segment coloured with a low score in a set of otherwise high-scoring segments. Also a gene may have acquired significant lengths of N- or C-terminal extensions which may not be homologous (unless they are the result of an internal duplication).
 +
 
 +
&nbsp;<br>
 +
 
 +
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 +
*'''Review''' the  [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/resources/T-coffee_scores.html (score-colored T-COFFEE alignment)]. Based on this alignment, how do you feel about our initial assertion that these 18 proteins should be considered orthologous over their entire length? <small>You do not need to discuss this in the assignment but you should study the evidence in the alignment. Note that this question does not ask about the general level of conservation, but about whether significant segments (of about the length of a domain) do not appear related/alignable at all in regions where the rest of the group are reasonably well conserved.</small>
 +
</div>
  
In order to evaluate the MSAs for our proteins, we will analyze alignments relative to the features we have annotated above.
+
&nbsp;
 
&nbsp;
 
&nbsp;
 +
 +
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 +
 +
==(3) Mbp1 orthologues: analysis of full length MSAs==
 +
</div>
 +
&nbsp;
 +
&nbsp;
 +
 +
 +
  
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
===(3.1)  APSES domains (1 mark)===
+
===(3.1)  APSES domains===
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
  
The APSES domains in all of our Mbp1 orthologues are highly conserved and any program must be able to align such obviously similar regions.
+
 
 +
The APSES domains in all of our Mbp1 orthologues are highly conserved and pretty much any alignment program must be able to align such obviously similar regions.
  
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
*Consider the CLUSTAL, Muscle and T-Coffee alignments of the Mbp1 orthologues.  Orient yourselves as to where the APSES domains are located. Briefly note whether the three alignments agree and, for one of the alignments, whether the charged residues in the proposed binding region are wholly or partially conserved across all 18 proteins. (Refer to the specific residues labelled (+) or (-) in the Mbp1 annotation above). (1 mark) <!-- Sequence variation may indicate variations in binding site -->
+
*Consider the CLUSTAL, Muscle and T-Coffee alignments of the Mbp1 orthologues.  Orient yourselves as to where the APSES domains are located. For one alignment, '''review''' whether the charged residues in the proposed binding region are wholly or partially conserved across all 18 proteins. Refer to the specific residues labelled (+) or (-) in the Mbp1 annotation above. <!-- Sequence variation may indicate variations in binding site -->
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
Line 583: Line 584:
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
  
===(3.2)  Ankyrin domains (1 mark)===
+
===(3.2)  Ankyrin domains===
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
Line 590: Line 591:
  
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
*For one of the alignments of your choice (CLUSTAL, T-coffee or MUSCLE), identify the helices in the Ankyrin repeat region of Mbp1, based on the annotations given above. (This is probably easiest done by pasting that part of the alignment into a word-processor and highlighting the residues you are discussing). Briefly state whether the indels in this region are concentrated in segments that connect the helices, or if they are more or less evenly distributed along the entire region of similarity. Conclude whether the assertion that ''indels should not be placed in elements of secondary structure'' has merit in this case; in particular if you notice indels that violate this rule-of-thumb, consider whether the location of the indel has strong support from aligned sequence motifs, or whether it could apparently be placed  into a different location whithout much loss in alignment quality. Support your conclusions with specific reference to particular elements of the alignment. (1 mark)
+
*Compare the distribution of indels in the ankyrin repeat regions of three alignments. '''Review''' whether the indels in this region are concentrated in segments that connect the helices, or if they are more or less evenly distributed along the entire region of similarity. Think about whether the assertion that ''indels should not be placed in elements of secondary structure'' has merit in this case; in particular if you notice indels that violate this rule-of-thumb (that have been placed into structurally annotated regions of secondary structure), consider whether the location of the indel has strong support from aligned sequence motifs, or whether it could apparently be placed  into a different location whithout much loss in alignment quality.
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
Line 605: Line 606:
 
Aligning functional features like ''coiled coil domains'' or ''intrinsically disorderd regions'' is even more difficult, since this is to a large degree a property of the amino acid composition, not as much the precise sequence. Thus we would expect it to be difficult to detect the correspondence between sequences in such regions.  I have annotated four low complexity regions of the yeast Mbp1 sequence.
 
Aligning functional features like ''coiled coil domains'' or ''intrinsically disorderd regions'' is even more difficult, since this is to a large degree a property of the amino acid composition, not as much the precise sequence. Thus we would expect it to be difficult to detect the correspondence between sequences in such regions.  I have annotated four low complexity regions of the yeast Mbp1 sequence.
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
&nbsp;<br><div style="padding: 5px; background: #FFCC99;">
*Copy the Mbp1 sequence from your organism from the multi-FASTA files and run a [http://smart.embl-heidelberg.de/ SMART] sequence analysis: paste your FASTA formatted sequence (or its Uniprot accession number), check only the checkbox for detecting '''intrinsic protein disorder''' and click "Sequence SMART". Locate the segments of '''low complexity''' for your sequence (they are in the lower part of the results page since they overlap with disordered segments). Now comment on '''one''' of the multiple sequence alignments: does your protein '''have''' similar low complexity regions as <code>Mbp1_SACCE</code>, and have these regions been '''aligned''' by the MSA algorithm? Briefly describe the situation: state whether these segments are found in the same general region, in the same detailed location, or perhaps even conserved in sequence, when you compare them to the ''saccharomyces cerevisiae'' protein.  Backup your conclusions with specific reference to particular elements of the alignment.
+
;Analysis (1 mark)
 +
 
 +
*Refer to your annotation of your organism's Mbp1 orthologue. Comment on '''one''' of the multiple sequence alignments: does your protein '''have''' similar low complexity regions as <code>Mbp1_SACCE</code>, and have these regions been '''aligned''' by the MSA algorithm? Briefly describe the situation: state whether these segments are found in the same general region, in the same detailed location, or perhaps even conserved in sequence, when you compare them to the ''saccharomyces cerevisiae'' protein.  Backup your conclusions with specific reference to particular elements of the alignment.
  
* Briefly discuss whether this observation should lead you to conclude that disorder in these proteins appears to be a conserved feature, i.e. a feature that is selected for in evolution. (1 mark)
+
* Briefly discuss whether this observation should lead you to conclude that disorder in these proteins appears to be a conserved feature, i.e. that disorder can be a functional feature that is selected for in evolution. If this is the case, consider whether the disordered segments appear to be homologous or analogous or whether the data does not allow a conlusion.
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
Line 629: Line 632:
 
&nbsp;<br>
 
&nbsp;<br>
  
You have read how to generate a source sequence file based on the results of a PSI-BLAST search for all APSES domains in fungi. Of course, since PSI-BLAST has detected these sequences due to their high-similarity to a sequence profile, this similarity implies an alignment; this is a model based MSA because the sequences are aligned to a protoypic model and not to each other. To align these domains the MUSCLE server is the tool of choice for such highly diverged sequences. For comparison, a CLUSTAL alignment has been computed as well.
+
You have read how to generate a source sequence file based on the results of a PSI-BLAST search for all APSES domains in fungi. Of course, since PSI-BLAST has detected these sequences due to their high-similarity to a sequence profile, this similarity implies an alignment. In this case this is a '''model based MSA''' because the sequences are aligned to a protoypic model (the sequence profile) and not to each other. To align these highly diverged sequences the MUSCLE server is the tool of choice. For comparison, a CLUSTAL alignment has been computed as well.
  
 
* The [[APSES_domains_PSI-BLAST| resulting alignment derived from the '''PSI-BLAST''' profile]] as an example of a model-based alignment. <small>Note that PSI-BLAST has not been optimized to work as an alignment program, thus the conclusion that model-based alignments are inferior because this example is a poor alignment is not justified.</small>
 
* The [[APSES_domains_PSI-BLAST| resulting alignment derived from the '''PSI-BLAST''' profile]] as an example of a model-based alignment. <small>Note that PSI-BLAST has not been optimized to work as an alignment program, thus the conclusion that model-based alignments are inferior because this example is a poor alignment is not justified.</small>
Line 639: Line 642:
 
&nbsp;
 
&nbsp;
  
 +
<!--
 
===(4.1)  Manual improvement  (1 mark)===
 
===(4.1)  Manual improvement  (1 mark)===
  
Line 706: Line 710:
 
&nbsp;
 
&nbsp;
 
&nbsp;
 
&nbsp;
 +
 +
-->
 +
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
  
===(4.2)  Patterns of residue conservation (1 mark)===
+
===(4.2)  Patterns of residue conservation===
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
Line 723: Line 730:
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
Consider any '''one''' of the three APSES domain alignments.   
 
Consider any '''one''' of the three APSES domain alignments.   
 +
*'''Review''' whether the patterns of sequence variation for ''functionally conserved'' residues are compatible with the notion that orthologues have conserved binding specificities and paralogues have acquired new functions by binding to different sequences.
 +
*'''Review''' whether the patterns of sequence variation for ''structurally conserved'' residues are compatible with the notion that all APSES domains have a common fold?
  
*Are the patterns of sequence variation for ''functionally conserved'' residues compatible with the notion that orthologues have conserved binding specificities and paralogues have acquired new functions by binding to different sequences?
+
To approach these questions systematically, define (but reference to specific sequences and residues) what you would expect (hypothesis) and whether the alignment supports or contradicts your expectations (observation). We have determined that the sequences labelled as Mbp1 are orthologues, and the other labels were constructed to identify the yeast gene that each sequence is most similar to. This means you may group Mbp1 sequences as orthologues, Swi4, Sok2, and Phd1 sequences are presumably orthologous, and all sequences originating from the same organism are of course groups of paralogues. However, labels such as MbpA, MbpB etc. are arbitrary: these sequences as a group are paralogous to e.g. Mbp1 but not necessarily orthologous to each other.
*Are the patterns of sequence variation for ''structurally conserved'' residues compatible with the notion that all APSES domains have a common fold? (1 mark)
 
 
 
For both cases, state briefly (but with reference to specific sequences and residues) what you would expect (hypothesis) and whether the alignment supports or contradicts your expectations (observation). We have determined that the sequences labelled as Mbp1 are orthologues, and the other labels were constructed to identify the yeast gene that each sequence is most similar to (although a reciprocal search was not done). This means you may group Mbp1 sequences as orthologues, Swi4, Sok2, and Phd1 sequences are presumably orthologous, and all sequences originating from the same organism are of course groups of paralogues. However, labels such as MbpA, MbpB etc. are arbitrary: these sequences as a group are paralogous to e.g. Mbp1 but not necessarily orthologous to each other. Your discussion ''may'' be easier if you sort the sequences differently than they are presented, this is easy to do in a text editor. Re-sorting does not change the alignment.
 
 
</div>
 
</div>
  
Line 734: Line 740:
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
===(4.3)  Visualization and analysis of alignment with VMD (2 marks)===
+
===(4.3)  Visualization and analysis of alignment with VMD===
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
  
VMD offers a very well constructed set of tools for the analyis of sequence and structural conservation: the '''MultiSeq''' extension. In this part of the assignment you will use VMD to analyse and visualize conservation patterns and comment on the alignments the servers have produced. I highly recommend to familiarize yourself with MultiSeq and the developers have produced an [http://www.ks.uiuc.edu/Training/Tutorials/#evolution excellent tutorial on the evolution of tRNA synthetases] to showcase the program's capabilities. However I am not ''requiring'' this for the course and we will be using only a subset of the available Multiseq functions. The tool is intuitive enough, beginning to use it should require no more than following the steps below.
+
VMD offers a very well constructed set of tools for the analyis of sequence and structural conservation: the '''MultiSeq''' extension. In this part of the assignment you will use VMD to analyse and visualize conservation patterns and comment on the alignments the servers have produced. I highly recommend to familiarize yourself with MultiSeq and the developers have produced an [http://www.ks.uiuc.edu/Training/Tutorials/#evolution excellent tutorial on the evolution of tRNA synthetases] to showcase the program's capabilities. However I am not ''requiring'' that you go through the tutorial and we will be using only a subset of the available Multiseq functions. The tool is intuitive enough, beginning to use it should require no more than following the steps below.
  
 
Proceed through the following steps:
 
Proceed through the following steps:
 
:(1) Save an alignment of the APSES domains on your computer.
 
:(1) Save an alignment of the APSES domains on your computer.
::(A) Choose either the CLUSTAL or MUSCLE alignment of all APSES domains, copy it from the Wiki page and save it on your computer, as a '''text file''' with some convenient filename and the extension .aln . This is a CLUSTAL formatted input file.  
+
::(A) Access the MUSCLE alignment of all APSES domains, copy it from the Wiki page and save it on your computer, as a '''text file''' with some convenient filename and the extension .aln . This is a CLUSTAL formatted input file.  
 
::(B) Edit the file to remove any header lines and lines containing the conservation symbols <code> .:*</code>. Leave the gene-names and aligned sequences as they are. Make sure you are not saving the file in MS-Word binary format (.doc) and that the extension is not changed (depending on how your computer is configured, it may silently append a <code>.txt</code> extension that will cause trouble later on).
 
::(B) Edit the file to remove any header lines and lines containing the conservation symbols <code> .:*</code>. Leave the gene-names and aligned sequences as they are. Make sure you are not saving the file in MS-Word binary format (.doc) and that the extension is not changed (depending on how your computer is configured, it may silently append a <code>.txt</code> extension that will cause trouble later on).
  
 
:(2) Open the Multiseq extension in VMD.
 
:(2) Open the Multiseq extension in VMD.
::(A) start VMD and load one of the APSES domain structures (1BM8 or 1MB1).
+
::(A) start VMD and load the 1MB1 APSES domain structure.
 
::(B) choose a stereo representation that will show you the fold of the domain and the sidechains of key residues. For example you could use a Tube representation for the protein backbone and a Licorice representation for the selection <code>((sidechain or type CA) and not element H) and resid 30 to 90</code>.  (And switch the axes display off! The axes carry no information you need).
 
::(B) choose a stereo representation that will show you the fold of the domain and the sidechains of key residues. For example you could use a Tube representation for the protein backbone and a Licorice representation for the selection <code>((sidechain or type CA) and not element H) and resid 30 to 90</code>.  (And switch the axes display off! The axes carry no information you need).
 
::(C) On the VMD Main form navigate to Extensions &rarr; Analysis &rarr; MultiSeq
 
::(C) On the VMD Main form navigate to Extensions &rarr; Analysis &rarr; MultiSeq
 
::(D) When you run MultiSeq for the first time, you will be asked for a directory in which to store metadata. You can use the default or a directory of your choice; you may subsequently skip all steps that ask you to install "required" databases locally since we will not need them for this task.
 
::(D) When you run MultiSeq for the first time, you will be asked for a directory in which to store metadata. You can use the default or a directory of your choice; you may subsequently skip all steps that ask you to install "required" databases locally since we will not need them for this task.
::(E) A window will appear - the ''MultiSeq'' window -it contains the sequence of the APSES domain you are visualizing. MultiSeq will also generate an additional cartoon representation of the structure.
+
::(E) A window will appear - the ''MultiSeq'' window - it contains the sequence of the APSES domain you are visualizing. MultiSeq will also generate an additional cartoon representation of the structure.
  
 
:(3) Load the APSES alignment.
 
:(3) Load the APSES alignment.
Line 757: Line 763:
 
::(C) find the Mbp1_SACCE sequence in the list, click on it and move it to the top of the '''Sequences''' list with your mouse (the list is not static, you can re-order the sequences in any way you like).
 
::(C) find the Mbp1_SACCE sequence in the list, click on it and move it to the top of the '''Sequences''' list with your mouse (the list is not static, you can re-order the sequences in any way you like).
  
You will see that the stucture's sequence and the APSES domain sequence do not match; at the beginning the structure has extra sequence extending its N-terminus and in the middle the APSES sequences have gaps inserted.
+
You will see that the 1MB1 sequence and the APSES domain sequence do not match; at the beginning the structure has extra sequence extending its N-terminus, and in the middle the APSES sequences have gaps inserted.
  
:(4) Bring the structure's sequence in register with the APSES alignment.
+
:(4) '''Bring the 1MB1 sequence in register with the APSES alignment'''.
::(A) MultiSeq supports typical text-editor selection mechanisms. Clicking on a residue selects it, clicking on a row selects the whole sequence. Dragging with the mouse selects several residues, shift-clicking selects ranges, and option-clicking toggles the selection on or off for individual residues. Using the mouse and/or the shift key as required, select the '''entire first column''' of the sequence group.
+
::(A) MultiSeq supports typical text-editor selection mechanisms. Clicking on a residue selects it, clicking on a row selects the whole sequence. Dragging with the mouse selects several residues, shift-clicking selects ranges, and option-clicking toggles the selection on or off for individual residues. Using the mouse and/or the shift key as required, select the '''entire first column''' of the sequences you have imported.
 
::(B) Select Edit &rarr; Enable Editing... &rarr; Gaps only to allow changing indels.  
 
::(B) Select Edit &rarr; Enable Editing... &rarr; Gaps only to allow changing indels.  
::(C) Pressing the spacebar once should insert a gap character before the selected column in all sequences. Insert as many gaps as you need to align the beginning of sequences with the corresponding residues of the structure <code>S&nbsp;I&nbsp;M&nbsp;...</code>.
+
::(C) Pressing the spacebar once should insert a gap character before the selected column in all sequences. Insert as many gaps as you need to align the beginning of sequences with the corresponding residues of 1MB1 <code>S&nbsp;I&nbsp;M&nbsp;...</code>.
::(D) Now insert as many gaps as you need into the '''structure''' sequence, to align it completely with the Mbp1_SACCE APSES domain sequence. (Simply select residues in the sequence and use the space bar to insert gaps. <small>(Note: I have noticed a bug that sometimes prevents slider or keuyboard input to the MultiSeq window; it fails to ''regain focus'' after operations in a different window. I don't know whether this is a Mac related problem or a more general bug in MultiSeq. When this happens I quit VMD and restore the session from a saved state. It is a bit annoying but not mission-critical.)</small>
+
::(D) Now insert as many gaps as you need into the '''structure''' sequence, to align it completely with the Mbp1_SACCE APSES domain sequence. (Simply select residues in the sequence and use the space bar to insert gaps. <small>(Note: I have noticed a bug that sometimes prevents slider or keyboard input to the MultiSeq window; it fails to ''regain focus'' after operations in a different window. I don't know whether this is a Mac related problem or a more general bug in MultiSeq. When this happens I quit VMD and restore the session from a saved state. It is a bit annoying but not mission-critical.)</small>
 
::(E) When you are done, it may be prudent to save the state of your alignment. Use File &rarr; Save Session...
 
::(E) When you are done, it may be prudent to save the state of your alignment. Use File &rarr; Save Session...
  
Line 780: Line 786:
  
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
&nbsp;<br><div style="padding: 5px; background: #FFCC99;">
+
;Analysis (1 mark)
 +
 
 
*Generate two  parallel stereo views that shows the APSES domain backbone and selected sidechains as described above. One should be colored by sequence similarity among all APSES domains, the other by similarity among only the Mbp1 orthologues. Scale and rotate the structure so that the putative DNA binding domain is easily visible. Paste both views into your assignment in a compressed format, as was explained for Assignment 2.
 
*Generate two  parallel stereo views that shows the APSES domain backbone and selected sidechains as described above. One should be colored by sequence similarity among all APSES domains, the other by similarity among only the Mbp1 orthologues. Scale and rotate the structure so that the putative DNA binding domain is easily visible. Paste both views into your assignment in a compressed format, as was explained for Assignment 2.
  
Line 787: Line 794:
  
 
*Briefly discuss how the situation changes when you compare only Mbp1 orthologues with each other. Never mind that overall conservation is higher: does the '''distribution''' of conserved residues in the context of the domain change, and if so, how? Are the patterns of sequence variation for ''functionally conserved'' residues compatible with the notion that all Mbp1 orthologues have a similar function?
 
*Briefly discuss how the situation changes when you compare only Mbp1 orthologues with each other. Never mind that overall conservation is higher: does the '''distribution''' of conserved residues in the context of the domain change, and if so, how? Are the patterns of sequence variation for ''functionally conserved'' residues compatible with the notion that all Mbp1 orthologues have a similar function?
 
*The structure makes it easy to confirm where gaps in the alignment have been placed. Discuss briefly (but with reference to specific instances) whether the indel placements of CLUSTAL or MUSCLE appear more plausible. To do this, define where you would expect to find indels and where they have been placed by the MSA program. (2 marks total)
 
 
 
</div>
 
</div>
  
Line 819: Line 823:
 
;Alignments
 
;Alignments
 
:'''Mbp1 proteins:'''
 
:'''Mbp1 proteins:'''
:* [[All_Mbp1_CLUSTAL|Mbp1 proteins '''CLUSTAL''' aligned]]
+
:* [[All_Mbp1_CLUSTAL_annotated|Mbp1 proteins '''CLUSTAL''' aligned]]
:* [[All_Mbp1_MUSCLE|Mbp1 proteins '''MUSCLE''' aligned]]
+
:* [[All_Mbp1_MUSCLE_annotated|Mbp1 proteins '''MUSCLE''' aligned]]
:* [[All_Mbp1_T-COFFEE|Mbp1 proteins '''T-Coffee''' aligned (text version)]]
+
:* [[All_Mbp1_T-COFFEE_annotated|Mbp1 proteins '''T-Coffee''' aligned (text version)]]
 
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/resources/T-coffee_scores.html Mbp1 proteins '''T-Coffee''' aligned (coloured according to scores)]
 
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/resources/T-coffee_scores.html Mbp1 proteins '''T-Coffee''' aligned (coloured according to scores)]
  
Line 837: Line 841:
 
</div>
 
</div>
  
If you have any questions at all, don't hesitate to mail me at [mailto:boris.steipe@utoronto.ca boris.steipe@utoronto.ca] or post your question to the [mailto:bch441_2006@googlegroups.com Course Mailing List]
+
If you have any questions at all, don't hesitate to mail me at [mailto:boris.steipe@utoronto.ca boris.steipe@utoronto.ca] or post your question to the [mailto:bch441_2008@googlegroups.com Course Mailing List]

Revision as of 01:38, 15 October 2008

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 


   


   

Assignment 3 - Multiple Sequence Alignment

   

Preparation, submission and due date

Read carefully.
Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which people have simply overlooked crucial questions. Sadly, we always get assignments back in which people have not described procedural details. If you did not notice that the above were two different sentences, you are still not reading carefully enough.

Review the guidelines for preparation and submission of BCH441 assignments.

The due date for the assignment is Monday, October 27. at 10:00 in the morning.

   


Your documentation for the procedures you follow in this assignment will be worth 1 mark.


   


Introduction  

Take care of things, and they will take care of you.
Shunryu Suzuki

Much of what we know about a protein's physiological function is based on the conservation of that function as the organism evolves. We assess conservation by comparison to related proteins. Conservation - or variability - is a consequence of selection under constraints: the multiple effects on an organism's fitness function that are induced through changes to the structural or functional features of a protein. Conservation patterns can thus provide evidence for many different questions: structural conservation among proteins with similar 3D-structures, functional conservation, among homologues with comparable roles, peaks of variability that emphasize domain boundaries in multi-domain proteins, or amino acid propensities as powerful predictors for protein engineering and design.

Measuring conservation requires alignment. Therefore a carefully done multiple sequence alignment (MSA) is a cornerstone for the annotation of the essential properties a gene or protein. They are also useful to resolve ambiguities in the precise placement of indels and to ensure that columns in alignments actually contain amino acids that evolve in a similar context. MSAs serve as input for

  • functional annotation;
  • protein homology modeling;
  • phylogenetic analyses, and
  • sensitive homology searches in databases.

As a first step, we will explore the search and retrieval of fungal proteins that are orthologous to yeast Mbp1, and of the APSES domains they contain. We will search the Entrez Protein database using BLAST and PSI-BLAST.

Then we will align and annotate sequences.

It is remarkable that by far the most frequently used MSA algorithm is CLUSTAL, a procedure that was first published for the microprocessors of the late 1980s, surpassed in performance many times, and shown to be significantly inferior to more modern approaches when aligning sequences with about 30% identity or less.

  • A model-based approach (based on the PSSM that PSI-BLAST generates)
  • A progressive alignment - the CLUSTAL algorithm
  • A consistency based alignment - T-Coffee, MUSCLE or Probcons


(1) Mbp1 homologue

   

In Assignment 2 you retrieved the protein sequences of saccharomyces cerevisiae Mbp1 and defined its APSES domain. Let us now search for an orthologue of this sequence in Your Organism (Whenever this assignment mentions your organism, this means the organism that is listed with your project group.)

  • Navigate to the NCBI's main BLAST page.
  • Follow the link to the list of all genomic BLAST databases.
  • Click on the (B) icon next to Fungi.
  • On the genomic BLAST page, identify your organism, check the box next to its name and run a BLAST search
    • of the full length yeast Mbp1 protein sequence,
    • against the protein database,
    • using the blastp algorithm and default parameters.
  • (Remember to record the parameters for your search).

Familiarize yourself with the output form you obtain, this is by far the most frequently used bioinformatics result page. Here is a list of things to look for, all of which I expect you to know and understand (you do not need to comment on these points in your submission!):

On the alignment image
  • What do the different colored bars mean?
  • What is the information you get when you "mouse-over" a colored bar on the alignment image.
  • What happens when you click on one of the bars?
In the description list
  • Where does the link next to an identifier take you?
  • Where does the link in the "score" column take you?
  • What does the icon at the end of each row mean? What other icons could appear there?
In the alignment section
  • What happens if you select three or four matches and then click Get selected sequences?
  • Why is there sometimes more than one alignment result given with one sequence identifier?
  • What do the alignment metrics mean:
    • Score?
    • Expect (E-value)?
    • Identities?
    • Positives?
    • Gaps?
  • What is the alignment length?
  • Which sequence is labeled Query and which one is labelled Sbjct?


(1) Mbp1 orthologue

   

You have identified homologues to yeast Mbp1 in your organism, but which one of these (if any) is an orthologue?

  • Perform a reciprocal BLAST search with your highest scoring hit and note whether the reciprocal best match criterion has been fulfilled.
  • Repeat the procedure (yeast → your organism → yeast) with only the Mbp1 APSES domain sequence that you have defined in Assignment 2.

You should have noticed a number of APSES domains in your organism, apparently related to Mbp1. As well, there are matches of the full length yeast gene to your organism's genome.

 

Analysis (1 mark)
  • Based on your results, comment briefly on whether your organism appears to have an orthologue of the entire Mbp1 gene and/or only of the Mbp1 APSES domain.


Now compare the (empirical, local) BLAST alignment with a Needleman-Wunsch (optimal, global) sequence alignment. Use the correct algorithm from the set of EMBOSS tools.

  • Retrieve the full-length sequence of the orthologue to yeast Mbp1 in your organism, and use an online tool to generate an optimal global alignment between this and S. cerevisiae Mbp1. You have to figure out where to find a Web service that does such alignments, what the name of the algorithm is that you should use and how to define reasonable parameters for the alignment.
  • Review if and how the alignments are different, or whether the two alignment algorithms have given essentially the same results.

Note: When I instruct you to review..., I do not require you to include your conclusions in the submitted assignment. However I expect you to be familiar with the analysis and to be able to answer questions on the process and the conclusions.


 
 


(1.1) Input data for multiple alignments

Preparing an input set of sequences for a multiple sequence alignment is essentially more of the same. It includes

  • searching a query sequence across a database subset of interest,
  • retrieving orthologue- and or paralogue- sequences,
  • validating BLAST alignments if needed,
  • trimming the sequences to a particular region of interest, and
  • saving the result as a multi-FASTA formatted file.

To shortcut some of the repetitive work, I have generated a reference list of Mbp1 orthologues using the canonical procedure defined below: (departures from the procedure are noted below the table). You might want to check whether you have found the same sequence, and if your sequence is the correct orthologue and mine is not, let me know.


  1. Retrieved the Mbp1 protein sequence by searching Entrez for Mbp1 AND "saccharomyces cerevisiae"[organism]
  2. Clicked on the RefSeq tab to find the RefSeq ID "NP_010227"
  3. Accessed the BLAST form, followed the link to the list of all genomic BLAST databases and clicked on the (B) icon, next to Fungi to navigate to the Fungi Genomic BLAST page.
  4. Pasted "NP_010227" into the query field. Chose Protein for both Query and Database, kept default parameters but set the Filter option to none. Clicked on the check-box of each of the fungal species we have considered in the previous assignment. Run BLAST.
  5. On the results page, checked the checkbox next to the alignment to select the most significant hit from each organism we are studying.
  6. Clicked on the "Get selected sequences" button.
  7. Separately searched for sequences from organisms that were either not included in the list or for which no hits were reported. Verified all ambiguous cases, as explained in the notes below.
  8. Verified that each of these sequences finds Mbp1 as the best match in the saccharomyces cerevisiae genome by clicking on each "Blink" (click for example) in the retrieved list. Scrolled down the list to confirm that the top hit of a saccharomyces cerevisiae protein is indeed Mbp1 (NP_010227).
  9. Obtained UniProt accessions for all sequences, with a single query using the UniProt ID mapping service. This service accepts a comma delimited list of RefSeq IDs, GI numbers or GenPept accession numbers and returns a list of Uniprot accession numbers.

Since it was thus confirmed that each of these sequences is the protein that is most similar to yeast Mbp1 in its respective organism's genome, and that yeast Mbp1 is the most similar yeast protein to each of them, the all fulfil the criterion of a reciprocal best match with yeast Mbp1. Accordingly we can postulate that this list contains the fungal orthologues to Mbp1.


 
 

Mbp1 and its orthologues
Organism CODE GI NCBI Uniprot Most similar yeast gene
Aspergillus fumigatus ASPFU 70999021 XP_754232 Q4WYQ9_ASPFU Mbp1
Aspergillus nidulans ASPNI 67525393 XP_660758 Q5B8H6_EMENI Mbp1
Aspergillus terreus ASPTE 115391425 XP_001213217 Q0CQJ5_ASPTN Mbp1
Candida albicans CANAL 68465714 XP_722925 Q5ANP5_CANAL Mbp1
Candida glabrata CANGL 50286059 XP_445458 Q6FWD6_CANGA Mbp1
Coprinopsis cinerea COPCI 169861520 XP_001837394 A8NYC6 Mbp1
Cryptococcus neoformans CRYNE 134110416 XP_776035 Q5KHS0_CRYNE Mbp1
Debaryomyces hansenii DEBHA 50420495 XP_458784 Q6BSN6_DEBHA Mbp1
Eremothecium gossypii EREGO 45199118 NP_986147 Q752H3_ASHGO Mbp1
Gibberella zeae GIBZE 46116756 XP_384396 UPI000023DBF3 Mbp1
Kluyveromyces lactis KLULA 50308375 XP_454189 MBP1_KLULA Mbp1
Magnaporthe grisea MAGGR 74274844 ABA02072 Q3S405_MAGGR Mbp1*
Neurospora crassa NEUCR 164424100 XP_962967 Q7SBG9 Mbp1
Pichia stipitis PICST 126275256 XP_001386821 A3GHD6_PICST Mbp1
Saccharomyces cerevisiae SACCE 6320147 NP_010227 MBP1_YEAST Mbp1
Schizosaccharomyces pombe SCHPO 19113944 NP_593032 RES2_SCHPO Mbp1
Ustilago maydis USTMA 71024227 XP_762343 Q4P117_USTMA Mbp1
Yarrowia lipolytica YARLI 50545439 XP_500257 Q6CGF5_YARLI Mbp1

Table of yeast Mbp1 orthologues in genome-sequenced fungi. Columns from left to right: Systematic name, organism code (simply a string that lets us identify the organism in alignments), GI number, RefSeq ID (if existing) or GenPept accession, Uniprot accession, most similar yeast protein.

This process was not entirely straightforward in all cases and such variations are quite normal for a database query. You need to be familiar with exceptions such as the ones described below and know how to deal with them.

Note: for Aspergillus fumigatus and Aspergillus nidulans, the top BLAST hit is not the best match. The reason is that the best matching protein has a deletion just C-terminal to the APSES domain. This causes BLAST to split the HSP into two parts,and even though the APSES domain alone has a higher % identity, its E-value turns out to be lower because it is a shorter sequence. Global alignment of each sequence with yeast Mbp1, as well as alignment of only the APSES domains were consistent in showing that for both Aspergillus species the second highest BLAST score is indeed the most similar protein. The take-home message is that the comparison of BLAST scores can be misleading if we apply them to sequences of different length. For the record: Aspergillus fumigatus highest BLAST score is with XP_748947, second highest BLAST score is with XP_754232; the latter has higher global identity (25.7% vs. 22.6%) and higher identity in the APSES domain (55% vs. 45%). Aspergillus nidulans highest BLAST score is with XP_664319, second highest BLAST score is with XP_660758; the latter has higher global identity (26.7% vs. 22.8%) and higher identity in the APSES domain (59.5% vs. 50.6%). Interestingly, the Aspergillus terreus orthologue has the same deletion, but it provided the highest BLAST score to begin with.

Note: For Giberella zeae XP_384396 no UniProt ID was returned as cross-reference. EBI-BLAST retrieved FG04220 which is largely identical, except for short stretches that are absent in GenPept: apparently UniProt has a different gene-model for this protein.

Note: The Magnaporthe grisea protein ABA02072 has greater local C-terminal similarity to the yeast protein Swi6 than to Mbp1, whereas the N-terminal APSES domain is most similar to yeast Mbp1. However a global Needleman-Wunsch alignment (BLOSUM 30, gaps: 8.0/1.0) shows greater overall similarity to yeast Mbp1 than to Swi6. Accordingly I consider this an orthologue to Mbp1 even though its database annotation calls ABA02072 the M. grisea Swi6 homologue.

Note: For Pichia stipitis, BLAST finds two very similar sequences in GenPept as candidate Mbp1 orthologues; the RefSeq sequence XP_001386821.1 is translated according to the standard code, the entry EAZ62798.2 is translated according to the alternative nuclear code 12. The question had to be considered which translation appears to be correct. This required looking at the conservation of the residues in question in the BLAST alignment; better conservation indeed supports the alternative code translation.

Note: The Ustilago maydis protein XP_762343 (the protein with the sytematic name UM06196) is only the second-best hit in the original BLAST list as performed on the genomic BLAST page for the organism, however local optimal alignment (EMBOSS water) shows a much higher percentage of identity to yeast Mbp1 in the APSES domain than the top BLAST hit (XP_761485, systematic name UM05338) and global alignment (after trimming the N- and C- terminal extensions, respectively) also shows a slightly higher degree of similarity for the latter. Accordingly, XP_762343 is considered the Mbp1 orthologue, even though it is the second highest hit according to BLAST. The situation is similar as with the Aspergillus species, one protein was reported as a single HSP and one protein was broken into two HSPs. This emphasizes the fact that optimal sequence alignments are not entirely equivalent to BLAST alignments. Further, performing the same search against the "nr" database and applying an Organism filter for Ustilago maydis resulted in both proteins being split and the correct orthologue having the highest BLAST score in the list. This emphasizes the fact that searches in organism databases are not entirely equivalent to searches in the global database, even if the results are filtered.

 
 

To obtain all FASTA sequences based on a list of identifiers and to save them in a format in which we can use them as input for other programs or services is easy. We can simply paste all GI numbers as a comma separated list into the Entrez search form and on the results page, select Display FASTA and send to Text; then save the contents as a text file. This is a multi-Fasta file, suitable for input into MSA programs.


 

Review the resulting multi-FASTA file for the Mbp1 proteins (linked here) and make sure you understand the procedure that led to it. Depending on your personal learning style you may either carefully review the described procedure, reproduce key steps of the procedure, reproduce the entire procedure paying special attention to the problem cases discussed in the notes, or develop your own procedure. Whatever you do, you must be confident in the end that you could have produced the same input file. (You do not need to submit documentation for this part of the assignment, but you do need to understand the process.)

 
As you have seen from the results of your BLAST searches, Mbp1 orthologues are not the only proteins that contain APSES domains. In order to find all the rest, a PSI-BLAST search was performed using the yeast Mbp1 APSES domain as query. From the list of hits, the APSES domains were extracted and summarized in a file.

  • Review the resulting file for the APSES domains (linked here) and make sure you understand the procedure that was used in its construction, as above.

 

(2) Align and annotate

   


(2.1) Review of domain annotations

Let us first review some of the features of the yeast Mbp1 protein that we have defined in the second assignment (and some structural features I have compiled from various sources). Below is the yeast Mbp1 sequence with a number of annotations, compiled according to the following procedure.

  1. Performed CDD search with yeast Mbp1 protein sequence. This retrieves alignments of Mbp1 with the APSES and the ANKYRIN domains. These are profile based alignments and thus they are more reliable than pairwise alignments.
  2. Performed SMART search with yeast Mbp1 protein sequence. This retrieved the APSES domain, annotated a number of low-complexity regions and a stretch of coiled coil.
  3. Performed a SAS search with yeast Mbp1 protein sequence. This retrieved pairwise alignments with the structures 1MB1 (APSES) and chain D of 1IKN (ankyrin domains of Ikappab), together with their respective secondary structure annotations.
  4. Copied GenPept sequence into Word-processor.
  5. Transferred annotations of low complexity and coiled-coil regions from SMART.
  6. Transferred annotations of APSES secondary structure from SAS (this is a direct annotation, since the experimentally determined structure 1MB1 is a fagment of of the Mbp1 protein). The central helix that was proposed to be part of the DNA binding region is slightly distorted and SAS annotates a break in the helix, this break was bridged with lowercase "h" in the annotation.
  7. Ankyrin domain annotation was not as straightforward. While CDD, SMART and SAS all annotate the same general regions, they disagree in details of the domain boundaries and on the precise alignment. Used the profile-based CDD alignment of 1IKN. Transferred annotations of secondary structure from SAS output for 1IKN to sequence (this is a transferred annotation, the original annotation was for 1IKN and we assume that it applies to Mbp1 as well).


MBP1_SACCE
Annotations based on 
- CDD domain analysis,
- SAS structure annotation and
- literature data on binding region

Keys:

C   Coiled coil regions predicted by Coils2 program
x   Low complexity region
*   Proposed binding region
+   positively charged residues, oriented for possible DNA binding interactions
-   negatively charged residues, oriented for possible DNA binding interactions

E   beta strand
H   alpha helix
t   beta turn


                  10         20         30         40         50         60 
          MSNQIYSARY SGVDVYEFIH STGSIMKRKK DDWVNATHIL KAANFAKAKR TRILEKEVLK
1MB1      ----EEEEEt t-EEEEEEEE t-EEEEEEtt ---EEHHHHH HH----HHHH HHHHhhhHHH
                                                               * *+**-+****

                  70         80         90        100        110        120 
          ETHEKVQGGF GKYQGTWVPL NIAKQLAEKF SVYDQLKPLF DFTQTDGSAS PPPAPKHHHA
1MB1      ---EEE---- tt--EEEE-H HHHHHHHHH- --HHHHtt-         xxx xxxxxxxxxx
          **+*+***** ****

                 130        140        150        160        170        180 
          SKVDRKKAIR SASTSAIMET KRNNKKAEEN QFQSSKILGN PTAAPRKRGR PVGSTRGSRR
          x                                                                           


                 190        200        210        220        230        240 
          KLGVNLQRSQ SDMGFPRPAI PNSSISTTQL PSIRSTMGPQ SPTLGILEEE RHDSRQQQPQ
                                                                      xxxxx


                 250        260        270        280        290        300 
          QNNSAQFKEI DLEDGLSSDV EPSQQLQQVF NQNTGFVPQQ QSSLIQTQQT ESMATSVSSS
          x                                        xx xxxxxxxxxx xxxxxxxxxx


                 310        320        330        340        350        360 
          PSLPTSPGDF ADSNPFEERF PGGGTSPIIS MIPRYPVTSR PQTSDINDKV NKYLSKLVDY
          xxxxxxx

                 370        380        390        400        410        420 
          FISNEMKSNK SLPQVLLHPP PHSAPYIDAP IDPELHTAFH WACSMGNLPI AEALYEAGTS
ANKYRIN                                 -- t----HHHHH HH---HHHHH t-t--t-t--


                 430        440        450        460        470        480 
          IRSTNSQGQT PLMRSSLFHN SYTRRTFPRI FQLLHETVFD IDSQSQTVIH HIVKRKSTTP
ANKYRIN   t----t---- HHHHHHHH-- -------HHH HHHHHH-ttH HH-----HHH HHHH--tH--


                 490        500        510        520        530        540 
          SAVYYLDVVL SKIKDFSPQY RIELLLNTQD KNGDTALHIA SKNGDVVFFN TLVKMGALTT
ANKYRIN   HHHHHHHHH- ---------- -----t---- tt---HHHHH HH---HHHHH HHH--t-tt-


                 550        560        570        580        590        600 
          ISNKEGLTAN EIMNQQYEQM MIQNGTNQHV NSSNTDLNIH VNTNNIETKN DVNSMVIMSP
ANKYRIN   ---t----HH HHHHHH--HH HHH-t--HHH -t----HHHH HHH--tHHHH HHHHHH---t


                 610        620        630        640        650        660 
          VSPSDYITYP SQIATNISRN IPNVVNSMKQ MASIYNDLHE QHDNEIKSLQ KTLKSISKTK
ANKYRIN   ---tt----H HHHHHH---H HHHHHHH      CCCCCCCC CCCCCCCCCC CCCCC


                 670        680        690        700        710        720 
          IQVSLKTLEV LKESSKDENG EAQTNDDFEI LSRLQEQNTK KLRKRLIRYK RLIKQKLEYR
                                                    x xxxxxxxxxx xxxxxxx

                 730        740        750        760        770        780 
          QTVLLNKLIE DETQATTNNT VEKDNNTLER LELAQELTML QLQRKNKLSS LVKKFEDNAK


                 790        800        810        820        830 
          IHKYRRIIRE GTEMNIEEVD SSLDVILQTL IANNNKNKGA EQIITISNAN SHA


A good MSA comprises only columns of residues that play similar roles in the proteins' mechanism and/or that evolve in a comparable structural context. Since it is a result of biological selection and conservation, it has relatively few indels and the indels it has are usually not placed into elements of secondary structure or into functional motifs. The contiguous features annotated for Mbp1 are left intact.

A poor MSA has many errors in its columns, they contain residues that actuallly have diffferent functions or structural roles, even though they may look similar to a scoring matrix. It also may have introduced indels in biologically irrelevant positions, to maximize spurious sequence similarities. Some of the features annotated for Mbp1 will be disrupted.

 

 

  • Produce a similar set of annotations for your Mbp1 orthologue protein.


 

 

(2.2) Computing alignments

 

Multiple sequence alignments are compute-intensive and this used to require downloading and installing software on your own computer. While most tools were available on the Web in principle, many groups have restricted the total number of sequences or the total number of characters to be aligned. The EBI however offers three of the most commonly used tools with few limitations and it was possible to run MSAs for all Mbp1 orthologues jointly.

The following three servers were used for the alignments:

  • CLUSTAL-W is a progressive alignment program, it is the most popular, most widely referenced MSA algorithm, it is reasonably fast and easy to use. But alignment errors that are made early in the process can't get corrected and thus CLUSTAL is prone to misalign sets of sequences that have poor (<30% ID) local similarity. It is no longer considered state-of-the-art for carefully done alignments.
  • MUSCLE essentially starts out from a CLUSTAL like alignment as a draft, then identifies similar groups of sequences from which it calculates profiles, it then re-aligns the group to the profile. This procedure is iterated.
  • T-Coffee is one of my favourites - the tradeoffs appear to be especially well balanced. It too starts from a set of pairwise global alignments, like CLUSTAL, then additionally calculates sets of best local alignments. Global and local alignments are then combined to a similarity matrix and based on this matrix a guide-tree is constructed. This determines the order of steps in which sequences are added to the multiple alignment. A nice feature of T-Coffee is color coded output that allows you to quickly judge the local reliability of the alignment.

Multiple sequence alignments were performed for all 18 Mbp1 orthologues to compare the results. Since the results would all look the same for the same input file, I have simply posted them here. Of course you are welcome to do run an alignment on your own for your own learning experience, or to find an alternative program.

The first alignment was run with CLUSTAL.

Assignment 3, Figure 01
The guide tree computed by CLUSTAL-W. The algorithm uses this tree to determine the best order for its progressive alignment for the 18 Mbp1 orthologue sequences. This tree is based on a matrix of pairwise distances.

Subseqently, sequence alignments were performed with T-Coffee and MUSCLE. For these two, the input files were re-ordered to correspond to the order of the CLUSTAL output, and the option to order the alignments according to the input sequences was chosen on the form. This makes it much easier to compare alignments, since all MSAs are displayed in the same relative order.

Finally I have merged the domain annotations for the yeast Mbp1 protein into the output files.

The result files are linked here:

Globally speaking, the alignments are quite similar. Let's first look at the common themes, before we discuss details of the results. The (score-colored T-COFFEE alignment) is well suited to look at general relationships between the sequences, since outliers can be easily identified. For example, if one of the sequences would have a low-scoring domain that aligns poorly to the others of the group, it may be possible that that domain has been acquired in a separate evolutionary event and is not homologous to all others. We would notice an isolated stretch of poorly alignable sequence, i.e. it should be a segment coloured with a low score in a set of otherwise high-scoring segments. Also a gene may have acquired significant lengths of N- or C-terminal extensions which may not be homologous (unless they are the result of an internal duplication).

 

 

  • Review the (score-colored T-COFFEE alignment). Based on this alignment, how do you feel about our initial assertion that these 18 proteins should be considered orthologous over their entire length? You do not need to discuss this in the assignment but you should study the evidence in the alignment. Note that this question does not ask about the general level of conservation, but about whether significant segments (of about the length of a domain) do not appear related/alignable at all in regions where the rest of the group are reasonably well conserved.

   

(3) Mbp1 orthologues: analysis of full length MSAs

   



(3.1) APSES domains

 


The APSES domains in all of our Mbp1 orthologues are highly conserved and pretty much any alignment program must be able to align such obviously similar regions.

 

  • Consider the CLUSTAL, Muscle and T-Coffee alignments of the Mbp1 orthologues. Orient yourselves as to where the APSES domains are located. For one alignment, review whether the charged residues in the proposed binding region are wholly or partially conserved across all 18 proteins. Refer to the specific residues labelled (+) or (-) in the Mbp1 annotation above.

 

   

(3.2) Ankyrin domains

 

The Ankyrin domains are more highly diverged, the boundaries are less well defined and not even CDD, SMART and SAS agree on the precise annotations. Nevertheless we would hope that a good alignment would recognize homology in that region and that ideally the required indels would be placed between the secondary structure elements, not in their middle.

 

  • Compare the distribution of indels in the ankyrin repeat regions of three alignments. Review whether the indels in this region are concentrated in segments that connect the helices, or if they are more or less evenly distributed along the entire region of similarity. Think about whether the assertion that indels should not be placed in elements of secondary structure has merit in this case; in particular if you notice indels that violate this rule-of-thumb (that have been placed into structurally annotated regions of secondary structure), consider whether the location of the indel has strong support from aligned sequence motifs, or whether it could apparently be placed into a different location whithout much loss in alignment quality.

 

   

(3.3) Other features (1 mark)

 

Aligning functional features like coiled coil domains or intrinsically disorderd regions is even more difficult, since this is to a large degree a property of the amino acid composition, not as much the precise sequence. Thus we would expect it to be difficult to detect the correspondence between sequences in such regions. I have annotated four low complexity regions of the yeast Mbp1 sequence.

 

Analysis (1 mark)
  • Refer to your annotation of your organism's Mbp1 orthologue. Comment on one of the multiple sequence alignments: does your protein have similar low complexity regions as Mbp1_SACCE, and have these regions been aligned by the MSA algorithm? Briefly describe the situation: state whether these segments are found in the same general region, in the same detailed location, or perhaps even conserved in sequence, when you compare them to the saccharomyces cerevisiae protein. Backup your conclusions with specific reference to particular elements of the alignment.
  • Briefly discuss whether this observation should lead you to conclude that disorder in these proteins appears to be a conserved feature, i.e. that disorder can be a functional feature that is selected for in evolution. If this is the case, consider whether the disordered segments appear to be homologous or analogous or whether the data does not allow a conlusion.

 
 



(4) APSES domain homologues: analysis of domain MSAs

 

You have read how to generate a source sequence file based on the results of a PSI-BLAST search for all APSES domains in fungi. Of course, since PSI-BLAST has detected these sequences due to their high-similarity to a sequence profile, this similarity implies an alignment. In this case this is a model based MSA because the sequences are aligned to a protoypic model (the sequence profile) and not to each other. To align these highly diverged sequences the MUSCLE server is the tool of choice. For comparison, a CLUSTAL alignment has been computed as well.

If we compare the alignments, we notice immediately that they disagree over siginficant portions of the sequences.    


(4.2) Patterns of residue conservation

 


With any computational tool, we have to consider whether the program's objective function corresponds to our requirements. For example, the lack of conservation in a particular column does not necessarily mean that a residue has changed in evolution - sometimes this is simply a consequence of an alignment that has matched residues with a higher score at the expense of conserving columns we believe to be biologically important. MSAs can only take sequence information into account, while we may have complementary information available on structural and functional conservation patterns. This may include secondary structure (gaps should be moved out of regions of secondary structure, where possible), structurally required residues (these are expected to be conserved accross all structurally similar sequences), and functionally conserved residues (these are expected to have a high likelyhood of being conserved within groups of orthologues, but varying between paralogues).

In terms of structural conservation, we expect motif or consistency based alignments to be more accurate since they align to the "big picture". In terms of functional variation we expect progressive alignments to be more accurate, since they align to local similarities.

Let us consider the alignments in terms of their biological relevance. I have annotated the ligand-binding residues for the yeast Mbp1 APSES domain in the multiple sequence alignments by color coding the charged residues that putatively could bind DNA red (-) and blue (+). Thus these residues label columns of the alignment in which we expect functional conservation. I have also highlighted two residues that are associated with important structural features of the APSES domain in green. These two residues are G75, a mandatory glycine in the third position of a particular type of beta-turn, and W77, a key component of the domain's hydrophobic core. Thus these two residues label columns in which we expect structural conservation. Let's assume (i) that all the APSES domains fold into similar structures and (ii) that they all bind DNA, but (iii) they do not necessarily bind the same cognate sequence, as a consequence of the functional diversification of paralogues. This should allow you to discuss the following questions:


 

Consider any one of the three APSES domain alignments.

  • Review whether the patterns of sequence variation for functionally conserved residues are compatible with the notion that orthologues have conserved binding specificities and paralogues have acquired new functions by binding to different sequences.
  • Review whether the patterns of sequence variation for structurally conserved residues are compatible with the notion that all APSES domains have a common fold?

To approach these questions systematically, define (but reference to specific sequences and residues) what you would expect (hypothesis) and whether the alignment supports or contradicts your expectations (observation). We have determined that the sequences labelled as Mbp1 are orthologues, and the other labels were constructed to identify the yeast gene that each sequence is most similar to. This means you may group Mbp1 sequences as orthologues, Swi4, Sok2, and Phd1 sequences are presumably orthologous, and all sequences originating from the same organism are of course groups of paralogues. However, labels such as MbpA, MbpB etc. are arbitrary: these sequences as a group are paralogous to e.g. Mbp1 but not necessarily orthologous to each other.

   

(4.3) Visualization and analysis of alignment with VMD

 

VMD offers a very well constructed set of tools for the analyis of sequence and structural conservation: the MultiSeq extension. In this part of the assignment you will use VMD to analyse and visualize conservation patterns and comment on the alignments the servers have produced. I highly recommend to familiarize yourself with MultiSeq and the developers have produced an excellent tutorial on the evolution of tRNA synthetases to showcase the program's capabilities. However I am not requiring that you go through the tutorial and we will be using only a subset of the available Multiseq functions. The tool is intuitive enough, beginning to use it should require no more than following the steps below.

Proceed through the following steps:

(1) Save an alignment of the APSES domains on your computer.
(A) Access the MUSCLE alignment of all APSES domains, copy it from the Wiki page and save it on your computer, as a text file with some convenient filename and the extension .aln . This is a CLUSTAL formatted input file.
(B) Edit the file to remove any header lines and lines containing the conservation symbols .:*. Leave the gene-names and aligned sequences as they are. Make sure you are not saving the file in MS-Word binary format (.doc) and that the extension is not changed (depending on how your computer is configured, it may silently append a .txt extension that will cause trouble later on).
(2) Open the Multiseq extension in VMD.
(A) start VMD and load the 1MB1 APSES domain structure.
(B) choose a stereo representation that will show you the fold of the domain and the sidechains of key residues. For example you could use a Tube representation for the protein backbone and a Licorice representation for the selection ((sidechain or type CA) and not element H) and resid 30 to 90. (And switch the axes display off! The axes carry no information you need).
(C) On the VMD Main form navigate to Extensions → Analysis → MultiSeq
(D) When you run MultiSeq for the first time, you will be asked for a directory in which to store metadata. You can use the default or a directory of your choice; you may subsequently skip all steps that ask you to install "required" databases locally since we will not need them for this task.
(E) A window will appear - the MultiSeq window - it contains the sequence of the APSES domain you are visualizing. MultiSeq will also generate an additional cartoon representation of the structure.
(3) Load the APSES alignment.
(A) In the MultiSeq Window, navigate to File → Import Data...; Choose "From Files" and Browse to the location of the alignment you have saved. The File navigation window gives you options which files to enable: choose to Enable ALN files (these are CLUSTAL formatted multiple sequence alignments).
(B) Open the alignment file, click on Ok to Import Data, it will take a short while to load. If the data can't be loaded, the file may have the wrong extension: .aln is required.
(C) find the Mbp1_SACCE sequence in the list, click on it and move it to the top of the Sequences list with your mouse (the list is not static, you can re-order the sequences in any way you like).

You will see that the 1MB1 sequence and the APSES domain sequence do not match; at the beginning the structure has extra sequence extending its N-terminus, and in the middle the APSES sequences have gaps inserted.

(4) Bring the 1MB1 sequence in register with the APSES alignment.
(A) MultiSeq supports typical text-editor selection mechanisms. Clicking on a residue selects it, clicking on a row selects the whole sequence. Dragging with the mouse selects several residues, shift-clicking selects ranges, and option-clicking toggles the selection on or off for individual residues. Using the mouse and/or the shift key as required, select the entire first column of the sequences you have imported.
(B) Select Edit → Enable Editing... → Gaps only to allow changing indels.
(C) Pressing the spacebar once should insert a gap character before the selected column in all sequences. Insert as many gaps as you need to align the beginning of sequences with the corresponding residues of 1MB1 S I M ....
(D) Now insert as many gaps as you need into the structure sequence, to align it completely with the Mbp1_SACCE APSES domain sequence. (Simply select residues in the sequence and use the space bar to insert gaps. (Note: I have noticed a bug that sometimes prevents slider or keyboard input to the MultiSeq window; it fails to regain focus after operations in a different window. I don't know whether this is a Mac related problem or a more general bug in MultiSeq. When this happens I quit VMD and restore the session from a saved state. It is a bit annoying but not mission-critical.)
(E) When you are done, it may be prudent to save the state of your alignment. Use File → Save Session...
(5) Color by similarity
(A) Use the View → Coloring → Sequence similarity → BLOSUM30 option to color the residues in the alignment and structure. This clearly shows you where conserved and variable residues are located and allows to analyze their structural context.
(B) You can adjust the color scale in the usual way by navigating to VMD main → Graphics → Colors..., choosing the Color Scale tab and adjusting the scale midpoint (0.75 works well for me).
(C) Navigate to the Representations window and apply the color scheme to your tube-and-sidechain representation: double-click on the NewCartoon representation to hide it and use User coloring of your Tube and Licorice representations to apply the sequence similarity color gradient that MultiSeq has calculated. The example below shows in principle what you could expect to see (without sidechains).
Assignment 3, Figure 02
Stereo view of a tube representation of an APSES domain structure, colored according to residue similarity of all fungal APSES domains as defined in this assignment. A BLOSUM30 similarity matrix was applied and a gradient midpoint of 0.75. The domain is oriented with the putative recognition helix towards the front, left and the "wing" on the right.
(D) Now delete all non-Mbp1 sequences from the alignment and recalculate the similarity coloring using only the Mbp1 orthologues. You may want to shift the gradient midpoint to 0.9 or so since overall conservation is much higher. Again study the conservation patterns.
Assignment 3, Figure 03
Stereo view of a tube representation of an APSES domain structure, colored according to residue similarity of all Mbp1 orthologue APSES domains, as defined in this assignment. A BLOSUM50 similarity matrix was applied and a gradient midpoint of 0.90. The domain is oriented with the putative recognition helix towards the front, left and the "wing" on the right.


 

Analysis (1 mark)
  • Generate two parallel stereo views that shows the APSES domain backbone and selected sidechains as described above. One should be colored by sequence similarity among all APSES domains, the other by similarity among only the Mbp1 orthologues. Scale and rotate the structure so that the putative DNA binding domain is easily visible. Paste both views into your assignment in a compressed format, as was explained for Assignment 2.
  • Briefly discuss what you see (with reference to specific residues and sidechains) and what you conclude about residue conservation in the alignment of all APSES domains. Are the patterns of sequence variation for structurally conserved residues compatible with the notion that all APSES domains have a common fold?
  • Briefly discuss how the situation changes when you compare only Mbp1 orthologues with each other. Never mind that overall conservation is higher: does the distribution of conserved residues in the context of the domain change, and if so, how? Are the patterns of sequence variation for functionally conserved residues compatible with the notion that all Mbp1 orthologues have a similar function?

   

(5) Summary of Resources

 

Links
Sequences
Alignments
Mbp1 proteins:
APSES domains:


   

[End of assignment]

If you have any questions at all, don't hesitate to mail me at boris.steipe@utoronto.ca or post your question to the Course Mailing List