BIO Assignment 3 2011

From "A B C"
Jump to navigation Jump to search

Note! This assignment is currently active. All significant changes will be announced on the mailing list.

 
 

 

 

Assignment 3 (last: 2011) - Multiple Sequence Alignment

 

Preparation, submission and due date

Read carefully.
Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which people have simply overlooked crucial questions. Sadly, we always get assignments back in which people have not described procedural details. If you did not notice that the above were two different sentences, you are still not reading carefully enough.

Review the guidelines for preparation and submission of BCH441 assignments.

The due date for the assignment is Monday, November 21. at 12:00.

   

Your documentation for the procedures you follow in this assignment will be worth 1 mark.


 

Introduction

 

Take care of things, and they will take care of you.
Shunryu Suzuki

Much of what we know about a protein's physiological function is based on the conservation of that function as the species evolves. We assess conservation by comparison to related proteins. Conservation - or variability - is a consequence of selection under constraints: the multiple effects on a species' fitness function that are induced through changes to the structural or functional features of a protein. Conservation patterns can thus provide evidence for many different questions: structural conservation among proteins with similar 3D-structures, functional conservation among homologues with comparable roles, peaks of sequence variability that indicate domain boundaries in multi-domain proteins, or amino acid propensities as predictors for protein engineering and design tasks.

Measuring conservation requires alignment. Therefore a carefully done multiple sequence alignment (MSA) is a cornerstone for the annotation of the essential properties a gene or protein. MSAs are also useful to resolve ambiguities in the precise placement of indels and to ensure that columns in alignments actually contain amino acids that evolve in a similar context. MSAs serve as input for

  • functional annotation;
  • protein homology modeling;
  • phylogenetic analyses, and
  • sensitive homology searches in databases.


As a first step, we will explore the search and retrieval of fungal proteins that are orthologous to yeast Mbp1, and of the APSES domains they contain. Each student is being assigned one genome-sequenced fungus. Briefly, you will

  1. Collect sequence identifiers for all APSES domain transcription factors in your assigned species;
  2. Retrieve the sequences;
  3. Perform a multiple sequence alignment with these, and a number of reference domains;
  4. Edit the alignment and annotate.


Multiple Sequence Alignment is not a solved, computational problem and a significant number of alignment tools exist, each with different strengths and objectives. It is remarkable that by far the most frequently used MSA algorithm is CLUSTAL, a procedure that was first published for the microprocessors of the late 1980s, surpassed in performance many times, and shown to be significantly inferior to more modern approaches when aligning sequences with 30% identity or less. In this assignment we will encounter various approaches to multiple alignment:

  • A model-based approach (based on the PSSM that PSI-BLAST generates)
  • Progressive alignments - CLUSTAL and MAFFT
  • Consistency based alignment - T-Coffee and MUSCLE


(1) Mbp1 homologues


(1.1) Retrieving sequences


In Assignment 2 you retrieved the protein sequences of saccharomyces cerevisiae Mbp1 and defined its APSES (KilA-N) domain. Let us now search for an orthologue of this sequence in Your Species. More precisely, you should identify proteins that fulfill the Reciprocal Best Match criterion.

First, we need to define the sequence you will use to find Mbp1 homologues. Since Mbp1 contains the very widely distributed Ankyrin motifs, a BLAST search with full length sequences will pick up a large number of Ankyrin-repeat containing proteins that are otherwise unrelated to our query. We will instead search for homologues using only the APSES domain as a query. However, the Pfam definition of the APSES domain (or KilA-N family, as it is now called) does not cover the entire length of the domain that has been crystallized. Therefore, we will use the sequence of the crystallized protein instead of the Pfam alignment. One of the results of our analysis will be whether APSES domains in fungi all have the same length as the Mbp1 domain, or whether some are indeed much shorter, as sugested by the Pfam alignment. To remind you, here is the full sequence of the 1MB1 structure (Note that the C-terminal His6 tag that has been added for purification is not part of the Mbp1 protein sequence.) ...


>PDB:1MB1
MSNQIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQGTWVPL
NIAKQLAEKFSVYDQLKPLFDFTQTDGSASPPPAPKHHHASKVDHHHHHH


... and, for comparison, this is the corresponding alignment with the Pfam KilA-N model obtained from a RPS-BLAST search of the above sequence against the CDD database:


                           10        20        30        40        50        60        70        80
                   ....*....|....*....|....*....|....*....|....*....|....*....|....*....|....*....|
1MB1            19 IHSTGSIMKRKKDDWVNATHILKAANFAKaKRTRILEKEVLKETHEKVQ----------------GGFGKYQGTWVPLNI 82

Cdd:pfam04383    3 YNDFEIIIRRDKDGYINATKLCKAAGATK-RFRNWLRLESTKELIEELSkennidvliievenkkGKNGRLQGTYVHPDL 81


                           90
                   ....*....|....*
1MB1            83 AKQLA----EKFSVY 93

Cdd:pfam04383   82 ALAIAswisPEFALK 96


As you can see, the Pfam alignment is 18 amino acids shorter at the N-terminus and 31 amino acids shorter at the C-terminus.


Find APSES domain proteins in your species
  1. Access the species list and identify the species that has been assigned to you.
  2. Navigate to the NCBI's main page.
  3. In the left-hand menu of links, follow the link to Genomes & Maps.
  4. Under the Databases tab, follow the link to Genome.
  5. In the Genome tools section of that page, follow the link to Genomic groups BLAST.
  6. Click on link to the eukaryotic genomes tree, then on the link for the text table. This produces a BLAST interface to a list of species for which whole-genome sequences have been sequenced, annotated and entered into the various databases.
  7. Paste the FASTA sequence of the structurally defined Mbp1 APSES domain (e.g. from 1MB1) into the search field (excluding the His-tag, of course), set the parameters correctly for a Protein search against Protein sequences using blastp. Then find your assigned species in the table and check the box next to its name. Remember to record the parameters for your search. I expect you to understand which parameters would be needed in order to make this search reproducible. Run the search.
  8. On the next screen, check the box next to Format for: PSI-BLAST. Then click on View report to show the results of the first PSI-BLAST iteration.
  9. Run subsequent iterations of PSI-BLAST simply by clicking on Go after checking the sequences that have been included.
  10. Iterate the PSI-BLAST search until convergence (i.e. until no more new sequences are added); make sure to include only sequences for which the E-value is small (smaller than about 10e-03 should be safe). Sequences with borderline E-values that improve significantly in an iteration are probably homologues. Sequences with borderline E-values that do not improve much, or for which the E-value increases are probably not homologues. If this step does not work for you or the results are not what you expect, please contact your TA right away.
  • Note: Please spend a little time on each page to understand its contents. Ask, if the page contains resources or features you don't understand. Think about what you are doing. If you simply click on the links I provide, you will miss the opportunity to understand how the resources fit into the workflow you are working on, and to be able to execute similar processes yourself. Questions on page contents can potentially appear on quizzes and exam.


Familiarize yourself with the output form you obtain, this is by far the most frequently used bioinformatics result page. You may want to refer to the NCBI explanation.

Here is a list of things to look for, all of which I expect you to know and understand. (However you do not need to comment on these points in your submission.)

On the alignment image
  • What do the different colored bars mean?
  • What is the information you get when you "mouse-over" a colored bar on the alignment image.
  • What happens when you click on one of the bars?
In the description list
  • Where does the link next to an identifier take you?
  • Where does the link in the "score" column take you?
  • What does the icon at the end of each row mean? What other icons could appear there?
In the alignment section
  • What do the alignment metrics mean:
    • Score?
    • Expect (E-value)?
    • Identities?
    • Positives?
    • Gaps?
  • What is the alignment length?
  • Which sequence is labeled Query and which one is labelled Sbjct?


Next
retrieve the sequences that have E-values low enough to make you conclude they contain APSES domain homologues.
  1. Review the sequences you have found: they should all be significantly similar to the query profile. In some of the assigned species you will find one hit for each distinct sequence in the genome, in others, you will find several versions of essentially the same gene (e.g. refseq and other accession numbers).
  2. Explore the relationship between the hits by clicking on select all sequences, then choosing Distance tree of results at the top or bottom of your search results to visualize a tree representation of similarity. Highly similar sequences will be collapsed into the same node in the distance tree; you can expand those nodes to list all the node's members.
  3. Identify one representative for each distinct protein you have found. If possible, use proteins with refseq identifiers. Avoid duplicates or nearly identical variants. If there are length differences, use the longer version (shorter versions may contain only partial sequences). Click on the checkbox next to each protein you have identified.
  4. Click on get selected sequences at the top or bottom of the page. Note and record the GIs for your sequences that are listed in the Search details box, you can use them to easily reproduce your results by pasting them into any Entrez search. Also note the URL that this has produced (in your browser's URL bar). As you see, you can retrieve a list of sequences from NCBI simply by adding a list of comma-separated GI numbers to the URL of the protein database.
  5. Click on Display settings and choose FASTA (text).

If you want, for comparison, you can run a multiple alignment with an NCBI-developed MSA tool: COBALT. On the sequence list page, in the right-hand column, in the section Analyze these sequences, click on Align sequences with COBALT. It is a convenient way to get a quick first look at an alignment of NCBI retrieved sequences.

You now have a collection of APSES domain-containing homologues in your organism. There are two more tasks we need to address before we can compute alignments and analyze them. (A) we need to rename our sequences, and (B) we need to define the boundaries of their APSES domains.


(1.2) Renaming Sequences

A phylogenetic tree or multiple alignment is not really informative if it that displays GI numbers or other abstract identifiers as labels of rows or nodes. The relationship between species is fundamental to the variation we observe and we need to make this relationship explicit.

Imagine that the rows in an MSA were completely unlabeled, or the nodes in the tree would be just circles: we would have a very hard time relating the computed relationships back to the biology they represent. Abstract identifiers like NP_010227 are not much better.

Typically, the information that programs use to label sequences is taken from the FASTA header. This provides us with an easy way to make sure they display the information we need and that we can interpret. Typically such programs will use the first few (often ten) characters they find. We will therefore design short strings strings that identify potential gene family relationships as well as species.


Species codes

The scientific name of a species is formed according to Linnaean binomial nomenclature and Swissprot has for a long time condensed species names into mnemonic five-character codes, taking the first three from the genus name and the last two from the specific name. For example Saccharomyces cerevisiae is abbreviated as SACCE and Lachancea thermotolerans is LACTH. For the most part, this creates unique strings that are good mnemonic labels for the species. I have added these "codes" to the Species list.


Gene families

Most yeast genes have traditional names, like mbp1 or sok2. These names are convenient family labels since saccharomyces cerevisiae is one of the best studied model organisms. Therefore, once we identify a protein family that includes a yeast gene, we can easily access expert knowledge in textbooks or manuscripts. Of course, such labels are arbitrary - whether we call a gene Mbp1 or WXYZ makes no difference - as long as all genes that we presume to be family members carry the same label. For higher eukaryotes, I would probably choose human gene names as a reference point, for bacteria I would choose E. coli.

To define which gene belongs into which family, we can align all newly found genes with all yeast APSES domain homologues, to find out which ones they are most similar to. This creates common family labels. We can use these as provisional family names for the encoded proteins, even though we may want to revise them once we have mapped out explicit phylogenetic trees.


Identifying APSES domains (general procedure).

In order to identify the APSES domain boundaries, you can simply run a multiple sequence alignment of the structurally defined APSES domain sequence (e.g. taken from PDB-ID 1MB1) against all sequences you have found. The boundaries of the aligned APSES domain then define the domain boundaries in the aligned proteins.


Identifiying family relationships (in the same run)

However, for efficiency, we can also determine family relationships in the same alignment that we use to define domain boundaries, if we simply include all yeast APSES domains in the MSA. Then we can judge similarity simply from examining the guide tree of the alignment and label the families accordingly. This has the added advantage that the domain boundaries are more securely defined, since we include more sequence information into the alignment.

Proceed as follows.
  1. Open the Muscle MSA input page at the EBI.
  2. Access the Yeast APSES domain collection I have prepared and copy the FASTA sequences. Paste them into the sequence field of the MUSCLE program input form.
  3. Copy the FASTA sequenced of the full length APSES domain protein sequence collection from your PSI-BLAST search (above) and paste them into the MUSCLE input form as well.
  4. Set the following parameters:
OUTPUT FORMAT: CLUSTALW2
OUTPUT TREE: from second iteration
OUTPUT ORDER: aligned
  1. Click on Submit.


The output should show the MSA. The overlap of the yeast APSES domains with your sequences defines the domain boundaries. Moreover, a tree has been calculated and you can view the tree to identify family relationships.

Visualize the alignment tree and decide on names

Click on the link to the Guide tree. This is the so-called Newick tree format and there are a large number of online tree viewers to visualize such trees. The MUSCLE form will display one tree for you,

You could also navigate (for example) to the proWeb Tree viewer and paste the tree data into the User-supplied Newick Tree input field. Choose any graphics format your browser can handle (JPEG is a pretty safe bet) and click on View tree.


  1. Interpret the tree to decide on the protein family names for your sequences:
    1. If a yeast protein is grouped with exactly one of your proteins, your protein gets the same name.
    2. If a yeast protein is grouped with more than one of your proteins, replace the number in the yeast protein with a, b, c ..., from most similar to least similar for your protein. For example: if one Aspergillus fumigatus protein is most similar to yeast Mbp1, you will give it the name MBP1_ASPFU. If two proteins are both most similar to yeast Sok2, you will name them SOKA_ASPFU and SOKB_ASPFU. Try to get it approximately right but remember that this is a process of estimation - we are not accurately measuring distances (yet).

That done, edit your FASTA headers and save your APSES domain sequence set. We will need them for the next assignment.


(2) Align and Annotate

 


(2.1) Review of domain annotations

APSES domains are relatively easy to identify and annotate but we have had problems with the ankyrin domains in Mbp1 homologues. Both CDD as well as SMART have identified such domains, but while the domain model was based on the same Pfam profile for both, and both annotated approximately the same regions, the details of the alignments and the extent of the predicted region was different.

Mbp1 forms heterodimeric complexes with a homologue, Swi6. Swi6 does not have an APSES domain, thus it does not bind DNA. But it is similar to Mbp1 in the region spanning the ankyrin domains and in 1999 Foord et al. published its crystal structure (1SW6). This structure is a good model for Ankyrin repeats in Mbp1. For details, please refer to the consolidated Mbp1 annotation page I have prepared.

In what follows, we will use the program JALVIEW - a Java based multiple sequence alignment editor to load and align sequences and to consider structural similarity between yeast Mbp1 and its closest homologue in your organism.

In this part of the assignment,

  1. You will load sequences that are most similar to Mbp1 into an MSA editor;
  2. You will add sequences of ankyrin domain models;
  3. You will perform a multiple sequence alignment;
  4. You will try to improve the alignment manually;


(2.2) Jalview, loading sequences

Geoff Barton's lab in Dundee has developed an integrated MSA editor and sequence annotation workbench with a number of very useful functions. It is written in Java and should run on Mac, Linux and Windows platforms without modifications. We will use this tool for this assignment and explore its features as we go along.

  1. Navigate to the Jalview homepage click on Download, install Jalview on your computer and start it. A number of windows that showcase the program's abilities will load, you can close these.
  2. Prepare homologous Mbp1 sequences for alignment:
    1. Find the sequence in your assigned species that fulfills the Reciprocal Best Match crierion with yeast Mbp1.
    2. Open the Mbp1 RBM reference sequences page.
    3. Copy the FASTA sequences of the reference proteins, return to Jalview and select File → Input Alignment → from Textbox and paste the sequences into the textbox.
    4. Also paste a FASTA sequence of your species' Mbp1 protein into the window.
    5. Finally copy the sequences for ankyrin domain models (below) and paste them into the Jalview textbox as well. Paste two separate copies of the CD00204 consensus sequence and one copy of 1SW6.
    6. When all the sequences are present, click on New Window. Jalview gives you all the sequences, but of course this is not yet an alignment.
Ankyrin domain models
>CD00204 ankyrin repeat consensus sequence from CDD
NARDEDGRTPLHLAASNGHLEVVKLLLENGADVNAKDNDGRTPLHLAAKNGHLEIVKLLL
EKGADVNARDKDGNTPLHLAARNGNLDVVKLLLKHGADVNARDKDGRTPLHLAAKNGHL
>1SW6 from PDB - unstructured loops replaced with xxxx
GPIITFTHDLTSDFLSSPLKIMKALPSPVVNDNEQKMKLEAFLQRLLFxxxxSFDSLLQE
VNDAFPNTQLNLNIPVDEHGNTPLHWLTSIANLELVKHLVKHGSNRLYGDNMGESCLVKA
VKSVNNYDSGTFEALLDYLYPCLILEDSMNRTILHHIIITSGMTGCSAAAKYYLDILMGW
IVKKQNRPIQSGxxxxDSILENLDLKWIIANMLNAQDSNGDTCLNIAARLGNISIVDALL
DYGADPFIANKSGLRPVDFGAG


(2.3) Computing alignments

Sequence alignments can be calculated directly from Jalview.

  1. In Jalview, select Web Service → Alignment → MAFFT Multiple Protein Sequence Alignment. The alignment is calculated in a few minutes and displayed in a new window.
  2. Choose Colour → Hydrophobicity and → by Conservation. Then select Modify Conservation Threshold... and adjust the slider left or right to see which columns are highly conserved. You will notice that the Swi6 sequence that was supposed to align only to the ankyrin domains was in fact aligned to other parts of the sequence as well. This is one part of the MSA that we will have to correct manually and a common problem when aligning sequences of different lengths.
  3. Other alignment algorithms are available and you may wish to explore whether the alignments differ significantly.


(2.4) Editing ankyrin domain alignments


A good MSA comprises only columns of residues that play similar roles in the proteins' mechanism and/or that evolve in a comparable structural context. Since it is a result of biological selection and conservation, it has relatively few indels and the indels it has are usually not placed into elements of secondary structure or into functional motifs. The contiguous features annotated for Mbp1 are expected to be left intact by a good alignment.

A poor MSA has many errors in its columns; these contain residues that actually have different functions or structural roles, even though they may look similar according to a (pairwise!) scoring matrix. A poor MSA also may have introduced indels in biologically irrelevant positions, to maximize spurious sequence similarities. Some of the features annotated for Mbp1 will be disrupted in a poor alignment and residues that are conserved may be placed into different columns.

Often errors or inconsistencies are easy to spot, and manually editing an MSA is not generally frowned upon, even though this is not a strictly objective procedure. The main goal of manual editing is to make an alignment biologically more plausible. Most comonly this means to mimize the number of rare evolutionary events that the alignment suggests and/or to emphasize conservation of known functional motifs. Here are some examples for what one might aim for in manually editing an alignment:

Reduce number of indels
From a Probcons alignment:
0447_DEBHA    ILKTE-K-T---K--SVVK      ILKTE----KTK---SVVK
9978_GIBZE    MLGLN-PGLKEIT--HSIT      MLGLNPGLKEIT---HSIT
1513_CANAL    ILKTE-K-I---K--NVVK      ILKTE----KIK---NVVK
6132_SCHPO    ELDDI-I-ESGDY--ENVD      ELDDI-IESGDY---ENVD
1244_ASPFU    ----N-PGLREIC--HSIT  ->  ----NPGLREIC---HSIT
0925_USTMA    LVKTC-PALDPHI--TKLK      LVKTCPALDPHI---TKLK
2599_ASPTE    VLDAN-PGLREIS--HSIT      VLDANPGLREIS---HSIT
9773_DEBHA    LLESTPKQYHQHI--KRIR      LLESTPKQYHQHI--KRIR
0918_CANAL    LLESTPKEYQQYI--KRIR      LLESTPKEYQQYI--KRIR

Gaps marked in red were moved. The sequence similarity in the alignment does not change considerably, however the total number of indels in this excerpt is reduced to 13 from the original 22


Move indels to more plausible position
From a CLUSTAL alignment:
4966_CANGL     MKHEKVQ------GGYGRFQ---GTW      MKHEKVQ------GGYGRFQ---GTW
1513_CANAL     KIKNVVK------VGSMNLK---GVW      KIKNVVK------VGSMNLK---GVW
6132_SCHPO     VDSKHP-----------QID---GVW  ->  VDSKHPQ-----------ID---GVW
1244_ASPFU     EICHSIT------GGALAAQ---GYW      EICHSIT------GGALAAQ---GYW

The two characters marked in red were swapped. This does not change the number of indels but places the "Q" into a a column in which it is more highly conserved (green). Progressive alignments are especially prone to this type of error.

Conserve motifs
From a CLUSTAL alignment:
6166_SCHPO      --DKRVA---GLWVPP      --DKRVA--G-LWVPP
XBP1_SACCE      GGYIKIQ---GTWLPM      GGYIKIQ--G-TWLPM
6355_ASPTE      --DEIAG---NVWISP  ->  ---DEIA--GNVWISP
5262_KLULA      GGYIKIQ---GTWLPY      GGYIKIQ--G-TWLPY

The first of the two residues marked in red is a conserved, solvent exposed hydrophobic residue that may mediate domain interactions. The second residue is the conserved glycine in a beta turn that cannot be mutated without structural disruption. Changing the position of a gap and insertion in one sequence improves the conservation of both motifs.


The Ankyrin domains are quite highly diverged, the boundaries not well defined and not even CDD, SMART and SAS agree on the precise annotations. We expect there to be alignment errors in this region. Nevertheless we would hope that a good alignment would recognize homology in that region and that ideally the required indels would be placed between the secondary structure elements, not in their middle. But judging from the sequence alignment alone, we cannot judge where the secondary structure elements ought to be. You should therefore add the following "sequence" to the alignment; it contains exactly as many characters as the Swi6 sequence above and annotates the secondary structure elements. I have derived it from the 1SW6 structure

>SecStruc 1SW6 E: strand   t: turn   H: helix   _: irregular
_EEE__tt___ttt______EE_____t___HHHHHHHHHHHHHHHH_xxxx_HHHHHHH
HHHH_t_____t_____t____HHHHHHH__tHHHHHHHHH____t___tt____HHHHH
HH__HHHH___HHHHHHHHHHHHHEE_t____HHHHHHHHH__t__HHHHHHHHHHHHHH
HHHHHH__EEE_xxxx_HHHHHt_HHHHHHH______t____HHHHHHHH__HHHHHHHH
H____t____t____HHHH___


To proceed:

  1. You should manually align the Swi6 sequence with yeast Mbp1
  2. You should bring the Secondary structure annotation into its correct alignment with Swi6
  3. You should bring both CDD ankyrin profiles into the correct alignment with yeast Mbp1

Proceed along the following steps:

  1. Add the secondary structure annotation to the sequence alignment in Jalview. Copy, select File → Add sequences → from Textbox and paste the sequence.
  2. Select Help → Documentation and read about Editing Alignments, Cursor Mode and Key strokes.
  3. Click on the yeast Mbp1 sequence row to select the entire row. Then use the cursor key to move that sequence directly above the 1SW6 sequence. Select the row of 1SW6 and use shift/mouse to move the sequence elements and realign them with yeast Mbp1. Refer to the alignment given in the Mbp1 annotation page.
  4. Align the secondary structure elements with the 1SW6 sequence: Every character of 1SW6 should be matched with either E, t, H, or _. The result should be similar to the Mbp1 annotation page. If you need to insert gaps into all sequences in the alignment, simply drag your mouse over all row headers - movement of sequences is constrained to selected regions, the rest is locked into place to prevent inadvertent misalignments. Remember to save your project from time to time: File → save so you can reload a previous state if anything goes wrong and can't be fixed with Edit → Undo.
  5. Finally align the two CD00204 consensus sequences to their correct positions (again, refer to the Mbp1 annotation page).
  6. You can now consider the principles stated above and see if you can improve the alignment, for example by moving indels out of regions of secondary structure if that is possible without changing the character of the aligned columns significantly. Select blocks within which to work to leave the remaining alignment unchanged. So that this does not become tedious, you can restrict your editing to one Ankyrin repeat that is structurally defined in Swi6. You may want to open the 1SW6 structure in VMD to define the boundaries of one such repeat. You can copy and paste sections from Jalview into your assignment for documentation or export sections of the alignment to HTML (see the example below).


(2.4.1) Editing ankyrin domain alignments - Sample

This sample was created by

  1. Editing the alignments as described above;
  2. Copying a block of aligned sequence;
  3. Pasting it To New Alignment;
  4. Colouring the residues by Hydrophobicity and setting the colour saturation according to Conservation;
  5. Choosing File → Export Image → HTML and pasting the resulting HTML source into this Wikipage.


10
|
20
|
30
|
40
|
MBP1_USTMA/341-368   - - Y G D Q L - - - A D - - - - - - - - - - I L - - - - N F Q D D E G E T P L T M A A R A R S
MBP1B_SCHCO/470-498   - R E D G D Y - - - K S - - - - - - - - - - F L - - - - D L Q D E H G D T A L N I A A R V G N
MBP1_ASHGO/465-494   F S P Q Y R I - - - E T - - - - - - - - - - L I - - - - N A Q D C K G S T P L H I A A M N R D
MBP1_CLALU/550-586   G N Q N G N S N D K K E - - - - - - - - - - L I S K F L N H Q D N E G N T A F H I A A Y N M S
MBPA_COPCI/514-542   - H E G G D F - - - R S - - - - - - - - - - L V - - - - D L Q D E H G D T A I N I A A R V G N
MBP1_DEBHA/507-550   I R D S Q E I - - - E N K K L S L S D K K E L I A K F I N H Q D I D G N T A F H I V A Y N L N
MBP1A_SCHCO/388-415   - - Y P K E L - - - A D - - - - - - - - - - V L - - - - N F Q D E D G E T A L T M A A R C R S
MBP1_AJECA/374-403   T L P P H Q I - - - S M - - - - - - - - - - L L - - - - S S Q D S N G D T A A L A A A K N G C
MBP1_PARBR/380-409   I L P P H Q I - - - S L - - - - - - - - - - L L - - - - S S Q D S N G D T A A L A A A K N G C
MBP1_NEOFI/363-392   T C S Q D E I - - - D L - - - - - - - - - - L L - - - - S C Q D S N G D T A A L V A A R N G A
MBP1_ASPNI/365-394   T F S P E E V - - - D L - - - - - - - - - - L L - - - - S C Q D S V G D T A V L V A A R N G V
MBP1_UNCRE/377-406   M Y P H H E V - - - G L - - - - - - - - - - L L - - - - A S Q D S N G D T A A L T A A K N G C
MBP1_PENCH/439-468   T C S Q D E I - - - Q M - - - - - - - - - - L L - - - - S C Q D Q N G D T A V L V A A R N G A
MBPA_TRIVE/407-436   V F P R H E I - - - S L - - - - - - - - - - L L - - - - S S Q D A N G D T A A L T A A K N G C
MBP1_PHANO/400-429   T W I P E E V - - - T R - - - - - - - - - - L L - - - - N A Q D Q N G D T A I M I A A R N G A
MBPA_SCLSC/294-313   - - - - - - - - - - - - - - - - - - - - - - - L - - - - D A R D I N G N T A I H I A A K N K A
MBPA_PYRIS/363-392   T W I P E E V - - - T R - - - - - - - - - - L L - - - - N A A D Q N G D T A I M I A A R N G A
MBP1_/361-390   - - - N H S L G V L S Q - - - - - - - - - - F M - - - - D T Q N N E G D T A L H I L A R S G A
MBP1_ASPFL/328-364   T E Q P G E V I T L G R - - - - - - - - - - F I S E I V N L R D D Q G D T A L N L A G R A R S
MBPA_MAGOR/375-404   Q H D P N F V - - - Q Q - - - - - - - - - - L L - - - - D A Q D N D G N T A V H L A A Q R G S
MBP1_CHAGL/361-390   S R S A D E L - - - Q Q - - - - - - - - - - L L - - - - D S Q D N E G N T A V H L A A M R D A
MBP1_PODAN/372-401   V R Q P E E V - - - Q A - - - - - - - - - - L L - - - - D A Q D E E G N T A L H L A A R V N A
MBP1_LACTH/458-487   F S P R Y R I - - - E N - - - - - - - - - - L I - - - - N A Q D Q N G D T A V H L A A Q N G D
MBP1_FILNE/433-460   - - Y P Q E L - - - A D - - - - - - - - - - V I - - - - N F Q D E E G E T A L T I A A R A R S
MBP1_KLULA/477-506   F T P Q Y R I - - - D V - - - - - - - - - - L I - - - - N Q Q D N D G N S P L H Y A A T N K D
MBP1_SCHST/468-501   A K D P D N K - - - K D - - - - - - - - - - L I A K F I N H Q D S D G N T A F H I C S H N L N
MBP1_SACCE/496-525   F S P Q Y R I - - - E L - - - - - - - - - - L L - - - - N T Q D K N G D T A L H I A S K N G D
CD00204/1-19   - - - - - - - - - - - - - - - - - - - - - - - - - - - - N A R D E D G R T P L H L A A S N G H
CD00204/99-118   - - - - - - - - - - - - - - - - - - - - - - - V - - - - N A R D K D G R T P L H L A A K N G H
1SW6/203-232   L D L K W I I - - - A N - - - - - - - - - - M L - - - - N A Q D S N G D T C L N I A A R L G N
SecStruc/203-232   t _ H H H H H - - - H H - - - - - - - - - - _ _ - - - - _ _ _ _ t _ _ _ _ H H H H H H H H _ _
Aligned sequences before editing. The algorithm has placed gaps into the Swi6 helix LKWIIAN and the four-residue gaps before the block of well aligned sequence on the right are poorly supported.


10
|
20
|
30
|
40
|
MBP1_USTMA/341-368   - - Y G D Q L A D - - - - - - - - - - - - - - I L N F Q D D E G E T P L T M A A R A R S
MBP1B_SCHCO/470-498   - R E D G D Y K S - - - - - - - - - - - - - - F L D L Q D E H G D T A L N I A A R V G N
MBP1_ASHGO/465-494   F S P Q Y R I E T - - - - - - - - - - - - - - L I N A Q D C K G S T P L H I A A M N R D
MBP1_CLALU/550-586   G N Q N G N S N D K K E - - - - - - - L I S K F L N H Q D N E G N T A F H I A A Y N M S
MBPA_COPCI/514-542   - H E G G D F R S - - - - - - - - - - - - - - L V D L Q D E H G D T A I N I A A R V G N
MBP1_DEBHA/507-550   I R D S Q E I E N K K L S L S D K K E L I A K F I N H Q D I D G N T A F H I V A Y N L N
MBP1A_SCHCO/388-415   - - Y P K E L A D - - - - - - - - - - - - - - V L N F Q D E D G E T A L T M A A R C R S
MBP1_AJECA/374-403   T L P P H Q I S M - - - - - - - - - - - - - - L L S S Q D S N G D T A A L A A A K N G C
MBP1_PARBR/380-409   I L P P H Q I S L - - - - - - - - - - - - - - L L S S Q D S N G D T A A L A A A K N G C
MBP1_NEOFI/363-392   T C S Q D E I D L - - - - - - - - - - - - - - L L S C Q D S N G D T A A L V A A R N G A
MBP1_ASPNI/365-394   T F S P E E V D L - - - - - - - - - - - - - - L L S C Q D S V G D T A V L V A A R N G V
MBP1_UNCRE/377-406   M Y P H H E V G L - - - - - - - - - - - - - - L L A S Q D S N G D T A A L T A A K N G C
MBP1_PENCH/439-468   T C S Q D E I Q M - - - - - - - - - - - - - - L L S C Q D Q N G D T A V L V A A R N G A
MBPA_TRIVE/407-436   V F P R H E I S L - - - - - - - - - - - - - - L L S S Q D A N G D T A A L T A A K N G C
MBP1_PHANO/400-429   T W I P E E V T R - - - - - - - - - - - - - - L L N A Q D Q N G D T A I M I A A R N G A
MBPA_SCLSC/294-313   - - - - - - - - - - - - - - - - - - - - - - - - L D A R D I N G N T A I H I A A K N K A
MBPA_PYRIS/363-392   T W I P E E V T R - - - - - - - - - - - - - - L L N A A D Q N G D T A I M I A A R N G A
MBP1_/361-390   N H S L G V L S Q - - - - - - - - - - - - - - F M D T Q N N E G D T A L H I L A R S G A
MBP1_ASPFL/328-364   T E Q P G E V I T L G R F I S E - - - - - - - I V N L R D D Q G D T A L N L A G R A R S
MBPA_MAGOR/375-404   Q H D P N F V Q Q - - - - - - - - - - - - - - L L D A Q D N D G N T A V H L A A Q R G S
MBP1_CHAGL/361-390   S R S A D E L Q Q - - - - - - - - - - - - - - L L D S Q D N E G N T A V H L A A M R D A
MBP1_PODAN/372-401   V R Q P E E V Q A - - - - - - - - - - - - - - L L D A Q D E E G N T A L H L A A R V N A
MBP1_LACTH/458-487   F S P R Y R I E N - - - - - - - - - - - - - - L I N A Q D Q N G D T A V H L A A Q N G D
MBP1_FILNE/433-460   - - Y P Q E L A D - - - - - - - - - - - - - - V I N F Q D E E G E T A L T I A A R A R S
MBP1_KLULA/477-506   F T P Q Y R I D V - - - - - - - - - - - - - - L I N Q Q D N D G N S P L H Y A A T N K D
MBP1_SCHST/468-501   A K D P D N K K D - - - - - - - - - - L I A K F I N H Q D S D G N T A F H I C S H N L N
MBP1_SACCE/496-525   F S P Q Y R I E L - - - - - - - - - - - - - - L L N T Q D K N G D T A L H I A S K N G D
CD00204/1-19   - - - - - - - - - - - - - - - - - - - - - - - - - N A R D E D G R T P L H L A A S N G H
CD00204/99-118   - - - - - - - - - - - - - - - - - - - - - - - - V N A R D K D G R T P L H L A A K N G H
1SW6/203-232   L D L K W I I A N - - - - - - - - - - - - - - M L N A Q D S N G D T C L N I A A R L G N
SecStruc/203-232   t _ H H H H H H H - - - - - - - - - - - - - - _ _ _ _ _ _ t _ _ _ _ H H H H H H H H _ _
Aligned sequence after editing. A significant cleanup of the frayed region is possible. Now there is only one insertion event, and it is placed into the loop that connects two helices of the 1SW6 structure.


(2.5) Final analysis


  • Compare the distribution of indels in the ankyrin repeat regions of your alignments. Review whether the indels in this region are concentrated in segments that connect the helices, or if they are more or less evenly distributed along the entire region of similarity. Think about whether the assertion that indels should not be placed in elements of secondary structure has merit in your alignment. Recognize that an indel in an element of secondary structure could be interpreted in a number of different ways:
    • The alignment is correct, the annotation is correct too: the indel is tolerated in that particular case, for example by extending the length of an α-helix or β-strand;
    • The alignment algorithm has made an error, the structural annotation is correct: the indel should be moved a few residues;
    • The alignment is correct, the structural annotation is wrong, this is not a secondary structure element after all;
    • Both the algorithm and the annotation are probably wrong, but we have no data to improve the situation.

(NB: remember that the structural annotations have been made for the yeast protein and might have turned out differently for the other proteins...)

You should be able to analyse discrepancies between annotation and expectation in a structured and systematic way. In particular if you notice indels that have been placed into structurally annotated regions of secondary structure, you should be able to comment on whether the location of the indel has strong support from aligned sequence motifs, or whether the indel could possibly be moved into a different location without much loss in alignment quality.


Analysis (2 marks)
  • Considering the whole alignment and your experience with editing, please note in your assignment your assessment of whether the position of indels relative to structural features of the ankyrin domains in your organism's Mbp1 protein is reliable.
  • CDD extends the ankyrin domain annotation beyond the 1SW6 domain boundaries. Given your assessment of conservation in that region, do you think that this is reasonable in your organisms' protein? Is there evidence for this in the alignment of the CD00204 consensus with well aligned blocks of sequence beyond the positions that match Swi6?


(3) Summary of Resources

 

Links
Lists


Further reading

 

[End of assignment]

 

If you have any questions at all, don't hesitate to mail me at boris.steipe@utoronto.ca or post your question to the Course Mailing List