BIO Assignment 3 2011

A carefully done multiple sequence alignment (MSA) is a cornerstone for the annotation of a gene or protein. MSAs combine the information from several related proteins, allowing us to study their essential, shared properties. They are useful to resolve ambiguities in the precise placement of gaps and to ensure that columns in alignments actually contain amino acids that evolve in a similar context. Therefore we need MSAs as input for

protein homology modeling,
phylogenetic analyses, and
sensitive homology searches in databases.

Furthermore conservation - or the lack of conservation - reflects the requirements of structural or functional features of our protein, emphasizes domain boundaries in multi-domain proteins and it can guide mutations for protein engineering and design.

Given the ubiquitous importance of this procedure, it is somewhat surprising that by far the most frequently used algorithm is CLUSTAL, which has been shown to be significantly inferior to more modern approaches for sequences with about 30% identity or less.

In this assignment we will explore MSAs of the Mbp1 proteins and the APSES domains they contain and try several approaches to alignment:

A model-based approach (based on the PSSM that PSI-BLAST generates)
A progressive alignment - the CLUSTAL algorithm
A consistency based alignment - T-coffee resp. Probcons

Preparation, submission and due date

Please read carefully. Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which people have simply overlooked crucial questions. Sadly, we always get assignments back in which people have not described procedural details. If you did not notice that the above were two different sentences, you are still not reading carefully enough.

Prepare a Microsoft Word document with a title page that contains:

your full name
your Student ID
your e-mail address
the organism name you have been assigned (see below)

Follow the steps outlined below. You are encouraged to write your answers in short answer form or point form, like you would document an analysis in a laboratory notebook. However, you must

document what you have done,
note what Web sites and tools you have used,
paste important data sequences, alignments, information etc.

If you do not document the process of your work, we will deduct marks. Try to be concise, not wordy! Use your judgement: are you giving us enough information so we could exactly reproduce what you have done? If not, we will deduct marks. Avoid RTF and unnecessary formating. Do not paste screendumps. Keep the size of your submission below 1.5 MB.

Write your answers into separate paragraphs and give each its title. Save your document with a filename of: A3_family name.given name.doc (for example my first assignment would be named: A3_steipe.boris.doc - and don't switch the order of your given name and familyname please!)

Finally e-mail the document to [boris.steipe@utoronto.ca] before the due date.

Your document must not contain macros. Please turn off and/or remove all macros from your Word document; we will disable macros, since they pose a security risk.

With the number of students in the course, we have to economize on processing the assignments. Thus we will not accept assignments that are not prepared as described above. If you have technical difficulties, contact me.

The due date for the assignment is XXXXX at 10:00 in the morning.

Grading

Don't wait until the last day to find out there are problems! Assignments that are received past the due date will have one mark deducted at the first minute of every twelve hour period past the due date. Assignments received more than 5 days past the due date will not be assessed.

Marks are noted below in the section headings for of the tasks. A total of 10 marks will be awarded, if your assignment answers all of the questions. A total of 2 bonus marks (up to a maximum of 10 overall) can be awarded for particularily interesting findings, or insightful comments. A total of 2 marks can be subtracted for lack of form or for glaring errors. The marks you receive will

count directly towards your final marks at the end of term, for BCH441 (undergraduates), or
be divided by two for BCH1441 (graduates).

Retrieve

In Assignment 2 you had retrieved the Saccharomyces cerevisiae Mbp1 protein sequence.

Mbp1 homologues

Our first task is to compile a multi-FASTA file for all Mbp1 orthologues. First we need to define which sequences we are talking about.

In your second assignments, you used BLAST to find the best matches to the yeast Mbp1 protein in your assigned organism's genome. Since there was some variation in the sequences you reported, I have generated a list de novo using the following procedure:

Retrieved the Mbp1 protein sequence by searching Entrez for Mbp1 AND "saccharomyces cerevisiae"[organism]
Clicked on the RefSeq tab to find the RefSeq ID "NP_010227"
Accessed the BLAST form for protein/protein BLAST and pasted the RefSeq ID into the query field. Chose refseq as the database to search in, from the drop-down menu. Kept default parameters but turned Filter off. Chose Fungi as an ENTREZ query limit in the Options section.
On the results page, checked the checkbox next to the alignment of the most significant hit from each of the organisms we are studying.
Clicked on the "Get selected sequences" button. The results page lists the gene that is most similar to Mbp1 in each organism.
Verified that each of these sequences finds Mbp1 as the best match in the saccharomyces cerevisiae genome by clicking on each "BLink" (click for example) in the retrieved list. Scrolled down the list to confirm that the top hit of a saccharomyces cerevisiae protein is indeed Mbp1 (NP_010227).
Obtained UniProt accessionsfor all sequences, with a single query using the new UniProt ID mapping service. This service accepts a comma delimited list of RefSeq IDs and returns a list of Uniprot proteins.
Assembled this information into the following table.

*Organism*	`CODE`	GI	Refseq	Uniprot Accession	Most similar yeast gene
Aspergillus fumigatus	`ASPFU`	70986922	XP_748947	Q4WGN2	Mbp1
Aspergillus nidulans	`ASPNI`	67525393	XP_660758	Q5B8H6	Mbp1
Aspergillus terreus	`ASPTE`	115391425	XP_001213217	Q0CQJ5	Mbp1
Candida albicans	`CANAL`	68465419	XP_723071	Q5ANP5	Mbp1
Candida glabrata	`CANGL`	50286059	XP_445458	Q6FWD6	Mbp1
Cryptococcus neoformans	`CRYNE`	58266778	XP_570545	Q5KHS0	Mbp1
Debaryomyces hansenii	`DEBHA`	50420495	XP_458784	Q6BSN6	Mbp1
Eremothecium gossypii	`EREGO`	45199118	NP_986147	Q752H3	Mbp1
Gibberella zeae	`GIBZE`	46116756	XP_384396	Q4IEY8	Mbp1
Kluyveromyces lactis	`KLULA`	50308375	XP_454189	P39679	Mbp1
Magnaporthe grisea	`MAGGR`	39964664	XP_365024	ACC	Mbp1*
Neurospora crassa	`NEUCR`	85109541	XP_962967	Q7SBG9	Mbp1
Saccharomyces cerevisiae	`SACCE`	6320147	NP_010227	P39678	Mbp1
Schizosaccharomyces pombe	`SCHPO`	19113944	NP_593032	P41412	Mbp1
Ustilago maydis	`USTMA`	71024227	XP_762343	Q4P117	Mbp1
Yarrowia lipolytica	`YARLI`	50545439	XP_500257	Q6CGF5	Mbp1

* Note: This is a full-length homologue, however BLink shows that the C-terminal half is more similar to Swi6 than to Mbp1. Thus I would consider the ASPES domain orthologous, the remainder possibly paralogous.

From the information given here, briefly explain if these sequences appear to be orthologues to yeast Mbp1 (as evidenced through the "reciprocal best-match" criterium).

Briefly explain if these sequences are necessarily orthologues to each other. (1 mark)

Next the sequences were retrieved and slightly reformatted. This is our second task: obtaining all FASTA sequences from a list of identifiers and putting them in aform in which we can use them as input for other programs or services.

Review the resulting multi-FASTA file for the all Mbp1 proteins (linked here) and make sure you understand the procedure that led to it. Summarize the key steps of the procedure in point form in your submission. (Don't submit the entire file but make sure you understand (and could reproduce) the essential parts of the procedure). (1 mark)

Other ASPES domain sequences

Mbp1 orthologues are not the only proteins that contain APSES domains. In order to find all the rest, a PSI BLAST search was performed using the yeast Mbp1 APSES domain as query. From the list of hits, the APSES domains were extracted and summarized in a file.

Review the resulting file for the APSES domains and make sure you understand the procedure that led to it. Summarize the key steps of the procedure in point form. (1 mark)

Orthologues

Determine for one of the the APSES domains in your organism which yeast APSES domain (if any) it is orthologous to:

Choose at random one of the APSES domains from your organism (but not one labelled with Mbp1) and copy it's sequence into the input window of a BLAST search.
Restrict the BLAST search to RefSeq sequences in saccharomyces cerevisiae.
Run the search and determine the gene name of the best hit. (This is the best match.)
Find the sequence of your best hit's APSES domain in the sequence list. (Since the last contains all of them, your hit should be in there.
Copy that sequence and perform the same kind of BLAST search, this time restricted to your organism. (This finds the reciprocal match.)

Document the process and report briefly what you have found on the forward and on the reverse search. Does the gene you have chosen fulfill the reciprocal best match criterium for orthology with a yeast gene? (1 mark)

Align

Actually performing multiple sequence alignements used to involve downloading and installing software on your own computer. While most tools were available on the Web in principle, many groups have restricted the total number of sequences or the total number of characters to be aligned. The EBI however offers three of the most commonly used tools with few limitations and it was possible to run MSAs for all Mbp1 orthologues jointly.

Aligning the Mbp1 orthologues (X marks)

I used the following three servers:

CLUSTAL-W is a progressive alignment program, it is the most popular, most widely referenced, it is reasonably fast and easy to use. But alignment errors that are made early can't get corrected and thus it is prone to misalignments on sets of sequences that have poor (<30% ID) local similarity. It is no longer considered state-of-the-art for carefully done alignments.
MUSCLE essentially starts out from a CLUSTAL like alignment as a draft, then identifies similar groups of sequences from which it calculates profiles, it then re-aligns the group to the profile. This procedure is iterated.
T-COFFEE is one of my favourites - the tradeoffs appear to be especially well balanced. It too starts from a set of pairwise global alignments, like CLUSTAL, then additionally calculates sets of best local alignments. Global and local alignments are then combined to a similarity matrix and based on this matrix a guide-tree is constructed. This determines the order of steps in which sequences are added to the multiple alignment. A nice feature of T-COFFEE is color coded output that allows you to quickly judge the local reliability of the alignment.

We shall perform multiple sequence alignments for all 16 Mbp1 orthologues and compare the results. Since the results should look the same for all of you, it was possible to precomputed the alignments to save some resources. Of course you are welcome to do this on your own, but it is not required. In fact, since we want to compare the alignments, I have also edited them: I have re-sorted the results so that the sequences appear in the same order in each case. Only CLUSTAL provides the option to order the output in the same way as the input, the other two programs order the output so that adjacent sequences are most similar. This is useful, because it emphasizes sequence features, but it makes it impossibly tedious to compare alignments.

Assignment 3, Figure 01
The guide tree computed by CLUSTAL-W for the 16 Mbp1 orthologue sequences. This tree is based on a matrix of pairwise distances. Sequences in the multiple alignments were ordered in the same way as they apppear in this diagram.

The result files are linked here:

Globally speaking, the alignments are quite similar. Lets first look at the common themes, before we discuss details of the results. The (score-colored T-COFFEE alignment) is well suited to look at general relationships between the sequences, since outliers can be easily identified. For example, if one of the sequences would have a low-scoring domain, it may be possible that that domain has been acquired from elsewhere and is not homologous i.e. dissimilar to all others. Also a sequence may have gained significant lengths of N- or C-terminal sequence.

Instruction

Review the (score-colored T-COFFEE alignment). Based on this alignment, how do you feel about our initial assertion that these proteins should be orthologous? (Answer briefly, but with reference to specific evidence in the alignment.)

Mbp1 orthologues: analysis of full length MSAs

What do we mean by a good versus a poor multiple sequence alignment?

Let us first consider some of the features we have defined in the second assignment (and some structural features I have added). Here is an annotation of the yeast Mbp1 sequence. It was compiled with the following procedure.

performed CDD search with yeast Mbp1 protein sequence. This retrieves alignments of Mbp1 with the APSES and the ANKYRIN domains. These are profile based alignment and I would consider them more exact than pairwise alignments.
performed SMART search with yeast Mbp1 protein sequence. This retrieved the APSES domain, annotated a number of low-complexity regions and a stretch of coiled coil.
performed a SAS search with yeast Mbp1 protein sequence. This rerieved pairwise alignments with the structures 1mb1 (APSES) and chain D of 1ikn (ankyrin domains of I_kappab), together with their respectve secondary structure annotations.
copied GenPept sequence into Word-processor
transferred annotations of low complexity and coiled-coil regions from SMART
transferred annotations of APSES seondary structure from SAS (this is a direct annotation, since the structure 1MB1 has the same sequence as the coressponding parts of the Mbp1 protein). The central helix of the binding region is slightly distorted and SAS annotates a break in the helix, this was bridged with lowercase "h" in the annotation.
Ankyrin domain annotation was not as straightforward. While CDD, SMART and SAS all annotate the same general regions, they disagree in details of the domain boundaries and in the precise alignment. Used the profile-based CDD alignment of 1ikn. Transferred annotations of secondary structure from SAS output for 1ikn to sequence (this is a transferred annotation, the original annotation was for 1ikn and we assume that it applies to Mbp1 as well).

MBP1_SACCE
Annotations based on 
- CDD domain analysis,
- SAS structure annotation and
- literature data on binding region

Keys:

C   Coiled coil regions predicted by Coils2 program
x   Low complexity region
*   Proposed binding region
+   positively charged residues, oriented for possible DNA binding interactions
-   negatively charged residues, oriented for possible DNA binding interactions

E   beta strand
H   alpha helix
t   beta turn


        1 MSNQIYSARY SGVDVYEFIH STGSIMKRKK DDWVNATHIL KAANFAKAKR TRILEKEVLK
1MB1      ----EEEEEt t-EEEEEEEE t-EEEEEEtt ---EEHHHHH HH----HHHH HHHHhhhHHH
                                                               * *+**-+****

       61 ETHEKVQGGF GKYQGTWVPL NIAKQLAEKF SVYDQLKPLF DFTQTDGSAS PPPAPKHHHA
1MB1      ---EEE---- tt--EEEE-H HHHHHHHHH- --HHHHtt-         xxx xxxxxxxxxx
          **+*+***** ****

      121 SKVDRKKAIR SASTSAIMET KRNNKKAEEN QFQSSKILGN PTAAPRKRGR PVGSTRGSRR
          x                                                                           


      181 KLGVNLQRSQ SDMGFPRPAI PNSSISTTQL PSIRSTMGPQ SPTLGILEEE RHDSRQQQPQ
                                                                      xxxxx


      241 QNNSAQFKEI DLEDGLSSDV EPSQQLQQVF NQNTGFVPQQ QSSLIQTQQT ESMATSVSSS
          x                                        xx xxxxxxxxxx xxxxxxxxxx


      301 PSLPTSPGDF ADSNPFEERF PGGGTSPIIS MIPRYPVTSR PQTSDINDKV NKYLSKLVDY
          xxxxxxx

      361 FISNEMKSNK SLPQVLLHPP PHSAPYIDAP IDPELHTAFH WACSMGNLPI AEALYEAGTS
ANKYRIN                                 -- t----HHHHH HH---HHHHH t-t--t-t--


      421 IRSTNSQGQT PLMRSSLFHN SYTRRTFPRI FQLLHETVFD IDSQSQTVIH HIVKRKSTTP
ANKYRIN   t----t---- HHHHHHHH-- -------HHH HHHHHH-ttH HH-----HHH HHHH--tH--


      481 SAVYYLDVVL SKIKDFSPQY RIELLLNTQD KNGDTALHIA SKNGDVVFFN TLVKMGALTT
ANKYRIN   HHHHHHHHH- ---------- -----t---- tt---HHHHH HH---HHHHH HHH--t-tt-


      541 ISNKEGLTAN EIMNQQYEQM MIQNGTNQHV NSSNTDLNIH VNTNNIETKN DVNSMVIMSP
ANKYRIN   ---t----HH HHHHHH--HH HHH-t--HHH -t----HHHH HHH--tHHHH HHHHHH---t


      601 VSPSDYITYP SQIATNISRN IPNVVNSMKQ MASIYNDLHE QHDNEIKSLQ KTLKSISKTK
ANKYRIN   ---tt----H HHHHHH---H HHHHHHH      CCCCCCCC CCCCCCCCCC CCCCC


      661 IQVSLKTLEV LKESSKDENG EAQTNDDFEI LSRLQEQNTK KLRKRLIRYK RLIKQKLEYR
                                                    x xxxxxxxxxx xxxxxxx

      721 QTVLLNKLIE DETQATTNNT VEKDNNTLER LELAQELTML QLQRKNKLSS LVKKFEDNAK


      781 IHKYRRIIRE GTEMNIEEVD SSLDVILQTL IANNNKNKGA EQIITISNAN SHA

APSES domains (X marks)

The APSES domains in these Mbp1 orthologues are highly conserved and an alignment that would not recognize that would not be worth the electrons it was computed with.

In the three Mbp1 alignments, find the APSES domain alignments. Briefly note whether the alignments agree and whether the charged residues in the proposed binding region are wholly or partially conserved.

Ankyrin domains (X marks)

The Ankyrin domains are more highly diverged, the boundaries are less well defined and not even CDD, SMART and SAS agree on the precise annotations. Nevertheless we would hope that a good alignment would recognize homology in that region and that ideally the required indels would be placed between the secondary structure elements, not in their middle.

For one of the alignments of your choice, identify any four consecutive helices in the Ankyrin repeat region of Mbp1. ...

Other features (X marks)

Aligning functional features like coiled coil domains or intrinsically disorderd regions is even more difficult, since this is to a certain degree a property of the amino acid composition, not as much the precise sequence. Thus we would expect alignment algorithms to have difficulty to detect the correspondence between sequences in this region. I have marked the four low complexity regions of the yeast Mbp1 sequence with bold letters in all three alignments.

Copy the Mbp1 sequence from your organism from the multi-FASTA files and run a SMART sequence analysis: paste your sequence (or the Uniprot accession number), check only the checkbox for detecting intrinsic protein disorder and click "Sequence SMART". Locate the segments of low complexity for your sequence (they are in the lower part of the results page since they overlap with disordered segements). Find the corresponding positions for your sequence in one of the multiple sequence alignments. Briefly describe the situation: state whether these segments are found in the same general region, in the same detailed location, or perhaps even conserved in sequence, as compared to the saccharomyces cerevisiae sequence.

Briefly discuss whether this should lead you to conclude that disorder in these proteins appears to be an evolutionarily conserved feature.

APSES domain homologues: analysis of domain MSAs (X marks)

The procedures for obtaining the MSAs for all APSES domains is summarized at the top of the page for each alignment. Read it and make sure you understand what has been done. Three approaches were used:

An alignment based on the PSI-BLAST reults as an example of a profile-based alignment.

A CLUSTAL-W alignment as an example of our standard, plain vanilla progressive alignment procedure.

A consistency based, iterated alignment using probcons, as an example of the more modern metods. probcons was used rather than T-COFFEE since the EBI server restricts the number of sequences it will accept to 50.

Again, comparing the alignments, we note that they do not agree universally.

Manual improvement

Often errors or inconsistencies are easy to spot and manually editing an MSA is not generally frowned upon, even though this is not a strictly objective procedure. The main goal is to make an alignment biologically more plausible, usually this means to mimize the number of rare events that we need to postulate for the alignment, to move indels into more plausible positions and/or to emphasize conservation of known functional motifs. Here are some examples for what one might aim for in manually editing an alignment:

Reduce number of indels

From Probcons
0447_DEBHA    ILKTE-K-T---K--SVVK      ILKTE----KTK---SVVK
9978_GIBZE    MLGLN-PGLKEIT--HSIT      MLGLNPGLKEIT---HSIT
1513_CANAL    ILKTE-K-I---K--NVVK      ILKTE----KIK---NVVK
6132_SCHPO    ELDDI-I-ESGDY--ENVD      ELDDI-IESGDY---ENVD
1244_ASPFU    ----N-PGLREIC--HSIT  ->  ----NPGLREIC---HSIT
0925_USTMA    LVKTC-PALDPHI--TKLK      LVKTCPALDPHI---TKLK
2599_ASPTE    VLDAN-PGLREIS--HSIT      VLDANPGLREIS---HSIT
9773_DEBHA    LLESTPKQYHQHI--KRIR      LLESTPKQYHQHI--KRIR
0918_CANAL    LLESTPKEYQQYI--KRIR      LLESTPKEYQQYI--KRIR

Gaps marked in red were moved. The sequence similarity in the alignment does not change considerably, however the total number of indels in this excerpt is reduced to 13 from the original 22

Move indels to more plausible position

From CLUSTAL:
4966_CANGL     MKHEKVQ------GGYGRFQ---GTW      MKHEKVQ------GGYGRFQ---GTW
1513_CANAL     KIKNVVK------VGSMNLK---GVW      KIKNVVK------VGSMNLK---GVW
6132_SCHPO     VDSKHP-----------QID---GVW  ->  VDSKHPQ-----------ID---GVW
1244_ASPFU     EICHSIT------GGALAAQ---GYW      EICHSIT------GGALAAQ---GYW

The two characters marked in red were swapped. This does not change the number of indels but places the "Q" into a a column in which it is more highly conserved (green). Progressive alignments are especially prone to this type of error.

Conserve motifs

From CLUSTAL
6166_SCHPO      --DKRVA---GLWVPP      --DKRVA--G-LWVPP
XBP1_SACCE      GGYIKIQ---GTWLPM      GGYIKIQ--G-TWLPM
6355_ASPTE      --DEIAG---NVWISP  ->  ---DEIA--GNVWISP
5262_KLULA      GGYIKIQ---GTWLPY      GGYIKIQ--G-TWLPY

The first of the two residues marked in red is a conserved, solvent exposed hydrophobic residue that may mediate domain interactions. The second residue is the conserved glycine in a beta turn that cannot be mutated without structural disruption. Changing the position of a gap and insertion in one sequence improves the conservation of both motifs.

Please consider the following excerpts from the alignments:

PSI-BLAST
MBP1_SACCE    SIMKRKKDDWVNATHILKA------A----------NFA--------KAKRTR-----
2599_ASPTE    -IMWDYNIGLVRTTPLFRS------Q----------NYS--------KTTPAK-----
9773_DEBHA    -IIWDYETGFVHLTGIWKA------S----------INDEVNTHRNLKADIVK-----
0918_CANAL    -VIWDYETGWVHLTGIWKA------SLTIDGSNVSPSHL--------KADIVK-----
9901_DEBHA    -ILRRVQDSYINISQLF--------SILLKIG----HLS--------EAQLTN-----
7766_ASPNI    -LMRRSKDGYVSATGMFKI------A-----------FP--------WAKLEEERSER
5459_GIBZE    -LMRRSYDGFVSATGMFKASFPYAEA----------SDE--------DAERKY-----
2267_NEUCR    -LMRRSQDGYISATGMFKA------TFPYASQ----EEE--------EAERKY-----
3510_ASPFU    -LMRRSKDGYVSATGMFKI------A-----------FP--------WAK--------
3762_MAGGR    -LMRRSSDGYVSATGMFKATFPYADA----------EDE--------EAERNY-----
3412_CANAL    -VLRRVQDSFVNVTQLFQI------LIKLE------VLP--------TSQVDN-----

CLUSTAL
MBP1_SACCE    SIMKRKKDDWVNATHILKAAN----------FAKAKRTRILE----------KEVLKETHE
2599_ASPTE    -IMWDYNIGLVRTTPLFRSQ----------NYSKTTPAKVLDAN--------P-GLREISH
9773_DEBHA    -IIWDYETGFVHLTGIWKASIN-DEVNTHR-NLKADIVKLLEST--------PKQYHQHIK
0918_CANAL    -VIWDYETGWVHLTGIWKASLTIDGSNVSPSHLKADIVKLLEST--------PKEYQQYIK
9901_DEBHA    -ILRRVQDSYINISQLFSILL----------KIGHLSEAQLTNFLNNEILTNTQYLSSGGS
7766_ASPNI    -LMRRSKDGYVSATGMFKIAF----------PWAKLEEERSE----------REYLKTRPE
5459_GIBZE    -LMRRSYDGFVSATGMFKASF----------PYAEASDEDAE----------RKYIKSLPT
2267_NEUCR    -LMRRSQDGYISATGMFKATF----------PYASQEEEEAE----------RKYIKSIPT
3510_ASPFU    -LMRRSKDGYVSATGMFKIAF----------PWAKLEEEKAE----------REYLKTREG
3762_MAGGR    -LMRRSSDGYVSATGMFKATF----------PYADAEDEEAE----------RNYIKSLPA
3412_CANAL    -VLRRVQDSFVNVTQLFQILI----------KLEVLPTSQVDNYFDNEILSNLKYFGSSSN

Probcons 
MBP1_SACCE    SIMKRKKDDWVNATHILKAANF----AKA----------KRTRILEKE-V-LKETH--E
2599_ASPTE    -IMWDYNIGLVRTTPLFRSQNY----SKT----------TPAKVLDAN-PGLREIS--H
9773_DEBHA    -IIWDYETGFVHLTGIWKASIN----DEV--NTHRNLKADIVKLLESTPKQYHQHI--K
0918_CANAL    -VIWDYETGWVHLTGIWKASLT----IDGSNVSPSHLKADIVKLLESTPKEYQQYI--K
9901_DEBHA    -ILRRVQDSYINISQLFSILLKIGHLSEA----------QLTNFLNNE-I-LTNTQYLS
7766_ASPNI    -LMRRSKDGYVSATGMFKIAFP----WAK----------LEEERSERE-Y-LK-----T
5459_GIBZE    -LMRRSYDGFVSATGMFKASFP----YAE----------ASDEDAERK-Y-IK-----S
2267_NEUCR    -LMRRSQDGYISATGMFKATFP----YAS----------QEEEEAERK-Y-IK-----S
3510_ASPFU    -LMRRSKDGYVSATGMFKIAFP----WAK----------LEEEKAERE-Y-LK-----T
3762_MAGGR    -LMRRSSDGYVSATGMFKATFP----YAD----------AEDEEAERN-Y-IK-----S
3412_CANAL    -VLRRVQDSFVNVTQLFQILIKLEVLPTS----------QVDNYFDNE-I-LSNLKYFG

In any one of these excerpts, find at least one example where the alignment could be manually improved. Show the original version, the improved version and highlight the improvement in red.

The fact that such improvements usually are not hard to find teaches us to be cautious with the results. Not in all cases will lack of conservation in a particular column mean that a residue has changed in evolution - sometimes this is simply a consequence of misalignment. MSAs can only take sequence information into account, while we may have additional information on structural and functional conservation patterns. This may include secondary structure (gaps should be moved out of regions of secondary structure, where possible), structurally required residues (expected to be conserved accross all structurally similar sequences) and functionally conserved residues (expected to have a high likely hood of being conserved within groups of orthologues, but varying between orthologues and paralogues).

In terms of structural conservation, we expect motif or consistency based alignments to be better since they align to the "big picture". In terms of functional variation we expect progressive alignments to be better, since they align to local similarities.

I have transferred some of annotations for the yeast Mbp1 APSES domain into the multiple sequence alignments. This should allow you to answer the following questions:

[End of assignment]

If you have any questions at all, don't hesitate to mail me at boris.steipe@utoronto.ca or post your question to the Course Mailing List

BIO Assignment 3 2011

Contents

Retrieve

Mbp1 homologues

Other ASPES domain sequences

Orthologues

Align

Aligning the Mbp1 orthologues (X marks)

Mbp1 orthologues: analysis of full length MSAs

APSES domains (X marks)

Ankyrin domains (X marks)

Other features (X marks)

APSES domain homologues: analysis of domain MSAs (X marks)

Manual improvement

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools