BIO Assignment 3 2011

From "A B C"
Revision as of 01:43, 24 November 2006 by Boris (talk | contribs) (→‎Analyse)
Jump to navigation Jump to search

   

Assignment 3 - Multiple Sequence Alignment

Please note: This assignment is currently inactive. Unannounced changes may be made at any time.  


 

Introduction

A carefully done multiple sequence alignment (MSA) is a cornerstone for the annotation of a gene or protein. MSAs combine the information from several related proteins, allowing us to study their essential, shared properties. They are useful to resolve ambiguities in the precise placement of gaps and to ensure that columns in alignments actually contain amino acids that evolve in a similar context. Therefore we need MSAs as input for

  • protein homology modeling,
  • phylogenetic analyses, and
  • sensitive homology searches in databases.

Furthermore conservation - or the lack of conservation - reflects the requirements of structural or functional features of our protein, emphasizes domain boundaries in multi-domain proteins and it can guide mutations for protein engineering and design.

Given the ubiquitous importance of this procedure, it is somewhat surprising that by far the most frequently used algorithm is CLUSTAL, which has been shown to be significantly inferior to more modern approaches for sequences with about 30% identity or less.

In this assignment we will explore MSAs of the Mbp1 proteins and the APSES domains they contain and try several approaches to alignment:

  • A model-based approach (based on the PSSM that PSI-BLAST generates)
  • A progressive alignment - the CLUSTAL algorithm
  • A consistency based alignment - T-coffee resp. Probcons


Preparation, submission and due date

Please read carefully. Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which people have simply overlooked crucial questions. Sadly, we always get assignments back in which people have not described procedural details. If you did not notice that the above were two different sentences, you are still not reading carefully enough.

Prepare a Microsoft Word document with a title page that contains:

  • your full name
  • your Student ID
  • your e-mail address
  • the organism name you have been assigned (see below)

Follow the steps outlined below. You are encouraged to write your answers in short answer form or point form, like you would document an analysis in a laboratory notebook. However, you must

  • document what you have done,
  • note what Web sites and tools you have used,
  • paste important data sequences, alignments, information etc.

If you do not document the process of your work, we will deduct marks. Try to be concise, not wordy! Use your judgement: are you giving us enough information so we could exactly reproduce what you have done? If not, we will deduct marks. Avoid RTF and unnecessary formating. Do not paste screendumps. Keep the size of your submission below 1.5 MB.

Write your answers into separate paragraphs and give each its title. Save your document with a filename of: A3_family name.given name.doc (for example my first assignment would be named: A3_steipe.boris.doc - and don't switch the order of your given name and familyname please!)

Finally e-mail the document to [boris.steipe@utoronto.ca] before the due date.

Your document must not contain macros. Please turn off and/or remove all macros from your Word document; we will disable macros, since they pose a security risk.

With the number of students in the course, we have to economize on processing the assignments. Thus we will not accept assignments that are not prepared as described above. If you have technical difficulties, contact me.

The due date for the assignment is XXXXX at 10:00 in the morning.

Grading

Don't wait until the last day to find out there are problems! Assignments that are received past the due date will have one mark deducted at the first minute of every twelve hour period past the due date. Assignments received more than 5 days past the due date will not be assessed.

Marks are noted below in the section headings for of the tasks. A total of 10 marks will be awarded, if your assignment answers all of the questions. A total of 2 bonus marks (up to a maximum of 10 overall) can be awarded for particularily interesting findings, or insightful comments. A total of 2 marks can be subtracted for lack of form or for glaring errors. The marks you receive will

  • count directly towards your final marks at the end of term, for BCH441 (undergraduates), or
  • be divided by two for BCH1441 (graduates).

   

Retrieve

   

In Assignment 2 you had retrieved the Saccharomyces cerevisiae Mbp1 protein sequence. Here I have compiled the most similar homologues from the organisms you have studied:

Mbp1 homologues

 

Our first task is to compile a multi-FASTA file for all Mbp1 orthologues.

In your second assignments, you used BLAST to find the best matches to the yeast Mbp1 protein in your assigned organism's genome. Since there was some variation in the sequences you reported, I have generated a list de novo using the following procedure:

  1. Retrieved the Mbp1 protein sequence by searching Entrez for Mbp1 AND "saccharomyces cerevisiae"[organism]
  2. Clicked on the RefSeq tab to find the RefSeq ID "NP_010227"
  3. Accessed the BLAST form for protein/protein BLAST and paste the RefSeq ID into the query field. Choose refseq as the database. Keep default parameters. Choose Fungi as an ENTREZ query limit in the Options section.
  4. On the results page, checked the checkbox next to the alignment of the most significant hit from each of the organisms we are studying.
  5. Clicked on the "Get selected sequences" button. The results page lists the gene that is most similar to Mbp1 in each organism.
  6. Verified that each of these sequences finds Mbp1 as the best match in the saccharomyces cerevisiae genome by clicking on each "BLink" (click for example) in the retrieved list. Scrolled down the list to confirm that the top hit of a saccharomyces cerevisiae protein is Mbp1.
  7. Obtained UniProt accessionsfor all sequences, with a single query using the new UniProt ID mapping service. This service accepts a comma delimited list of RefSeq IDs and returns a list of Uniprot proteins.
  8. Assembled this information into a table.


Organism CODE GI Refseq Uniprot Accession Most similar yeast gene
Aspergillus fumigatus ASPFU 70986922 XP_748947 Q4WGN2 Mbp1
Aspergillus nidulans ASPNI 67525393 XP_660758 Q5B8H6 Mbp1
Aspergillus terreus ASPTE 115391425 XP_001213217 Q0CQJ5 Mbp1
Candida albicans CANAL 68465419 XP_723071 Q5ANP5 Mbp1
Candida glabrata CANGL 50286059 XP_445458 Q6FWD6 Mbp1
Cryptococcus neoformans CRYNE 58266778 XP_570545 Q5KHS0 Mbp1
Debaryomyces hansenii DEBHA 50420495 XP_458784 Q6BSN6 Mbp1
Eremothecium gossypii EREGO 45199118 NP_986147 Q752H3 Mbp1
Gibberella zeae GIBZE 46116756 XP_384396 Q4IEY8 Mbp1
Kluyveromyces lactis KLULA 50308375 XP_454189 P39679 Mbp1
Magnaporthe grisea MAGGR 39964664 XP_365024 ACC Mbp1*
Neurospora crassa NEUCR 85109541 XP_962967 Q7SBG9 Mbp1
Saccharomyces cerevisiae SACCE 6320147 NP_010227 P39678 Mbp1
Schizosaccharomyces pombe SCHPO 19113944 NP_593032 P41412 Mbp1
Ustilago maydis USTMA 71024227 XP_762343 Q4P117 Mbp1
Yarrowia lipolytica YARLI 50545439 XP_500257 Q6CGF5 Mbp1

* Note: This is a full-length homologue, however the C-terminal half is more similar to Swi6 than to Mbp1.

 

 

  • Briefly explain if these sequences appear to be orthologues to yeast Mbp1 (as evidenced through the "reciprocal best-match" criterium). Briefly explain if these sequences are necessarily orthologues to each other. (1 mark)

Since the calculation of MSAs can take up significant computer resources, many Web services restrict the size of input files, typically to something like 30 sequences or 10,000 characters overall. I have thus prepared a multi-FASTA file for the Mbp1 sequences from six major phyla of fungi.

  • Aspergillus fumigatus
  • Cryptococcus neoformans
  • Neurospora crassa
  • Saccharomyces cerevisiae
  • Schizosaccharomyces pombe
  • Yarrowia lipolytica

 

  • Review the resulting file for the selected Mbp1 proteins (linked here) and make sure you understand the procedure that led to it. Copy the data and save it on your computer as a text file. If the list does not contain the Mbp1 homologue for your organism, retreive the sequence from NCBI and add it to the list. If the list does contain your organism, choose a different organism at random and include that in the list (RefSeq IDs are in the table above). Download all sequences, generate a multi-Fasta file and save it to your computer. Summarize the key steps of the procedure in point form in your submission. (Don't submit the entire file but make sure you record how it was created). (1 mark)

Hint: don't do this by hand, you can get the sequences all at once. Click here if you don't know how.

 


Other ASPES domain sequences

 


Mbp1 orthologues are not the only proteins that contain APSES domains. In order to find all the rest, a PSI BLAST search was performed using the yeast Mbp1 APSES domain as query. From the list of hits, the APSES domains were extracted and summarized in a file.

 

  • Review the resulting file for the APSES domains and make sure you understand the procedure that led to it. Summarize the key steps of the procedure in point form. (1 mark)

 

Orthologues

 

Determine for one of the the APSES domains in your organism which yeast APSES domain (if any) it is orthologous to:

  1. Choose at random one of the APSES domains from your organism and copy it's sequence into the input window of a BLAST search.
  2. Restrict the BLAST search to RefSeq sequences in saccharomyces cerevisiae.
  3. Run the search and determine the gene name of the best hit. (This is the best match.)
  4. Find the sequence of the APSES domain in the sequence list.
  5. Copy that sequence and perform the same kind of BLAST search, this time restricted to your organism. (This finds the reciprocal match.)

 

  • Report briefly what you have found. Does the gene you have chosen fulfill the reciprocal best match criterium for orthology with a yeast gene? (1 mark)

 

Align

    Actually performing multiple sequence alignements used to involve downloading and installing software on your own computer. While most tools were available on the Web in principle, many groups have restricted the total number of sequences or the total number of characters to be aligned. The EBI however offers three of the most commonly used tools (as far as I can tell) without limitations. These are servers for

  • CLUSTAL-W is a progressive alignment program, it is reasonably fast and easy to use. But alignment errors that are made early can't get corrected and thus it is prone to misalignments on sets of sequences that have poor (<30% ID) local similarity.
  • MUSCLE essentially starts out from a CLUSTAL like alignment as a draft, then identifies similar groups of sequences from which it calculates profiles, it then re-aligns the group to the profile. This procedure is iterated.
  • T-COFFEE is one of my favourites - the tradeoffs appear to be especially well balanced. It too starts from a set of pairwise global alignments, like CLUSTAL, then additionally calculates sets of best local alignments. Global and local alignments are then combined to a similarity matrix and based on this matrix a guide-tree is constructed. This determines the order of steps in which sequences are added to the multiple alignment. A nice feature of T-COFFEE is color coded output that allows you to quickly judge the local reliability of the alignment.

We shall perform multiple sequence alignments for all 16 Mbp1 orthologues and compare the results. Since the results should look the same for all of you, I have precomputed them to save some resources. Of course you are welcome to do this on your own, but it is not required. In fact, since we want to compare the alignments, I have also edited them: I have re-sorted the results so that the sequences appear in the same order in each case. Only CLUSTAL provides the option to order the output in the same way as the input, the other two programs order the output so that adjacent sequences are most similar. This is useful, because it emphasizes sequence features, but it makes it virtually impossible to compare alignments.

Assignment 3, Figure 01
The guide tree computed by CLUSTAL-W for the 16 Mbp1 orthologue sequences. This tree is based on a matrix of pairwise distances. Sequences in the multiple alignments were ordered in the same way as they apppear in this diagram.


The result files are linked here:


Aligning the Mbp1 proteins (X marks)

 

Instruction

 

  • Task.

   

Aligning the APSES domain (X marks)

 


Instruction

 

  • Compare the three alignments

   

Consider the following sub-alignment from PSI-BLAST. Find at least one example where the alignment could be manually improved.

PSI-BLAST
MBP1_SACCE    SIMKRKKDDWVNATHILKA------A----------NFA--------KAKRTR-----
2599_ASPTE    -IMWDYNIGLVRTTPLFRS------Q----------NYS--------KTTPAK-----
9773_DEBHA    -IIWDYETGFVHLTGIWKA------S----------INDEVNTHRNLKADIVK-----
0918_CANAL    -VIWDYETGWVHLTGIWKA------SLTIDGSNVSPSHL--------KADIVK-----
9901_DEBHA    -ILRRVQDSYINISQLF--------SILLKIG----HLS--------EAQLTN-----
7766_ASPNI    -LMRRSKDGYVSATGMFKI------A-----------FP--------WAKLEEERSER
5459_GIBZE    -LMRRSYDGFVSATGMFKASFPYAEA----------SDE--------DAERKY-----
2267_NEUCR    -LMRRSQDGYISATGMFKA------TFPYASQ----EEE--------EAERKY-----
3510_ASPFU    -LMRRSKDGYVSATGMFKI------A-----------FP--------WAK--------
3762_MAGGR    -LMRRSSDGYVSATGMFKATFPYADA----------EDE--------EAERNY-----
3412_CANAL    -VLRRVQDSFVNVTQLFQI------LIKLE------VLP--------TSQVDN-----


Probcons 
MBP1_SACCE    SIMKRKKDDWVNATHILKAANF----AKA----------KRTRILEKE-V-LKETH--E
2599_ASPTE    -IMWDYNIGLVRTTPLFRSQNY----SKT----------TPAKVLDAN-PGLREIS--H
9773_DEBHA    -IIWDYETGFVHLTGIWKASIN----DEV--NTHRNLKADIVKLLESTPKQYHQHI--K
0918_CANAL    -VIWDYETGWVHLTGIWKASLT----IDGSNVSPSHLKADIVKLLESTPKEYQQYI--K
9901_DEBHA    -ILRRVQDSYINISQLFSILLKIGHLSEA----------QLTNFLNNE-I-LTNTQYLS
7766_ASPNI    -LMRRSKDGYVSATGMFKIAFP----WAK----------LEEERSERE-Y-LK-----T
5459_GIBZE    -LMRRSYDGFVSATGMFKASFP----YAE----------ASDEDAERK-Y-IK-----S
2267_NEUCR    -LMRRSQDGYISATGMFKATFP----YAS----------QEEEEAERK-Y-IK-----S
3510_ASPFU    -LMRRSKDGYVSATGMFKIAFP----WAK----------LEEEKAERE-Y-LK-----T
3762_MAGGR    -LMRRSSDGYVSATGMFKATFP----YAD----------AEDEEAERN-Y-IK-----S
3412_CANAL    -VLRRVQDSFVNVTQLFQILIKLEVLPTS----------QVDNYFDNE-I-LSNLKYFG


CLUSTAL
MBP1_SACCE    SIMKRKKDDWVNATHILKAAN----------FAKAKRTRILE----------KEVLKETHE
2599_ASPTE    -IMWDYNIGLVRTTPLFRSQ----------NYSKTTPAKVLDAN--------P-GLREISH
9773_DEBHA    -IIWDYETGFVHLTGIWKASIN-DEVNTHR-NLKADIVKLLEST--------PKQYHQHIK
0918_CANAL    -VIWDYETGWVHLTGIWKASLTIDGSNVSPSHLKADIVKLLEST--------PKEYQQYIK
9901_DEBHA    -ILRRVQDSYINISQLFSILL----------KIGHLSEAQLTNFLNNEILTNTQYLSSGGS
7766_ASPNI    -LMRRSKDGYVSATGMFKIAF----------PWAKLEEERSE----------REYLKTRPE
5459_GIBZE    -LMRRSYDGFVSATGMFKASF----------PYAEASDEDAE----------RKYIKSLPT
2267_NEUCR    -LMRRSQDGYISATGMFKATF----------PYASQEEEEAE----------RKYIKSIPT
3510_ASPFU    -LMRRSKDGYVSATGMFKIAF----------PWAKLEEEKAE----------REYLKTREG
3762_MAGGR    -LMRRSSDGYVSATGMFKATF----------PYADAEDEEAE----------RNYIKSLPA
3412_CANAL    -VLRRVQDSFVNVTQLFQILI----------KLEVLPTSQVDNYFDNEILSNLKYFGSSSN


Mbp1 alignments: analysis

   

What does a good alignment mean?


Let us first consider some of the features we have defined in the second assignment (and some structural features I have added). Here is an annotation of the yest Mbp1 protein.




SUB section Heading (X marks)

 

Instruction

 

  • Task

 

Instruction

 

  • Task.

   

[End of assignment]

If you have any questions at all, don't hesitate to mail me at boris.steipe@utoronto.ca or post your question to the Course Mailing List