BIO Assignment 3 2011
Contents
Assignment 3 - Multiple Sequence Alignment
Please note: This assignment is currently inactive. Unannounced changes may be made at any time.
Introduction
A carefully done multiple sequence alignment (MSA) is a cornerstone for the annotation of a gene or protein. MSAs combine the information from several related proteins, allowing us to focus on essential, shared properties. They are primarily useful to resolve ambiguities in the precise placement of gaps and to ensure that columns in alignments actually contain corresponding amino acids. Therefore we need MSAs as input for protein homology modeling for phylogenetic analyses, and for sensitive homology searches in databases. Furthermore conservation - or the lack of conservation - can tell us something about structural or functional features of our protein, about domain boundaries in multi-domain proteins and it can guide us regarding protein engineering and design.
Given the ubiquitous importance of this procedure, it is soemwhat surprising that by far the most frequently used algorithm is CLUSTAL, which has been shown to be significantly inferior to more modern approaches for seuqences with about 30% identity or less.
In this assignment we will explore MSAs of the Mbp1 proteins and the APSES domains they contain and try several approaches to alignment:
- A model-based approach (based on the PSSM that PSI-BLAST generates
- A progressive alignment - the CLUSTAL algorithm
- A consistency based alignment - T-coffee resp. Probcons
Preparation, submission and due date
Read carefully. Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which people have simply overlooked crucial questions. Sadly, we always get assignments back in which people have not described procedural details. If you did not notice that the above were two different sentences, you are still not reading carefully enough.
Prepare a Microsoft Word document with a title page that contains:
- your full name
- your Student ID
- your e-mail address
- the organism name you have been assigned (see below)
Follow the steps outlined below. You are encouraged to write your answers in short answer form or point form, like you would document an analysis in a laboratory notebook. However, you must
- document what you have done,
- note what Web sites and tools you have used,
- paste important data sequences, alignments, information etc.
If you do not document the process of your work, we will deduct marks. Try to be concise, not wordy! Use your judgement: are you giving us enough information so we could exactly reproduce what you have done? If not, we will deduct marks. Avoid RTF and unnecessary formating. Do not paste screendumps. Keep the size of your submission below 1.5 MB.
Write your answers into separate paragraphs and give each its title. Save your document with a filename of:
A3_family name.given name.doc
(for example my first assignment would be named: A3_steipe.boris.doc - and don't switch the order of your given name and familyname please!)
Finally e-mail the document to [boris.steipe@utoronto.ca] before the due date.
Your document must not contain macros. Please turn off and/or remove all macros from your Word document; we will disable macros, since they pose a security risk.
With the number of students in the course, we have to economize on processing the assignments. Thus we will not accept assignments that are not prepared as described above. If you have technical difficulties, contact me.
The due date for the assignment is XXXXX at 10:00 in the morning.
Grading
Don't wait until the last day to find out there are problems! Assignments that are received past the due date will have one mark deducted at the first minute of every twelve hour period past the due date. Assignments received more than 5 days past the due date will not be assessed.
Marks are noted below in the section headings for of the tasks. A total of 10 marks will be awarded, if your assignment answers all of the questions. A total of 2 bonus marks (up to a maximum of 10 overall) can be awarded for particularily interesting findings, or insightful comments. A total of 2 marks can be subtracted for lack of form or for glaring errors. The marks you receive will
- count directly towards your final marks at the end of term, for BCH441 (undergraduates), or
- be divided by two for BCH1441 (graduates).
Retrieve
In Assignment 2 you had retrieved the Saccharomyces cerevisiae Mbp1 protein sequence. Here I have compiled the most similar homologues from the organisms you have studied:
Mbp1 homologues
Our first task is to compile a multi-FASTA file for all Mbp1 orthologues.
In your second assignments, you used BLAST to find the best matches to the yeast Mbp1 protein in your assigned organism's genome. Since there was some variation in the sequences you reported, I have generated a list de novo using the following procedure:
- Retrieved the Mbp1 protein sequence by searching Entrez for
Mbp1 AND "saccharomyces cerevisiae"[organism]
- Clicked on the RefSeq tab to find the RefSeq ID "
NP_010227
" - Accessed the BLAST form for protein/protein BLAST and paste the RefSeq ID into the query field. Choose refseq as the database. Keep default parameters. Choose Fungi as an ENTREZ query limit in the Options section.
- On the results page, checked the checkbox next to the alignment of the most significant hit from each of the organisms we are studying.
- Clicked on the "Get selected sequences" button. The results page lists the gene that is most similar to Mbp1 in each organism.
- Verified that each of these sequences finds Mbp1 as the best match in the saccharomyces cerevisiae genome by clicking on each "BLink" (click for example) in the retrieved list. Scrolled down the list to confirm that the top hit of a saccharomyces cerevisiae protein is Mbp1.
- Obtained UniProt accessionsfor all sequences, with a single query using the new UniProt ID mapping service. This service accepts a comma delimited list of RefSeq IDs and returns a list of Uniprot proteins.
- Assembled this information into a table.
Organism | CODE |
GI | Refseq | Uniprot Accession | Most similar yeast gene |
Aspergillus fumigatus | ASPFU |
70986922 | XP_748947 | Q4WGN2 | Mbp1 |
Aspergillus nidulans | ASPNI |
67525393 | XP_660758 | Q5B8H6 | Mbp1 |
Aspergillus terreus | ASPTE |
115391425 | XP_001213217 | Q0CQJ5 | Mbp1 |
Candida albicans | CANAL |
68465419 | XP_723071 | Q5ANP5 | Mbp1 |
Candida glabrata | CANGL |
50286059 | XP_445458 | Q6FWD6 | Mbp1 |
Cryptococcus neoformans | CRYNE |
58266778 | XP_570545 | Q5KHS0 | Mbp1 |
Debaryomyces hansenii | DEBHA |
50420495 | XP_458784 | Q6BSN6 | Mbp1 |
Eremothecium gossypii | EREGO |
45199118 | NP_986147 | Q752H3 | Mbp1 |
Gibberella zeae | GIBZE |
46116756 | XP_384396 | Q4IEY8 | Mbp1 |
Kluyveromyces lactis | KLULA |
50308375 | XP_454189 | P39679 | Mbp1 |
Magnaporthe grisea | MAGGR |
39964664 | XP_365024 | ACC | Mbp1* |
Neurospora crassa | NEUCR |
85109541 | XP_962967 | Q7SBG9 | Mbp1 |
Saccharomyces cerevisiae | SACCE |
6320147 | NP_010227 | P39678 | Mbp1 |
Schizosaccharomyces pombe | SCHPO |
19113944 | NP_593032 | P41412 | Mbp1 |
Ustilago maydis | USTMA |
71024227 | XP_762343 | Q4P117 | Mbp1 |
Yarrowia lipolytica | YARLI |
50545439 | XP_500257 | Q6CGF5 | Mbp1 |
* Note: This is a full-length homologue, however the C-terminal half is more similar to Swi6 than to Mbp1.
- Download all sequences, generate a multi-Fasta file and save it to your computer. Don't submit the file but do record how you created it. (1 mark)
Hint: don't do this by hand, you can get the sequences all at once. Click here if you don't know how.
- Briefly explain if these sequences appear to be orthologues to yeast Mbp1 (as evidenced through the "reciprocal best-match" criterium). Briefly explain if these sequences are necessarily orthologues to each other. (1 mark)
Other ASPES domain sequences
Mbp1 orthologues are not the only proteins that contain APSES domains. In order to find all the rest, a PSI BLAST search was performed using the yeast Mbp1 APSES domain as query. From the list of hits, the APSES domains were extracted and summarized in a file.
- Review the resulting file for the APSES domains and make sure you understand the procedure that led to it. Summarize the key steps of the procedure in point form. (1 mark)
Orthologues
Determine for one of the the APSES domains in your organism which yeast APSES domain (if any) it is orthologous to:
- Choose at random one of the APSES domains from your organism and copy it's sequence into the input window of a BLAST search.
- Restrict the BLAST search to RefSeq sequences in saccharomyces cerevisiae.
- Run the search and determine the gene name of the best hit. (This is the best match.)
- Find the sequence of the APSES domain in the sequence list.
- Copy that sequence and perform the same kind of BLAST search, this time restricted to your organism. (This finds the reciprocal match.)
- Report briefly what you have found. Does the gene you have chosen fulfill the reciprocal best match criterium for orthology with a yeast gene? (1 mark)
Align
Aligning the Mbp1 proteins (X marks)
Instruction
- Task.
Aligning the APSES domain (X marks)
- All APSES domains CLUSTAL-W alignment
- All APSES domains probcons alignment
- All APSES domains PSI-BLAST alignment
Instruction
- Compare the three alignments
Consider the following sub-alignment from PSI-BLAST. Find at least one example where the alignment could be manually improved.
PSI-BLAST MBP1_SACCE SIMKRKKDDWVNATHILKA------A----------NFA--------KAKRTR----- 2599_ASPTE -IMWDYNIGLVRTTPLFRS------Q----------NYS--------KTTPAK----- 9773_DEBHA -IIWDYETGFVHLTGIWKA------S----------INDEVNTHRNLKADIVK----- 0918_CANAL -VIWDYETGWVHLTGIWKA------SLTIDGSNVSPSHL--------KADIVK----- 9901_DEBHA -ILRRVQDSYINISQLF--------SILLKIG----HLS--------EAQLTN----- 7766_ASPNI -LMRRSKDGYVSATGMFKI------A-----------FP--------WAKLEEERSER 5459_GIBZE -LMRRSYDGFVSATGMFKASFPYAEA----------SDE--------DAERKY----- 2267_NEUCR -LMRRSQDGYISATGMFKA------TFPYASQ----EEE--------EAERKY----- 3510_ASPFU -LMRRSKDGYVSATGMFKI------A-----------FP--------WAK-------- 3762_MAGGR -LMRRSSDGYVSATGMFKATFPYADA----------EDE--------EAERNY----- 3412_CANAL -VLRRVQDSFVNVTQLFQI------LIKLE------VLP--------TSQVDN-----
Probcons MBP1_SACCE SIMKRKKDDWVNATHILKAANF----AKA----------KRTRILEKE-V-LKETH--E 2599_ASPTE -IMWDYNIGLVRTTPLFRSQNY----SKT----------TPAKVLDAN-PGLREIS--H 9773_DEBHA -IIWDYETGFVHLTGIWKASIN----DEV--NTHRNLKADIVKLLESTPKQYHQHI--K 0918_CANAL -VIWDYETGWVHLTGIWKASLT----IDGSNVSPSHLKADIVKLLESTPKEYQQYI--K 9901_DEBHA -ILRRVQDSYINISQLFSILLKIGHLSEA----------QLTNFLNNE-I-LTNTQYLS 7766_ASPNI -LMRRSKDGYVSATGMFKIAFP----WAK----------LEEERSERE-Y-LK-----T 5459_GIBZE -LMRRSYDGFVSATGMFKASFP----YAE----------ASDEDAERK-Y-IK-----S 2267_NEUCR -LMRRSQDGYISATGMFKATFP----YAS----------QEEEEAERK-Y-IK-----S 3510_ASPFU -LMRRSKDGYVSATGMFKIAFP----WAK----------LEEEKAERE-Y-LK-----T 3762_MAGGR -LMRRSSDGYVSATGMFKATFP----YAD----------AEDEEAERN-Y-IK-----S 3412_CANAL -VLRRVQDSFVNVTQLFQILIKLEVLPTS----------QVDNYFDNE-I-LSNLKYFG
CLUSTAL MBP1_SACCE SIMKRKKDDWVNATHILKAAN----------FAKAKRTRILE----------KEVLKETHE 2599_ASPTE -IMWDYNIGLVRTTPLFRSQ----------NYSKTTPAKVLDAN--------P-GLREISH 9773_DEBHA -IIWDYETGFVHLTGIWKASIN-DEVNTHR-NLKADIVKLLEST--------PKQYHQHIK 0918_CANAL -VIWDYETGWVHLTGIWKASLTIDGSNVSPSHLKADIVKLLEST--------PKEYQQYIK 9901_DEBHA -ILRRVQDSYINISQLFSILL----------KIGHLSEAQLTNFLNNEILTNTQYLSSGGS 7766_ASPNI -LMRRSKDGYVSATGMFKIAF----------PWAKLEEERSE----------REYLKTRPE 5459_GIBZE -LMRRSYDGFVSATGMFKASF----------PYAEASDEDAE----------RKYIKSLPT 2267_NEUCR -LMRRSQDGYISATGMFKATF----------PYASQEEEEAE----------RKYIKSIPT 3510_ASPFU -LMRRSKDGYVSATGMFKIAF----------PWAKLEEEKAE----------REYLKTREG 3762_MAGGR -LMRRSSDGYVSATGMFKATF----------PYADAEDEEAE----------RNYIKSLPA 3412_CANAL -VLRRVQDSFVNVTQLFQILI----------KLEVLPTSQVDNYFDNEILSNLKYFGSSSN
Analyse
SUB section Heading (X marks)
Instruction
- Task
Instruction
- Task.
[End of assignment]
If you have any questions at all, don't hesitate to mail me at boris.steipe@utoronto.ca or post your question to the Course Mailing List