|
|
(189 intermediate revisions by the same user not shown) |
Line 1: |
Line 1: |
− | <div style="padding: 5px; background: #FF4560; border:solid 2px #000000;"> | + | <div id="APB"> |
− | '''Note!'''
| |
− | This assignment is currently inactive. Major and minor unannounced changes may be made at any time.
| |
− | </div>
| |
− |
| |
| | | |
− | | + | <table width="40%"><tr><td class="l1"> </td><td> |
| | | |
| + | ===Hardware=== |
| + | <table width="100%"> |
| + | <tr class="s1"><td class="l1">High performance computing <!-- (... at the bench: GPUs, FPGAs, Clusters) --></td></tr> |
| + | <tr class="s2"><td class="l1">Cloud computing</td></tr> |
| + | <tr><td class="sp"> </td></tr> |
| + | </table> |
| | | |
− | __TOC__
| + | ===Systems and Tools=== |
− |
| + | <table width="100%"> |
− |
| |
| | | |
− | <div style="padding: 5px; background: #A6AFD0; border:solid 1px #AAAAAA; font-size:200%;font-weight:bold;"> | + | <tr class="s1"><td class="l1 mw-collapsible mw-collapsed" data-expandtext="Expand subtopics" data-collapsetext="Collapse">[[Unix]] |
− | Assignment 3 - Multiple Sequence Alignment
| + | <div class="mw-collapsible-content"> |
| + | <table width="100%"><tr class="s2"><td class="l2">[[Unix system administration]]</td></tr></table> |
| + | <table width="100%"><tr class="s1"><td class="l2">[[Unix automation]]</td></tr></table> |
| + | <table width="100%"><tr class="s2"><td class="l2">[[Program installation]]</td></tr></table> |
| + | <table width="100%"><tr class="s1"><td class="l2">[[wget]]</td></tr></table> |
| </div> | | </div> |
| + | </td></tr> |
| | | |
− | Please note: This assignment is currently inactive. Unannounced changes may be made at any time.
| + | <tr class="s2"><td class="l1">[[Network Configuration]]</td></tr> |
− |
| + | <tr class="s1"><td class="l1">[[Apache]]</td></tr> |
− | <!-- '''Please note: This assignment is currently active. All significant changes will be announced on the course mailing list.''' | + | <tr class="s2"><td class="l1">[[MySQL]]</td></tr> |
− | --> | + | <tr class="s1"><td class="l1">[[Tools for the bioinformatics lab]]</td></tr> |
− | | + | <tr class="s2"><td class="l1">[[GBrowse|GBrowse and LDAS]]</td></tr> |
| + | <tr><td class="sp"> </td></tr> |
| + | </table> |
| | | |
− | <div style="padding: 2px; background: #F0F1F7; border:solid 1px #AAAAAA; font-size:125%;color:#444444"> | + | ===Programming=== |
− | Introduction
| + | <table width="100%" > |
− |
| + | <tr class="s1"><td class="l1">[[IDE|IDE (Integrated Development Environment)]]</td></tr> |
| + | <tr class="s2"><td class="l1">[[Regular Expressions]]</td></tr> |
| + | <tr class="s1"><td class="l1">[[Screenscraping]]</td></tr> |
| | | |
− | ;The difficulty lies, not in the new ideas, but in escaping the old ones, which ramify, for those brought up as most of us have been, into every corner of our minds.
| + | <tr class="s2"><td class="l1 mw-collapsible mw-collapsed" data-expandtext="Expand subtopics" data-collapsetext="Collapse">[[Perl]] |
− | :''John Maynard Keynes'' | + | <div class="mw-collapsible-content"> |
| + | <table width="100%"><tr class="s1"><td class="l2">[[Perl basic programming]]</td></tr></table> |
| + | <table width="100%"><tr class="s2"><td class="l2">[[Perl hash example]]</td></tr></table> |
| + | <table width="100%"><tr class="s1"><td class="l2">[[Perl LWP example]]</td></tr></table> |
| + | <table width="100%"><tr class="s2"><td class="l2">[[Perl MySQL introduction]]</td></tr></table> |
| + | <table width="100%"><tr class="s1"><td class="l2">[[Perl OBO parser]]</td></tr></table> |
| + | <table width="100%"><tr class="s2"><td class="l2">[[Perl basic programming]]</td></tr></table> |
| + | <table width="100%"><tr class="s1"><td class="l2">[[Perl programming exercises 1]]</td></tr></table> |
| + | <table width="100%"><tr class="s2"><td class="l2">[[Perl programming exercises 2]]</td></tr></table> |
| + | <table width="100%"><tr class="s1"><td class="l2">[[Perl programming Data Structures]]</td></tr></table> |
| + | <table width="100%"><tr class="s2"><td class="l2">[[Perl references]]</td></tr></table> |
| + | <table width="100%"><tr class="s1"><td class="l2">[[Perl simulation]]</td></tr></table> |
| + | <table width="100%"><tr class="s2"><td class="l2">[[Perl: Object oriented programming]]</td></tr></table> |
| + | <table width="100%"><tr class="s1"><td class="l2">[[Perl: Ugly programming]]</td></tr></table> |
| </div> | | </div> |
| + | </td></tr> |
| | | |
− | ... but what confidence can we have in the new idea in the first place ?
| + | <tr class="s1"><td class="l1">[[BioPerl]]</td></tr> |
| + | <tr class="s2"><td class="l1">[[PHP]]</td></tr> |
| + | <tr class="s1"><td class="l1">[[Data modelling]]</td></tr> |
| + | <tr class="s2"><td class="l1">BioPython <!-- (scope, highlights, installation, use, support) --></td></tr> |
| + | <tr class="s1"><td class="l1">Graphical output <!-- (PNG and SVG) --></td></tr> |
| + | <tr class="s2"><td class="l1">[[Autonomous agents]]</td></tr> |
| + | </table> |
| | | |
− | A carefully done multiple sequence alignment (MSA) is a cornerstone for the annotation of a gene or protein. MSAs combine the information from several related proteins, allowing us to study their essential, shared and conserved properties. They are useful to resolve ambiguities in the precise placement of gaps and to ensure that columns in alignments actually contain amino acids that evolve in a similar context. Therefore we need MSAs as input for
| + | ===Algorithms=== |
− | *protein homology modeling,
| + | <table width="100%" > |
− | * phylogenetic analyses, and
| + | <tr class="sh"><td class="l1">Algorithms on Sequences</td></tr> |
− | * sensitive homology searches in databases.
| + | <tr class="s1"><td class="l2">[[Dynamic Programming]]</td></tr> |
| + | <tr class="s2"><td class="l2">[[Multiple Sequence Alignment]]</td></tr> |
| + | <tr class="s1"><td class="l2">[[Genome Assembly]]</td></tr> |
| | | |
− | Furthermore conservation - or the lack of conservation - reflects the requirements of structural or functional features of our protein, emphasizes domain boundaries in multi-domain proteins and it can guide mutations for protein engineering and design.
| + | <tr><td class="sp"> </td></tr> |
| | | |
− | Given the ubiquitous importance of this procedure, it is somewhat surprising that by far the most frequently used algorithm is CLUSTAL, which has been shown to be significantly inferior to more modern approaches for sequences with about 30% identity or less.
| + | <tr class="sh"><td class="l1">Algorithms on Structures</td></tr> |
| + | <tr class="s1"><td class="l2">[[Docking]]</td></tr> |
| + | <tr class="s2"><td class="l2">Protein Structure Prediction <!-- ''ab initio'' --></td></tr> |
| | | |
− | In this assignment we will explore MSAs of the Mbp1 proteins and the APSES domains they contain and discuss several approaches to alignment:
| + | <tr><td class="sp"> </td></tr> |
| | | |
− | * A model-based approach (based on the [[Glossary#PSSM| PSSM]] that PSI-BLAST generates)
| + | <tr class="sh"><td class="l1">Algorithms on Trees</td></tr> |
− | * A progressive alignment - the CLUSTAL algorithm
| + | <tr class="s1"><td class="l2">Computing with trees <!-- Bayesian approaches for phylogenetic trees, tree comparison) --></td></tr> |
− | * A consistency based alignment - T-Coffee resp. Probcons
| |
| | | |
| + | <tr><td class="sp"> </td></tr> |
| | | |
− | <div style="padding: 2px; background: #F0F1F7; border:solid 1px #AAAAAA; font-size:125%;color:#444444"> | + | <tr class="sh"><td class="l1">Algorithms on Networks</td></tr> |
− | Preparation, submission and due date
| + | <tr class="s1"><td class="l2">Network metrics <!-- (Degree distributions, Centrality metrics, other metrics on topology, small-world- vs. random-geometric controversy) --></td></tr> |
− | </div> | + | <tr class="s2"><td class="l3">[[Dijkstras Algorithm]]</td></tr> |
| + | <tr class="s1"><td class="l3">[[Floyd Warshall Algorithm]]</td></tr> |
| + | </table> |
| | | |
− | Please read carefully. Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which people have simply overlooked crucial questions. Sadly, we always get assignments back in which people have not described procedural details. If you did not notice that the above were two different sentences, you are still not reading carefully enough.
| |
| | | |
− | Prepare a Microsoft Word document with a title page that contains:
| + | ===Communication and collaboration=== |
− | *your full name
| + | <table width="100%" > |
− | *your Student ID
| + | <tr class="s1"><td class="l1">[[MediaWiki]]</td></tr> |
− | *your e-mail address
| + | <tr class="s2"><td class="l1">[[HTML essentials]]</td></tr> |
− | *the organism name you have been [[Organism_list_2006|assigned]]
| + | <tr class="s1"><td class="l1">[[HTML 5]]</td></tr> |
− | | + | <tr class="s2"><td class="l1">[[SADI|SADI Semantic Automated Discovery and Integration]]</td></tr> |
− | Follow the steps outlined below. You are encouraged to write your answers in short answer form or point form, '''like you would document an analysis in a laboratory notebook'''. However, you must
| + | <tr class="s1"><td class="l1">[[CGI]]</td></tr> |
− | *document what you have done,
| + | <tr><td class="sp"> </td></tr> |
− | *note what Web sites and tools you have used,
| + | </table> |
− | *paste important data sequences, alignments, information etc.
| |
− | | |
− | '''If you do not document the process of your work, we will deduct marks.''' Try to be concise, not wordy! Use your judgement: are you giving us enough information so we could exactly reproduce what you have done? If not, we will deduct marks. Avoid RTF and unnecessary formating. Do not paste screendumps. Keep the size of your submission '''below 1.5 MB'''.
| |
− | | |
− | Write your answers into separate paragraphs and give each its title. Save your document with a filename of:
| |
− | <code>A3_family name.given name.doc</code> | |
− | <small>(for example my submission would be named: A3_steipe.boris.doc - and don't switch the order of your given name and family name please!)</small> | |
− | | |
− | Finally e-mail the document to [mailto:boris.steipe@utoronto.ca boris.steipe@utoronto.ca] before the due date.
| |
− | | |
− | Your document must not contain macros. Please turn off and/or remove all macros from your Word document; we will disable macros, since they pose a security risk.
| |
− | | |
− | With the number of students in the course, we have to economize on processing the assignments. '''Thus we will not accept assignments that are not prepared as described above.''' If you have technical difficulties, contact me.
| |
| | | |
− | '''The due date for the assignment is Thursday, December 7. at 24:00 (last day of class). In case you need more time since the assignment was posted late, an extension is automatically granted to Friday, December 8. at 10:00 in the morning.'''
| + | ===Statistics=== |
| + | <table width="100%" > |
| + | <tr class="s1"><td class="l1">[[Pattern discovery]]</td></tr> |
| + | <tr class="s2"><td class="l1">Correlation <!-- (Covariance matrices and their interpretation, application to large problems, collaborative filtering, MIC and MINE) --></td></tr> |
| + | <tr class="s1"><td class="l1">Clustering methods <!-- (Algorithms and choice (including: hierarchical, model-based and partition clustering, graphical methods (MCL), flow based methods (RRW) and spectral methods). Implementation in R if possible) --></td></tr> |
| + | <tr class="s2"><td class="l1">Cluster metrics <!-- (Cluster quality metrics (Akaike, BIC)–when and how) --></td></tr> |
| + | <tr class="s1"><td class="l1">[[Map equation|The Map Equation]] </td></tr> |
| + | <tr class="s2"><td class="l1">Machine learning <!-- (Classification problems: Neural Networks, HMMs, SVM..) --></td></tr> |
| | | |
− | <div style="padding: 2px; background: #F0F1F7; border:solid 1px #AAAAAA; font-size:125%;color:#444444"> | + | <tr class="s1"><td class="l1 mw-collapsible mw-collapsed" data-expandtext="Expand subtopics" data-collapsetext="Collapse">[[R]] |
− | Grading
| + | <div class="mw-collapsible-content"> |
| + | <table width="100%"><tr class="s2"><td class="l2">R plotting</td></tr></table> |
| + | <table width="100%"><tr class="s1"><td class="l2">[[R programming]]</td></tr></table> |
| + | <table width="100%"><tr class="s2"><td class="l2">R EDA</td></tr></table> |
| + | <table width="100%"><tr class="s1"><td class="l2">R regression</td></tr></table> |
| + | <table width="100%"><tr class="s2"><td class="l2">R PCA</td></tr></table> |
| + | <table width="100%"><tr class="s1"><td class="l2">R Clustering</td></tr></table> |
| + | <table width="100%"><tr class="s2"><td class="l2">R Classification <!-- Phrasing inquiry as a classification problem, dealing with noisy data, machine learning approaches to classification, implementation in R) --></td></tr></table> |
| + | <table width="100%"><tr class="s1"><td class="l2">R hypothesis testing</td></tr></table> |
| + | <table width="100%"><tr class="s2"><td class="l2">[[Bioconductor]]</td></tr></table> |
| </div> | | </div> |
| + | </td></tr> |
| | | |
− | Don't wait until the last day to find out there are problems! Assignments that are received past the due date will have one mark deducted and an additional mark for every full twelve hour period past the due date. Assignments received more than 5 days past the due date will not be assessed. If you need an extension, you '''must''' arrange this beforehand.
| + | <tr><td class="sp"> </td></tr> |
− | | + | </table> |
− | Marks are noted below in the section headings for of the tasks. A total of 10 marks will be awarded, if your assignment answers all of the questions. A total of 2 bonus marks (up to a maximum of 10 overall) can be awarded for particularily interesting findings, or insightful comments. A total of 2 marks can be subtracted for lack of form or for glaring errors. The marks you receive will
| |
− | * count directly towards your final marks at the end of term, for BCH441 (undergraduates), or
| |
− | * be divided by two for BCH1441 (graduates).
| |
− | | |
− |
| |
− |
| |
− | | |
− | <div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;"> | |
− | ==(1) Retrieve==
| |
− | </div> | |
− |
| |
− |
| |
− | | |
− | In [[Assignment 2]] you retrieved the ''saccharomyces cerevisiae'' [http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&val=6320147 '''Mbp1'''] protein sequence. Our first task is to compile a multi-FASTA file for all Mbp1 orthologues. First we need to define which sequences we are talking about. Then we need to retrieve them from the database.
| |
− | | |
− |
| |
− | | |
− | <div style="padding: 5px; background: #E9EBF3; border:solid 1px #AAAAAA;">
| |
− | ===(1.1) Mbp1 orthologues (1 mark)===
| |
− | </div>
| |
− | <br> | |
− | | |
− | | |
− | In your second assignments, you used BLAST to find the best matches to the yeast Mbp1 protein in your assigned organism's genome. Since there was some variation in the sequences you reported, I have generated a list ''de novo'' using the following procedure:
| |
− | | |
− | #Retrieved the Mbp1 protein sequence by searching [http://www.ncbi.nlm.nih.gov/ Entrez] for <code>Mbp1 AND "saccharomyces cerevisiae"[organism]</code>
| |
− | #Clicked on the ''RefSeq tab'' to find the RefSeq ID "<code>[http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=6320147&dopt=GenPept NP_010227]</code>"
| |
− | #Accessed the [http://www.ncbi.nlm.nih.gov/blast '''BLAST'''] form for protein/protein BLAST and pasted the RefSeq ID into the ''query field''. Chose ''refseq'' as the database to search in, from the ''drop-down menu''. Kept default parameters but turned ''Filter'' off. Chose Fungi as an ENTREZ query limit in the ''Options'' section.
| |
− | #On the results page, checked the checkbox next to the alignment '''of the most significant hit from each of the organisms''' we are studying.
| |
− | #Clicked on the "Get selected sequences" button. The results page lists the gene that is most similar to Mbp1 in each organism.
| |
− | #Verified that each of these sequences finds Mbp1 as the best match in the ''saccharomyces cerevisiae'' genome by clicking on each "[http://www.ncbi.nlm.nih.gov/sutils/blink.cgi?pid=68465419 BLink]" (<small>click for example</small>) in the retrieved list. Scrolled down the list to confirm that the '''top hit of a ''saccharomyces cerevisiae'' protein''' is indeed Mbp1 (<code>NP_010227</code>).
| |
− | #Obtained UniProt accessions for all sequences, with a single query using the new UniProt [http://www.pir.uniprot.org/search/idmapping.shtml ID mapping service]. This service accepts a comma delimited list of RefSeq IDs and returns a list of Uniprot proteins.
| |
− | #Assembled this information into the following table.
| |
− | | |
− | | |
− | <table style="border-left:1px solid #AAAAAA; border-bottom:1px solid #AAAAAA;" cellpadding="10" cellspacing="0">
| |
− | <tr style="background: #BDC3DC;">
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b><i>Organism</i></b></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>CODE</code></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b>GI</b></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b>Refseq</b></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b>Uniprot Accession</b></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b>Most similar yeast gene</b></td>
| |
− | </tr> | |
− | | |
− | <tr style="background: #FFFFFF;">
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus fumigatus</i></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>ASPFU</code></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">70986922</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_748947</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q4WGN2 </td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
| |
− | </tr>
| |
− | | |
− | <tr style="background: #E9EBF3;">
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus nidulans</i></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>ASPNI</code></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">67525393</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_660758</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q5B8H6 </td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
| |
− | </tr>
| |
− | | |
− | <tr style="background: #FFFFFF;">
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus terreus</i></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>ASPTE</code></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">115391425</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_001213217</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q0CQJ5 </td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
| |
− | </tr>
| |
− | | |
− | <tr style="background: #E9EBF3;">
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Candida albicans</i></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>CANAL</code></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">68465419</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_723071</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q5ANP5 </td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
| |
− | </tr>
| |
− | | |
− | <tr style="background: #FFFFFF;">
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Candida glabrata</i></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>CANGL</code></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">50286059</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_445458</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q6FWD6 </td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
| |
− | </tr>
| |
− | | |
− | <tr style="background: #E9EBF3;">
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Coprinopsis cinerea</i></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>COPCI</code></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">...</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">...</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> ... </td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">...</td>
| |
− | </tr>
| |
− | | |
− | <tr style="background: #FFFFFF;">
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Cryptococcus neoformans</i></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>CRYNE</code></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">58266778</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_570545</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q5KHS0 </td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
| |
− | </tr>
| |
− | | |
− | <tr style="background: #E9EBF3;">
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Debaryomyces hansenii</i></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>DEBHA</code></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">50420495</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_458784</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q6BSN6 </td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
| |
− | </tr>
| |
− | | |
− | <tr style="background: #FFFFFF;">
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Eremothecium gossypii</i></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>EREGO</code></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">45199118</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">NP_986147</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q752H3 </td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
| |
− | </tr>
| |
− | | |
− | <tr style="background: #E9EBF3;">
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Gibberella zeae</i></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>GIBZE</code></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">46116756</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_384396</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q4IEY8 </td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
| |
− | </tr>
| |
− | | |
− | <tr style="background: #FFFFFF;">
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Kluyveromyces lactis</i></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>KLULA</code></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">50308375</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_454189</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> P39679 </td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
| |
− | </tr>
| |
− | | |
− | <tr style="background: #E9EBF3;">
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Magnaporthe grisea</i></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>MAGGR</code></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">39964664</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_365024</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">ACC</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1*</td>
| |
− | </tr>
| |
− | | |
− | <tr style="background: #FFFFFF;">
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Neurospora crassa</i></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>NEUCR</code></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">85109541</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_962967</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q7SBG9 </td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
| |
− | </tr>
| |
− | | |
− | <tr style="background: #E9EBF3;">
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Pichia stipitis</i></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>PICST</code></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">...</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">...</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> ... </td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
| |
− | </tr>
| |
− | | |
− | <tr style="background: #FFFFFF;">
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Saccharomyces cerevisiae</i></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>SACCE</code></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">6320147 </td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">NP_010227</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> P39678 </td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
| |
− | </tr>
| |
− | | |
− | <tr style="background: #E9EBF3;">
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Schizosaccharomyces pombe</i></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>SCHPO</code></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">19113944</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">NP_593032</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> P41412 </td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
| |
− | </tr>
| |
− | | |
− | <tr style="background: #FFFFFF;">
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Ustilago maydis</i></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>USTMA</code></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">71024227</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_762343</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q4P117 </td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
| |
− | </tr>
| |
− | | |
− | <tr style="background: #E9EBF3;">
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Yarrowia lipolytica</i></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><code>YARLI</code></td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">50545439</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">XP_500257</td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"> Q6CGF5 </td>
| |
− | <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">Mbp1</td>
| |
− | </tr>
| |
| | | |
| + | ===Applications=== |
| + | <table width="100%" > |
| + | <tr class="s1"><td class="l1">[[Data integration]] <!-- Add BioMart: Biodata integration, and data-mining of complex, related, descriptive data --></td></tr> |
| + | <tr class="s2"><td class="l1">Text mining <!-- (Use cases, tasks and metrics, taggers, vocabulary mapping, Practicals: R-support, Python/Perl support, others...) --></td></tr> |
| + | <tr class="s1"><td class="l1">[[HMMER]]</td></tr> |
| + | <tr class="s2"><td class="l1">High-throughput sequencing</td></tr> |
| + | <tr class="s1"><td class="l1">Functional annotation <!-- GFF --></td></tr> |
| + | <tr class="s2"><td class="l1">Microarray analysis <!-- (... in R: differential expression and multiple testing; Loading and normalizing data, calculating differential expression, LOWESS, the question of significance, FWERs: Bonferroni and FDR; SAM and LIMMA) --></td></tr> |
| + | <tr><td class="sp"> </td></tr> |
| </table> | | </table> |
| + | </td></tr></table> |
| | | |
− | <small>Table of yeast Mbp1 orthologues in genome-sequenced fungi. Columns from left to right: Systematic name, rganism code (a string we use as an abbreviation
| |
− |
| |
− | * Note: This is a full-length homologue, however BLink shows that the C-terminal half is more similar to Swi6 than to Mbp1. Thus I would consider the ASPES domain orthologous, the remainder possibly paralogous.</small>
| |
− |
| |
− | <br>
| |
− | Our second task is to obtain all FASTA sequences based on a list of identifiers and to save them in a format in which we can use them as input for other programs or services.
| |
− | <br>
| |
− |
| |
− |
| |
− | <br><div style="padding: 5px; background: #EEEEEE;">
| |
− | *From the information given here, briefly explain if the sequences listed above appear to be '''orthologues to yeast Mbp1''' (as evidenced through the "reciprocal best-match" criterium). Briefly explain if these sequences are necessarily also '''orthologues to each other'''.
| |
− |
| |
− | *Review the resulting multi-FASTA file for the [[All_Mbp1_proteins|'''all Mbp1 proteins (linked here)''']] and make sure you understand the procedure that led to it. Summarize the key steps of the procedure in point form. (Don't submit the entire file of course but make sure you understand (and could reproduce) the essential parts of the procedure). (1 mark)<br>
| |
− |
| |
− | </div>
| |
− | <br>
| |
− |
| |
− |
| |
− | <div style="padding: 5px; background: #E9EBF3; border:solid 1px #AAAAAA;">
| |
− |
| |
− | ===(1.2) Other APSES domain sequences (1 mark)===
| |
− | </div>
| |
− | <br>
| |
− |
| |
− |
| |
− | Mbp1 orthologues are not the only proteins that contain APSES domains. In order to find all the rest, a PSI BLAST search was performed using the yeast Mbp1 APSES domain as query. From the list of hits, the APSES domains were extracted and summarized in a file.
| |
− |
| |
− | <br><div style="padding: 5px; background: #EEEEEE;">
| |
− | *Review the resulting file for the [[All_APSES_domains|'''APSES domains''']] and make sure you understand the procedure that led to it. Summarize the key steps of the procedure in point form. (1 mark)
| |
− | </div>
| |
− | <br>
| |
− |
| |
− | <div style="padding: 5px; background: #E9EBF3; border:solid 1px #AAAAAA;">
| |
− |
| |
− | ===(1.3) Orthologues (1 mark)===
| |
− | </div>
| |
− | <br>
| |
− |
| |
− | For '''one''' of the the APSES domains in your organism, determine which yeast APSES domain (if any) it is orthologous to:
| |
− | # Choose at random one of the [[All_APSES_domains|APSES domains]] from your organism (but not one labelled with Mbp1) and copy it's [[All_APSES_domains|sequence]] into the input window of a [http://www.ncbi.nlm.nih.gov/blast/ BLAST] search.
| |
− | # Restrict the BLAST search to RefSeq sequences in ''saccharomyces cerevisiae''.
| |
− | # Run the search and determine the gene name of the best hit. (This is the best match.)
| |
− | # Find the sequence of your best hit's APSES domain in the [[All_APSES_domains|sequence file]]. (Since the file contains all of them, your hit has to be in there, unless you found a non-RefSeq sequence).
| |
− | # Copy that sequence (i.e. use the exact sequence from the file, not only the possibly truncated sequence from the BLAST results alignment) and perform the same kind of BLAST search, this time restricted to your organism instead of yeast. (This finds the reciprocal match.)
| |
− |
| |
− | <br><div style="padding: 5px; background: #EEEEEE;">
| |
− | * Document the process and report briefly what you have found on the forward and on the reverse search. Does the gene you have chosen fulfill the ''reciprocal best match'' criterium for orthology with a yeast gene? (1 mark)
| |
− | </div>
| |
− | <br>
| |
− |
| |
− | <div style="padding: 5px; background: #E9EBF3; border:solid 1px #AAAAAA;">
| |
− |
| |
− | ==(2) Align==
| |
− | </div>
| |
− |
| |
− |
| |
− |
| |
− | Actually performing multiple sequence alignements used to involve downloading and installing software on your own computer. While most tools were available on the Web in principle, many groups have restricted the total number of sequences or the total number of characters to be aligned. The EBI however offers three of the most commonly used tools with few limitations and it was possible to run MSAs for all Mbp1 orthologues jointly.
| |
− |
| |
− |
| |
− | <div style="padding: 5px; background: #E9EBF3; border:solid 1px #AAAAAA;">
| |
− | ===(2.1) Aligning the Mbp1 orthologues (1 mark)===
| |
− | </div>
| |
− | <br>
| |
− |
| |
− | I used the following three servers:
| |
− |
| |
− | * [http://www.ebi.ac.uk/clustalw/ '''CLUSTAL-W'''] is a progressive alignment program, it is the most popular, most widely referenced MSA algorithm, it is reasonably fast and easy to use. But alignment errors that are made early can't get corrected and thus it is prone to misalignments on sets of sequences that have poor (<30% ID) local similarity. It is no longer considered state-of-the-art for carefully done alignments.
| |
− | * [http://www.ebi.ac.uk/muscle/ '''MUSCLE'''] essentially starts out from a CLUSTAL like alignment as a draft, then identifies similar groups of sequences from which it calculates profiles, it then re-aligns the group to the profile. This procedure is iterated.
| |
− | * [http://www.ebi.ac.uk/t-coffee/ '''T-Coffee'''] is one of my favourites - the tradeoffs appear to be especially well balanced. It too starts from a set of pairwise global alignments, like CLUSTAL, then additionally calculates sets of best local alignments. Global and local alignments are then combined to a similarity matrix and based on this matrix a guide-tree is constructed. This determines the order of steps in which sequences are added to the multiple alignment. A nice feature of T-Coffee is color coded output that allows you to quickly judge the local reliability of the alignment.
| |
− |
| |
− | We shall perform multiple sequence alignments for all 16 Mbp1 orthologues and compare the results. Since the results should look the same for all of you, it was possible to precompute the alignments to save some resources. Of course you are welcome to do this on your own, but it is not required. In fact, since we want to compare the alignments, I have also edited them: I have '''re-sorted the results so that the sequences appear in the same order in each case'''. Only CLUSTAL provides the option to order the output in the same way as the input, the other two programs order the output so that adjacent sequences are most similar. This is useful, because it emphasizes sequence features, but it makes it impossibly tedious to compare alignments.
| |
− |
| |
− | [[Image:A03_01.jpg|frame|none|Assignment 3, Figure 01<br>
| |
− | The guide tree computed by CLUSTAL-W for the 16 Mbp1 orthologue sequences. This tree is based on a matrix of pairwise distances. Sequences in the multiple alignments have been rearranged into the same order as they apppear in this diagram.
| |
− | ]]
| |
− |
| |
− |
| |
− | The result files are linked here:
| |
− |
| |
− | * [[All_Mbp1_CLUSTAL|Mbp1 proteins '''CLUSTAL''' aligned]]
| |
− | * [[All_Mbp1_MUSCLE|Mbp1 proteins '''MUSCLE''' aligned]]
| |
− | * [[All_Mbp1_T-COFFEE|Mbp1 proteins '''T-Coffee''' aligned (text version)]] and [[All_Mbp1_T-COFFEE_scores| (coloured according to scores)]]
| |
− |
| |
− | Globally speaking, the alignments are quite similar. Let's first look at the common themes, before we discuss details of the results. The [[All_Mbp1_T-COFFEE_scores| (score-colored T-COFFEE alignment)]] is well suited to look at general relationships between the sequences, since outliers can be easily identified. For example, if one of the sequences would have a low-scoring domain, aligning poorly to the others of the group, it may be possible that that domain has been acquired in a separate evolutionary event and is not homologous to all others. We would notice an isolated stretch of poorly alignable sequence, i.e. it should be coloured wihth a low score in a set of otherwise high-scoring segments. Also a gene may have acquired significant lengths of N- or C-terminal extensions which may not be homologous (unless they are the reuslt of an internal duplication).
| |
− |
| |
− | <br><div style="padding: 5px; background: #EEEEEE;">
| |
− | *Review the [[All_Mbp1_T-COFFEE_scores| (score-colored T-Coffee alignment)]]. Based on this alignment, how do you feel about our initial assertion that these proteins should be considered orthologous? (Answer briefly, but with reference to specific evidence in the alignment. Note that this is not about the general level of conservation, but about whether significant segments do not appear related/alignable at all.) (1 mark)
| |
− | </div>
| |
− |
| |
− |
| |
− |
| |
− |
| |
− | <div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;">
| |
− |
| |
− | ==(3) Mbp1 orthologues: analysis of full length MSAs==
| |
− | </div>
| |
− |
| |
− |
| |
− |
| |
− | What do we mean by a ''good'' versus a ''poor'' multiple sequence alignment?
| |
− |
| |
− | Let us first consider some of the features we have defined in the second assignment (and some structural features I have added). Here is an annotation of the yeast Mbp1 sequence. It was compiled with the following procedure.
| |
− |
| |
− | # Performed [http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi '''CDD'''] search with yeast Mbp1 protein sequence. This retrieves alignments of Mbp1 with the APSES and the ANKYRIN domains. These are profile based alignment and I would consider them more reliable than pairwise alignments.
| |
− | # Performed [http://smart.embl-heidelberg.de/ '''SMART'''] search with yeast Mbp1 protein sequence. This retrieved the APSES domain, annotated a number of low-complexity regions and a stretch of coiled coil.
| |
− | # Performed a [http://www.ebi.ac.uk/thornton-srv/databases/sas/ '''SAS'''] search with yeast Mbp1 protein sequence. This retrieved pairwise alignments with the structures 1MB1 (APSES) and chain D of 1IKN (ankyrin domains of I<sub>kappa</sub>b), together with their respectve secondary structure annotations.
| |
− | # Copied GenPept sequence into Word-processor.
| |
− | # Transferred annotations of low complexity and coiled-coil regions from SMART.
| |
− | # Transferred annotations of APSES seondary structure from SAS (this is a ''direct'' annotation, since the structure 1MB1 has the same sequence as the coressponding parts of the Mbp1 protein). The central helix of the binding region is slightly distorted and SAS annotates a break in the helix, this was bridged with lowercase "h" in the annotation.
| |
− | # Ankyrin domain annotation was not as straightforward. While CDD, SMART and SAS all annotate the same general regions, they disagree in details of the domain boundaries and in the precise alignment. Used the profile-based CDD alignment of 1IKN. Transferred annotations of secondary structure from SAS output for 1IKN to sequence (this is a ''transferred'' annotation, the original annotation was for 1IKN and we assume that it applies to Mbp1 as well).
| |
− |
| |
− |
| |
− | MBP1_SACCE
| |
− | Annotations based on
| |
− | - CDD domain analysis,
| |
− | - SAS structure annotation and
| |
− | - literature data on binding region
| |
− |
| |
− | Keys:
| |
− |
| |
− | C Coiled coil regions predicted by Coils2 program
| |
− | x Low complexity region
| |
− | * Proposed binding region
| |
− | + positively charged residues, oriented for possible DNA binding interactions
| |
− | - negatively charged residues, oriented for possible DNA binding interactions
| |
− |
| |
− | E beta strand
| |
− | H alpha helix
| |
− | t beta turn
| |
− |
| |
− |
| |
− | 10 20 30 40 50 60
| |
− | MSNQIYSARY SGVDVYEFIH STGSIMKRKK DDWVNATHIL KAANFAKAKR TRILEKEVLK
| |
− | 1MB1 ----EEEEEt t-EEEEEEEE t-EEEEEEtt ---EEHHHHH HH----HHHH HHHHhhhHHH
| |
− | * *+**-+****
| |
− |
| |
− | 70 80 90 100 110 120
| |
− | ETHEKVQGGF GKYQGTWVPL NIAKQLAEKF SVYDQLKPLF DFTQTDGSAS PPPAPKHHHA
| |
− | 1MB1 ---EEE---- tt--EEEE-H HHHHHHHHH- --HHHHtt- xxx xxxxxxxxxx
| |
− | **+*+***** ****
| |
− |
| |
− | 130 140 150 160 170 180
| |
− | SKVDRKKAIR SASTSAIMET KRNNKKAEEN QFQSSKILGN PTAAPRKRGR PVGSTRGSRR
| |
− | x
| |
− |
| |
− |
| |
− | 190 200 210 220 230 240
| |
− | KLGVNLQRSQ SDMGFPRPAI PNSSISTTQL PSIRSTMGPQ SPTLGILEEE RHDSRQQQPQ
| |
− | xxxxx
| |
− |
| |
− |
| |
− | 250 260 270 280 290 300
| |
− | QNNSAQFKEI DLEDGLSSDV EPSQQLQQVF NQNTGFVPQQ QSSLIQTQQT ESMATSVSSS
| |
− | x xx xxxxxxxxxx xxxxxxxxxx
| |
− |
| |
− |
| |
− | 310 320 330 340 350 360
| |
− | PSLPTSPGDF ADSNPFEERF PGGGTSPIIS MIPRYPVTSR PQTSDINDKV NKYLSKLVDY
| |
− | xxxxxxx
| |
− |
| |
− | 370 380 390 400 410 420
| |
− | FISNEMKSNK SLPQVLLHPP PHSAPYIDAP IDPELHTAFH WACSMGNLPI AEALYEAGTS
| |
− | ANKYRIN -- t----HHHHH HH---HHHHH t-t--t-t--
| |
− |
| |
− |
| |
− | 430 440 450 460 470 480
| |
− | IRSTNSQGQT PLMRSSLFHN SYTRRTFPRI FQLLHETVFD IDSQSQTVIH HIVKRKSTTP
| |
− | ANKYRIN t----t---- HHHHHHHH-- -------HHH HHHHHH-ttH HH-----HHH HHHH--tH--
| |
− |
| |
− |
| |
− | 490 500 510 520 530 540
| |
− | SAVYYLDVVL SKIKDFSPQY RIELLLNTQD KNGDTALHIA SKNGDVVFFN TLVKMGALTT
| |
− | ANKYRIN HHHHHHHHH- ---------- -----t---- tt---HHHHH HH---HHHHH HHH--t-tt-
| |
− |
| |
− |
| |
− | 550 560 570 580 590 600
| |
− | ISNKEGLTAN EIMNQQYEQM MIQNGTNQHV NSSNTDLNIH VNTNNIETKN DVNSMVIMSP
| |
− | ANKYRIN ---t----HH HHHHHH--HH HHH-t--HHH -t----HHHH HHH--tHHHH HHHHHH---t
| |
− |
| |
− |
| |
− | 610 620 630 640 650 660
| |
− | VSPSDYITYP SQIATNISRN IPNVVNSMKQ MASIYNDLHE QHDNEIKSLQ KTLKSISKTK
| |
− | ANKYRIN ---tt----H HHHHHH---H HHHHHHH CCCCCCCC CCCCCCCCCC CCCCC
| |
− |
| |
− |
| |
− | 670 680 690 700 710 720
| |
− | IQVSLKTLEV LKESSKDENG EAQTNDDFEI LSRLQEQNTK KLRKRLIRYK RLIKQKLEYR
| |
− | x xxxxxxxxxx xxxxxxx
| |
− |
| |
− | 730 740 750 760 770 780
| |
− | QTVLLNKLIE DETQATTNNT VEKDNNTLER LELAQELTML QLQRKNKLSS LVKKFEDNAK
| |
− |
| |
− |
| |
− | 790 800 810 820 830
| |
− | IHKYRRIIRE GTEMNIEEVD SSLDVILQTL IANNNKNKGA EQIITISNAN SHA
| |
− |
| |
− |
| |
− | A '''good''' MSA comprises only columns of residues that play similar roles in the proteins' mechanism and/or that evolve in a comparable structural context. Since it is a result of biological selection and conservation, it has relatively few indels and the indels it has are usually not placed into elements of secondary structure or into functional motifs.
| |
− |
| |
− | A '''poor''' MSA has many errors in its columns in the sense that they contain residues that actuallly have diffferent functions or structural roles, even though they may look similar to a scoring matrix. It also may have introduced indels in biologically irrelevant positions, to maximize spurious sequence similarities.
| |
− |
| |
− | In order to evaluate the MSAs for our proteins, we will analyze alignments relative to the features we have annotated above.
| |
− |
| |
− |
| |
− |
| |
− | <div style="padding: 5px; background: #E9EBF3; border:solid 1px #AAAAAA;">
| |
− | ===(3.1) APSES domains (1 mark)===
| |
− | </div>
| |
− | <br>
| |
− |
| |
− | The APSES domains in all of our Mbp1 orthologues are highly conserved and a program that would misalign such obvius similarity would not be worth the electrons it computes with.
| |
− |
| |
− | <br><div style="padding: 5px; background: #EEEEEE;">
| |
− | *Consider the CLUSTAL, Muscle and T-Coffee alignments of the Mbp1 orthologues. Orient yourselves as to where the APSES domains are located. Briefly note whether the three alignments agree and whether the charged residues in the proposed binding region are wholly or partially conserved. (Refer to the specific residues labelled (+) or (-) in the Mbp1 annotation above). (1 mark) <!-- Sequence variation may indicate variations in binding site -->
| |
− | </div>
| |
− | <br>
| |
− |
| |
− |
| |
− |
| |
− |
| |
− | <div style="padding: 5px; background: #E9EBF3; border:solid 1px #AAAAAA;">
| |
− |
| |
− | ===(3.2) Ankyrin domains (1 mark)===
| |
− | </div>
| |
− | <br>
| |
− |
| |
− | The Ankyrin domains are more highly diverged, the boundaries are less well defined and not even CDD, SMART and SAS agree on the precise annotations. Nevertheless we would hope that a good alignment would recognize homology in that region and that ideally the required indels would be placed between the secondary structure elements, not in their middle.
| |
− |
| |
− | <br><div style="padding: 5px; background: #EEEEEE;">
| |
− | *For one of the alignments of your choice, identify the helices in the Ankyrin repeat region of Mbp1. To facilitate this, I have colored the annotated ankyrin helices red in the yeast Mbp1 protein. Briefly state whether the indels are concentrated in regions that connect the helices or if they are more or less evenly distributed along the entire region of similarity. Conclude whether the assertion that ''indels should not be placed in elelements of secondary structure'' has merit in this case, i.e. whether the indels that violate it have strong support from aligned sequence motifs. (1 mark)
| |
− | </div>
| |
− | <br>
| |
− |
| |
− |
| |
− |
| |
− |
| |
− | <div style="padding: 5px; background: #E9EBF3; border:solid 1px #AAAAAA;">
| |
− |
| |
− | ===(3.3) Other features (2 marks)===
| |
− | </div>
| |
− | <br>
| |
− |
| |
− | Aligning functional features like ''coiled coil domains'' or ''intrinsically disorderd regions'' is even more difficult, since this is to a large degree a property of the amino acid composition, not as much the precise sequence. Thus we would expect alignment algorithms to have difficulty to detect the correspondence between sequences in such regions. I have marked the four low complexity regions of the yeast Mbp1 sequence with '''bold''' letters in all three alignments.
| |
− |
| |
− | <br><div style="padding: 5px; background: #EEEEEE;">
| |
− | *Copy the Mbp1 sequence from your organism from the multi-FASTA files and run a [http://smart.embl-heidelberg.de/ SMART] sequence analysis: paste your sequence (or the Uniprot accession number), check only the checkbox for detecting '''intrinsic protein disorder''' and click "Sequence SMART". Locate the segments of '''low complexity''' for your sequence (they are in the lower part of the results page since they overlap with disordered segements). Find the corresponding positions for your sequence in '''one''' of the multiple sequence alignments. Briefly describe the situation: state whether these segments are found in the same general region, in the same detailed location, or perhaps even conserved in sequence, when you compare them to the ''saccharomyces cerevisiae'' sequence. (1 mark)
| |
− |
| |
− | * Briefly discuss whether this observation should lead you to conclude that disorder in these proteins appears to be a conserved feature, i.e. that is selected for in evolution. (1 mark)
| |
− | </div>
| |
− | <br>
| |
− |
| |
− |
| |
− | <!-- add at a later time similar analysis of coils via 2ZIP server - conserved feature? [http://2zip.molgen.mpg.de/index.html 2Zip server]
| |
− | <br><div style="padding: 5px; background: #EEEEEE;">
| |
− | *Task
| |
− | </div>
| |
− | <br>
| |
− |
| |
− | -->
| |
− |
| |
− |
| |
− |
| |
− | <div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;">
| |
− |
| |
− | ==(4) APSES domain homologues: analysis of domain MSAs==
| |
− | </div>
| |
− | <br>
| |
− |
| |
− | The procedures for obtaining the MSAs for all APSES domains is summarized at the top of the page for each alignment. Read it and make sure you understand what has been done. Three approaches were used:
| |
− |
| |
− |
| |
− | * An [[APSES_domains_PSI-BLAST| alignment based on the PSI-BLAST reults]] as an example of a profile-based alignment.
| |
− |
| |
− | * A [[APSES_domains_CLUSTAL| CLUSTAL-W alignment]] as an example of our standard, plain vanilla progressive alignment procedure.
| |
− |
| |
− | * A consistency based, iterated [[APSES_domains_probcons| alignment using '''probcons''']], as an example of the more modern methods. probcons was used rather than T-Coffee since the EBI server restricts the number of sequences it will accept to 50.
| |
− |
| |
− | Comparing the three alignments, we note that they do not agree in detail over large stretches.
| |
− |
| |
− |
| |
− |
| |
− | ===(4.1) Manual improvement (1 mark)===
| |
− |
| |
− | Often errors or inconsistencies are easy to spot and manually editing an MSA is not generally frowned upon, even though this is not a strictly objective procedure. The main goal is to make an alignment biologically more plausible, usually this means to mimize the number of rare events that we need to postulate for the alignment: move indels into more appropriate positions and/or to emphasize conservation of known functional motifs. Here are some examples for what one might aim for in manually editing an alignment:
| |
− |
| |
− | * Reduce number of indels
| |
− |
| |
− | From Probcons:
| |
− | 0447_DEBHA ILKTE-K<span style="color:#FF0000;">-</span>T<span style="color:#FF0000;">---</span>K--SVVK ILKTE----KTK---SVVK
| |
− | 9978_GIBZE MLGLN<span style="color:#FF0000;">-</span>PGLKEIT--HSIT MLGLNPGLKEIT---HSIT
| |
− | 1513_CANAL ILKTE-K<span style="color:#FF0000;">-</span>I<span style="color:#FF0000;">---</span>K--NVVK ILKTE----KIK---NVVK
| |
− | 6132_SCHPO ELDDI-I<span style="color:#FF0000;">-</span>ESGDY--ENVD ELDDI-IESGDY---ENVD
| |
− | 1244_ASPFU ----N<span style="color:#FF0000;">-</span>PGLREIC--HSIT -> ----NPGLREIC---HSIT
| |
− | 0925_USTMA LVKTC<span style="color:#FF0000;">-</span>PALDPHI--TKLK LVKTCPALDPHI---TKLK
| |
− | 2599_ASPTE VLDAN<span style="color:#FF0000;">-</span>PGLREIS--HSIT VLDANPGLREIS---HSIT
| |
− | 9773_DEBHA LLESTPKQYHQHI--KRIR LLESTPKQYHQHI--KRIR
| |
− | 0918_CANAL LLESTPKEYQQYI--KRIR LLESTPKEYQQYI--KRIR
| |
− |
| |
− | <small>Gaps marked in red were moved. The sequence similarity in the alignment does not change considerably, however the total number of indels in this excerpt is reduced to 13 from the original 22</small>
| |
− |
| |
− | * Move indels to more plausible position
| |
− |
| |
− | From CLUSTAL:
| |
− | 4966_CANGL MKHEKVQ------GGYGRFQ---GTW MKHEKV<span style="color:#00AA00;">Q</span>------GGYGRFQ---GTW
| |
− | 1513_CANAL KIKNVVK------VGSMNLK---GVW KIKNVV<span style="color:#00AA00;">K</span>------VGSMNLK---GVW
| |
− | 6132_SCHPO VDSKHP<span style="color:#FF0000;">-</span>----------<span style="color:#FF0000;">Q</span>ID---GVW -> VDSKHP<span style="color:#00AA00;">Q</span>-----------ID---GVW
| |
− | 1244_ASPFU EICHSIT------GGALAAQ---GYW EICHSI<span style="color:#00AA00;">T</span>------GGALAAQ---GYW
| |
− |
| |
− | <small>The two characters marked in red were swapped. This does not change the number of indels but places the "Q" into a a column in which it is more highly conserved (green). Progressive alignments are especially prone to this type of error.</small>
| |
− |
| |
− | * Conserve motifs
| |
− |
| |
− | From CLUSTAL:
| |
− | 6166_SCHPO --DKR<span style="color:#FF0000;">V</span>A---<span style="color:#FF0000;">G</span>LWVPP --DKR<span style="color:#FF0000;">V</span>A--<span style="color:#FF0000;">G</span>-LWVPP
| |
− | XBP1_SACCE GGYIK<span style="color:#FF0000;">I</span>Q---<span style="color:#FF0000;">G</span>TWLPM GGYIK<span style="color:#FF0000;">I</span>Q--<span style="color:#FF0000;">G</span>-TWLPM
| |
− | 6355_ASPTE --DE<span style="color:#FF0000;">I</span>A<span style="color:#FF0000;">G</span>---NVWISP -> ---DE<span style="color:#FF0000;">I</span>A--<span style="color:#FF0000;">G</span>NVWISP
| |
− | 5262_KLULA GGYIK<span style="color:#FF0000;">I</span>Q---<span style="color:#FF0000;">G</span>TWLPY GGYIK<span style="color:#FF0000;">I</span>Q--<span style="color:#FF0000;">G</span>-TWLPY
| |
− |
| |
− | <small>The first of the two residues marked in red is a conserved, solvent exposed hydrophobic residue that may mediate domain interactions. The second residue is the conserved glycine in a beta turn that cannot be mutated without structural disruption. Changing the position of a gap and insertion in one sequence improves the conservation of both motifs.</small>
| |
− |
| |
− |
| |
− |
| |
− |
| |
− |
| |
− | Please consider the following excerpts from the alignments:
| |
− |
| |
− | PSI-BLAST
| |
− | '''MBP1_SACCE SIMKRKKDDWVNATHILKA------A----------NFA--------KAKRTR-----'''
| |
− | 2599_ASPTE -IMWDYNIGLVRTTPLFRS------Q----------NYS--------KTTPAK-----
| |
− | 9773_DEBHA -IIWDYETGFVHLTGIWKA------S----------INDEVNTHRNLKADIVK-----
| |
− | 0918_CANAL -VIWDYETGWVHLTGIWKA------SLTIDGSNVSPSHL--------KADIVK-----
| |
− | 9901_DEBHA -ILRRVQDSYINISQLF--------SILLKIG----HLS--------EAQLTN-----
| |
− | 7766_ASPNI -LMRRSKDGYVSATGMFKI------A-----------FP--------WAKLEEERSER
| |
− | 5459_GIBZE -LMRRSYDGFVSATGMFKASFPYAEA----------SDE--------DAERKY-----
| |
− | 2267_NEUCR -LMRRSQDGYISATGMFKA------TFPYASQ----EEE--------EAERKY-----
| |
− | 3510_ASPFU -LMRRSKDGYVSATGMFKI------A-----------FP--------WAK--------
| |
− | 3762_MAGGR -LMRRSSDGYVSATGMFKATFPYADA----------EDE--------EAERNY-----
| |
− | 3412_CANAL -VLRRVQDSFVNVTQLFQI------LIKLE------VLP--------TSQVDN-----
| |
− |
| |
− |
| |
− | CLUSTAL
| |
− | '''MBP1_SACCE SIMKRKKDDWVNATHILKAAN----------FAKAKRTRILE----------KEVLKETHE'''
| |
− | 2599_ASPTE -IMWDYNIGLVRTTPLFRSQ----------NYSKTTPAKVLDAN--------P-GLREISH
| |
− | 9773_DEBHA -IIWDYETGFVHLTGIWKASIN-DEVNTHR-NLKADIVKLLEST--------PKQYHQHIK
| |
− | 0918_CANAL -VIWDYETGWVHLTGIWKASLTIDGSNVSPSHLKADIVKLLEST--------PKEYQQYIK
| |
− | 9901_DEBHA -ILRRVQDSYINISQLFSILL----------KIGHLSEAQLTNFLNNEILTNTQYLSSGGS
| |
− | 7766_ASPNI -LMRRSKDGYVSATGMFKIAF----------PWAKLEEERSE----------REYLKTRPE
| |
− | 5459_GIBZE -LMRRSYDGFVSATGMFKASF----------PYAEASDEDAE----------RKYIKSLPT
| |
− | 2267_NEUCR -LMRRSQDGYISATGMFKATF----------PYASQEEEEAE----------RKYIKSIPT
| |
− | 3510_ASPFU -LMRRSKDGYVSATGMFKIAF----------PWAKLEEEKAE----------REYLKTREG
| |
− | 3762_MAGGR -LMRRSSDGYVSATGMFKATF----------PYADAEDEEAE----------RNYIKSLPA
| |
− | 3412_CANAL -VLRRVQDSFVNVTQLFQILI----------KLEVLPTSQVDNYFDNEILSNLKYFGSSSN
| |
− |
| |
− |
| |
− | Probcons
| |
− | '''MBP1_SACCE SIMKRKKDDWVNATHILKAANF----AKA----------KRTRILEKE-V-LKETH--E'''
| |
− | 2599_ASPTE -IMWDYNIGLVRTTPLFRSQNY----SKT----------TPAKVLDAN-PGLREIS--H
| |
− | 9773_DEBHA -IIWDYETGFVHLTGIWKASIN----DEV--NTHRNLKADIVKLLESTPKQYHQHI--K
| |
− | 0918_CANAL -VIWDYETGWVHLTGIWKASLT----IDGSNVSPSHLKADIVKLLESTPKEYQQYI--K
| |
− | 9901_DEBHA -ILRRVQDSYINISQLFSILLKIGHLSEA----------QLTNFLNNE-I-LTNTQYLS
| |
− | 7766_ASPNI -LMRRSKDGYVSATGMFKIAFP----WAK----------LEEERSERE-Y-LK-----T
| |
− | 5459_GIBZE -LMRRSYDGFVSATGMFKASFP----YAE----------ASDEDAERK-Y-IK-----S
| |
− | 2267_NEUCR -LMRRSQDGYISATGMFKATFP----YAS----------QEEEEAERK-Y-IK-----S
| |
− | 3510_ASPFU -LMRRSKDGYVSATGMFKIAFP----WAK----------LEEEKAERE-Y-LK-----T
| |
− | 3762_MAGGR -LMRRSSDGYVSATGMFKATFP----YAD----------AEDEEAERN-Y-IK-----S
| |
− | 3412_CANAL -VLRRVQDSFVNVTQLFQILIKLEVLPTS----------QVDNYFDNE-I-LSNLKYFG
| |
− |
| |
− | <br><div style="padding: 5px; background: #EEEEEE;">
| |
− | *In any '''one''' of these excerpts, find at least one example where the alignment could be manually improved. Show the original version, the improved version and highlight the changes in red. (1 mark)
| |
| </div> | | </div> |
− |
| |
− |
| |
− | The fact that such improvements usually are not hard to find teaches us to be cautious with the results. Not in all cases will lack of conservation in a particular column mean that a residue has changed in evolution - sometimes this is simply a consequence of misalignment. MSAs can only take sequence information into account, while we may have additional information on structural and functional conservation patterns. This may include secondary structure (gaps should be moved out of regions of secondary structure, where possible), structurally required residues (expected to be conserved accross all structurally similar sequences) and functionally conserved residues (expected to have a high likelyhood of being conserved within groups of orthologues, but varying between orthologues and paralogues).
| |
− |
| |
− | In terms of structural conservation, we expect motif or consistency based alignments to be more accurate since they align to the "big picture". In terms of functional variation we expect progressive alignments to be more accurate, since they align to local similarities.
| |
− |
| |
− |
| |
− |
| |
− |
| |
− | <div style="padding: 5px; background: #E9EBF3; border:solid 1px #AAAAAA;">
| |
− |
| |
− | ===(4.2) Residue conservation (1 mark)===
| |
− | </div>
| |
− | <br>
| |
− |
| |
− | Let us finally interpret the alignments in terms of their biological relevance. I have transferred the ligand-binding annotations for the yeast Mbp1 APSES domain into the multiple sequence alignments by color coding the charged residues that putatively could bind DNA <span style="color:#FF0000;">'''red'''</span> (-) and <span style="color:#0066FF;">'''blue'''</span> (+). Thus these residues label columns in which we expect ''functional'' conservation. I have labeled two residues that are associated with important structural features <span style="color:#00AA33;">'''green'''</span>. These two residues are G75, a mandatory glycine in the third position of a particular type of beta-turn, and W77, a key component of the domain's hydrophobic core. Thus these two residues label columns in which we expect ''structural'' conservation. Let's assume that all the APSES domains fold into similar structures and that they all bind DNA, although not necessarily the same cognate sequence. This should allow you to answer the following questions:
| |
− |
| |
− |
| |
− | <br><div style="padding: 5px; background: #EEEEEE;">
| |
− | Consider any '''one''' of the three APSES domain MSAs.
| |
− |
| |
− | *Are the patterns of sequence variation for functionally conserved residues compatible with different binding specificities for different APSES domains? State briefly (but with reference to specific residues) what you would expect and what you find.
| |
− |
| |
− | *Are the patterns of sequence variation for structurally conserved residues compatible with a common fold of different APSES domains? State briefly (but with reference to specific residues) what you would expect and what you find. (1 mark)
| |
− | </div>
| |
− |
| |
− |
| |
− |
| |
− |
| |
− | <div style="padding: 5px; background: #BDC3DC; border:solid 1px #AAAAAA;">
| |
− | ==(5) Summary of Resources==
| |
− | </div>
| |
− | <br>
| |
− |
| |
− | ;Links
| |
− | :* [[Organism_list_2006|Assigned Organisms]]
| |
− | :* [http://www.ncbi.nlm.nih.gov/blast '''BLAST''']
| |
− | :* [http://www.pir.uniprot.org/search/idmapping.shtml '''Uniprot ID mapping''' service]
| |
− | :* [http://www.ncbi.nlm.nih.gov/sutils/blink.cgi?pid=68465419 A '''BLink''' example]
| |
− | :* [http://www.ebi.ac.uk/clustalw/ EBI '''CLUSTAL-W''' server]
| |
− | :* [http://www.ebi.ac.uk/muscle/ EBI '''MUSCLE''' server]
| |
− | :* [http://www.ebi.ac.uk/t-coffee/ EBI '''T-Coffee''' server]
| |
− | :* [http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi '''CDD''']
| |
− | :* [http://smart.embl-heidelberg.de/ '''SMART''']
| |
− | :* [http://www.ebi.ac.uk/thornton-srv/databases/sas/ '''SAS''']
| |
− |
| |
− | ;Sequences
| |
− | :* [[All_Mbp1_proteins|'''All Mbp1 proteins''']]
| |
− | :* [[All_APSES_domains|'''All APSES domains''']]
| |
− |
| |
− | ;Alignments
| |
− | :'''Mbp1 proteins:'''
| |
− | :* [[All_Mbp1_CLUSTAL|Mbp1 proteins '''CLUSTAL''' aligned]]
| |
− | :* [[All_Mbp1_MUSCLE|Mbp1 proteins '''MUSCLE''' aligned]]
| |
− | :* [[All_Mbp1_T-COFFEE|Mbp1 proteins '''T-Coffee''' aligned (text version)]]
| |
− | :* [[All_Mbp1_T-COFFEE_scores|Mbp1 proteins '''T-Coffee''' aligned (coloured according to scores)]]
| |
− |
| |
− | :'''APSES domains:'''
| |
− | :* [[APSES_domains_PSI-BLAST|All APSES domains - alignment based on '''PSI-BLAST''' results]]
| |
− | :* [[APSES_domains_CLUSTAL|All APSES domains - '''CLUSTAL-W''' alignment]]
| |
− | :* [[APSES_domains_probcons|All APSES domains - '''probcons''' alignment]]
| |
− |
| |
− |
| |
− |
| |
− |
| |
− |
| |
− | <div style="padding: 5px; background: #D3D8E8; border:solid 1px #AAAAAA;">
| |
− | [End of assignment]
| |
− | </div>
| |
− |
| |
− | If you have any questions at all, don't hesitate to mail me at [mailto:boris.steipe@utoronto.ca boris.steipe@utoronto.ca] or post your question to the [mailto:bch441_2006@googlegroups.com Course Mailing List]
| |