Difference between revisions of "BIO Assignment 5 2011"

From "A B C"
Jump to navigation Jump to search
Line 13: Line 13:
  
 
<div style="padding: 5px; background: #A6AFD0;  border:solid 1px #AAAAAA; font-size:200%;font-weight:bold;">
 
<div style="padding: 5px; background: #A6AFD0;  border:solid 1px #AAAAAA; font-size:200%;font-weight:bold;">
Assignment 5 - Homology modeling
+
Assignment 4 - Phylogenetic Analysis
 
</div>
 
</div>
  
 
Please note: This assignment is currently inactive. Unannounced changes may be made at any time.
 
Please note: This assignment is currently inactive. Unannounced changes may be made at any time.
 +
 
&nbsp;
 
&nbsp;
  
Line 23: Line 24:
 
&nbsp;-->
 
&nbsp;-->
  
<div style="padding: 15px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
+
 
;How could the search for ultimate truth have revealed so hideous and visceral-looking an object?
+
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
::''<small>Max Perutz (on his first glimpse of the Hemoglobin structure)</small>''
+
Introduction
 +
&nbsp;
 +
 
 +
;Nothing in Biology makes sense except in the light of evolution.
 +
:''Theodosius Dobzhansky''
 
</div>
 
</div>
&nbsp;
 
&nbsp;
 
  
Where is the hidden beauty in structure, and where, the "ultimate truth"? In the previous assignments we have studied sequence conservation in APSES family domains and looked at how these domains have evolved over time. We have seen that this is an ancient family, that had several members already in the cenancestor of all fungi, an organism that lived in the [http://www.ucmp.berkeley.edu/fungi/fungifr.html vendian period] of the proterozoic era of precambrian times, more than 600,000,000 years ago.
+
... but does evolution make sense in the light of biology?
 +
 
 +
As we have seen in the previous assignments, the Mbp1 transcription factor has homologues in all other fungi, yet - looking at orthologues - this is not always a clear one-to-one mapping of related genes to each other. It appears that various systems of APSES domain transcription factors have evolved independently. Of course this bears directly on our notion of ''function'' - what it means to say that two genes in different organisms have the "same" function. In case two organisms both have an orthologous gene for the same, distinct function, this may be warranted. But what if that gene has duplicated in one of them, and the two paralogues now perform different, related functions in one organism? In order to be able to even ask such questions, we need to understand how we can make the evolutionary history of gene families explicit. This is the domain of '''phylogenetic analysis'''. We can ask questions like: how many paralogues did the cenancestor of a clade possess? Which of these underwent additional duplications in the phylogenesis of the organism I am studying? Did any genes get lost? And - adding additional biological insight to the picture - did the observed duplications lead to the "invention" of new biological systems? When was that? And how did the species benefit from this event?
  
In order to understand how particular residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to consider an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. In particular, it would be interesting to correlate the conservation patterns we have observed in the MSAs with specific DNA binding interactions. Unfortunately, the 1MB1 structure does not have DNA bound and the evidence we have considered in Assignment 2 ([http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=10747782 Taylor ''et al.'', 2000]) is not sufficient to define the details of how a DNA double helix might be bound. These details would require the structure of a complex that contains protein as well as DNA. No such complex of an APSES domain has yet been crystallized.
+
We will develop some of this kind of analysis in this assignment. In the previous assignment you have established which genes are the reciprocally most closely related orthologues to Mbp1 in yeast. In this assignment, we will analyse their evolutionary relationship and compare it to the evolutionary relationship of all fungal APSES domains. The goal is to define families of related transcription factors and their evolutionary history.
  
''In this assignment you will construct a molecular model of the Mbp1 orthologue in your assigned organism, identify similar structures of distantly related domains for which protein-DNA complexes are known, define whether the available evidence allows you to distinguish between different modes of ligand binding, and assemble a hypothetical complex structure.''
+
A number of good tools for phylogenetic analysis exist; ''general purpose packages'' include the (free) PHYLIP package and the (commercial) PAUP package. ''Specialized tools'' for tree-building include Treepuzzle or Mr. Bayes. This assignment is conctructed around programs that are availble in PHYLIP, however you are welcome to use other tools that fulfil a similar purpose if you wish. In this field, researchers consider trees that have been built with ML (maximum likelihood) methods to be more reliable than trees that are built with parsimony methods, or distance methods such as NJ (Neighbor Joining). However ML methods are also much more compute-intensive. Just like with multiple sequence alignments, some algorithms will come closer to guessing the truth and others will not and usually it is hard to tell, which is the more trustworthy of two diverging results. The prudent researcher tries out alternatives and forms her own opinion. Specifically, we may usually assume results that converge, independent of the algorithm, to be more reliable than those that depend strongly on a particular algorithm or details of input data.
  
For the following, please remember the following terminology:
+
But regarding algorithm and rersources: we will take two shortcuts in this assignment (and both shortcuts are things you should not do ''in real life''):
  
;Target
+
One: we will use an '''efficient''' tree-building algorithm, not the best-available one. This is an algorithm which is available on the Web, without the need for you to install software on your own machine. In ''real life'' you would of course use the most accurate algortihm you can lay your hands on, regardless of the resources this requires, since it makes no sense to waste your time on a careful analysis of inaccurate trees. Your supervisor would want it so as well. And if not she, the reviewers of your manuscript.
:The protein that you are planning to model.
 
;Template
 
:The protein whose structure you are using as a guide to build the model.
 
;Model
 
:The structure that results from the modeling process. It has the '''Target sequence''' and is similar to the '''Template structure'''.
 
&nbsp;
 
  
A brief overview article on the construction and use of homology models is linked to the resource section at the bottom of this page. That section also contains links to other sites and resources you might require.
+
Two: we will assume the tree the algorithm constructs is ''correct''. In ''real life'' you would establish its reliability with a bootstrap procedure: repeat the tree-building a hundred times with partial data and see which branches and groupings are robust and which depend on the details of the data. But we should still acknowledge that bifurcations that are very close to each other have not been" resolved". Any conscientious reviewer would flag such leniency and send your results back to you for a bootstrapping exercise at the computer. In phylogenetic analysis, not all lines that the program draws are equally trustworthy. Dont take the trees as a given fact just because a program suggests this. Look at the evidence, use your reasoning, and analyse them critically.
  
 +
In case you want to review concept of trees, clades, LCAs OTUs and the like, I have linked an excellent and very understandable introduction-level article on phylogenetic analysis to the resource section at the bottom of this page.
  
 +
&nbsp;
  
 
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
 
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
Line 54: Line 55:
 
</div>
 
</div>
  
Read carefully. Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we see too many assignments which, arduously effected, nevertheless intimate nescience of elementary tenets of molecular biology. If the sentence above did not trigger an urge to open a dictionary, you have a tendency to guess, rather than confirm possibly important information.
+
Read carefully. Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which important aspects have simply overlooked marks unnecessarily. If you did not notice that the above did not make sense, you are reading what you expect, not what is written.
  
 
Prepare a Microsoft Word document with a title page that contains:
 
Prepare a Microsoft Word document with a title page that contains:
Line 70: Line 71:
  
 
Write your answers into separate paragraphs and give each its title. Save your document with a filename of:
 
Write your answers into separate paragraphs and give each its title. Save your document with a filename of:
<code>A5_family name.given name.doc</code>
+
<code>A3_family name.given name.doc</code>
<small>(for example my fifth assignment would be named: A5_steipe.boris.doc - and don't switch the order of your given name and familyname please!)</small>
+
<small>(for example my fourth assignment would be named: A4_steipe.boris.doc - and don't switch the order of your given name and familyname please!)</small>
  
Finally e-mail the document to [mailto:boris.steipe@utoronto.ca boris.steipe@utoronto.ca] before the due date.
+
Finally e-mail the document to [boris.steipe@utoronto.ca] before the due date.
  
 
Your document must not contain macros. Please turn off and/or remove all macros from your Word document; we will disable macros, since they pose a security risk.
 
Your document must not contain macros. Please turn off and/or remove all macros from your Word document; we will disable macros, since they pose a security risk.
 
We do not have the resources to correct formatting errors or to convert assignments into different formats. <!-- Becoming familiar and proficient with technologies is part of the course objectives and that includes e-mail attachments. I will also not accept files that are significantly in excess of 1.5 MB. This will be enforced in this assignment, as as the assignment includes a number of image files and as a proficient user of your computer you should be aware of an image's size, its resolution, its displayed size and its file format, all of which influence the displayed image and the size of its file.--> Keep your image-file sizes manageable!
 
 
:<small>Image sizes are measured in pixels - 600px across is sufficient for the assignment, resolutions are measured in dpi (dots per imperial inch) - 72 dpi is the standard resolution for images that are viewed on a monitor; the displayed size may be scaled (in %) by an application program: stereo images should be presented so that equivalent points are approximately 6 cm apart; images can be stored uncompressed as .tiff or.bmp, or compressed as .gif or .jpg. .gif is preferred for images with large, monochrome areas and sharp, high-contrast edges; '''.jpg is preferred for images with shades and halftones such as the structure views required here;''' .tiff is preferred to archive master copies of images in a lossless fashion, use LZW compression for .tiff files if your system/application supports it; .bmp is not preferred for anything, its used because its easier to code.</small>
 
 
<!--Make it a habit to focus on information, pure and simple, and avoid HTML and RTF formatting and the like, where it does not contribute significantly to emphasize actual information. -->Information that you present (such as added colouring, formatting etc.) should be meaningful. If you have technical difficulties, post your questions to the list and/or contact me.
 
 
All required stereo views are to be presented as divergent stereo frames (left eye's view in the left frame). <!--Marks will be deducted if they are not.--> Remember to list the Rasmol command input you have used to generate the images.
 
  
 
With the number of students in the course, we have to economize on processing the assignments. '''Thus we will not accept assignments that are not prepared as described above.''' If you have technical difficulties, contact me.
 
With the number of students in the course, we have to economize on processing the assignments. '''Thus we will not accept assignments that are not prepared as described above.''' If you have technical difficulties, contact me.
Line 93: Line 86:
 
</div>
 
</div>
  
Don't wait until the last day to find out there are problems! Assignments that are received past the due date will have one mark deducted and an additional mark for every full twelve hour period past the due date. Assignments received more than 5 days past the due date will not be assessed. If you need an extension, you '''must''' arrange this beforehand.
+
Don't wait until the last day to find out there are problems! Assignments that are received past the due date will have one mark deducted and an additional mark for every full twelve hour period past the due date. Assignments received more than 5 days past the due date will not be assessed. If you need an extension, you '''must''' arrange this beforehand.
  
 
Marks are noted below in the section headings for of the tasks. A total of 10 marks will be awarded, if your assignment answers all of the questions. A total of 2 bonus marks (up to a maximum of 10 overall) can be awarded for particularily interesting findings, or insightful comments. A total of 2 marks can be subtracted for lack of form or for glaring errors. The marks you receive will  
 
Marks are noted below in the section headings for of the tasks. A total of 10 marks will be awarded, if your assignment answers all of the questions. A total of 2 bonus marks (up to a maximum of 10 overall) can be awarded for particularily interesting findings, or insightful comments. A total of 2 marks can be subtracted for lack of form or for glaring errors. The marks you receive will  
Line 103: Line 96:
  
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
==(1) Preparation==
+
==(1) Preparations==
 
</div>
 
</div>
 
+
&nbsp;
 
+
&nbsp;
<!--
 
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
===Choosing a template (1 marks)===
+
===(1.1) Preparing Input Files (2 marks)===
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
Often more than one related structure can be found in the PDB. We have discussed principles of selecting template structures in the lecture. Interestingly the PDB itself cannot be searched for the contents of its holdings, by structural- or sequence similarity, but there is always BLAST since the NCBI conveniently allows you to search against all sequences in PDB files.
 
 
*Use BLAST to identify all PDB files that contain APSES domains that are clearly homologuous to your target. (Document that you have searched in the correct subsection of the Genbank holdings). For the hits you find, consider how these structures differ and which features would make each more or less suitable for your task. Comment briefly on what options you have, select one template and note why you have decided to use this particular structure as a template. Include aspects of sequence similarity, length of the sequence, presence or absence of ligands and their potential effect on the structure, and experimental method and quality in your reasoning.
 
 
*Note which sequence is contained in the coordinate section of the PDB file; note if and how this implied sequence differs from the sequences ...
 
 
:*listed in the seqres records;
 
:*given in the FASTA sequence for the template that the PDB provides;
 
:*and that stored by the NCBI.
 
 
* In a table, establish the correspondence of the coordinate sequence numbering (defined by the residue numbers/insertion codes in the atom records) with your target sequence numbering.
 
  
* Retrieve the most suitable template structure coordinate file from the PDB.
+
=====Introduction: Task=====
 +
For this assignment, we start from the multiple sequence alignments we have constructed in the last assignment. We will edit alignments to make them suitable for phylogenetic analysis. We will construct a tree and we will discuss tree analysis in the end.
  
-->
+
=====Introduction: Principle=====
 +
In order to use molecular sequences for the construction of phylogenetic trees, you have to build a multiple alignment first, then edit it. This is important: all sequences have to be edited to contain the exact same number of characters and to hold aligned characters in corresponding positions. Phylogeny programs are not meant to revise an alignment but to analyse evolutionary relationships, given the alignment.
  
 +
The result of the tree estimation is a decision about likely relationships. Fundamentally all the programs do is to decide which sequences had common ancestors. '''Distance based''' phylogeny programs have a way to convert sequence comparisons into evolutionary distances (applying a model of evolution such as a mutation data matrix, calculating one number for each pair of sequences and using that to estimate a tree by grouping together the most highly related sequences (e.g. with an NJ, Neigbor Joining, algorithm). Alternatively one can find trees that are most compatible with the observed sequences and the specific model of evolutionary change through point-mutations (either  or by minimizing the number of mutation events over the tree ('''Parsimony''') or by finding the tree for which the observed sequences would be the most likely ('''ML''', Maximum Likelihood)). Clearly, in order for any of this to work, one must not include fragments of sequence which have evolved under a totally different evolutionary model, such as domain fusion, or insertion/deletion of residues. The goal is not to be as comprehensive and complete as possible but to input the columns of aligned residues that will best represent the phylogenetic relationships between the sequences.
  
&nbsp;
+
=====Introduction: Problems=====
&nbsp;
+
Gaps are a real problem here, as usual. Strictly speaking, the similarity score of an '''alignment''' program as well as the distance score of a '''phylogeny''' program are not calculated for an ordered ''sequence'', but for a ''sum of independent values'', one for each aligned columns of characters. The order of the columns does not change the score. Hoever for the alignment, this is no longer true, once gaps are introduced: we treat an insertion gap-character differently from an elongation character. Most '''alignment''' programs use a model with a constant gap insertion penalty and a linear gap extension penalty. This is not rigourously justified from biology, but parametrized (or you could say "tweaked") to correspond to our observations. However, most '''phylogeny''' programs, (such as the programs in PHYLIP) do not work in this way. PHYLIP strictly operates on columns of characters and treats a gap character like a residue with the one letter code "-". This underestimates the distance between gapped sequences, since any evolutionary model should reflect the fact that gaps are much less likely than point mutations. If the gap is very long though, all events are counted individually as many single substitutions (rather than one lengthy one) and this overestimates the distance. And it gets worse: long stretches of gaps can make sequences appear similar in a way that is not justified, just because they are identical in the "-" character. When there are unambiguous gaps, one might be tempted to fudge the alignment by inserting matching characters into sequences that are ungapped (e.g. five "A"s each into the ungapped sequences and five "-" each into the gapped sequences), however, I would caution against this approach since it possibly introduces even more non-obvious implicit assumptions and potential for error. It is therefore common to remove most, if not all gaps.
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+
=====Introduction: Practice=====
=== The input alignment (1 marks)===
+
In practice, follow the fundamental principle that '''all characters in a column should be related by homology'''. This implies the following rules of thumb:
</div>
 
&nbsp;<br>
 
  
The sequence alignment between target and template is the single most important factor that determines the quality of your model.
+
:*Remove all stretches of residues in which the alignment appears ambiguous.
 +
:*Remove all frayed N- and C- termini, especially regions in which not all sequences that are being compared appear homologous.
 +
:*Remove all but approximately one column from gapped regions, and all residues N- and C- terminal of the gap in which the alignment appears questionable. ( I would keep one gapped column as a placeholder for a rare and very distinct evolutionary event, rather than simply deleting them all, some researchers remove all gaps).
 +
:*Also, consider that neither residues that are completely different between all species, nor residues that are completely conserved are informative for relationship distances.
 +
:*If your sequences are too long, you may run out of memory. 60-80 aligned residues should be plenty and if the sequences fit on a single line you will save yourself potential trouble with block-wise vs. interleaved input.
  
No homology modeling process will repair an incorrect alignment and it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment, rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these just because they are convenient, rather than the more sophisticated methods and more informed procedures we have discussed. Detailed analysis of fallacious models rarely leads to good results.
+
:<small>(A '''very''' useful trick with Microsoft Word is that you can select blocks of text and entire columns in the document with your mouse: hold the "ALT" key depressed while you click and drag your mouse to select. This will greatly facilitate the preparation of sequences. You can treat that selection as any other selected text, color characters, or delete them. Importantly, you can also cut and paste entire columns! Of course, this will only work as expected if you use a fixed-width font such as Courier. )</small>
  
The best possible alignment is usually constructed from a multiple sequence alignment that includes at least the target and template sequence and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Typically such an alignment will also include additional optimization steps to move insertions or deletions between target and template out of the secondary structure elements of the template structure.
+
The preparation of the input file of aligned residues, used by the PHYLIP package is straightforward in principle; just carefully follow the instructions in PHYLIP's well written documentation. If you plan to use an outgroup for your tree, it is a good idea to move that to the first line of your alignment, since this is where PHYLIP will look for it by default.
  
Here is an excerpt from the T-coffee aligned Mbp1 sequences: it contains all the residues of the yeast sequence that are found in the 1MB1 crystal structure - the '''template''' sequence for our homology model - and it has been edited to remove the N-terminal gaps in the sequence. Thus the N-terminus is 21 amino acids longer than the definition of the APSES domain in CDD (which starts with <code>SIMKR...</code>), the C- terminus is slightly shorter.  
+
Some notes on how to avoid common editing troubles. Copy the sequences frrom the link provided below. Paste them into a document, using the Word "Edit -> Paste special -> Unformatted text". Set the page-setup to "landscape", the font-size to something small, then you can put every sequence into one line. You can replace all paragraph marks ("^p") with (nothing) to remove them, then replace the FASTA header line character ">" with paragraphs ("^p") to separate them by line again. Take special note that your files must not include tab characters. You can use Word to globally replace all tabs (specified as "^t") with a blank, to make sure. Spaces count, so display your alignment in a fixed-width font, such as Courier ("Courier New" on Windows), not a proportional-width font such as Times, Arial, or Helvetica, and ensure all characters in your alignments align as they should. As always, make sure you save your input files as "Text Only".  
  
Since the sequences are very similar between each other, there is no ambiguity in the alignment and the construction of a homology model should be straightforward. Normally one would spend considerable some effort at this stage to consider which parts of the target sequence and the template sequence appear to correctly aligned and to edit the alignment manually. In our case, evolutionary pressure was so strong that essentially all have evolved without a single indel in their sequence.
+
<small>
 +
:A note if you are working on a '''Mac''' and saving input on disk, to run with a locally installe PHYLIP version: here MS Word will play one of its usual [http://en.wikipedia.org/wiki/Shenanigan shenanigans] on you since it writes text files with the old-style OS 9 Carriage Return characters <code>(\r; ASCII 13; hex 0D; CR)</code>. Just by looking at the file, this is quite invisible but such "Carriage returns" are not going to be recognized by PHYLIP and most other self-respecting UNIX based programs. It may not make a difference when you paste your sequences to a Web server; but if you compute things locally it will appear to the program as though everything were in one line). And this can (and did) lead to head-banging rounds of frustration. You need to replace them with '''Linefeed''' resp. '''Newline''' characters <code>(\n; ASCII 10; hex 0A; LF)</code> and you can't even do that within Word(!). Open a UNIX terminal window and navigate to the directory where your files reside. Then type:
  
I have added to the alignment the APSES domain of [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=116197493&dopt=GenPept XP_001224558], the ''Chaetomium globosum'' Mbp1 orthologue (MBP1_CHAGL). This will serve as the reference and fallback sequence.
+
:'''tr "\r" "\n" &lt; infile    &gt; outfile'''
  
1MB1            NQIYSARYSGVDVYEFIHSTG---SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEV
+
:... where outfile is different from infile (careful: if a file by the name of outfile already exists, '''tr''' will cheerfully overwrite it.) Alternatively you could type the following perl one-liner :
MBP1_CANGL      NQIYSAKYSGVDVYEFIHPTG---SIMKRKNDGWVNATHILKAANFAKAKRTRILEKEV
 
MBP1_EREGO      TQIYSAKYSGVEVYEFLHPTG---SIMKRKADDWVNATHILKAAKFAKAKRTRILEKEV
 
MBP1_KLULA      NQIYSAKYSGVDVYEFIHPTG---SIMKRKADNWVNATHILKAAKFPKAKRTRILEKEV
 
MBP1_CANAL      SQIYSATYSNVPAFEFVTSEG---PIMRRKKDSWINATHILKIAKFPKAKRTRILEKDV
 
MBP1_DEBHA      TQIYSATYSNVPVFEFVTLEG---PIMRRKLDSWINATHILKIAKFPKAKRTRILEKDV
 
MBP1_YARLI      MSIYKATYSGVPVYEFQCKNV---AVMRRKSDGWVNATHILKVAGFDKPQRTRILEKEV
 
MBP1_SCHPO      SAVHVAVYSGVEVYECFIKGV---SVMRRRRDSWLNATQILKVADFDKPQRTRVLERQV
 
MBP1_USTMA      KTIFKATYSGVPVYECIINNV---AVMRRRSDDWLNATQILKVVGLDKPQRTRVLEREI
 
MBP1_ASPNI      SNVYSATYSSVPVYEFKIGTD---SVMRRRSDDWINATHILKVAGFDKPARTRILEREV
 
MBP1_ASPTE      SKIYSATYSSVPVYEFKIEGD---SVMRRRADDWINATHILKVAGFDKPARTRILEREV
 
MBP1_CRYNE      PKVYASVYSGVPVFEAMIRGI---SVMRRASDSWVNATQILKVAGVHKSARTKILEKEV
 
MBP1_GIBZE      G-IYSASYSGVDVYEMEVNNI---AVMRRRNDSWLNATQILKVAGVDKGKRTKILEKEI
 
MBP1_NEUCR      IYSLQATYSGVGVYEMEVNNV---AVMRRQKDGWVNATQILKVANIDKGRRTKILEKEI
 
MBP1_MAGGR      P-IYTAVYSNVEVYEFEVNGV---AVMKRIGDSKLNATQILKVAGVEKGKRTKILEKEI
 
MBP1_ASPFU      PQIYKAVYSNVSVYEMEVNGV---AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEI
 
MBP1_CHAGL      AGIYSATYSGIPVYEYQFGPDMKEHVMRRREDNWINATHILKAAGFDKPARTRILERDV
 
 
1MB1            LKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLF
 
MBP1_CANGL      LKEMHEKVQGGFGKYQGTWVPLNIAINLAEKFDVYQDLKPLF
 
MBP1_EREGO      IKDTHEKVQGGFGKYQGTWVPLDIARRLAQKFEVLEELRPLF
 
MBP1_KLULA      ITDTHEKVQGGFGKYQGTWIPLELASKLAEKFEVLDELKPLF
 
MBP1_CANAL      QTGIHEKVQGGYGKYQGTYVPLDLGAAIARNFGVYDVLKPIF
 
MBP1_DEBHA      QTGVHEKVQGGYGKYQGTYVPLDLGADIAKNFGVFDSLRPIF
 
MBP1_YARLI      QKGVHEKVQGGYGKYQGTWVPLERAREIATLYDVDSHLAPIF
 
MBP1_SCHPO      QIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKVDGIMSPIL
 
MBP1_USTMA      QKGIHEKVQGGYGKYQGTWIPLDVAIELAERYNIQGLLQPIT
 
MBP1_ASPNI      QKGVHEKVQGGYGKYQGTWIPLQEGRQLAERNNILDKLLPIF
 
MBP1_ASPTE      QKGVHEKVQGGYGKYQGTWIPLPEGRLLAERNNIIDKLRPIF
 
MBP1_CRYNE      LNGIHEKIQGGYGKYQGTWVPLDRGRDLAEQYGVGSYLSSVF
 
MBP1_GIBZE      QTGEHEKVQGGYGKYQGTWIKFERGLQVCRQYGVEELLRPLL
 
MBP1_NEUCR      QIGEHEKVQGGYGKYQGTWIPFERGLEVCRQYGVEELLSKLL
 
MBP1_MAGGR      QTGEHEKVQGGYGKYQGTWIKYERALEVCRQYGVEELLRPLL
 
MBP1_ASPFU      AAGEHEKVQGGYGKYQGTWVNYQRGVELCREYHVEELLRPLL
 
MBP1_CHAGL      QKDVHEKIQGGYGKYQGTWIPLEQGRALAQRNNIYDRLRPIF
 
  
&nbsp;<br>
+
:'''perl -e 'while(&lt;&gt;){tr/\r/\n/;print}'  &lt; infile    &gt; outfile'''
 +
</small>
  
It should be obvious to you by now how you can copy a string of amino acids from such an alignment and create a FASTA file. However we need to take a little detour: this detour brings us to the question of sequence numbers.
 
  
It is not straightforward at all how to number sequence in such a project. The "natural" way would be to start a sequential numbering from the start-codon of the full length protein and go sequentially from there. However imagine what would happen if a curator would discover that one of the splice-sites for a gene has been missed in automatic annotation. All of a sudden a corrected sequence would have a different length than the one that may have been used for earlier studies. Unfortunatlety, there is no mechanism (''wouldn't it be nice!'') that automatically goes back through the literature and your lab-journal and updates the revised sequence numbering... But there are other possible complications, regarding sequence numbers. The first residue of the CDD-APSES domain is not Residue 1 of the Mbp1 protein. The first residue of the 1MB1 FASTA file ''is'' the first residue of Mbp1 protein, but the last five residues are an artifiical His tag. Is H125 of 1MB1 the equivalent residue to R125 in MBP1_SACCE? The N-terminus of the Mbp1 crystal structure is disordered. The first residue in the structure is ASN 3, whereas the SEQRES records start with MET ... and so on. The take-home message is that a sequence number is nothing absolute, but something that makes sense only in a particular context. To emphasize this, we will write a FASTA header for our '''target''' sequence that lists the residues of the source sequence it correspond to. In terms of actual sequence numbering, we will adopt the numbering of the 1MB1 protein throughout to be able to consistently label particular amino acids.
+
In your assignment submission, clearly identify the source sequences you are using, as well as the alignment method you have used. Paste your unaltered source alignment into your document, clearly highlight or otherwise color the columns that you have selected, annotate why you have selected them and paste your result as well. Here is an example of what this might look like:
  
Access the sequence of "your" organism's Mbp1 Orthologue at UniProt. (You can use the links I have provided in the table below).
 
  
 +
[[Image:EditingGuide.jpg|frame|none|(Possible) steps in editing a multiple sequence alignment towards a PHYLIP input file. '''a''': raw alignment (CLUSTAL format); '''b''': sequences assembled into single lines; '''c''': columns to be deleted highlighted in red - 1, 3 and 4: large gaps; 2: uncertain alignment and 5: frayed C-terminus: both would put non-homologous characters into the same column; '''d''': input data for PHYLIP: names for sequences must not be longer than 10 characters, the first line must contain the number of sequences and the sequence length. PHYLIP is very picky about incorrectly formatted input, read the [http://evolution.genetics.washington.edu/phylip/doc/sequence.html PHYLIP sequence format guide].]]
  
<table style="border-left:1px solid #AAAAAA; border-bottom:1px solid #AAAAAA;" cellpadding="10" cellspacing="0">
+
=====Introduction: Web Service and data=====
<tr style="background: #BDC3DC;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b><i>Organism</i></b></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b>Uniprot Accession</b></td>
 
</tr>
 
  
<tr style="background: #FFFFFF;">
+
You have two choices for completing the assignment: either to peruse one of the on-line Webservices that generously provide a compute-intensive task such as PHYLIP, or to download and install the program at home. If you choose the former, one of your options is the [http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html '''PHYLIP''' service at the Institut Pasteur]. I have tried it, and it works - however not entirely without problems. Uninformative errors will occur when your input is too large for the system's memory (like: "sequences not aligned" ... "out of memories" and such) but what is worse, after submitting a number of jobs, the system locked me out, asking me to what an unspecified time until results would be sent by e-mail (three days later, that hasn't happened). Regrettably, this is not documented. If you can live with that, the integration of their services in a logical sequence of steps is good and some of their services are a bit more advanced than plain out of the box PHYLIP. If you rateher decide to install PHYLIP, good for you. That is easy to do, well documented, there are much less limitations on memory - but if you don't read and understand the instructions carefully, you may be in for a spell of frustration.
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus fumigatus</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q4WGN2_ASPFU Q4WGN2]</td>
 
</tr>
 
  
<tr style="background: #E9EBF3;">
+
Either way, I have posted typical input files and result files here, to allow you to bail out in case technical problems become overwhelming. If you use the data posted here instead of your own, you '''must''' document that fact and explain what you have tried, and why that has failed. If you fail to do that, we will deduct marks - the posted data is a fallback, not a shortcut.
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus nidulans</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q5B8H6_EMENI Q5B8H6]</td>
 
</tr>
 
  
<tr style="background: #FFFFFF;">
+
In this assignment, we will use a simple distance based tree construction method. This represent a reasonable compromise between accurracy and speed, especially when applied to moderately dissimilar sequences. In genereal, distance methods include '''two''' steps: (1) calculate a pairwise-distance matrix between sequences, (2) construct a tree, based on the matrix. Thus all the information in the alignment bewtween two pairs of sequences is collapsed into a single number: their pairwise distance. Alternative approaches, parsimony as well as ML based algorithms take individual columns into account. Parsimony based methods construct inaccurate trees when they can't make good estimates for the required number of sequence changes, if the sequences become too dissimilar. ML based methods are considered the most accurate for dissimilar sequences, however they are also very compute intensive and the full-length APSES domain alignment for 74 species run about half a day on my workstation. Thus we will use distance based methods here, specifically the UPGMA variant of the neighbor joining algorithm.
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus terreus</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q0CQJ5_ASPTE Q0CQJ5]</td>
 
</tr>
 
  
<tr style="background: #E9EBF3;">
+
Prepare an input file that is representative of the whole protein, and one that represents only the APSES domains.
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Candida albicans</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q5ANP5_CANAL Q5ANP5]</td>
 
</tr>
 
  
<tr style="background: #FFFFFF;">
+
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Candida glabrata</i></td>
+
*Access one of the MSAs for '''Mbp1 proteins''', linked from the resources section at the bottom of the page. Choose an MSA that you have determined in your third assignment to be "reliable" and (briefly) justify your choice. Prepare a PHYLIP formatted input file from this MSA, restricting the number of characters to no more than 60. Follow the considerations dicussed above. In particular you should choose some residues from each of the three aligned regions (The APSES domains, the Ankyrin domains and the C-terminal aligned region), to represent the diversity between these proteins. Document this as described above. ([[Assignment_4_fallback_data|See the fallback data in case you get stuck]]) (1 mark)
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q6FWD6_CANGL Q6FWD6]</td>
 
</tr>
 
  
<tr style="background: #E9EBF3;">
+
*Prepare a second PHYLIP formatted input file from this MSA, that contains only the APSES domains.  ([[Assignment_4_fallback_data|See the fallback data in case you get stuck]]) (1 mark)
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Cryptococcus neoformans</i></td>
+
</div>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q5KHS0_CRYNE Q5KHS0]</td>
 
</tr>
 
  
<tr style="background: #FFFFFF;">
+
&nbsp;<br>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Debaryomyces hansenii</i></td>
+
&nbsp;<br>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q6BSN6_DEBHA Q6BSN6]</td>
 
</tr>
 
  
<tr style="background: #E9EBF3;">
+
<div style="padding: 5px; background: #E9EBF3; border:solid 1px #AAAAAA;">
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Eremothecium gossypii</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q752H3_ASHGO Q752H3]</td>
 
</tr>
 
  
<tr style="background: #FFFFFF;">
+
===(1.2) Calculating Trees (3 marks)===
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Gibberella zeae</i></td>
+
</div>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q4IEY8_GIBZE Q4IEY8]</td>
+
&nbsp;<br>
</tr>
 
  
<tr style="background: #E9EBF3;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Kluyveromyces lactis</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=MBP1_KLULA P39679]</td>
 
</tr>
 
  
<tr style="background: #FFFFFF;">
+
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Magnaporthe grisea</i></td>
+
*Using the '''protdist''' program of PHYLIP, calculate a distance matrix for both files. ([[Assignment_4_fallback_data|See the fallback data in case you get stuck]]) (1 mark)
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q3S405_MAGGR Q3S405]</td>
 
</tr>
 
  
<tr style="background: #E9EBF3;">
+
*If you use the PHYLIP Webserver, do the following: use a neighbor joining algorithm ('''bionj''' on the PHYLIP server), construct a tree for both input files (on the server: run ... "on outfile" ) When the program is done, select the option '''drawgram''' and click '''Run the selected program on treefile'''. Choose a '''cladogram''' tree-style and a suitable output format (e.g. postscript). Paste the trees into your assignment.  
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Neurospora crassa</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q7SBG9_NEUCR Q7SBG9]</td>
 
</tr>
 
  
<tr style="background: #FFFFFF;">
+
*If you use a locally installed version of PHYLIP use '''neighbor''' with the UPGMA method to construct a tree for both input files. Open the file '''outfile''' in a text-editor, copy and paste the trees into your assignment.
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Saccharomyces cerevisiae</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=MBP1_YEAST P39678]</td>
 
</tr>
 
  
<tr style="background: #E9EBF3;">
+
:(1 mark) for either of the above
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Schizosaccharomyces pombe</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=RES2_SCHPO P41412]</td>
 
</tr>
 
  
<tr style="background: #FFFFFF;">
+
*Briefly discuss whether the trees are fundamentally similar or whether there are important differences (i.e. differences in '''topology'''). If there are differences in topology, which branch(es) would have to be moved to make the trees congruent? (1 mark)
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Ustilago maydis</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q4P117_USTMA Q4P117]</td>
 
</tr>
 
  
<tr style="background: #E9EBF3;">
+
</div>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Yarrowia lipolytica</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q6CGF5_YARLI Q6CGF5]</td>
 
</tr>
 
  
</table>
+
&nbsp;<br>
 +
&nbsp;<br>
  
 +
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
  
<div style="padding: 5px; background: #EEEEEE;">
+
==(2) Analysis==
*Copy your organism's Mbp1 sequence from the alignment above. Then define the start- and end- sequence numbers of the '''target''' sequence relative to the full-length protein. Prepare a FASTA formatted file for the '''target''' sequence in your organism, giving it an appropriate header and include the sequence numbers. Refer to the [[Assignment_5_fallback_data|'''Fallback data''']] file if you are not sure about the format. (1 mark)
 
 
</div>
 
</div>
&nbsp;<br>
 
  
Your FASTA sequence should look similar to this:
+
It is surprisingly difficult to find a comprehensive phylogenetic analysis of the fungal species for which the genomes have been sequenced, although one would assume this to be data of considerable utility for the community. I have constructed a cladogram for the species we are analysing, based on data published for 1551 fungal ribosomal sequences. Such rRNA trees are a standard method of phylogenetic analysis, supported by the assumption that rRNA sequences are monophyletic and have evolved under comparable selective pressure in all species.
  
>1MB1: Mbp1_SACCE 1..100
+
[[Image:FungiCladogram.jpg|frame|none|Cladogram of fungi. Cladogram is based on small subunit ribosomal rRNA sequences, excerpt from ''Tehler et al.'' (2003) ''Mycol Res.'' '''107''':901-916. Even though many details of fungal phylogeny remain unresolved, the branches shown here individually appear to have strong support. In a cladogram such as this, the branch lengths are not drawn to any scale of similarity. I have labeled all speciation events so you can refer to these labels in your assignment.]]
NQIYSARYSGVDVYEFIHSTG---SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEV
 
LKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLF
 
 
 
&nbsp;
 
&nbsp;
 
  
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
  
==(2) Homology model==
 
</div>
 
 
&nbsp;
 
&nbsp;
 
&nbsp;
 
&nbsp;
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
=== (2.1) SwissModel (1 mark)===
+
===(2.1) Correspondence of Gene trees and Phylogenetic Tree (1 mark)===
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
  
Access the Swissmodel server at [http://swissmodel.expasy.org '''http://swissmodel.expasy.org'''] . Navigate to the '''Alignment Interface'''.
+
Both your Mbp1 trees should of course correspond to  the reference cladogram of fungi, after all, we have chosen the orthologues of Mbp1 for this task. However, in practice this is usually not exactly the case. The treest are likely going to be different in parts. '''Note that phylogenetic trees that differ only in rotations around branchpoints are considerd identical&nbsp;!''' Differences in phylogenetic trees only concern the topology, i.e. the branching order.
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
&nbsp;<br>
*Copy from the alignment above the 1MB1 sequence and the sequence from your organism, and paste it into the form field. Refer to the [[Assignment_5_fallback_data|'''Fallback Data file''']] if you are not sure about the format.
+
<div style="padding: 5px; background: #EEEEEE;">
:(You have to choose the format, and, if e.g. you choose a CLUSTAL format, you have to include a header line and a blank line. Other common problems uploading your alignment may include uploading a file that has not been saved as "text only" and periods i.e.  "."  in sequence names. Underscores appear to be safe.)
+
*Briefly compare '''one''' of the phylogenetic trees with the reference cladogram. Are there species that are not in the expected position? Does this mean the cladogram is not correct?  (1 mark)
 
 
* Click '''submit''' and define your '''target''' and '''template''' sequence. For the '''template sequence''' define the coordinate file and chain. (In our case the coordinate file is <code>'''1MB1'''</code> and the chain is "<code>'''_'''</code>" i.e. none, since the PDB file does not contain more than one chain.
 
 
 
*Click '''submit''' and request the construction of a homology model: Enter your e-mail address and check the button for '''Normal Mode''', not "Swiss-PDB Viewer mode. (Important, since there will be problems with the output otherwise). Click '''submit'''. You should receive four files files by e-mail within half an hour or so. (1 mark)
 
 
 
(You do not need to submit any coordinate files with your assignment.)
 
  
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
In case you do not wish to submit the modelling job yourself, you can access the result files for the  from the  [[Assignment_5_fallback_data|'''Fallback Data file''']].
 
 
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
 
==(3) Model analysis==
 
</div>
 
&nbsp;
 
 
&nbsp;
 
&nbsp;
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
=== (3.1) The PDB file (1 mark)===
+
 
 +
===(2.2) Evolutionary History of the APSES Domain (2 marks)===
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
  
Open your  '''model''' coordinates in a text-editor (make sure you view the PDB file in a fixed-width font) and consider the following questions: (Alternatively, view the coordinates linked to the [[Assignment_5_fallback_data|'''Fallback Data file''']].)
+
In order to study the evolutionary history of the entire gene family, we have to prepare a tree of all 74 APSES genes. I have done this with the program '''promlk''' of the PHYLIP package. You can access the [[APSES_domains_reference_tree|'''APSES domains reference tree''']] here.
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
This is a complicated tree, and it can look impenetrably confusing at first. Here are two principles that will help you make sense of the tree.
*What is the residue number of the first residue in the '''model'''? What should it be, based on the alignment? If the putative DNA binding region was reported to be residues 50-74 in the Mbp1 protein, which residues of the '''model''' correspond to that? (1 mark)
 
</div>
 
  
<!-- discuss flagging of loops - setting of B-factor to 99.0 -->
+
A: '''A gene that is present in an ancestral species, is inherited in all descendent species.''' The gene has to be observed in all OTUs, unless its has been lost (which is a rare event). This means, if a gene is present in two widely divergent species, but in none other of the descendants of the LCA, it is possible that there is some problem with the tree (long branch attraction maybe), or the sequence has been acquired through horizontal gene transfer.
  
&nbsp;
+
B: '''Paralogous genes in an ancestral species should give rise to monophyletic subtrees for each of the genes, in all descendants'''; this means: if the LCA of a branch has e.g. three genes, we would expect three copies of the species cladogram below this branchpoint, one for each of these genes. Each of these subtrees should recapitulate the reference phylogenetic tree of the OTUs, up to the branchpoint of their LCA.
&nbsp;
 
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+
<!-- (Punctuated equilibrium ?) -->
===(3.2) first visualization (3 marks)===
 
</div>
 
&nbsp;<br>
 
  
In assignment 2 you have already studied the 1MB1 coordinate file and compared it to your organism's Mbp1 APSES domain, Since a homology model inherits its structural details from the '''template''', the model should look very similar to the original structure but contain the sequence of the '''target'''.
+
With these two simple principles (you should draw them out on a piece of paper if they do not seem obvious to you), you can probably pry the [[APSES_domains_reference_tree|reference tree of all APSES domains]] apart quite nicely. A few colored pencils and a printout of the tree will help.
  
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
*Save the attachment of your '''model''' coordinates to your harddisk and visualize it in RasMol. (Alternatively, copy and save the coordinates from the [[Assignment_5_fallback_data|'''Fallback Data file''']] to your harddisk.) Make an informative view, divergent stereo and paste it into your assignment. (3 marks)
+
*Identify at least two (more or less) monophyletic subtrees of the tree that recapitulate (more or less) the species tree. Write down their branch-point numbers.  In one of these subtrees, define the correspondence between the letters in the species tree and the branchpoints in the gene tree '''as far as possible'''. (The correspondence is not going to be exact. ) (1 mark)
 +
 
 +
<!-- Discuss briefly how many APSES domain proteins the fungal cenancestor appears to have posessed and by which sequence of gene loss or gene duplications the APSES domains in "your" organism appear to have arisen. (This is a straightforward synthesis based on what you have done above, by referring to labelled nodes in the reference cladogram.) -->
  
 +
* Discuss briefly if there are features of the gene tree that are systematically inconsistent with the reference cladogram: choose '''one''' species from the gene tree that violates the expectation from the species tree and check what it's close neighbors are for the other parts of the tree. Do these sequences always appear more closely or more distantly related than they should be? (1 mark)
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
 
 
[[Image:A5_Mbp1_subdomain.jpg|frame|none|Stereo-view of a subdomain within the 1MB1 structure that includes residues 36 to 76. The color gradient ramps from blue (36) to green (76).]]
 
 
&nbsp;
 
 
&nbsp;
 
&nbsp;
 
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
  
===(3.3) modeling a DNA ligand (4 marks)===
+
===(2.3) Unraveling the Mbp1 clade (2 marks)===
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
  
The really interesting question we could begin to address with our model is how sequence variation might be converted into changing DNA recognition sites, and then lead to changed cognate DNA binding sequences. But in order to address this, we would need to add a plausible model for a bound DNA molecule to our model.
+
If we consider the Mbp1 clade (the clade containing all 16 of our Mbp1 orthologues), descending from branchpoint '''115''', we note that it contains 16 additional genes, one from each of the species we have studied, however both subtrees are polyphyletic within the clade. (In what follows I will refer to subgroups of the tree as ''Clades'' and label them with their branch number in this reference tree.) It is thus tempting to speculate that this subtree, ''Clade 115'' is actually a mix-up of two ancestral genes, one of them an orthologue to Mbp1, the other to a different gene - perhaps Swi4. If this were true, there would be three possible explanations for the polyphyletic tree: (i) either we have mislabeled the orthologues, or (ii) we have constructed an incorrect tree, or (iii) our model of how these genes have evolved is wrong. Actually, it is rather obvious when we compare the clades that branch off at 117 and 128, why this is not a simple case of mislabeling the genes ...
  
Since there is currently no software available that would accurately model such a complex from first principles, we will base this on homology modeling as well. This means we need to find a similar structure for which the complex structure is known. However, you may remember from the third assignment that the APSES domains in fungi seem to be a relatively small family. And there is no structure available of a protein-DNA complex. Now what?
+
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 +
*Explain briefly why it is unlikely that the clades at branchpoints 117 and 128 are each descendants of a single cenancestral gene. (1 mark)
 +
</div>
 +
&nbsp;<br>
  
Remember that homologous sequences can have diverged to the point where their sequence similarity is no longer recognizable, however their structure may be quite well conserved. Thus if we could find similar structures in the PDB, these might provide us with some plausible hypotheses for how DNA is bound by APSES domains. We thus need a tool similar to BLAST, but not for the purpose of sequence alignment, but for structure alignment. A kind of BLAST for structures.
+
Could this point to a problem with the columns we have selected for the tree building? Have we biased our input data and then constructed an incorrect tree? To test this, a revised tree was constructed, taking the genes from ''Clade 115'', realigning them and constructing a new tree. You can have a look at the [[Revised_Mbp1_APSES_domain_tree| '''revised Mbp1 APSES domain tree''']], and the procedure - or you can just take my word for it, the trees are substantially similar. Apparently the precise choice of columns has a minor effect on the results, as it should have. We should note in general however that '''branch points that are close together in time may be poorly resolvable, while branch points that are separated by longer evolutionary distances are more reliable'''. In summary, save for some inaccuracies, we have no reason to believe the tree is fundamentally incorrect.
  
However, very similar to BLAST, we might not want to search with the entire protein, if all we are interested in is a subdomain that binds to DNA. Attempting to match all structural elements in addition to the ones we are actually interested in is likely to make the search less specific - we would find false positives that are similar to some irrelevant part of our structure. However, defining too small of a subdomain would also lead to a loss of specificity: in the extreme it is easy to imagine that the search for e.g. a single helix would retrieve very many hits that would be quite meaningless. The arrangement of the residues from 50 to 74 that we have already discussed in Assignment 2 suggests that the compact subdomain from 36 to 76 (see the image above) might be a useful structure to search with: it contains the residues we are interested in and enough of connected secondary structure elements to be structurally meaningful.
+
So what about or model of evolution? What do we actually expect?
  
At the '''NCBI''', [http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml VAST] is a search tool for structural similarity search tool for this purpose. Unfortunately it does not seem to be able to handle a query with such a structural subdomain (the process did not finish after several days) but at least you can get a list of structural neighbors of the 1MB1 full-length template structure, by entering the PDB ID in a small form field on the VAST home page, and then clicking on the colored bar labeled "Chain" on the MMDB structure summary page. This precomputed page for the 1MB1 structure shows a number of diverse proteins matching to various helices and strands of the structure.
+
As we had written above, we expect to see anumber of more or less faithful copies of the species tree, for each of gene in the cenancestor (or LCA for later duplications).
  
At the '''EBI''' there are a number of very well designed structure analysis tools linked off the [http://www.ebi.ac.uk/Tools/structural.html '''Structural Analysis''' page]. As part of its MSD Services, the SSM (Secondary Structure Matching service) provides a well thought out interface for searching files from the PDB or uploading coordinates.  
+
And as you see, this is '''not''' exactly what we find in ''Clade 115''. What we '''do''' find however is that ''Clade 115'' reflects our expectations in part. ''Clade 123'' pretty well recapitulates the ''Saccharomycotina'', as does ''Clade 118'', and when we consider ''Clade 128'', apart from some "noise" we see both the ''Dikaryomycota'' (in ''Clade 142'') reasonably grouped together, as well as the ''Euascomycotina'' (in ''Clade 129") - in fact the most striking departure from our expected tree for Mbp1 is the placement of ''Clade 123'', which we would expect attached below branch 128, not above it where we have found it.
  
After uploading the coordinates for residues 36 to 76 of the 1MB1 structure running the search and sorting the results by alignment length, the top hits include a number of nucleotide binding proteins such as a replication terminator (1F4K), the LexA repressor (1MVD) and a "Winged Helix" protein (1KQ8). These are all members of a much larger superfamily, the "winged helix" DNA binding domains ([http://cathwww.biochem.ucl.ac.uk/cgi-bin/cath/GotoCath.pl?cath=1.10.10.10 CATH 1.10.10.10]), of which hundreds of structures have been solved. They represent one branch of the tree of helix-turn-helix (HTH) DNA binding modules. (A recent review on HTH proteins is linked from the resources section at the bottom of this page). Winged Helix domains typically bind their cognate DNA with a "recognition helix" which precedes the beta hairpin and binds into the major groove; additional stabilizing interactions are provided by the edge of the beta strand binding into the minor groove.
+
This may be due to insufficient resolution of the trees. Whenever we have long, evolutionary distances with a few OTUs clustered at the end, we can be reasonbly certain of the branch. When branching events occur at similar times, the order is uncertain and any specific ordering would probably not be well supported in bootstrapping.
 +
 +
What is striking however is the parallelism between ''Clade 123'' and ''Clade 118''. Apparently branchpoint 117 marks a duplication event of the ancestral Mbp1 gene in the ''Saccharomycotina''. Here our expectation appears to be borne out. And given the uncertainty in the precise ordering between branchpoints 117 and 128, the Clade overall is thus actually quite consistent with the species tree.
  
<!-- The other service the EBI structure links to is the DALI server. DALI was one of the first algorithms capable of large-scale protein structure searches; it was developed by Liisa Holm and is now hosted by her group in Helsinki. Submitting our search domain generates the e-mailed result linked to here. Both results (there are only two) are also found in the top 100 list of the SSM service. The winged helix domain 1DP7 merits some comment though: its structure shows a novel mode of binding for DNA. Here, it is the beta-wing, not the "recognition helix" that inserts into the major groove! We will consider this in more detail below.
+
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 +
*Finally, find one example of a gene duplication, below branchpoint 128.
 +
</div>
  
First we shall explore some of the structures that SSM has returned. The SSM server presents its result details in Web pages, but it also allows to download the entire result set in an XML formatted file. This is a common method of data-interchange in bioinformatics but you would not want to actually read such a file and manually extract information (even though you could, in principle). Thus I have prepared a summary file of the alignment details of the SSM results. This should allow you to rapidly find the exact aligned residues in the matched domains. While I have derived this file from the output through a computer program I have written, you could easily have accessed the same information on the Web, had you run the query yourself. -->
 
  
This is good news: once we have determined that the APSES domain is actually an example of a larger group of transcription factors, we can pick one of these for which a DNA complex structure is known. I have picked one such structure from the list of hits that were returned by SSM: it is the Elk-1 transcription factor.
 
 
[[Image:A5_canonical_wHTH.jpg|frame|none|Stereo-view of the canonical DNA binding mode of the Winged Helix domain family. Shown here is the Elk-1 transcription factor - an ETS DNA binding domain - in complex with a high-affinity binding site (pdb|1DUX). Note how the "recognition helix" inserts into the major groove of the DNA molecule. The color gradient ramps from blue (34) to green (84). Note how the first helix of the "helix-turn-helix" architecture serves only to position the recognition helix and makes few interactions by itself.]]
 
 
Now all that is left to do is to bring the DNA molecule  into the correct orientation for our '''model''' and then to combine the two files. We need to superimpose the Elk-1 protein/DNA complex onto our '''model'''.
 
 
;Structure superposition
 
There are quite a number of superposition servers available on the Web, a remarkably comprehensive overview can be found in [http://en.wikipedia.org/wiki/Structural_alignment Wikipedia]. However, overengineering and black-box mentality makes our task more difficult than it need be: most tools do not allow users to specify particular alignment zones but attempt to automatically define the zones of residues to be supoerimposed according to some geometric target function. Almost none return the actual rotation matrix and translation vector that is used for the superposition. And almost none transform the coordinates of heteratoms such as solvent, ligands or DNA molecules along with the protein coordinates. An exception that I have found to be very useable is the [http://www.predictioncenter.org/local/lga/lga.html Local-Global Alignment server ('''LGA''')], written by Adam Zemla. The procedure is quite straightforward:
 
 
*Define the structure to be rotated (1DUX in this case). This is a dimer, so download the file from the PDB and manually edit to contain only DNA chains A and B and protein chain C.
 
*Define the structure to be held constant (1MB1 in this case). Download from PDB.
 
*Use the "browse" option to define both files as input on the LGA inpput form
 
*Use the option to have both coordinate sets included in your output: <code>-o2</code>
 
*Submit
 
 
The results arrive per e-mail. I have linked the resulting PDB file to the [[Assignment_5_fallback_data|'''Fallback Data page''']]. <small>If you run this analysis on your own, you may want to review the types of edits the edits I made to the PDB file to get it displayed correctly in Rasmol.</small>
 
 
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*Save the superimposed  coordinates in a file, open and view in Rasmol and note how well the "recognition helix" and adjacent beta strands superimpose! (Alternatively, copy and save the coordinates from the c to your harddisk.) Make an informative view, divergent stereo and paste it into your assignment. (4 marks)
 
</div>
 
&nbsp;<br>
 
 
&nbsp;
 
&nbsp;
 
+
&nbsp;
  
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
  
==(4) Summary of Resources==
+
==(3) Summary of Resources==
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
  
 
;Links
 
;Links
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Peitsch_2002_UseOfModels.pdf '''Review (PDF, restricted)''' Manuel Peitsch on Homology Modeling]
+
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Baldauf_2003_PhylogenyTutorial.pdf '''Review (PDF, restricted)''' Sandra Baldauf: Phylogeny for the Faint of Heart]
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Aravind_2005_HTHdomains.pdf '''Review (PDF, restricted)''' Aravind ''et al.'' Helix-turn-helix domains] (background reading, not required reading)
 
 
:* [[Organism_list_2006|Assigned Organisms]]
 
:* [[Organism_list_2006|Assigned Organisms]]
:* [http://www.rcsb.org/pdb/file_formats/pdb/pdbguide2.2/guide2.2_frame.html '''PDB file format''']
+
:* [http://evolution.genetics.washington.edu/phylip.html '''PHYLIP''' home page]
:* [http://en.wikipedia.org/wiki/Structural_alignment Wikipedia on '''Structural Superposition'''] <small>(although the article is called "Structural Alignment")</small>
+
:* [http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html '''PHYLIP''' Web Service at the Institut Pasteur]
 +
:*[[Assignment_4_fallback_data|'''Fallback data''']]
  
:* [[Assignment_5_fallback_data|'''Fallback Data page''']]
+
;Sequences
 +
:* [[All_Mbp1_proteins|'''All Mbp1 proteins''']]
 +
:* [[All_APSES_domains|'''All APSES domains''']]
  
 
;Alignments
 
;Alignments
 +
:'''Mbp1 proteins:'''
 +
:* [[All_Mbp1_CLUSTAL|Mbp1 proteins '''CLUSTAL''' aligned]]
 +
:* [[All_Mbp1_MUSCLE|Mbp1 proteins '''MUSCLE''' aligned]]
 
:* [[All_Mbp1_T-COFFEE|Mbp1 proteins '''T-Coffee''' aligned (text version)]]
 
:* [[All_Mbp1_T-COFFEE|Mbp1 proteins '''T-Coffee''' aligned (text version)]]
 +
 +
:'''APSES domains:'''
 +
:* [[APSES_domains_PSI-BLAST|All APSES domains - alignment based on '''PSI-BLAST''' results]]
 +
:* [[APSES_domains_CLUSTAL|All APSES domains -  '''CLUSTAL-W''' alignment]]
 +
:* [[APSES_domains_probcons|All APSES domains -  '''probcons''' alignment]]
 +
 +
;Trees
 +
:*[[APSES_domains_reference_tree|'''APSES domains reference tree''']]
 +
:*[[Revised_Mbp1_APSES_domain_tree| '''revised Mbp1 APSES domain tree''']]
 +
  
 
&nbsp;
 
&nbsp;

Revision as of 12:44, 19 October 2007

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 

 


   

Assignment 4 - Phylogenetic Analysis

Please note: This assignment is currently inactive. Unannounced changes may be made at any time.

 


Introduction  

Nothing in Biology makes sense except in the light of evolution.
Theodosius Dobzhansky

... but does evolution make sense in the light of biology?

As we have seen in the previous assignments, the Mbp1 transcription factor has homologues in all other fungi, yet - looking at orthologues - this is not always a clear one-to-one mapping of related genes to each other. It appears that various systems of APSES domain transcription factors have evolved independently. Of course this bears directly on our notion of function - what it means to say that two genes in different organisms have the "same" function. In case two organisms both have an orthologous gene for the same, distinct function, this may be warranted. But what if that gene has duplicated in one of them, and the two paralogues now perform different, related functions in one organism? In order to be able to even ask such questions, we need to understand how we can make the evolutionary history of gene families explicit. This is the domain of phylogenetic analysis. We can ask questions like: how many paralogues did the cenancestor of a clade possess? Which of these underwent additional duplications in the phylogenesis of the organism I am studying? Did any genes get lost? And - adding additional biological insight to the picture - did the observed duplications lead to the "invention" of new biological systems? When was that? And how did the species benefit from this event?

We will develop some of this kind of analysis in this assignment. In the previous assignment you have established which genes are the reciprocally most closely related orthologues to Mbp1 in yeast. In this assignment, we will analyse their evolutionary relationship and compare it to the evolutionary relationship of all fungal APSES domains. The goal is to define families of related transcription factors and their evolutionary history.

A number of good tools for phylogenetic analysis exist; general purpose packages include the (free) PHYLIP package and the (commercial) PAUP package. Specialized tools for tree-building include Treepuzzle or Mr. Bayes. This assignment is conctructed around programs that are availble in PHYLIP, however you are welcome to use other tools that fulfil a similar purpose if you wish. In this field, researchers consider trees that have been built with ML (maximum likelihood) methods to be more reliable than trees that are built with parsimony methods, or distance methods such as NJ (Neighbor Joining). However ML methods are also much more compute-intensive. Just like with multiple sequence alignments, some algorithms will come closer to guessing the truth and others will not and usually it is hard to tell, which is the more trustworthy of two diverging results. The prudent researcher tries out alternatives and forms her own opinion. Specifically, we may usually assume results that converge, independent of the algorithm, to be more reliable than those that depend strongly on a particular algorithm or details of input data.

But regarding algorithm and rersources: we will take two shortcuts in this assignment (and both shortcuts are things you should not do in real life):

One: we will use an efficient tree-building algorithm, not the best-available one. This is an algorithm which is available on the Web, without the need for you to install software on your own machine. In real life you would of course use the most accurate algortihm you can lay your hands on, regardless of the resources this requires, since it makes no sense to waste your time on a careful analysis of inaccurate trees. Your supervisor would want it so as well. And if not she, the reviewers of your manuscript.

Two: we will assume the tree the algorithm constructs is correct. In real life you would establish its reliability with a bootstrap procedure: repeat the tree-building a hundred times with partial data and see which branches and groupings are robust and which depend on the details of the data. But we should still acknowledge that bifurcations that are very close to each other have not been" resolved". Any conscientious reviewer would flag such leniency and send your results back to you for a bootstrapping exercise at the computer. In phylogenetic analysis, not all lines that the program draws are equally trustworthy. Dont take the trees as a given fact just because a program suggests this. Look at the evidence, use your reasoning, and analyse them critically.

In case you want to review concept of trees, clades, LCAs OTUs and the like, I have linked an excellent and very understandable introduction-level article on phylogenetic analysis to the resource section at the bottom of this page.

 

Preparation, submission and due date

Read carefully. Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which important aspects have simply overlooked marks unnecessarily. If you did not notice that the above did not make sense, you are reading what you expect, not what is written.

Prepare a Microsoft Word document with a title page that contains:

  • your full name
  • your Student ID
  • your e-mail address
  • the organism name you have been assigned

Follow the steps outlined below. You are encouraged to write your answers in short answer form or point form, like you would document an analysis in a laboratory notebook. However, you must

  • document what you have done,
  • note what Web sites and tools you have used,
  • paste important data sequences, alignments, information etc.

If you do not document the process of your work, we will deduct marks. Try to be concise, not wordy! Use your judgement: are you giving us enough information so we could exactly reproduce what you have done? If not, we will deduct marks. Avoid RTF and unnecessary formating. Do not paste screendumps. Keep the size of your submission below 1.5 MB.

Write your answers into separate paragraphs and give each its title. Save your document with a filename of: A3_family name.given name.doc (for example my fourth assignment would be named: A4_steipe.boris.doc - and don't switch the order of your given name and familyname please!)

Finally e-mail the document to [boris.steipe@utoronto.ca] before the due date.

Your document must not contain macros. Please turn off and/or remove all macros from your Word document; we will disable macros, since they pose a security risk.

With the number of students in the course, we have to economize on processing the assignments. Thus we will not accept assignments that are not prepared as described above. If you have technical difficulties, contact me.

The due date for the assignment is Thursday, December 7. at 24:00 (last day of class). In case you need more time since the assignment was posted late, an extension is automatically granted to Tuesday, December 19. at 10:00 in the morning.

Grading

Don't wait until the last day to find out there are problems! Assignments that are received past the due date will have one mark deducted and an additional mark for every full twelve hour period past the due date. Assignments received more than 5 days past the due date will not be assessed. If you need an extension, you must arrange this beforehand.

Marks are noted below in the section headings for of the tasks. A total of 10 marks will be awarded, if your assignment answers all of the questions. A total of 2 bonus marks (up to a maximum of 10 overall) can be awarded for particularily interesting findings, or insightful comments. A total of 2 marks can be subtracted for lack of form or for glaring errors. The marks you receive will

  • count directly towards your final marks at the end of term, for BCH441 (undergraduates), or
  • be divided by two for BCH1441 (graduates).

   

(1) Preparations

   

(1.1) Preparing Input Files (2 marks)

 

Introduction: Task

For this assignment, we start from the multiple sequence alignments we have constructed in the last assignment. We will edit alignments to make them suitable for phylogenetic analysis. We will construct a tree and we will discuss tree analysis in the end.

Introduction: Principle

In order to use molecular sequences for the construction of phylogenetic trees, you have to build a multiple alignment first, then edit it. This is important: all sequences have to be edited to contain the exact same number of characters and to hold aligned characters in corresponding positions. Phylogeny programs are not meant to revise an alignment but to analyse evolutionary relationships, given the alignment.

The result of the tree estimation is a decision about likely relationships. Fundamentally all the programs do is to decide which sequences had common ancestors. Distance based phylogeny programs have a way to convert sequence comparisons into evolutionary distances (applying a model of evolution such as a mutation data matrix, calculating one number for each pair of sequences and using that to estimate a tree by grouping together the most highly related sequences (e.g. with an NJ, Neigbor Joining, algorithm). Alternatively one can find trees that are most compatible with the observed sequences and the specific model of evolutionary change through point-mutations (either or by minimizing the number of mutation events over the tree (Parsimony) or by finding the tree for which the observed sequences would be the most likely (ML, Maximum Likelihood)). Clearly, in order for any of this to work, one must not include fragments of sequence which have evolved under a totally different evolutionary model, such as domain fusion, or insertion/deletion of residues. The goal is not to be as comprehensive and complete as possible but to input the columns of aligned residues that will best represent the phylogenetic relationships between the sequences.

Introduction: Problems

Gaps are a real problem here, as usual. Strictly speaking, the similarity score of an alignment program as well as the distance score of a phylogeny program are not calculated for an ordered sequence, but for a sum of independent values, one for each aligned columns of characters. The order of the columns does not change the score. Hoever for the alignment, this is no longer true, once gaps are introduced: we treat an insertion gap-character differently from an elongation character. Most alignment programs use a model with a constant gap insertion penalty and a linear gap extension penalty. This is not rigourously justified from biology, but parametrized (or you could say "tweaked") to correspond to our observations. However, most phylogeny programs, (such as the programs in PHYLIP) do not work in this way. PHYLIP strictly operates on columns of characters and treats a gap character like a residue with the one letter code "-". This underestimates the distance between gapped sequences, since any evolutionary model should reflect the fact that gaps are much less likely than point mutations. If the gap is very long though, all events are counted individually as many single substitutions (rather than one lengthy one) and this overestimates the distance. And it gets worse: long stretches of gaps can make sequences appear similar in a way that is not justified, just because they are identical in the "-" character. When there are unambiguous gaps, one might be tempted to fudge the alignment by inserting matching characters into sequences that are ungapped (e.g. five "A"s each into the ungapped sequences and five "-" each into the gapped sequences), however, I would caution against this approach since it possibly introduces even more non-obvious implicit assumptions and potential for error. It is therefore common to remove most, if not all gaps.

Introduction: Practice

In practice, follow the fundamental principle that all characters in a column should be related by homology. This implies the following rules of thumb:

  • Remove all stretches of residues in which the alignment appears ambiguous.
  • Remove all frayed N- and C- termini, especially regions in which not all sequences that are being compared appear homologous.
  • Remove all but approximately one column from gapped regions, and all residues N- and C- terminal of the gap in which the alignment appears questionable. ( I would keep one gapped column as a placeholder for a rare and very distinct evolutionary event, rather than simply deleting them all, some researchers remove all gaps).
  • Also, consider that neither residues that are completely different between all species, nor residues that are completely conserved are informative for relationship distances.
  • If your sequences are too long, you may run out of memory. 60-80 aligned residues should be plenty and if the sequences fit on a single line you will save yourself potential trouble with block-wise vs. interleaved input.
(A very useful trick with Microsoft Word is that you can select blocks of text and entire columns in the document with your mouse: hold the "ALT" key depressed while you click and drag your mouse to select. This will greatly facilitate the preparation of sequences. You can treat that selection as any other selected text, color characters, or delete them. Importantly, you can also cut and paste entire columns! Of course, this will only work as expected if you use a fixed-width font such as Courier. )

The preparation of the input file of aligned residues, used by the PHYLIP package is straightforward in principle; just carefully follow the instructions in PHYLIP's well written documentation. If you plan to use an outgroup for your tree, it is a good idea to move that to the first line of your alignment, since this is where PHYLIP will look for it by default.

Some notes on how to avoid common editing troubles. Copy the sequences frrom the link provided below. Paste them into a document, using the Word "Edit -> Paste special -> Unformatted text". Set the page-setup to "landscape", the font-size to something small, then you can put every sequence into one line. You can replace all paragraph marks ("^p") with (nothing) to remove them, then replace the FASTA header line character ">" with paragraphs ("^p") to separate them by line again. Take special note that your files must not include tab characters. You can use Word to globally replace all tabs (specified as "^t") with a blank, to make sure. Spaces count, so display your alignment in a fixed-width font, such as Courier ("Courier New" on Windows), not a proportional-width font such as Times, Arial, or Helvetica, and ensure all characters in your alignments align as they should. As always, make sure you save your input files as "Text Only".

A note if you are working on a Mac and saving input on disk, to run with a locally installe PHYLIP version: here MS Word will play one of its usual shenanigans on you since it writes text files with the old-style OS 9 Carriage Return characters (\r; ASCII 13; hex 0D; CR). Just by looking at the file, this is quite invisible but such "Carriage returns" are not going to be recognized by PHYLIP and most other self-respecting UNIX based programs. It may not make a difference when you paste your sequences to a Web server; but if you compute things locally it will appear to the program as though everything were in one line). And this can (and did) lead to head-banging rounds of frustration. You need to replace them with Linefeed resp. Newline characters (\n; ASCII 10; hex 0A; LF) and you can't even do that within Word(!). Open a UNIX terminal window and navigate to the directory where your files reside. Then type:
tr "\r" "\n" < infile > outfile
... where outfile is different from infile (careful: if a file by the name of outfile already exists, tr will cheerfully overwrite it.) Alternatively you could type the following perl one-liner :
perl -e 'while(<>){tr/\r/\n/;print}' < infile > outfile


In your assignment submission, clearly identify the source sequences you are using, as well as the alignment method you have used. Paste your unaltered source alignment into your document, clearly highlight or otherwise color the columns that you have selected, annotate why you have selected them and paste your result as well. Here is an example of what this might look like:


(Possible) steps in editing a multiple sequence alignment towards a PHYLIP input file. a: raw alignment (CLUSTAL format); b: sequences assembled into single lines; c: columns to be deleted highlighted in red - 1, 3 and 4: large gaps; 2: uncertain alignment and 5: frayed C-terminus: both would put non-homologous characters into the same column; d: input data for PHYLIP: names for sequences must not be longer than 10 characters, the first line must contain the number of sequences and the sequence length. PHYLIP is very picky about incorrectly formatted input, read the PHYLIP sequence format guide.
Introduction: Web Service and data

You have two choices for completing the assignment: either to peruse one of the on-line Webservices that generously provide a compute-intensive task such as PHYLIP, or to download and install the program at home. If you choose the former, one of your options is the PHYLIP service at the Institut Pasteur. I have tried it, and it works - however not entirely without problems. Uninformative errors will occur when your input is too large for the system's memory (like: "sequences not aligned" ... "out of memories" and such) but what is worse, after submitting a number of jobs, the system locked me out, asking me to what an unspecified time until results would be sent by e-mail (three days later, that hasn't happened). Regrettably, this is not documented. If you can live with that, the integration of their services in a logical sequence of steps is good and some of their services are a bit more advanced than plain out of the box PHYLIP. If you rateher decide to install PHYLIP, good for you. That is easy to do, well documented, there are much less limitations on memory - but if you don't read and understand the instructions carefully, you may be in for a spell of frustration.

Either way, I have posted typical input files and result files here, to allow you to bail out in case technical problems become overwhelming. If you use the data posted here instead of your own, you must document that fact and explain what you have tried, and why that has failed. If you fail to do that, we will deduct marks - the posted data is a fallback, not a shortcut.

In this assignment, we will use a simple distance based tree construction method. This represent a reasonable compromise between accurracy and speed, especially when applied to moderately dissimilar sequences. In genereal, distance methods include two steps: (1) calculate a pairwise-distance matrix between sequences, (2) construct a tree, based on the matrix. Thus all the information in the alignment bewtween two pairs of sequences is collapsed into a single number: their pairwise distance. Alternative approaches, parsimony as well as ML based algorithms take individual columns into account. Parsimony based methods construct inaccurate trees when they can't make good estimates for the required number of sequence changes, if the sequences become too dissimilar. ML based methods are considered the most accurate for dissimilar sequences, however they are also very compute intensive and the full-length APSES domain alignment for 74 species run about half a day on my workstation. Thus we will use distance based methods here, specifically the UPGMA variant of the neighbor joining algorithm.

Prepare an input file that is representative of the whole protein, and one that represents only the APSES domains.

 

  • Access one of the MSAs for Mbp1 proteins, linked from the resources section at the bottom of the page. Choose an MSA that you have determined in your third assignment to be "reliable" and (briefly) justify your choice. Prepare a PHYLIP formatted input file from this MSA, restricting the number of characters to no more than 60. Follow the considerations dicussed above. In particular you should choose some residues from each of the three aligned regions (The APSES domains, the Ankyrin domains and the C-terminal aligned region), to represent the diversity between these proteins. Document this as described above. (See the fallback data in case you get stuck) (1 mark)

 
 

(1.2) Calculating Trees (3 marks)

 


 

  • If you use the PHYLIP Webserver, do the following: use a neighbor joining algorithm (bionj on the PHYLIP server), construct a tree for both input files (on the server: run ... "on outfile" ) When the program is done, select the option drawgram and click Run the selected program on treefile. Choose a cladogram tree-style and a suitable output format (e.g. postscript). Paste the trees into your assignment.
  • If you use a locally installed version of PHYLIP use neighbor with the UPGMA method to construct a tree for both input files. Open the file outfile in a text-editor, copy and paste the trees into your assignment.
(1 mark) for either of the above
  • Briefly discuss whether the trees are fundamentally similar or whether there are important differences (i.e. differences in topology). If there are differences in topology, which branch(es) would have to be moved to make the trees congruent? (1 mark)

 
 

(2) Analysis

It is surprisingly difficult to find a comprehensive phylogenetic analysis of the fungal species for which the genomes have been sequenced, although one would assume this to be data of considerable utility for the community. I have constructed a cladogram for the species we are analysing, based on data published for 1551 fungal ribosomal sequences. Such rRNA trees are a standard method of phylogenetic analysis, supported by the assumption that rRNA sequences are monophyletic and have evolved under comparable selective pressure in all species.

Cladogram of fungi. Cladogram is based on small subunit ribosomal rRNA sequences, excerpt from Tehler et al. (2003) Mycol Res. 107:901-916. Even though many details of fungal phylogeny remain unresolved, the branches shown here individually appear to have strong support. In a cladogram such as this, the branch lengths are not drawn to any scale of similarity. I have labeled all speciation events so you can refer to these labels in your assignment.


   

(2.1) Correspondence of Gene trees and Phylogenetic Tree (1 mark)

 

Both your Mbp1 trees should of course correspond to the reference cladogram of fungi, after all, we have chosen the orthologues of Mbp1 for this task. However, in practice this is usually not exactly the case. The treest are likely going to be different in parts. Note that phylogenetic trees that differ only in rotations around branchpoints are considerd identical ! Differences in phylogenetic trees only concern the topology, i.e. the branching order.

 

  • Briefly compare one of the phylogenetic trees with the reference cladogram. Are there species that are not in the expected position? Does this mean the cladogram is not correct? (1 mark)

 
 

(2.2) Evolutionary History of the APSES Domain (2 marks)

 

In order to study the evolutionary history of the entire gene family, we have to prepare a tree of all 74 APSES genes. I have done this with the program promlk of the PHYLIP package. You can access the APSES domains reference tree here.

This is a complicated tree, and it can look impenetrably confusing at first. Here are two principles that will help you make sense of the tree.

A: A gene that is present in an ancestral species, is inherited in all descendent species. The gene has to be observed in all OTUs, unless its has been lost (which is a rare event). This means, if a gene is present in two widely divergent species, but in none other of the descendants of the LCA, it is possible that there is some problem with the tree (long branch attraction maybe), or the sequence has been acquired through horizontal gene transfer.

B: Paralogous genes in an ancestral species should give rise to monophyletic subtrees for each of the genes, in all descendants; this means: if the LCA of a branch has e.g. three genes, we would expect three copies of the species cladogram below this branchpoint, one for each of these genes. Each of these subtrees should recapitulate the reference phylogenetic tree of the OTUs, up to the branchpoint of their LCA.


With these two simple principles (you should draw them out on a piece of paper if they do not seem obvious to you), you can probably pry the reference tree of all APSES domains apart quite nicely. A few colored pencils and a printout of the tree will help.

 

  • Identify at least two (more or less) monophyletic subtrees of the tree that recapitulate (more or less) the species tree. Write down their branch-point numbers. In one of these subtrees, define the correspondence between the letters in the species tree and the branchpoints in the gene tree as far as possible. (The correspondence is not going to be exact. ) (1 mark)


  • Discuss briefly if there are features of the gene tree that are systematically inconsistent with the reference cladogram: choose one species from the gene tree that violates the expectation from the species tree and check what it's close neighbors are for the other parts of the tree. Do these sequences always appear more closely or more distantly related than they should be? (1 mark)

 
 

(2.3) Unraveling the Mbp1 clade (2 marks)

 

If we consider the Mbp1 clade (the clade containing all 16 of our Mbp1 orthologues), descending from branchpoint 115, we note that it contains 16 additional genes, one from each of the species we have studied, however both subtrees are polyphyletic within the clade. (In what follows I will refer to subgroups of the tree as Clades and label them with their branch number in this reference tree.) It is thus tempting to speculate that this subtree, Clade 115 is actually a mix-up of two ancestral genes, one of them an orthologue to Mbp1, the other to a different gene - perhaps Swi4. If this were true, there would be three possible explanations for the polyphyletic tree: (i) either we have mislabeled the orthologues, or (ii) we have constructed an incorrect tree, or (iii) our model of how these genes have evolved is wrong. Actually, it is rather obvious when we compare the clades that branch off at 117 and 128, why this is not a simple case of mislabeling the genes ...

 

  • Explain briefly why it is unlikely that the clades at branchpoints 117 and 128 are each descendants of a single cenancestral gene. (1 mark)

 

Could this point to a problem with the columns we have selected for the tree building? Have we biased our input data and then constructed an incorrect tree? To test this, a revised tree was constructed, taking the genes from Clade 115, realigning them and constructing a new tree. You can have a look at the revised Mbp1 APSES domain tree, and the procedure - or you can just take my word for it, the trees are substantially similar. Apparently the precise choice of columns has a minor effect on the results, as it should have. We should note in general however that branch points that are close together in time may be poorly resolvable, while branch points that are separated by longer evolutionary distances are more reliable. In summary, save for some inaccuracies, we have no reason to believe the tree is fundamentally incorrect.

So what about or model of evolution? What do we actually expect?

As we had written above, we expect to see anumber of more or less faithful copies of the species tree, for each of gene in the cenancestor (or LCA for later duplications).

And as you see, this is not exactly what we find in Clade 115. What we do find however is that Clade 115 reflects our expectations in part. Clade 123 pretty well recapitulates the Saccharomycotina, as does Clade 118, and when we consider Clade 128, apart from some "noise" we see both the Dikaryomycota (in Clade 142) reasonably grouped together, as well as the Euascomycotina (in Clade 129") - in fact the most striking departure from our expected tree for Mbp1 is the placement of Clade 123, which we would expect attached below branch 128, not above it where we have found it.

This may be due to insufficient resolution of the trees. Whenever we have long, evolutionary distances with a few OTUs clustered at the end, we can be reasonbly certain of the branch. When branching events occur at similar times, the order is uncertain and any specific ordering would probably not be well supported in bootstrapping.

What is striking however is the parallelism between Clade 123 and Clade 118. Apparently branchpoint 117 marks a duplication event of the ancestral Mbp1 gene in the Saccharomycotina. Here our expectation appears to be borne out. And given the uncertainty in the precise ordering between branchpoints 117 and 128, the Clade overall is thus actually quite consistent with the species tree.

 

  • Finally, find one example of a gene duplication, below branchpoint 128.


   

(3) Summary of Resources

 

Links
Sequences
Alignments
Mbp1 proteins:
APSES domains:
Trees


   

[End of assignment]

If you have any questions at all, don't hesitate to mail me at boris.steipe@utoronto.ca or post your question to the Course Mailing List