Difference between revisions of "BIO Assignment 4 2011"

From "A B C"
Jump to navigation Jump to search
 
(44 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
<!-- {{Template:Inactive}} -->
 +
{{Template:Active}}
 +
 +
 +
 
__TOC__
 
__TOC__
 
&nbsp;
 
&nbsp;
Line 4: Line 9:
  
 
<div style="padding: 5px; background: #A6AFD0;  border:solid 1px #AAAAAA; font-size:200%;font-weight:bold;">
 
<div style="padding: 5px; background: #A6AFD0;  border:solid 1px #AAAAAA; font-size:200%;font-weight:bold;">
Assignment 4 - Phylogenetic Analysis
+
Assignment 4 (last: 2011) - Phylogenetic Analysis
 
</div>
 
</div>
 
Please note: This assignment is currently inactive. Unannounced changes may be made at any time.
 
&nbsp;
 
 
<!-- '''Please note: This assignment is currently active. All changes will be announced on the course mailing list.'''-->
 
 
&nbsp;
 
  
 
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
 
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
 
Introduction
 
Introduction
 
+
&nbsp;
</div>
 
  
 
;Nothing in Biology makes sense except in the light of evolution.
 
;Nothing in Biology makes sense except in the light of evolution.
 
:''Theodosius Dobzhansky''
 
:''Theodosius Dobzhansky''
 
... but does evolution make sense in the light of biology? As we have seen in the previous assignments, the Mbp1 transcription factor has homologues in all other fungi, yet - looking at orthologues - this is not always a clear one-to-one mapping of related genes to each other. It appears that various systems of APSES domain transcription factors have evolved independently. Of course this bears directly on our notion of ''function'' - what it means to say that two genes in different organisms have the "same" function. In case two organisms both have an orthologous gene for the same, distinct function, this may be warranted. But what if that gene has duplicated in one of them, and the two paralogues now perform different, related functions in one organism? In order to be able to even ask such questions, we need to understand how we can make the evolutionary history of gene families explicit. This is the domain of '''phylogenetic analysis'''. We can ask questions like: how many paralogues did the cenancestor of a group possess? Which of these underwent additional duplications in the phylogenesis of the organism I am studying? Did any genes get lost? And - adding additional biological insight to the picture - did the observed duplications lead to the "invention" of new biological systems? When was that? And how did the species benefit from this event?
 
 
We will develop some of this kind of analysis in this assignment. In the previous assignment you have established which genes are the reciprocally most closely related orthologues to Mbp1 in yeast. In this assignment, we will analyse their evolutionary relationship and compare it to the evolutionary relationship of all fungal APSES domains. The goal is to define families of related transcription factors and their evolutionary history.
 
 
A number of good tools for phylogenetic analysis exist; ''general purpose packages'' include the (free) PHYLIP package and the (commercial) PAUP package. ''Specialized tools'' for tree-building include Treepuzzle or Mr. Bayes. This assignment is conctructed around programs that are availble in PHYLIP, however you are welcome to use other tools that fulfil a similar purpose if you wish. In this field, researchers consider trees that have been built with ML (maximum likelihood) methods to be more reliable than trees that are built with parsimony methods, or distance methods such as NJ (Neighbor Joining). However ML methods are also much more compute-intensive. Just like with multiple sequence alignments, some algorithms will come closer to guessing the truth and others will not and usually it is hard to tell, which is the more trustworthy of two diverging results. The prudent researcher tries out alternatives and forms her own opinion. Specifically, we may usually assume results that converge, independent of the algorithm, to be more reliable than those that depend strongly on a particular algorithm or details of input data.
 
 
But regarding algorithm and rersources: we will take two shortcuts in this assignment (and both shortcuts are things you should not do ''in real life):
 
 
One: we will use an '''efficient''' tree-building algorithm, not the best-available one. This is an algorithm which is available on the Web, without the need for you to install software on your own machine. In ''real life'' you would of course use the most accurate algortihm you can lay your hands on, regardless of the resources this requires, since it makes no sense to waste your time on a careful analysis of inaccurate trees. Your supervisor would want it so as well. And if not she, the reviewers of your manuscript.
 
 
Two: we will assume the tree the algorithm constructs is ''correct''. In ''real life'' you would establish its reliability with a bootstrap procedure: repeat the tree-building a hundred times with partial data and see which branches and groupings are robust and which depend on the details of the data. But we should still acknowledge that bifurcations that are very close to each other have not been" resolved". Any conscientious reviewer would flag such leniency and send your results back to you for a bootstrapping exercise at the computer. In phylogenetic analysis, not all lines that the program draws are equally trustworthy. Dont take the trees as a given fact just because a program suggests this. Look at the evidence, use your reasoning, and analyse them critically.
 
 
In case you want to review concept of trees, clades, LCAs OTUs and the like, I have linked two excellent and very understandable introduction-level articles on phylogenetic analysis to the resource section at the bottom of this page.
 
 
&nbsp;
 
 
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
 
Preparation, submission and due date
 
 
</div>
 
</div>
  
Read carefully. Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which important aspects have simply overlooked marks unnecessarily. If you did not notice that the above did not make sense, you are reading what you expect, not what is written.
+
... but does evolution make sense in the light of biology?
  
Prepare a Microsoft Word document with a title page that contains:
+
As we have seen in the previous assignments, the Mbp1 transcription factor has homologues in all other fungi, yet there is not always a clear one-to-one mapping between members of a family in distantly related species. It appears that various systems of APSES domain transcription factors have evolved independently. Of course this bears directly on our notion of function - what it means to say that two genes in different organisms have the "same" function. In case two organisms both have an orthologous gene for the same, distinct function, saying that the function is the same may be warranted. But what if that gene has duplicated in one species, and the two paralogues now perform different, related functions in one organism? Theses two are still orthologues to the other species, but now we expect functionally significant residues to have adapted to the new role of one paralogue. In order to be able to even ask such questions, we need to make the evolutionary history of gene families explicit. This is the domain of '''phylogenetic analysis'''. We can ask questions like: how many paralogues did the cenancestor of a clade possess? Which of these underwent additional duplications in the phylogenesis of the organism I am studying? Did any genes get lost? And - adding additional biological insight to the picture - did the observed duplications lead to the "invention" of new biological systems? When was that? And perhaps even: how did the species benefit from this event?
*your full name
 
*your Student ID
 
*your e-mail address
 
*the organism name you have been [[Organism_list_2006|assigned]]
 
  
Follow the steps outlined below. You are encouraged to  write your answers in short answer form or point form, '''like you would document an analysis in a laboratory notebook'''. However, you must
 
*document what you have done,
 
*note what Web sites and tools you have used,
 
*paste important data sequences, alignments, information etc.
 
  
'''If you do not document the process of your work, we will deduct marks.'''  Try to be concise, not wordy! Use your judgement: are you giving us enough information so we could exactly reproduce what you have done? If not, we will deduct marks. Avoid RTF and unnecessary formating. Do not paste screendumps. Keep the size of your submission below 1.5 MB.
+
We will develop this kind of analysis in this assignment. In the previous assignment you have established which gene in your species is the reciprocally most closely related orthologue to yeast Mbp1 and you have identified the full complement of APSES domain genes in your assigned organism. In this assignment, we will analyse these genes' evolutionary relationship and compare it to the evolutionary relationship of all fungal APSES domains. The goal is to define families of related transcription factors and their evolutionary history.
  
Write your answers into separate paragraphs and give each its title. Save your document with a filename of:
+
A number of excellent tools for phylogenetic analysis exist; ''general purpose packages'' include the (free) [http://evolution.genetics.washington.edu/phylip.html PHYLIP] package and the (commercial) PAUP package. ''Specialized tools'' for tree-building include Treepuzzle or Mr. Bayes. This assignment is constructed around programs that are available in PHYLIP, however you are welcome to use other tools that fulfill a similar purpose if you wish. In this field, researchers consider trees that have been built with ML (maximum likelihood) methods to be more reliable than trees that are built with parsimony methods, or distance methods such as NJ (Neighbor Joining). However ML methods are also much more compute-intensive. Just like with multiple sequence alignments, some algorithms will come closer to guessing the truth and others will not and usually it is hard to tell which is the more trustworthy of two diverging results. The prudent researcher tries out alternatives and forms her own opinion. Specifically, we may usually assume results that converge when computed with different algorithms, to be more reliable than those that depend strongly on a particular algorithm, parameters, or details of input data.
<code>A3_family name.given name.doc</code>
 
<small>(for example my first assignment would be named: A3_steipe.boris.doc - and don't switch the order of your given name and familyname please!)</small>
 
  
Finally e-mail the document to [boris.steipe@utoronto.ca] before the due date.
+
However: regarding algorithm and resources, we will take a shortcut in this assignment (something you should not do in real life). We will assume that the tree the algorithm constructs is correct. In "real life" you would establish its reliability with a bootstrap procedure: repeat the tree-building a hundred times with partial data and see which branches and groupings are robust and which depend on the details of the data. In this assignment, we should simply acknowledge that bifurcations that are very close to each other have not been "resolved" and be appropriately cautious in our inferences. In phylogenetic analysis, not all lines a program draws are equally trustworthy. Don't take the trees as a given fact just because a program suggests this. Look at the evidence, include independent information where available, use your reasoning, and analyse the results critically. As you will see, there are some facts that we know for certain: we know which species the genes we have sequenced come from, and we can (usually) make good assumptions about the relationship of the species themselves - the history of speciation events that underlies all evolution of genes. This is extremely helpful information for our work.  
  
Your document must not contain macros. Please turn off and/or remove all macros from your Word document; we will disable macros, since they pose a security risk.
+
=====Introduction: Tasks=====
  
With the number of students in the course, we have to economize on processing the assignments. '''Thus we will not accept assignments that are not prepared as described above.''' If you have technical difficulties, contact me.
+
For this assignment, we start from the APSES domains you have collected previously. You will align these domains with a set of reference domains and edit the alignment to make it suitable for phylogenetic analysis, using Jalview. Then you will construct a phylogenetic tree and interpret the tree. The goal is to identify orthologues and paralogues. <!-- Optionally, you will look at structural and functional conservation of residues. -->
  
'''The due date for the assignment is XXXXX at 10:00 in the morning.'''
+
In case you want to review concept of trees, clades, LCAs, OTUs and the like, I have linked an excellent and very understandable introduction-level [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Baldauf_2003_PhylogenyTutorial.pdf article on phylogenetic analysis (pdf)] here and to the resource section at the bottom of this page.
  
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
+
&nbsp;
Grading
 
</div>
 
  
Don't wait until the last day to find out there are problems! The assignment is excellent preparation for the exam, so even if its due later, its a good idea to do it earlier. Assignments that are received past the due date will have one mark deducted at the first minute of every twelve hour period past the due date. Assignments received more than 5 days past the due date will not be assessed. If you need an extension, you '''must''' arrange this beforehand.
+
{{Template:Preparation|
 +
care=Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which important aspects have simply overlooked marks unnecessarily. If you did not notice that the above did not make sense, you are reading what you expect, not what is written.|
 +
num=4|
 +
ord=fourth|
 +
due = Monday, November 28 at 12:00 in the morning}}
  
Marks are noted below in the section headings for of the tasks. A total of 10 marks will be awarded, if your assignment answers all of the questions. A total of 2 bonus marks (up to a maximum of 10 overall) can be awarded for particularily interesting findings, or insightful comments. A total of 2 marks can be subtracted for lack of form or for glaring errors. The marks you receive will  
+
;Your documentation for the procedures you follow in this assignment will be worth 1 mark.
* count directly towards your final marks at the end of term, for BCH441 (undergraduates), or
 
* be divided by two for BCH1441 (graduates).
 
  
 
&nbsp;
 
&nbsp;
Line 83: Line 51:
  
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 +
 
==(1) Preparations==
 
==(1) Preparations==
 
</div>
 
</div>
Line 89: Line 58:
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
===(1.1) Tools (X marks)===
+
===(1.1) Preparing Input Files===
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
  
Instruction
+
For this assignment, we start from the multiple sequence alignments we have constructed previously. We will edit the alignment to make it suitable for phylogenetic analysis. We will construct a phylogenetic tree and we will analyse the tree.
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*Task
 
</div>
 
&nbsp;<br>
 
  
Instruction
+
=====Introduction: Principle=====
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*Task.
 
</div>
 
  
&nbsp;
+
In order to use molecular sequences for the construction of phylogenetic trees, you have to build a multiple alignment first, then edit it. This is important: all rows of sequences have to contain the exact same number of characters and to hold '''aligned characters in corresponding positions'''. Phylogeny programs are not meant to revise an alignment but to analyze evolutionary relationships, '''after''' the alignment has been determined. The program's inferences are made on a column-wise basis and if your columns contain data from unrelated positions, the inferences are going to be questionable.
&nbsp;
 
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+
The result of the tree construction is a decision about the most likely evolutionary relationships. Fundamentally, tree-construction programs decide which sequences had common ancestors.
===(1.2) Preparing Input Files (X marks)===
 
</div>
 
&nbsp;<br>
 
  
=====Introduction: Task=====
 
For this assignment, you will need a file of source data (linked from the resource section at the bottom of this page). It is very similar to the files from the previous assignment, containing the orthologous Mbp1 sequences and contains the sequences for all APSES domains in fungi. I have edited the sequence identifiers, to tell us something about the gene they are taken from. In particular, I have given each yeast gene its standard name (eg. MBP1_SACCE) and named each gene from another organism with an arbitrary "A", "B", "C" ... to make sure the first ten characters are unique (since these first ten characters will be used and displayed by Phylip). This is then followed by the gi number in all cases, so it should be easy for you to retrieve the actual sequences from NCBI in case you need to. I have also omitted sequences from organisms we are no longer considering.
 
  
=====Introduction: Principle=====
+
'''Distance based''' phylogeny programs start by using sequence comparisons to estimate evolutionary distances:
In order to use these sequences for the estimation of phylogenetic trees, you have to build a multiple alignment first, then edit it. Most importantly, all sequences have to be edited to contain the exact same number of characters and to hold aligned characters in corresponding positions. Phylogeny programs are not meant to revise your alignment but to analyse evolutionary relationships, given the alignment.
 
  
The result of the tree estimation is a decision about likely relationships, fundamentally all the programs do is to decide which sequences had common ancestors. The phylogeny programs have a way to convert sequence comparisons into evolutionary distances (applying a model of evolution such as a mutation data matrix, calculating one number for each pair of sequences and using that to estimate a tree). Alternatively you can find trees that are most compatible with the observed sequences and the specific model of evolutionary change through point-mutations (either by grouping together the most highly related sequences (NJ, Neigbor Joining), or by minimizing the number of mutation events over the tree (Parsimony) or by finding the tree for which the observed sequences would be the most likely (ML, Maximum Likelihood)). Clearly, in order for this to work, you must not include fragments of sequence which have evolved under a totally different evolutionary model, such as domain fusion, or insertion/deletion of residues. The goal is not to be as comprehensive and complete as possible but to input the columns of aligned residues that will best represent the phylogenetic relationships between the sequences.
+
* they apply a model of evolution such as a mutation data matrix, to calculate a score for each pair of sequences,
 +
* this score is stored in a "distance matrix" ...
 +
* ... and used to estimate a tree that groups sequences with close relationships together. (e.g. by using an NJ, Neigbor Joining, algorithm).  
  
=====Introduction: Problems=====
+
They are fast, can work on large numbers of sequences, but are less accurate if genes evolve at different rates.
Gaps are a real problem here, as usual. Strictly speaking, the similarity score of an alignment program as well as the distance score of a phylogeny program are not calculated for an ordered sequence, but for a sum of values from aligned columns of characters. Most '''alignment''' programs use a model with a constant gap insertion penalty and a linear gap extension penalty. This is not rigourously justified from biology, but parametrized (or you could say "tweaked") to correspond to our observations. Most '''phylogeny''' programs, (such as the programs in PHYLIP) do not work in this way though. PHYLIP strictly operates on columns of characters and treats a gap character like a residue with the one letter code "-". This underestimates the distance between gapped sequences, since any evolutionary model should reflect the fact that gaps are much less likely than point mutations. If the gap is very long though, all events are counted individually as many single substitutions (rather than one lengthy one) and this overestimates the distance. And it gets worse: long stretches of gaps can make sequences appear similar in a way that is not justified, just because they are identical in the "-" character. When there are unambiguous gaps, one might be tempted to fudge the alignment by inserting matching characters into sequences that are ungapped (e.g. five "A"s each into the ungapped sequences and five "-" each into the gapped sequences), however, I would caution against this approach since it possibly introduces even more non-obvious implicit assumptions and potential for error.
 
  
=====Introduction: Practice=====
 
In practice, follow the fundamental principle that '''all characters in a column should be related by homology'''. This implies the following rules of thumb:
 
  
:*Remove all stretches of residues in which the alignment appears ambiguous.
+
'''Parsimony based''' phylogeny programs build a tree that minimizes the number of mutation events that are required to get from a common ancestral sequence to all observed sequences. They take all columns into account, not just a single number per sequence pair, as the Distance Methods do. For closely related sequences they work very well, but they construct inaccurate trees when they can't make good estimates for the required number of sequence changes.
:*Remove all frayed N- and C- termini, especially regions in which not all sequences that are being compared appear homologous.
 
:*Remove all but ~ one column from gapped regions, and all residues N- and C- terminal of the gap in which the alignment appears questionable. ( I would keep one gapped column as a placeholder for a rare and very distinct evolutionary event, rather than simply deleting them all).
 
:*Also, consider that neither residues that are completely different between all species, nor residues that are completely conserved are informative for relationship distances.
 
:*If your sequences are too long, you may run out of memory. 60-80 aligned residues should be plenty and if the sequences fit on a single line you will save yourself potential trouble with block-wise vs. interleaved input.
 
  
<small>(A '''very''' useful trick with Microsoft Word is that you can select blocks of text and entire columns in the document with your mouse: hold the "ALT" key depressed while you click and drag your mouse to select. This will greatly facilitate the preparation of sequences. You can treat that selection as any other selected text, color characters, or delete them. Importantly, you can also cut and paste entire columns! Of course, this will only work as expected if you use a fixed-width font such as Courier. )</small>
 
  
The preparation of the input file of aligned residues, used by the PHYLIP package is straightforward in principle; just carefully follow the instructions in PHYLIP's well written documentation. If you plan to use an outgroup for your tree, it is a good idea to move that to the first line of your alignment, since this is where PHYLIP will look for it by default.
+
'''ML''', or '''Maximum Likelihood''' methods attempt to find the tree for which the observed sequences would be the most likely under a particular evolutionary model. They are based on a rigorous statistical framework and yield the most robust results. But they are also quite compute intensive and a tree of the size that we are building in this assignment is a challenge for the resources of common workstation (runs about an hour on my computer). If the problem is too large, one may split a large problem into smaller, obvious subtrees (e.g. analysing orthologues as a group, only including a few paralogues for comparison) and then merge the smaller trees; this way even very large problems can become tractable.
  
Some notes on how to avoid common editing troubles. Copy the sequences frrom the link provided below. Paste them into a document, using the Word "Edit -> Paste special -> Unformatted text". Set the page-setup to "landscape", the font-size to something small, then you can put every sequence into one line. You can replace all paragraph marks ("^p") with (nothing) to remove them, then replace the FASTA header line character ">" with paragraphs ("^p") to separate them by line again. Take special note that your files must not include tab characters. You can use Word to globally replace all tabs (specified as "^t") with a blank, to make sure. Spaces count, so display your alignment in a fixed-width font, such as Courier ("Courier New" on Windows), not a proportional-width font such as Times, Arial, or Helvetica, and ensure all characters in your alignments align as they should. As always, make sure you save your input files as "Text Only".  
+
ML methods suffer less from "long-branch attraction" - the phenomenon that weakly similar sequences can be grouped inappropriately close together in a tree due to spuriously shared differences.
  
<small>
 
A note if you are  working on a '''Mac''': here MS Word will play one of its usual [http://en.wikipedia.org/wiki/Shenanigan shenanigans] on you and use the old-style OS 9 Carriage Return characters <code>(\r; ASCII 13; hex 0D; CR)</code>. Just by looking at the file this is quite invisible but they are not going to be recognized by PHYLIP or other self-respecting UNIX based programs (it may not make a difference when you paste your sequences to a Web server; but if you compute things locally it will appear to the program as though everything were in one line). And this can (and did) to head-banging rounds of frustration. You need to replace them with '''Linefeed''' resp. '''Newline''' characters <code>(\n; ASCII 10; hex 0A; LF)</code> and you can't even do that within Word(!). Open a UNIX terminal window and navigate to the directory where your files reside. Then type:
 
  
'''tr "\r" "\n" &lt; infile    &gt; outfile'''
+
Clearly, in order for tree-estimation to work, one must not include fragments of sequence which have evolved under a different evolutionary model as all others, e.g. after domain fusion, or after accommodating large stretches of indels. Thus it is appropriate to edit the sequences and pare them down to a most characteristic subset of amino acids. The goal is not to be as comprehensive as possible, but to input those columns of aligned residues that will best represent the true phylogenetic relationships between the sequences.
  
... where outfile is different from infile (careful: if a file by the name of outfile already exists, '''tr''' will cheerfully overwrite it.) Alternatively you could type the following perl one-liner :
+
=====Introduction: Gaps=====
  
'''perl -e 'while(&lt;&gt;){tr/\r/\n/;print}' &lt; infile    &gt; outfile'''
+
Gaps are a real problem here, as usual. Strictly speaking, the similarity score of an '''alignment''' program as well as the distance score of a '''phylogeny''' program are not calculated for an ordered sequence, but for a sum of independent values, one for each aligned columns of characters. The order of the columns does not change the score. However in an optimal sequence alignment with gaps, this is no longer strictly true since a one-character gap creation has a different penalty score than a one-character gap extension! Most '''alignment''' programs use a model with a constant gap insertion penalty and a linear gap extension penalty. This is not rigorously justified from biology, but parametrized (or you could say "tweaked") to correspond to our observations. However, most '''phylogeny''' programs, (such as the programs in PHYLIP) do not work in this way. PHYLIP strictly operates on columns of characters and treats a gap character just like a residue with the one letter code "-". Thus gap insertion- and extension- characters get the same score. For short indels, this '''underestimates''' the distance between pairs of sequences, since any evolutionary model should reflect the fact that gaps are much less likely than point mutations. If the gap is very long though, all events are counted individually as many single substitutions (rather than one lengthy one) and this '''overestimates''' the distance. And it gets worse: long stretches of gaps can make sequences appear similar in a way that is not justified, just because they are identical in the "-" character. It is therefore common and acceptable to edit gaps in the alignment and delete all but one or two columns of gapped sequence, or to remove such columns altogether.
</small>
 
  
  
In your assignment submission, clearly identify the source sequences you are using, as well as the alignment method you have used. Paste your unaltered source alignment into your document, clearly highlight or otherwise color the columns that you have selected, annotate why you have selected them and paste your result as well. Here is an example of what this might look like:
+
=====Introduction: The outgroup=====
  
 +
To analyse phylogenetic trees it is useful (and for some algorithms required) to define an outgroup, a sequence that presumably diverged from all other sequences in a clade before they split up among themselves. Wherever the outgroup inserts into the tree, this is the root of the rest of the tree. And whenever a molecular clock is assumed, the branching point that connects the outgroup can be assumed to be the oldest divergence event. I have defined an outgroup sequence and added it to the [[Reference APSES domains|reference APSES domains page]]. The procedure is explained in detail on that page.
  
;IMAGE
+
>gi|301025594|ref|ZP_07189117.1| KilA-N domain protein [Escherichia coli MS 69-1]
 +
<span style="color: #999999;">MTSFQLSLISRE</span>IDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGIPISELIQS
 +
FKGGRPENQGTWVHPDIAINLAQ<span style="color: #999999;">WLSPKFAVQVSRWVREWMSGERTTAEMPVHLKRYMVNRSRIPHTHFS
 +
ILNELTFNLVAPLEQAGYTLPEKMVPDISQGRVFSQWLRDNRNVEPKTFPTYDHEYPDGRVYPARLYPNE
 +
YLADFKEHFNNIWLPQYAPKYFADRDKKALALIEKIMLPNLDGNEQF</span>
 +
''E. coli'' KilA-N protein. Residues that do not align with APSES domains are shown in grey.
  
Figure 1: (Possible) steps in editing a multiple sequence alignment towards a PHYLIP input file. a: raw alignment (CLUSTAL format); b: sequences assembled into single lines; c: columns to be deleted highlighted in red; d: input data for PHYLIP (don't forget to include number of sequences and sequence length in the first line, read the [http://evolution.genetics.washington.edu/phylip/doc/sequence.html PHYLIP sequence format guide].)
+
=====Preparing APSES sequences=====
  
=====Introduction: Web Service and data=====
+
<div style="padding: 5px; background: #DDDDEE;">
 +
#Navigate to the [[Reference APSES domains|reference APSES domains page]] and copy the sequences.
 +
#Open Jalview, select '''File &rarr; Input Alignment &rarr; from Textbox''' and paste the sequences into the textbox.
 +
#Add the APSES domain sequences '''from your species''' that you have defined in the previous assignment.
 +
#When all the sequences are present, click on '''New Window'''.
 +
#In Jalview, select Web Service &rarr; Alignment &rarr; MAFFT Multiple Sequence Alignment. The alignment is calculated in a few minutes and displayed in a new window.
 +
#Choose any colour scheme and add '''Colour &rarr; by Conservation'''. Adjust the slider left or right to see which columns are highly conserved.
 +
#Save the alignment as a Jalview project before editing it for phylogenetic analysis. You may need it again.
 +
</div>
  
You have two choices for completing the assignment: either to peruse one of the on-line Webservices that kindly run a compute-intensive task like Phylip, or to download and install the program at home. If you choose the former, one of your options is the [http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html '''PHYLIP''' service at the Institut Pasteur]. I have tried it, and it works - however not entirely without wierdness. Uninformative errors will occur when your input is too large for the system's memory (like: "sequences not aligned" ... "out of memories" and such) but what is worse, after submitting a number of jobs, the system locked me out, asking me to what an unspecified time until results would be sent by e-mail. Regrettably, this is not documented. If you can live with that, the integration of their services in a logical sequence is good and some of their services are a bit more advanced than plain out of the box PHYLIP. If you decide to install PHYLIP, good for you. That is easy to do, well documented, there are much less limitations on memory - but if you don't read and understand the instructions carefully, you are in for a spell of frustration.
 
  
Either way, I have posted typical input files and result files here, to allow you to bail out in case technical problems become overwhelming. If you use the data posted here instead of your own, you '''must''' document that fact and explain what you have tried, and why that has failed. If you fail to do that, we will deduct marks - the posted data is a fallback, not a shortcut.
+
=====Introduction: Alignment editing for phylogenetic reconstruction=====
  
In this assignment, we will use distance based tree construction methods. They represent a reasonable compromise between accurracy and speed, especially when applied to moderately dissimilar sequences. In genereal, the include '''two''' steps: (1) calcualte a pairwise-distance matrix between sequences, (2) construct a tree, baed on the matrix. Thus all the information in the alignment bewtween two pairs of sequences is collapsed into a single number: thier pairwise distance. Both parsimony as well as ML based algorithms take individual columns into account. Parsimony based methods construct inaccurate trees when they can't make good estimates for the required number of sequence changes, if the sequences become too dissimilar. ML based methods are considered the most accurate for dissimilar sequences, however they are also very compute intensive and a full-length APSES domain alignment for 74 species can easily run for a full day on a workstation. Thus we will use distance based methods here, specifically the UPGMA variant of the neighbor joining algorithm.
+
In practice, follow the fundamental principle that '''all characters in a column should be related by homology'''. This implies the following rules of thumb:
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
*Remove all stretches of residues in which the ''alignment'' appears ambiguous (not just highly variable, but ambiguous regarding the aligned positions).
*Access one of the MSAs for '''Mbp1 proteins''', linked from the resources section at the bottom of the page. Choose an MSA that you have determined in your third assignment to be "reliable" and (briefly) justify your choice.
+
*Remove all frayed N- and C- termini, especially regions in which not all sequences that are being compared appear homologous and that may stem from unrelated domains.
 +
*Remove all gapped regions that appear to be alignment artefacts due to inappropriate input sequences.
 +
*Remove all but approximately one column from gapped regions '''in those cases where the presence of several related insertions suggest that the indel is real, and not just an alignment artefact.''' (Some researchers simply remove all gapped regions).
 +
*Remove sections N- and C- terminal of gaps where the alignment appears questionable.
 +
*Also, consider that neither residues that are completely different between all species, nor residues that are completely conserved are informative for relationship distances.
 +
*If your sequences are too long, your tree calculations may run out of memory. 60-80 aligned residues should be plenty and if the sequences fit on a single line you will save yourself potential trouble with block-wise vs. interleaved input.If you do run out of memory try removing columns of sequence.
 +
*Move the KilA-N outgroup sequence to the first line of your alignment, since this is where PHYLIP will look for it by default.  
  
*Prepare a PHYLIP formatted input file from this MSA, restricting the number of characters to no more than 60. Follow the considerations dicussed above. In particular you should choose some residues from each of the three aligned regions (The APSES domains, the Ankyrin domains and the C-terminal aligned region), to represent the diversity between these proteins. Document this as described above. [fallback]
 
  
*Prepare a second PHYLIP formatted input file from this MSA, that contains only the APSES domains. [fallback]
 
  
*Using PHYLIP, calculate a distance matrix for both files. [fallback]
+
[[Image:EditingGuide.jpg|frame|none|(Possible) steps in editing a multiple sequence alignment towards a PHYLIP input file. '''a''': raw alignment (CLUSTAL format); '''b''': sequences assembled into single lines; '''c''': columns to be deleted highlighted in red - 1, 3 and 4: large gaps; 2: uncertain alignment and 5: frayed C-terminus: both would put non-homologous characters into the same column; '''d''': input data for PHYLIP: names for sequences must not be longer than 10 characters, the first line must contain the number of sequences and the sequence length. PHYLIP is very picky about incorrectly formatted input, read the [http://evolution.genetics.washington.edu/phylip/doc/sequence.html PHYLIP sequence format guide].]]
  
*Using the neighbor joining algorithm - UPGMA method, construct a tree for both input files
+
;Once you are satisfied with your editing, proceed as follows:
  
*Briefly discuss whether the trees are fundamentally similar or whether there are important differences (i.e. differences in '''topology'''). If there are differences in topology, which branch would have to be moved to make the trees congruent.
+
<div style="padding: 5px; background: #DDDDEE;">
 +
#Download the PHYLIP package from the [http://evolution.genetics.washington.edu/phylip.html Phylip homepage] and install it on your computer.
 +
#Prepare a PHYLIP input file from your Jalview alignment. The simplest way to achieve this appears to be:
 +
##In Jalview, use '''File &rarr; Output to Textbox&rarr;FASTA''', then '''Edit&rarr;Select All''' and '''Edit&rarr;copy''' the sequences.
 +
##In a browser, navigate to the [http://www-bimas.cit.nih.gov/molbio/readseq/ '''Readseq sequence conversion service'''].
 +
##Paste your sequences into the form and choose '''Phylip''' as the output format. Click on '''submit'''.
 +
##Save the resulting page as a text file in the directory where the phylip executables reside on your computer. Give it some useful name such as <code>All-APSES_domains.phy</code>.
 +
#Make a copy of that file and name it <code>infile</code>. Note: make sure that your Microsoft Windows operating system does not silently append the extension ".txt" to your file. It should be called "infile", nothing else and you should never, never, ever permit your operating systems to slyly hide file extensions from you when it displays filenames. You have been warned.  
 +
</div>
  
*Briefly discuss whether the APSES domain tree is similar to the rRNA derived cladogram. Are there species that are not in the expected position? 
 
  
 +
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
  
 +
===(1.2) Calculating a Tree===
 
</div>
 
</div>
 +
 +
&nbsp;<br>
 
&nbsp;<br>
 
&nbsp;<br>
 +
<div style="padding: 5px; background: #DDDDEE;">
  
I have prepared a Phylip formatted input file for all APSES domains, I have also added the ''Schewanella denitrificans'' APSES domain as an outgroup, and I have computed an outgroup-rooted, maximum likelyhood tree with default parameters. [...]
+
*Use the '''proml''' program of PHYLIP (protein sequences, maximum likelihood tree) to calculate a phylogenetic tree. Use the default parameters except that you must change option <code>S: Speedier but rougher analysis?</code> to No - your analysis should not sacrifice accuracy for speed. The calculation will take a while.
 
 
&nbsp;
 
&nbsp;
 
  
 +
</div>
  
 +
&nbsp;<br>
 +
&nbsp;<br>
  
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
  
==(2) Trees==
+
==(2) Analysis (2 marks)==
 
</div>
 
</div>
  
As explained above, distance-based phylogeny methods first calculate a distance value between all sequence-pairs, then use these numbers to construct a tree.  
+
I have constructed a cladogram for the species we are analysing, based on data published for 1551 fungal ribosomal sequences. Such reference tres from rRNA data are a standard method of phylogenetic analysis, supported by the assumption that rRNA sequences are monophyletic and have evolved under comparable selective pressure in all species.
  
 +
[[Image:FungiCladogram.jpg|frame|none|Cladogram of fungi studied in the assignments. This cladogram is based on small subunit ribosomal rRNA sequences, and largely follows ''Tehler et al.'' (2003) ''Mycol Res.'' '''107''':901-916. Even though many details of fungal phylogeny remain unresolved, the branches shown here individually appear to have strong support. In a cladogram such as this, the branch lengths are not drawn to any scale of similarity.]]
  
&nbsp;
+
Your species may not be included in this cladogram, but you can easily calculate your own with the following procedure:
&nbsp;
 
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+
<div style="padding: 5px; background: #DDDDEE;">
===(2.1) The Mbp1 Gene Tree (X marks)===
+
#Access the [http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=retrieve&db=taxonomy NCBI taxonomy database Entrez query page].
</div>
+
#Edit the list of reference species below to include your species and paste it into the form.
&nbsp;<br>
 
 
 
Use the '''protdist''' program to calculate a distance matrix for all APSES domains. Use the default parameters for all but the following:
 
:*Use the PMB matrix for similarity.
 
Use '''neighbor''' to estimate a rooted (!) tree based on your distance matrix. Use the default parameters for all but the following:
 
 
 
:*Use the ''Shewanella denitrificans'' KilA sequence as outgroup for the tree.
 
:*Use '''Drawgram''' (or any other tree-drawing program, such as Treeview) to generate a plot of the rooted (!) tree you have inferred. Paste the plot as well as the ASCII character tree from the '''outfile''' into your assignment (you would use the plot in a publication, but you can more easily edit the ASCII character tree by hand ...).
 
  
 +
"Emericella nidulans"[Scientific Name] OR
 +
"Candida albicans"[Scientific Name] OR
 +
"Neurospora crassa"[Scientific Name] OR
 +
"Saccharomyces cerevisiae"[Scientific Name] OR
 +
"Schizosaccharomyces pombe"[Scientific Name] OR
 +
"Ustilago maydis"[Scientific Name]
  
&nbsp;<br>
+
#Next, as '''Display''' option, select '''Common Tree'''.
<div style="padding: 5px; background: #EEEEEE;">
+
#Then select the '''phylip tree''' option and click '''save as''' to save the tree in Newick format.
*Task
+
#The output can be edited, and visualized in any program that reads Newick trees.
 
</div>
 
</div>
&nbsp;
 
&nbsp;
 
  
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
===(2.1) The APSES Domain Tree (X marks)===
+
 
 +
===(2.2) Visualizing the APSES domain Phylogenetic Tree===
 
</div>
 
</div>
&nbsp;<br>
 
  
Instruction
 
  
&nbsp;<br>
+
Once Phylip is done calculating the tree, the tree in a text format will be contained in the Phylip <code>outfile</code> - the documentation of what the program has done. Open this textfile for a first look. The tree is complicated and it can look confusing at first. The tree in Newick format is contained in the Phylip file <code>outtree</code>. Visualize it as follows:
<div style="padding: 5px; background: #EEEEEE;">
+
 
*Task
+
<div style="padding: 5px; background: #DDDDEE;">
 +
#Open <code>outtree</code> in a texteditor and copy the tree.
 +
#Visualize the tree in alternative representations:
 +
##Navigate to the [http://www.proweb.org/treeviewer/ Proweb treeviewer], paste and visualize your tree.
 +
##Navigate to the [http://www.trex.uqam.ca/index.php?action=newick&project=trex Trex-online Newick tree viewer] for an alternative view. Visualize the tree as a phylogram. You can increase the window height to keep the labels from overlapping.
 +
##In your Jalview window, choose '''File &rarr; Load associated Tree''' and load the Phylip <code>outtree</code> file. You can click into the tree-window to show which clades branch off at what level - it should be obvious that you can identify three major subclades.
 +
##Study the tree: understand what you see and what you would have expected.
 
</div>
 
</div>
&nbsp;<br>
 
 
&nbsp;
 
&nbsp;
 
  
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
+
Here are two principles that will help you make sense of the tree.
  
==(3) Analysis==
 
</div>
 
  
It is surprisingly difficult to find a comprehensive phylogenetic analysis of the fungal species for which the genomes have been sequenced, although one would assume this to be data of considerable utility for the community. I have constructed a cladogram for the species we are analysing, based on data published for 1551 fungal ribosomal sequences. Such rRNA trees are a standard method of phylogenetic analysis, supported by the assumption that rRNA sequences are monophyletic and have evolved under comparable selective pressure in all species.
+
A: '''A gene that is present in an ancestral species is inherited in all descendant species'''. The gene has to be observed in all OTUs, unless its has been lost (which is a rare event).
  
;IMAGE
+
B: '''Paralogous genes in an ancestral species should give rise to monophyletic subtrees for each of the paralogues, in all descendants'''; this means: if the LCA of a branch has e.g. three genes, we would expect three copies of the species cladogram below this branchpoint, one for each of these genes. Each of these subtrees should recapitulate the reference phylogenetic tree of the species, up to the branchpoint of their LCA.
  
Figure 2: The "Reference Cladogram" of fungi based on small subunit ribosomal rRNA sequences. Even though many details of fungal phylogeny remain unresolved, the branches shown here individually appear to have strong support. In a cladogram such as this, the branch lengths are not drawn to any scale of similarity. I have labeled all speciation events so you can refer to these labels in your assignment.
 
  
 +
With these two simple principles (you should draw them out on a piece of paper if they do not seem obvious to you), you can probably pry your tree apart quite nicely. A few colored pencils and a printout of the tree will help.
  
  
 
&nbsp;
 
&nbsp;
 
&nbsp;
 
&nbsp;
 +
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
===(3.1) Correspondence of Gene trees and Phylogenetic Tree (X marks)===
+
 
 +
===(2.1) The Cenancestor's APSES Domains===
 
</div>
 
</div>
&nbsp;<br>
 
  
Your ML Mbp1 tree should of course correspond to a subtree of the APSES domain NJ tree since all the Mbp1 APSES domain sequences are included there (or should have been!). In fact, we would expect all Mbp1 domains to be attached in one monophyletic group. This allows you to compare the results of both phylogenetic analysis methods.
+
Refer to your tree for the following tasks. (Please remember to include your tree in your Assignment submission - it is a result of your computational experiment. Its easiest to copy/paste the tree from the Phylip outfile, rather than copying an image from a Tree viewer). Be specific in your discussion, i.e. refer to specific branchpoints (branchpoints are numbered in the Phylip output) and OTU or gene names in your analysis (see the example below).  
  
Compare your maximum likelihood Mbp1 tree to the reference cladogram. Discuss briefly which branching events in your tree appear to correspond to speciation events in the reference cladogram and which branch points appear to be incompatible. Edit the output tree to show the subtrees which are fully compatible with the reference cladogram. Which branches did you have to remove? Did you have to remove branches which proml has labelled as highly significant? What do you conclude?
 
  
 +
<div style="padding: 5px; background: #DDDDEE;">
 +
*Consider how many APSES domain proteins the fungal cenancestor appears to have possessed and what evidence you see in the tree that this is so.
 +
</div>
  
Your ML Mbp1 tree should of course correspond to a subtree of the APSES domain NJ tree since all the Mbp1 APSES domain sequences are included there (or should have been!). In fact, we would expect all Mbp1 domains to be attached in one monophyletic group. This allows you to compare the results of both phylogenetic analysis methods.
 
  
Compare your maximum likelihood Mbp1 tree to the subtree containing all Mbp1 APSES domains in your large NJ tree. Are the two trees compatible? Are there important differences? Assuming that the reference cladogram is correct (and it may not be), which of the two methods has yielded the better tree for Mbp1 APSES domains?
+
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
  
&nbsp;<br>
+
===(2.2) Unraveling your organism's APSES domains (2 marks)===
<div style="padding: 5px; background: #EEEEEE;">
 
*Task
 
 
</div>
 
</div>
 +
 
&nbsp;<br>
 
&nbsp;<br>
 
Instruction
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*Task.
 
</div>
 
 
&nbsp;
 
&nbsp;
 
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
===(3.2) Evolutionary History of the APSES Domain (X marks)===
 
</div>
 
 
&nbsp;<br>
 
&nbsp;<br>
 +
<div style="padding: 5px; background: #FFCC99;">
 +
;Analysis (2 marks)
  
A complicated tree, such as your NJ tree for all APSES domains can look impenetrably confusing at first. Here are three principles that will help you make sense of the tree.
+
Assume that the cladogram for fungi that I have given above is correct, and that the mixed gene tree you have calculated is fundamentally correct in its overall arrangement but may have local inaccuracies due to the limited resolution of the method. You have identified the APSES domain genes of the fungal cenancestor above. Apply the expectations we have stated above to discuss briefly through what sequence of duplications and/or gene loss your organism has ended up with the APSES domains it possesses today. Make specific reference to the cladogram of species and note in particular in case some of your sequences appear to have been placed into regions of the tree where they don't seem to belong. Also note which branchpoints in the evolutionary history of your sequences correspond to speciations and which ones to duplications.
  
A: '''A gene that is present in an ancestral species, is inherited in all descendent species.''' It is thus observed in all OTUs, unless its has been lost (which is a rare event). This means, if a gene is present in two widely divergent species, but in none other of the descendants of the LCA, it is possible that there is some problem with the tree (long branch attraction maybe), or the sequence has been acquired through horizontal gene transfer.
 
  
B: '''Paralogous genes in an ancestral species should give rise to monophyletic subtrees for each of the genes, in all descendants'''; each of these subtrees should independently recapitulate the reference phylogenetic tree of the OTUs, up to their LCAs.
+
Note: A common confusion about cenancestral genes arises from the fact that by far not all expected genes are present in the OTUs. Some will have been lost, some will have been incorrectly annotated in their genome (frameshifts!) and not been found with PSI-BLAST, some may have been missed by you. In general you have to ask: '''given the species represented in a subclade, what is the last common ancestor of that branch'''? The expectation is that '''all''' descendants of that ancestor should be represented in that branch '''unless''' one of the above reasons why a gene might be absent would apply.
  
C: '''After a gene duplication event, one of the genes evolves at a higher evolutionary rate.''' Eric Lander's group has provided spectacular evidence for this hypothesis. The expected effect of this systematically unequal rate of evolution for paralogues is that the branch point for one set of duplicates moves higher up the tree than the speciation event for the remaining part, and the branch lengths increase (because their ancestor accumulates more mutations relative to all the other sequences that remain under strict evolutionary pressure).
 
 
(Punctuated equilibrium ?)
 
 
 
With these three simple principles (you should draw them out on a piece of paper if they do not seem obvious to you), you can probably pry the NJ tree apart quite nicely. A few colored pencils and a printout of the tree will help.
 
 
 
Identify monophyletic subtrees of your NJ tree that are the result of gene duplication events. You can name the subtrees by the standard name of the yeast sequence they contain. (Unless something went very wrong in the analysis, the four yeast genes should appear in reasonably distinct subtrees.) Not all subtrees may contain a yeast gene though, name the others something meaningful.
 
 
Identify the LCA of each of the subtrees above in terms of the letters in the reference cladogram. Take into account that not all branches in the subtree may be completely reliable. (For example a subtree containing all (or nearly all) species would have "A" as its LCA, a tree that contains only Gibberella, Magnaporthe and Neurospora would have "D" as its LCA.) This is not a strictly rigorous operation, since some of the branching orders may not be resolved. You need to apply reasoning to this task.
 
 
Discuss briefly how many APSES domain proteins the fungal cenancestor appears to have posessed and by which sequence of gene loss or gene duplications the APSES domains in "your" organism appear to have arisen. (This is a straightforward synthesis based on what you have done above, by referring to labelled nodes in the reference cladogram.)
 
 
Discuss briefly if there are features of the NJ tree that are systematically inconsistent with the reference cladogram, (for example that some sequences always appear more closely or more distantly related than they should be). Do you think the reference cladogram needs to be revised?
 
  
 +
If your species does not have all the genes you would expect it to have inherited from its ancestors, you MUST note that fact and attempt to explain it.
 +
</div>
  
  
 +
For example the following discusion for ''Saccharomyces cerevisiae'' would be sufficient for full marks:
 +
:(Numbers refer to branchpoints of the mixed gene tree, letters to branchpoints of the species tree). I have found five homologues to ''Saccharomyces cerevisiae'' Mbp1 and included them in the mixed gene tree. Two subclades are well defined, and contain all current species, they branch from 41 (Xbp1) and 50 (Sok2/Phd1). The subclade below 6 includes Mbp1 orthologues as well as Swi4 orthologues that do not appear well resolved. Considering only species below the ''saccharomycetales'' branchpoint, I postulate a duplication at that branchpoint that gave rise to yeast Mbp1 and Swi4 since the respective branches contain representatives from all fungi that descended from that branch. There is no good support for the idea that the cenancestor had a Swi4 paralogue. Therefore the cenancestor most likely posessed two paralogues: Mbp1, and Sok2. ''Saccharomyces cerevisiae'' has one gene in each of the major subclades, there is no gene loss. It also has an additional paralogue to Sok2: the Phd1 gene that duplicated at branchpoint 3.
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*Task
 
</div>
 
&nbsp;<br>
 
 
Instruction
 
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
 
*Task.
 
</div>
 
 
&nbsp;
 
&nbsp;
 
  
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
  
==(4) Summary of Resources==
+
==(3) Summary of Resources==
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
Line 333: Line 270:
 
;Links
 
;Links
 
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Baldauf_2003_PhylogenyTutorial.pdf '''Review (PDF, restricted)''' Sandra Baldauf: Phylogeny for the Faint of Heart]
 
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Baldauf_2003_PhylogenyTutorial.pdf '''Review (PDF, restricted)''' Sandra Baldauf: Phylogeny for the Faint of Heart]
:* [[Organism_list_2006|Assigned Organisms]]
 
 
:* [http://evolution.genetics.washington.edu/phylip.html '''PHYLIP''' home page]
 
:* [http://evolution.genetics.washington.edu/phylip.html '''PHYLIP''' home page]
:* [http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html '''PHYLIP''' Web Service at the Institut Pasteur]
 
  
 
;Sequences
 
;Sequences
:* [[All_Mbp1_proteins|'''All Mbp1 proteins''']]
+
:* [[Reference APSES domains|Reference APSES domains page]]
:* [[All_APSES_domains|'''All APSES domains''']]
 
 
 
;Alignments
 
:'''Mbp1 proteins:'''
 
:* [[All_Mbp1_CLUSTAL|Mbp1 proteins '''CLUSTAL''' aligned]]
 
:* [[All_Mbp1_MUSCLE|Mbp1 proteins '''MUSCLE''' aligned]]
 
:* [[All_Mbp1_T-COFFEE|Mbp1 proteins '''T-Coffee''' aligned (text version)]]
 
 
 
:'''APSES domains:'''
 
:* [[APSES_domains_PSI-BLAST|All APSES domains - alignment based on '''PSI-BLAST''' results]]
 
:* [[APSES_domains_CLUSTAL|All APSES domains -  '''CLUSTAL-W''' alignment]]
 
:* [[APSES_domains_probcons|All APSES domains -  '''probcons''' alignment]]
 
 
 
&nbsp;
 
&nbsp;
 
  
 
<div style="padding: 5px; background: #D3D8E8;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #D3D8E8;  border:solid 1px #AAAAAA;">
Line 359: Line 279:
 
</div>
 
</div>
  
If you have any questions at all, don't hesitate to mail me at [mailto:boris.steipe@utoronto.ca boris.steipe@utoronto.ca] or post your question to the [mailto:bch441_2006@googlegroups.com Course Mailing List]
+
If you have any questions at all, don't hesitate to mail me at [mailto:boris.steipe@utoronto.ca boris.steipe@utoronto.ca] or post your question to the [mailto:bch441_2011@googlegroups.com Course Mailing List]

Latest revision as of 23:34, 21 September 2012

Note! This assignment is currently active. All significant changes will be announced on the mailing list.

 
 


   

Assignment 4 (last: 2011) - Phylogenetic Analysis

Introduction  

Nothing in Biology makes sense except in the light of evolution.
Theodosius Dobzhansky

... but does evolution make sense in the light of biology?

As we have seen in the previous assignments, the Mbp1 transcription factor has homologues in all other fungi, yet there is not always a clear one-to-one mapping between members of a family in distantly related species. It appears that various systems of APSES domain transcription factors have evolved independently. Of course this bears directly on our notion of function - what it means to say that two genes in different organisms have the "same" function. In case two organisms both have an orthologous gene for the same, distinct function, saying that the function is the same may be warranted. But what if that gene has duplicated in one species, and the two paralogues now perform different, related functions in one organism? Theses two are still orthologues to the other species, but now we expect functionally significant residues to have adapted to the new role of one paralogue. In order to be able to even ask such questions, we need to make the evolutionary history of gene families explicit. This is the domain of phylogenetic analysis. We can ask questions like: how many paralogues did the cenancestor of a clade possess? Which of these underwent additional duplications in the phylogenesis of the organism I am studying? Did any genes get lost? And - adding additional biological insight to the picture - did the observed duplications lead to the "invention" of new biological systems? When was that? And perhaps even: how did the species benefit from this event?


We will develop this kind of analysis in this assignment. In the previous assignment you have established which gene in your species is the reciprocally most closely related orthologue to yeast Mbp1 and you have identified the full complement of APSES domain genes in your assigned organism. In this assignment, we will analyse these genes' evolutionary relationship and compare it to the evolutionary relationship of all fungal APSES domains. The goal is to define families of related transcription factors and their evolutionary history.

A number of excellent tools for phylogenetic analysis exist; general purpose packages include the (free) PHYLIP package and the (commercial) PAUP package. Specialized tools for tree-building include Treepuzzle or Mr. Bayes. This assignment is constructed around programs that are available in PHYLIP, however you are welcome to use other tools that fulfill a similar purpose if you wish. In this field, researchers consider trees that have been built with ML (maximum likelihood) methods to be more reliable than trees that are built with parsimony methods, or distance methods such as NJ (Neighbor Joining). However ML methods are also much more compute-intensive. Just like with multiple sequence alignments, some algorithms will come closer to guessing the truth and others will not and usually it is hard to tell which is the more trustworthy of two diverging results. The prudent researcher tries out alternatives and forms her own opinion. Specifically, we may usually assume results that converge when computed with different algorithms, to be more reliable than those that depend strongly on a particular algorithm, parameters, or details of input data.

However: regarding algorithm and resources, we will take a shortcut in this assignment (something you should not do in real life). We will assume that the tree the algorithm constructs is correct. In "real life" you would establish its reliability with a bootstrap procedure: repeat the tree-building a hundred times with partial data and see which branches and groupings are robust and which depend on the details of the data. In this assignment, we should simply acknowledge that bifurcations that are very close to each other have not been "resolved" and be appropriately cautious in our inferences. In phylogenetic analysis, not all lines a program draws are equally trustworthy. Don't take the trees as a given fact just because a program suggests this. Look at the evidence, include independent information where available, use your reasoning, and analyse the results critically. As you will see, there are some facts that we know for certain: we know which species the genes we have sequenced come from, and we can (usually) make good assumptions about the relationship of the species themselves - the history of speciation events that underlies all evolution of genes. This is extremely helpful information for our work.

Introduction: Tasks

For this assignment, we start from the APSES domains you have collected previously. You will align these domains with a set of reference domains and edit the alignment to make it suitable for phylogenetic analysis, using Jalview. Then you will construct a phylogenetic tree and interpret the tree. The goal is to identify orthologues and paralogues.

In case you want to review concept of trees, clades, LCAs, OTUs and the like, I have linked an excellent and very understandable introduction-level article on phylogenetic analysis (pdf) here and to the resource section at the bottom of this page.

 

Preparation, submission and due date

Read carefully.
Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which important aspects have simply overlooked marks unnecessarily. If you did not notice that the above did not make sense, you are reading what you expect, not what is written.

Review the guidelines for preparation and submission of BCH441 assignments.

The due date for the assignment is Monday, November 28 at 12:00 in the morning.

   

Your documentation for the procedures you follow in this assignment will be worth 1 mark.

   

(1) Preparations

   

(1.1) Preparing Input Files

 

For this assignment, we start from the multiple sequence alignments we have constructed previously. We will edit the alignment to make it suitable for phylogenetic analysis. We will construct a phylogenetic tree and we will analyse the tree.

Introduction: Principle

In order to use molecular sequences for the construction of phylogenetic trees, you have to build a multiple alignment first, then edit it. This is important: all rows of sequences have to contain the exact same number of characters and to hold aligned characters in corresponding positions. Phylogeny programs are not meant to revise an alignment but to analyze evolutionary relationships, after the alignment has been determined. The program's inferences are made on a column-wise basis and if your columns contain data from unrelated positions, the inferences are going to be questionable.

The result of the tree construction is a decision about the most likely evolutionary relationships. Fundamentally, tree-construction programs decide which sequences had common ancestors.


Distance based phylogeny programs start by using sequence comparisons to estimate evolutionary distances:

  • they apply a model of evolution such as a mutation data matrix, to calculate a score for each pair of sequences,
  • this score is stored in a "distance matrix" ...
  • ... and used to estimate a tree that groups sequences with close relationships together. (e.g. by using an NJ, Neigbor Joining, algorithm).

They are fast, can work on large numbers of sequences, but are less accurate if genes evolve at different rates.


Parsimony based phylogeny programs build a tree that minimizes the number of mutation events that are required to get from a common ancestral sequence to all observed sequences. They take all columns into account, not just a single number per sequence pair, as the Distance Methods do. For closely related sequences they work very well, but they construct inaccurate trees when they can't make good estimates for the required number of sequence changes.


ML, or Maximum Likelihood methods attempt to find the tree for which the observed sequences would be the most likely under a particular evolutionary model. They are based on a rigorous statistical framework and yield the most robust results. But they are also quite compute intensive and a tree of the size that we are building in this assignment is a challenge for the resources of common workstation (runs about an hour on my computer). If the problem is too large, one may split a large problem into smaller, obvious subtrees (e.g. analysing orthologues as a group, only including a few paralogues for comparison) and then merge the smaller trees; this way even very large problems can become tractable.

ML methods suffer less from "long-branch attraction" - the phenomenon that weakly similar sequences can be grouped inappropriately close together in a tree due to spuriously shared differences.


Clearly, in order for tree-estimation to work, one must not include fragments of sequence which have evolved under a different evolutionary model as all others, e.g. after domain fusion, or after accommodating large stretches of indels. Thus it is appropriate to edit the sequences and pare them down to a most characteristic subset of amino acids. The goal is not to be as comprehensive as possible, but to input those columns of aligned residues that will best represent the true phylogenetic relationships between the sequences.

Introduction: Gaps

Gaps are a real problem here, as usual. Strictly speaking, the similarity score of an alignment program as well as the distance score of a phylogeny program are not calculated for an ordered sequence, but for a sum of independent values, one for each aligned columns of characters. The order of the columns does not change the score. However in an optimal sequence alignment with gaps, this is no longer strictly true since a one-character gap creation has a different penalty score than a one-character gap extension! Most alignment programs use a model with a constant gap insertion penalty and a linear gap extension penalty. This is not rigorously justified from biology, but parametrized (or you could say "tweaked") to correspond to our observations. However, most phylogeny programs, (such as the programs in PHYLIP) do not work in this way. PHYLIP strictly operates on columns of characters and treats a gap character just like a residue with the one letter code "-". Thus gap insertion- and extension- characters get the same score. For short indels, this underestimates the distance between pairs of sequences, since any evolutionary model should reflect the fact that gaps are much less likely than point mutations. If the gap is very long though, all events are counted individually as many single substitutions (rather than one lengthy one) and this overestimates the distance. And it gets worse: long stretches of gaps can make sequences appear similar in a way that is not justified, just because they are identical in the "-" character. It is therefore common and acceptable to edit gaps in the alignment and delete all but one or two columns of gapped sequence, or to remove such columns altogether.


Introduction: The outgroup

To analyse phylogenetic trees it is useful (and for some algorithms required) to define an outgroup, a sequence that presumably diverged from all other sequences in a clade before they split up among themselves. Wherever the outgroup inserts into the tree, this is the root of the rest of the tree. And whenever a molecular clock is assumed, the branching point that connects the outgroup can be assumed to be the oldest divergence event. I have defined an outgroup sequence and added it to the reference APSES domains page. The procedure is explained in detail on that page.

>gi|301025594|ref|ZP_07189117.1| KilA-N domain protein [Escherichia coli MS 69-1]
MTSFQLSLISREIDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGIPISELIQS
FKGGRPENQGTWVHPDIAINLAQWLSPKFAVQVSRWVREWMSGERTTAEMPVHLKRYMVNRSRIPHTHFS
ILNELTFNLVAPLEQAGYTLPEKMVPDISQGRVFSQWLRDNRNVEPKTFPTYDHEYPDGRVYPARLYPNE
YLADFKEHFNNIWLPQYAPKYFADRDKKALALIEKIMLPNLDGNEQF

E. coli KilA-N protein. Residues that do not align with APSES domains are shown in grey.

Preparing APSES sequences
  1. Navigate to the reference APSES domains page and copy the sequences.
  2. Open Jalview, select File → Input Alignment → from Textbox and paste the sequences into the textbox.
  3. Add the APSES domain sequences from your species that you have defined in the previous assignment.
  4. When all the sequences are present, click on New Window.
  5. In Jalview, select Web Service → Alignment → MAFFT Multiple Sequence Alignment. The alignment is calculated in a few minutes and displayed in a new window.
  6. Choose any colour scheme and add Colour → by Conservation. Adjust the slider left or right to see which columns are highly conserved.
  7. Save the alignment as a Jalview project before editing it for phylogenetic analysis. You may need it again.


Introduction: Alignment editing for phylogenetic reconstruction

In practice, follow the fundamental principle that all characters in a column should be related by homology. This implies the following rules of thumb:

  • Remove all stretches of residues in which the alignment appears ambiguous (not just highly variable, but ambiguous regarding the aligned positions).
  • Remove all frayed N- and C- termini, especially regions in which not all sequences that are being compared appear homologous and that may stem from unrelated domains.
  • Remove all gapped regions that appear to be alignment artefacts due to inappropriate input sequences.
  • Remove all but approximately one column from gapped regions in those cases where the presence of several related insertions suggest that the indel is real, and not just an alignment artefact. (Some researchers simply remove all gapped regions).
  • Remove sections N- and C- terminal of gaps where the alignment appears questionable.
  • Also, consider that neither residues that are completely different between all species, nor residues that are completely conserved are informative for relationship distances.
  • If your sequences are too long, your tree calculations may run out of memory. 60-80 aligned residues should be plenty and if the sequences fit on a single line you will save yourself potential trouble with block-wise vs. interleaved input.If you do run out of memory try removing columns of sequence.
  • Move the KilA-N outgroup sequence to the first line of your alignment, since this is where PHYLIP will look for it by default.


(Possible) steps in editing a multiple sequence alignment towards a PHYLIP input file. a: raw alignment (CLUSTAL format); b: sequences assembled into single lines; c: columns to be deleted highlighted in red - 1, 3 and 4: large gaps; 2: uncertain alignment and 5: frayed C-terminus: both would put non-homologous characters into the same column; d: input data for PHYLIP: names for sequences must not be longer than 10 characters, the first line must contain the number of sequences and the sequence length. PHYLIP is very picky about incorrectly formatted input, read the PHYLIP sequence format guide.
Once you are satisfied with your editing, proceed as follows
  1. Download the PHYLIP package from the Phylip homepage and install it on your computer.
  2. Prepare a PHYLIP input file from your Jalview alignment. The simplest way to achieve this appears to be:
    1. In Jalview, use File → Output to Textbox→FASTA, then Edit→Select All and Edit→copy the sequences.
    2. In a browser, navigate to the Readseq sequence conversion service.
    3. Paste your sequences into the form and choose Phylip as the output format. Click on submit.
    4. Save the resulting page as a text file in the directory where the phylip executables reside on your computer. Give it some useful name such as All-APSES_domains.phy.
  3. Make a copy of that file and name it infile. Note: make sure that your Microsoft Windows operating system does not silently append the extension ".txt" to your file. It should be called "infile", nothing else and you should never, never, ever permit your operating systems to slyly hide file extensions from you when it displays filenames. You have been warned.


(1.2) Calculating a Tree

 
 

  • Use the proml program of PHYLIP (protein sequences, maximum likelihood tree) to calculate a phylogenetic tree. Use the default parameters except that you must change option S: Speedier but rougher analysis? to No - your analysis should not sacrifice accuracy for speed. The calculation will take a while.

 
 

(2) Analysis (2 marks)

I have constructed a cladogram for the species we are analysing, based on data published for 1551 fungal ribosomal sequences. Such reference tres from rRNA data are a standard method of phylogenetic analysis, supported by the assumption that rRNA sequences are monophyletic and have evolved under comparable selective pressure in all species.

Cladogram of fungi studied in the assignments. This cladogram is based on small subunit ribosomal rRNA sequences, and largely follows Tehler et al. (2003) Mycol Res. 107:901-916. Even though many details of fungal phylogeny remain unresolved, the branches shown here individually appear to have strong support. In a cladogram such as this, the branch lengths are not drawn to any scale of similarity.

Your species may not be included in this cladogram, but you can easily calculate your own with the following procedure:

  1. Access the NCBI taxonomy database Entrez query page.
  2. Edit the list of reference species below to include your species and paste it into the form.
"Emericella nidulans"[Scientific Name] OR
"Candida albicans"[Scientific Name] OR
"Neurospora crassa"[Scientific Name] OR
"Saccharomyces cerevisiae"[Scientific Name] OR
"Schizosaccharomyces pombe"[Scientific Name] OR
"Ustilago maydis"[Scientific Name]
  1. Next, as Display option, select Common Tree.
  2. Then select the phylip tree option and click save as to save the tree in Newick format.
  3. The output can be edited, and visualized in any program that reads Newick trees.


(2.2) Visualizing the APSES domain Phylogenetic Tree


Once Phylip is done calculating the tree, the tree in a text format will be contained in the Phylip outfile - the documentation of what the program has done. Open this textfile for a first look. The tree is complicated and it can look confusing at first. The tree in Newick format is contained in the Phylip file outtree. Visualize it as follows:

  1. Open outtree in a texteditor and copy the tree.
  2. Visualize the tree in alternative representations:
    1. Navigate to the Proweb treeviewer, paste and visualize your tree.
    2. Navigate to the Trex-online Newick tree viewer for an alternative view. Visualize the tree as a phylogram. You can increase the window height to keep the labels from overlapping.
    3. In your Jalview window, choose File → Load associated Tree and load the Phylip outtree file. You can click into the tree-window to show which clades branch off at what level - it should be obvious that you can identify three major subclades.
    4. Study the tree: understand what you see and what you would have expected.

Here are two principles that will help you make sense of the tree.


A: A gene that is present in an ancestral species is inherited in all descendant species. The gene has to be observed in all OTUs, unless its has been lost (which is a rare event).

B: Paralogous genes in an ancestral species should give rise to monophyletic subtrees for each of the paralogues, in all descendants; this means: if the LCA of a branch has e.g. three genes, we would expect three copies of the species cladogram below this branchpoint, one for each of these genes. Each of these subtrees should recapitulate the reference phylogenetic tree of the species, up to the branchpoint of their LCA.


With these two simple principles (you should draw them out on a piece of paper if they do not seem obvious to you), you can probably pry your tree apart quite nicely. A few colored pencils and a printout of the tree will help.


   


(2.1) The Cenancestor's APSES Domains

Refer to your tree for the following tasks. (Please remember to include your tree in your Assignment submission - it is a result of your computational experiment. Its easiest to copy/paste the tree from the Phylip outfile, rather than copying an image from a Tree viewer). Be specific in your discussion, i.e. refer to specific branchpoints (branchpoints are numbered in the Phylip output) and OTU or gene names in your analysis (see the example below).


  • Consider how many APSES domain proteins the fungal cenancestor appears to have possessed and what evidence you see in the tree that this is so.


(2.2) Unraveling your organism's APSES domains (2 marks)

 
 

Analysis (2 marks)

Assume that the cladogram for fungi that I have given above is correct, and that the mixed gene tree you have calculated is fundamentally correct in its overall arrangement but may have local inaccuracies due to the limited resolution of the method. You have identified the APSES domain genes of the fungal cenancestor above. Apply the expectations we have stated above to discuss briefly through what sequence of duplications and/or gene loss your organism has ended up with the APSES domains it possesses today. Make specific reference to the cladogram of species and note in particular in case some of your sequences appear to have been placed into regions of the tree where they don't seem to belong. Also note which branchpoints in the evolutionary history of your sequences correspond to speciations and which ones to duplications.


Note: A common confusion about cenancestral genes arises from the fact that by far not all expected genes are present in the OTUs. Some will have been lost, some will have been incorrectly annotated in their genome (frameshifts!) and not been found with PSI-BLAST, some may have been missed by you. In general you have to ask: given the species represented in a subclade, what is the last common ancestor of that branch? The expectation is that all descendants of that ancestor should be represented in that branch unless one of the above reasons why a gene might be absent would apply.


If your species does not have all the genes you would expect it to have inherited from its ancestors, you MUST note that fact and attempt to explain it.


For example the following discusion for Saccharomyces cerevisiae would be sufficient for full marks:

(Numbers refer to branchpoints of the mixed gene tree, letters to branchpoints of the species tree). I have found five homologues to Saccharomyces cerevisiae Mbp1 and included them in the mixed gene tree. Two subclades are well defined, and contain all current species, they branch from 41 (Xbp1) and 50 (Sok2/Phd1). The subclade below 6 includes Mbp1 orthologues as well as Swi4 orthologues that do not appear well resolved. Considering only species below the saccharomycetales branchpoint, I postulate a duplication at that branchpoint that gave rise to yeast Mbp1 and Swi4 since the respective branches contain representatives from all fungi that descended from that branch. There is no good support for the idea that the cenancestor had a Swi4 paralogue. Therefore the cenancestor most likely posessed two paralogues: Mbp1, and Sok2. Saccharomyces cerevisiae has one gene in each of the major subclades, there is no gene loss. It also has an additional paralogue to Sok2: the Phd1 gene that duplicated at branchpoint 3.


(3) Summary of Resources

 

Links
Sequences

[End of assignment]

If you have any questions at all, don't hesitate to mail me at boris.steipe@utoronto.ca or post your question to the Course Mailing List