BIO Assignment 4 2011
Assignment 4 - Phylogenetic Analysis
Please note: This assignment is currently inactive. Unannounced changes may be made at any time.
Introduction
- Nothing in Biology makes sense except in the light of evolution.
- Theodosius Dobzhansky
... but does evolution make sense in the light of biology? As we have seen in the previous assignments, the Mbp1 transcription factor has homologues in all other fungi, yet - looking at orthologues - this is not always a clear one-to-one mapping of related genes to each other. It appears that various systems of APSES domain transcription factors have evolved independently. Of course this bears directly on our notion of function - what it means to say that two genes in different organisms have the "same" function. In case two organisms both have an orthologous gene for the same, distinct function, this may be warranted. But what if that gene has duplicated in one of them, and the two paralogues now perform different, related functions in one organism? In order to be able to even ask such questions, we need to understand how we can make the evolutionary history of gene families explicit. This is the domain of phylogenetic analysis. We can ask questions like: how many paralogues did the cenancestor of a group possess? Which of these underwent additional duplications in the phylogenesis of the organism I am studying? Did any genes get lost? And - adding additional biological insight to the picture - did the observed duplications lead to the "invention" of new biological systems? When was that? And how did the species benefit from this event?
We will develop some of this kind of analysis in this assignment. In the previous assignment you have established which genes are the reciprocally most closely related orthologues to Mbp1 in yeast. In this assignment, we will analyse their evolutionary relationship and compare it to the evolutionary relationship of all fungal APSES domains. The goal is to define families of related transcription factors and their evolutionary history.
A number of good tools for phylogenetic analysis exist; general purpose packages include the (free) PHYLIP package and the (commercial) PAUP package. Specialized tools for tree-building include Treepuzzle or Mr. Bayes. This assignment is conctructed around programs that are availble in PHYLIP, however you are welcome to use other tools that fulfil a similar purpose if you wish. In this field, researchers consider trees that have been built with ML (maximum likelihood) methods to be more reliable than trees that are built with parsimony methods, or distance methods such as NJ (Neighbor Joining). However ML methods are also much more compute-intensive. Just like with multiple sequence alignments, some algorithms will come closer to guessing the truth and others will not and usually it is hard to tell, which is the more trustworthy of two diverging results. The prudent researcher tries out alternatives and forms her own opinion. Specifically, we may usually assume results that converge, independent of the algorithm, to be more reliable than those that depend strongly on a particular algorithm or details of input data.
But regarding algorithm and rersources: we will take two shortcuts in this assignment (and both shortcuts are things you should not do in real life):
One: we will use an efficient tree-building algorithm, not the best-available one. This is an algorithm which is available on the Web, without the need for you to install software on your own machine. In real life you would of course use the most accurate algortihm you can lay your hands on, regardless of the resources this requires, since it makes no sense to waste your time on a careful analysis of inaccurate trees. Your supervisor would want it so as well. And if not she, the reviewers of your manuscript.
Two: we will assume the tree the algorithm constructs is correct. In real life you would establish its reliability with a bootstrap procedure: repeat the tree-building a hundred times with partial data and see which branches and groupings are robust and which depend on the details of the data. But we should still acknowledge that bifurcations that are very close to each other have not been" resolved". Any conscientious reviewer would flag such leniency and send your results back to you for a bootstrapping exercise at the computer. In phylogenetic analysis, not all lines that the program draws are equally trustworthy. Dont take the trees as a given fact just because a program suggests this. Look at the evidence, use your reasoning, and analyse them critically.
In case you want to review concept of trees, clades, LCAs OTUs and the like, I have linked two excellent and very understandable introduction-level articles on phylogenetic analysis to the resource section at the bottom of this page.
Preparation, submission and due date
Read carefully. Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which important aspects have simply overlooked marks unnecessarily. If you did not notice that the above did not make sense, you are reading what you expect, not what is written.
Prepare a Microsoft Word document with a title page that contains:
- your full name
- your Student ID
- your e-mail address
- the organism name you have been assigned
Follow the steps outlined below. You are encouraged to write your answers in short answer form or point form, like you would document an analysis in a laboratory notebook. However, you must
- document what you have done,
- note what Web sites and tools you have used,
- paste important data sequences, alignments, information etc.
If you do not document the process of your work, we will deduct marks. Try to be concise, not wordy! Use your judgement: are you giving us enough information so we could exactly reproduce what you have done? If not, we will deduct marks. Avoid RTF and unnecessary formating. Do not paste screendumps. Keep the size of your submission below 1.5 MB.
Write your answers into separate paragraphs and give each its title. Save your document with a filename of:
A3_family name.given name.doc
(for example my first assignment would be named: A3_steipe.boris.doc - and don't switch the order of your given name and familyname please!)
Finally e-mail the document to [boris.steipe@utoronto.ca] before the due date.
Your document must not contain macros. Please turn off and/or remove all macros from your Word document; we will disable macros, since they pose a security risk.
With the number of students in the course, we have to economize on processing the assignments. Thus we will not accept assignments that are not prepared as described above. If you have technical difficulties, contact me.
The due date for the assignment is XXXXX at 10:00 in the morning.
Grading
Don't wait until the last day to find out there are problems! The assignment is excellent preparation for the exam, so even if its due later, its a good idea to do it earlier. Assignments that are received past the due date will have one mark deducted at the first minute of every twelve hour period past the due date. Assignments received more than 5 days past the due date will not be assessed. If you need an extension, you must arrange this beforehand.
Marks are noted below in the section headings for of the tasks. A total of 10 marks will be awarded, if your assignment answers all of the questions. A total of 2 bonus marks (up to a maximum of 10 overall) can be awarded for particularily interesting findings, or insightful comments. A total of 2 marks can be subtracted for lack of form or for glaring errors. The marks you receive will
- count directly towards your final marks at the end of term, for BCH441 (undergraduates), or
- be divided by two for BCH1441 (graduates).
(1) Preparations
(1.1) Tools (X marks)
Instruction
- Task
Instruction
- Task.
(1.2) Preparing Input Files (X marks)
Introduction: Task
For this assignment, you will need a file of source data (linked from the resource section at the bottom of this page). It is very similar to the files from the previous assignment, containing the orthologous Mbp1 sequences and contains the sequences for all APSES domains in fungi. I have edited the sequence identifiers, to tell us something about the gene they are taken from. In particular, I have given each yeast gene its standard name (eg. MBP1_SACCE) and named each gene from another organism with an arbitrary "A", "B", "C" ... to make sure the first ten characters are unique (since these first ten characters will be used and displayed by Phylip). This is then followed by the gi number in all cases, so it should be easy for you to retrieve the actual sequences from NCBI in case you need to. I have also omitted sequences from organisms we are no longer considering.
Introduction: Principle
In order to use these sequences for the estimation of phylogenetic trees, you have to build a multiple alignment first, then edit it. Most importantly, all sequences have to be edited to contain the exact same number of characters and to hold aligned characters in corresponding positions. Phylogeny programs are not meant to revise your alignment but to analyse evolutionary relationships, given the alignment.
The result of the tree estimation is a decision about likely relationships, fundamentally all the programs do is to decide which sequences had common ancestors. The phylogeny programs have a way to convert sequence comparisons into evolutionary distances (applying a model of evolution such as a mutation data matrix, calculating one number for each pair of sequences and using that to estimate a tree). Alternatively you can find trees that are most compatible with the observed sequences and the specific model of evolutionary change through point-mutations (either by grouping together the most highly related sequences (NJ, Neigbor Joining), or by minimizing the number of mutation events over the tree (Parsimony) or by finding the tree for which the observed sequences would be the most likely (ML, Maximum Likelihood)). Clearly, in order for this to work, you must not include fragments of sequence which have evolved under a totally different evolutionary model, such as domain fusion, or insertion/deletion of residues. The goal is not to be as comprehensive and complete as possible but to input the columns of aligned residues that will best represent the phylogenetic relationships between the sequences.
Introduction: Problems
Gaps are a real problem here, as usual. Strictly speaking, the similarity score of an alignment program as well as the distance score of a phylogeny program are not calculated for an ordered sequence, but for a sum of values from aligned columns of characters. Most alignment programs use a model with a constant gap insertion penalty and a linear gap extension penalty. This is not rigourously justified from biology, but parametrized (or you could say "tweaked") to correspond to our observations. Most phylogeny programs, (such as the programs in PHYLIP) do not work in this way though. PHYLIP strictly operates on columns of characters and treats a gap character like a residue with the one letter code "-". This underestimates the distance between gapped sequences, since any evolutionary model should reflect the fact that gaps are much less likely than point mutations. If the gap is very long though, all events are counted individually as many single substitutions (rather than one lengthy one) and this overestimates the distance. And it gets worse: long stretches of gaps can make sequences appear similar in a way that is not justified, just because they are identical in the "-" character. When there are unambiguous gaps, one might be tempted to fudge the alignment by inserting matching characters into sequences that are ungapped (e.g. five "A"s each into the ungapped sequences and five "-" each into the gapped sequences), however, I would caution against this approach since it possibly introduces even more non-obvious implicit assumptions and potential for error.
Introduction: Practice
In practice, follow the fundamental principle that all characters in a column should be related by homology. This implies the following rules of thumb:
- Remove all stretches of residues in which the alignment appears ambiguous.
- Remove all frayed N- and C- termini, especially regions in which not all sequences that are being compared appear homologous.
- Remove all but ~ one column from gapped regions, and all residues N- and C- terminal of the gap in which the alignment appears questionable. ( I would keep one gapped column as a placeholder for a rare and very distinct evolutionary event, rather than simply deleting them all).
- Also, consider that neither residues that are completely different between all species, nor residues that are completely conserved are informative for relationship distances.
- If your sequences are too long, you may run out of memory. 60-80 aligned residues should be plenty and if the sequences fit on a single line you will save yourself potential trouble with block-wise vs. interleaved input.
(A very useful trick with Microsoft Word is that you can select blocks of text and entire columns in the document with your mouse: hold the "ALT" key depressed while you click and drag your mouse to select. This will greatly facilitate the preparation of sequences. You can treat that selection as any other selected text, color characters, or delete them. Importantly, you can also cut and paste entire columns! Of course, this will only work as expected if you use a fixed-width font such as Courier. )
The preparation of the input file of aligned residues, used by the PHYLIP package is straightforward in principle; just carefully follow the instructions in PHYLIP's well written documentation. If you plan to use an outgroup for your tree, it is a good idea to move that to the first line of your alignment, since this is where PHYLIP will look for it by default.
Some notes on how to avoid common editing troubles. Copy the sequences frrom the link provided below. Paste them into a document, using the Word "Edit -> Paste special -> Unformatted text". Set the page-setup to "landscape", the font-size to something small, then you can put every sequence into one line. You can replace all paragraph marks ("^p") with (nothing) to remove them, then replace the FASTA header line character ">" with paragraphs ("^p") to separate them by line again. Take special note that your files must not include tab characters. You can use Word to globally replace all tabs (specified as "^t") with a blank, to make sure. Spaces count, so display your alignment in a fixed-width font, such as Courier ("Courier New" on Windows), not a proportional-width font such as Times, Arial, or Helvetica, and ensure all characters in your alignments align as they should. As always, make sure you save your input files as "Text Only".
A note if you are working on a Mac: here MS Word will play one of its usual shenanigans on you and use the old-style OS 9 Carriage Return characters (\r; ASCII 13; hex 0D; CR)
and these are not going to be recognized by PHYLIP or other self-respecting UNIX based programs (it may not make a difference when you paste your sequences to a Web server; but if you compute things locally it will appear to the program as though everything were in one line) . You need to replace them with Linefeed resp. Newline characters (\n; ASCII 10; hex 0A; LF)
and you can't even do that within Word(!). Open a UNIX terminal window and navigate to the directory where your files reside. Then type:
tr "\r" "\n" < infile > outfile
... where outfile is different from infile (careful: if a file by the name of outfile already exists, tr will cheerfully overwrite it.) Alternatively you could type the following perl one-liner :
perl -e 'while(<>){tr/\r/\n/;print}' < infile > outfile
In your assignment submission, clearly identify the source sequences you are using, as well as the alignment method you have used. Paste your unaltered source alignment into your document, clearly highlight or otherwise color the columns that you will delete, annotate why you have deleted them and paste your result as well. Here is an example of what this might look like:
- IMAGE
Figure 1: (Possible) steps in editing a multiple sequence alignment towards a PHYLIP input file. a: raw alignment (CLUSTAL format); b: sequences assembled into single lines; c: columns to be deleted highlighted in red; d: input data for PHYLIP (don't forget to include number of sequences and sequence length in the first line, read the PHYLIP sequence format guide.)
- Access the sequence file for "ALL APSES" domains, linked from the resources section at the bottom of the page. Prepare a multiple sequence alignment of the domains.
- Prepare a PHYLIP formatted input file from your multiple alignment of the APSES domains, following the considerations dicussed above.
- Prepare a second PHYLIP formatted input file for the Mbp1 orthologous sequences. You can either align and edit the Mbp1 sequences separately, or you can take the Mbp1 sequences from the first PHYLIP input.
Instruction
- Task.
(2) Trees
(2.1) The Mbp1 Gene Tree (X marks)
Instruction
- Task
(2.1) The APSES Domain Tree (X marks)
Instruction
- Task
(3) Analysis
(3.1) Correspondence of Gene trees and Phylogenetic Tree (X marks)
Instruction
- Task
Instruction
- Task.
(3.2) Evolutionary History of the APSES Domain (X marks)
Instruction
- Task
Instruction
- Task.
(4) Summary of Resources
- Links
- Sequences
- Alignments
- Mbp1 proteins:
- APSES domains:
[End of assignment]
If you have any questions at all, don't hesitate to mail me at boris.steipe@utoronto.ca or post your question to the Course Mailing List