Difference between revisions of "BIO Assignment 4 2011"

From "A B C"
Jump to navigation Jump to search
 
(10 intermediate revisions by the same user not shown)
Line 1: Line 1:
<div style="padding: 5px; background: #FF4560;  border:solid 2px #000000;">
+
<!-- {{Template:Inactive}} -->
'''Note!'''
+
{{Template:Active}}
This assignment is currently inactive. Major and minor unannounced changes may be made at any time.
 
</div>
 
&nbsp;
 
  
&nbsp;
 
  
  
Line 13: Line 9:
  
 
<div style="padding: 5px; background: #A6AFD0;  border:solid 1px #AAAAAA; font-size:200%;font-weight:bold;">
 
<div style="padding: 5px; background: #A6AFD0;  border:solid 1px #AAAAAA; font-size:200%;font-weight:bold;">
Assignment 4 - Homology modeling
+
Assignment 4 (last: 2011) - Phylogenetic Analysis
 
</div>
 
</div>
  
<!-- '''Please note: This assignment is currently active. All significant changes will be announced on the course mailing list.'''
+
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
 
+
Introduction
&nbsp;-->
 
 
 
<div style="padding: 15px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
 
;How could the search for ultimate truth have revealed so hideous and visceral-looking an object?
 
::''<small>Max Perutz (on his first glimpse of the Hemoglobin structure)</small>''
 
</div>
 
 
&nbsp;
 
&nbsp;
&nbsp;
 
 
Where is the hidden beauty in structure, and where, the "ultimate truth"? In the previous assignments we have studied sequence conservation in APSES family domains and looked at how these domains have evolved over time. We have seen that this is an ancient family, that had several members already in the cenancestor of all fungi, an organism that lived in the [http://www.ucmp.berkeley.edu/fungi/fungifr.html vendian period] of the proterozoic era of precambrian times, more than 600,000,000 years ago.
 
 
In order to understand how particular residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to consider an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. In particular, it would be interesting to correlate the conservation patterns we have observed in the MSAs with specific DNA binding interactions. Unfortunately, the 1MB1 structure does not have DNA bound and the evidence we have considered in Assignment 2 ([http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=10747782 Taylor ''et al.'', 2000]) is not sufficient to define the details of how a DNA double helix might be bound. These details would require the structure of a complex that contains protein as well as DNA. No such complex of an APSES domain has yet been crystallized.
 
  
''In this assignment you will construct a molecular model of the Mbp1 orthologue in your assigned organism, identify similar structures of distantly related domains for which protein-DNA complexes are known, define whether the available evidence allows you to distinguish between different modes of ligand binding, and assemble a hypothetical complex structure.''
+
;Nothing in Biology makes sense except in the light of evolution.
 
+
:''Theodosius Dobzhansky''
For the following, please remember the following terminology:
 
 
 
;Target
 
:The protein that you are planning to model.
 
;Template
 
:The protein whose structure you are using as a guide to build the model.
 
;Model
 
:The structure that results from the modeling process. It has the '''Target sequence''' and is similar to the '''Template structure'''.
 
&nbsp;
 
 
 
A brief overview article on the construction and use of homology models is linked to the resource section at the bottom of this page. That section also contains links to other sites and resources you might require.
 
 
 
 
 
 
 
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
 
Preparation, submission and due date
 
 
</div>
 
</div>
  
Read carefully. Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we see too many assignments which, arduously effected, nevertheless intimate nescience of elementary tenets of molecular biology. If the sentence above did not trigger an urge to open a dictionary, you have a tendency to guess, rather than confirm possibly important information.
+
... but does evolution make sense in the light of biology?
  
Prepare a Microsoft Word document with a title page that contains:
+
As we have seen in the previous assignments, the Mbp1 transcription factor has homologues in all other fungi, yet there is not always a clear one-to-one mapping between members of a family in distantly related species. It appears that various systems of APSES domain transcription factors have evolved independently. Of course this bears directly on our notion of function - what it means to say that two genes in different organisms have the "same" function. In case two organisms both have an orthologous gene for the same, distinct function, saying that the function is the same may be warranted. But what if that gene has duplicated in one species, and the two paralogues now perform different, related functions in one organism? Theses two are still orthologues to the other species, but now we expect functionally significant residues to have adapted to the new role of one paralogue. In order to be able to even ask such questions, we need to make the evolutionary history of gene families explicit. This is the domain of '''phylogenetic analysis'''. We can ask questions like: how many paralogues did the cenancestor of a clade possess? Which of these underwent additional duplications in the phylogenesis of the organism I am studying? Did any genes get lost? And - adding additional biological insight to the picture - did the observed duplications lead to the "invention" of new biological systems? When was that? And perhaps even: how did the species benefit from this event?
*your full name
 
*your Student ID
 
*your e-mail address
 
*the organism name you have been [[Organism_list_2006|assigned]]
 
  
Follow the steps outlined below. You are encouraged to  write your answers in short answer form or point form, '''like you would document an analysis in a laboratory notebook'''. However, you must
 
*document what you have done,
 
*note what Web sites and tools you have used,
 
*paste important data sequences, alignments, information etc.
 
  
'''If you do not document the process of your work, we will deduct marks.'''  Try to be concise, not wordy! Use your judgement: are you giving us enough information so we could exactly reproduce what you have done? If not, we will deduct marks. Avoid RTF and unnecessary formating. Do not paste screendumps. Keep the size of your submission below 1.5 MB.
+
We will develop this kind of analysis in this assignment. In the previous assignment you have established which gene in your species is the reciprocally most closely related orthologue to yeast Mbp1 and you have identified the full complement of APSES domain genes in your assigned organism. In this assignment, we will analyse these genes' evolutionary relationship and compare it to the evolutionary relationship of all fungal APSES domains. The goal is to define families of related transcription factors and their evolutionary history.
  
Write your answers into separate paragraphs and give each its title. Save your document with a filename of:
+
A number of excellent tools for phylogenetic analysis exist; ''general purpose packages'' include the (free) [http://evolution.genetics.washington.edu/phylip.html PHYLIP] package and the (commercial) PAUP package. ''Specialized tools'' for tree-building include Treepuzzle or Mr. Bayes. This assignment is constructed around programs that are available in PHYLIP, however you are welcome to use other tools that fulfill a similar purpose if you wish. In this field, researchers consider trees that have been built with ML (maximum likelihood) methods to be more reliable than trees that are built with parsimony methods, or distance methods such as NJ (Neighbor Joining). However ML methods are also much more compute-intensive. Just like with multiple sequence alignments, some algorithms will come closer to guessing the truth and others will not and usually it is hard to tell which is the more trustworthy of two diverging results. The prudent researcher tries out alternatives and forms her own opinion. Specifically, we may usually assume results that converge when computed with different algorithms, to be more reliable than those that depend strongly on a particular algorithm, parameters, or details of input data.
<code>A5_family name.given name.doc</code>
 
<small>(for example my fifth assignment would be named: A5_steipe.boris.doc - and don't switch the order of your given name and familyname please!)</small>
 
  
Finally e-mail the document to [mailto:boris.steipe@utoronto.ca boris.steipe@utoronto.ca] before the due date.
+
However: regarding algorithm and resources, we will take a shortcut in this assignment (something you should not do in real life). We will assume that the tree the algorithm constructs is correct. In "real life" you would establish its reliability with a bootstrap procedure: repeat the tree-building a hundred times with partial data and see which branches and groupings are robust and which depend on the details of the data. In this assignment, we should simply acknowledge that bifurcations that are very close to each other have not been "resolved" and be appropriately cautious in our inferences. In phylogenetic analysis, not all lines a program draws are equally trustworthy. Don't take the trees as a given fact just because a program suggests this. Look at the evidence, include independent information where available, use your reasoning, and analyse the results critically. As you will see, there are some facts that we know for certain: we know which species the genes we have sequenced come from, and we can (usually) make good assumptions about the relationship of the species themselves - the history of speciation events that underlies all evolution of genes. This is extremely helpful information for our work.  
  
Your document must not contain macros. Please turn off and/or remove all macros from your Word document; we will disable macros, since they pose a security risk.
+
=====Introduction: Tasks=====
  
We do not have the resources to correct formatting errors or to convert assignments into different formats. <!-- Becoming familiar and proficient with technologies is part of the course objectives and that includes e-mail attachments. I will also not accept files that are significantly in excess of 1.5 MB. This will be enforced in this assignment, as as the assignment includes a number of image files and as a proficient user of your computer you should be aware of an image's size, its resolution, its displayed size and its file format, all of which influence the displayed image and the size of its file.--> Keep your image-file sizes manageable!
+
For this assignment, we start from the APSES domains you have collected previously. You will align these domains with a set of reference domains and edit the alignment to make it suitable for phylogenetic analysis, using Jalview. Then you will construct a phylogenetic tree and interpret the tree. The goal is to identify orthologues and paralogues. <!-- Optionally, you will look at structural and functional conservation of residues. -->
  
:<small>Image sizes are measured in pixels - 600px across is sufficient for the assignment, resolutions are measured in dpi (dots per imperial inch) - 72 dpi is the standard resolution for images that are viewed on a monitor; the displayed size may be scaled (in %) by an application program: stereo images should be presented so that equivalent points are approximately 6 cm apart; images can be stored uncompressed as .tiff or.bmp, or compressed as .gif or .jpg. .gif is preferred for images with large, monochrome areas and sharp, high-contrast edges; '''.jpg is preferred for images with shades and halftones such as the structure views required here;''' .tiff is preferred to archive master copies of images in a lossless fashion, use LZW compression for .tiff files if your system/application supports it; .bmp is not preferred for anything, its used because its easier to code.</small>
+
In case you want to review concept of trees, clades, LCAs, OTUs and the like, I have linked an excellent and very understandable introduction-level [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Baldauf_2003_PhylogenyTutorial.pdf article on phylogenetic analysis (pdf)] here and to the resource section at the bottom of this page.
  
<!--Make it a habit to focus on information, pure and simple, and avoid HTML and RTF formatting and the like, where it does not contribute significantly to emphasize actual information. -->Information that you present (such as added colouring, formatting etc.) should be meaningful. If you have technical difficulties, post your questions to the list and/or contact me.
+
&nbsp;
 
 
All required stereo views are to be presented as divergent stereo frames (left eye's view in the left frame). <!--Marks will be deducted if they are not.--> Remember to list the Rasmol command input you have used to generate the images.
 
 
 
With the number of students in the course, we have to economize on processing the assignments. '''Thus we will not accept assignments that are not prepared as described above.''' If you have technical difficulties, contact me.
 
 
 
'''The due date for the assignment is Monday, November 5. at 10:00 .'''
 
 
 
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
 
Grading
 
</div>
 
  
Don't wait until the last day to find out there are problems! Assignments that are received past the due date will have one mark deducted and an additional mark for every full twelve hour period past the due date. Assignments received more than 5 days past the due date will not be assessed.  If you need an extension, you '''must''' arrange this beforehand.
+
{{Template:Preparation|
 +
care=Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which important aspects have simply overlooked marks unnecessarily. If you did not notice that the above did not make sense, you are reading what you expect, not what is written.|
 +
num=4|
 +
ord=fourth|
 +
due = Monday, November 28 at 12:00 in the morning}}
  
Marks are noted below in the section headings for of the tasks. A total of 10 marks will be awarded, if your assignment answers all of the questions. A total of 2 bonus marks (up to a maximum of 10 overall) can be awarded for particularily interesting findings, or insightful comments. A total of 2 marks can be subtracted for lack of form or for glaring errors. The marks you receive will  
+
;Your documentation for the procedures you follow in this assignment will be worth 1 mark.
* count directly towards your final marks at the end of term, for BCH441 (undergraduates), or
 
* be divided by two for BCH1441 (graduates).
 
  
 
&nbsp;
 
&nbsp;
Line 100: Line 51:
  
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
==(1) Preparation==
+
 
 +
==(1) Preparations==
 
</div>
 
</div>
 
+
&nbsp;
 
+
&nbsp;
<!--
 
  
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
===Choosing a template (1 marks)===
+
===(1.1) Preparing Input Files===
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
Often more than one related structure can be found in the PDB. We have discussed principles of selecting template structures in the lecture. Interestingly the PDB itself cannot be searched for the contents of its holdings, by structural- or sequence similarity, but there is always BLAST since the NCBI conveniently allows you to search against all sequences in PDB files.
 
  
*Use BLAST to identify all PDB files that contain APSES domains that are clearly homologuous to your target. (Document that you have searched in the correct subsection of the Genbank holdings). For the hits you find, consider how these structures differ and which features would make each more or less suitable for your task. Comment briefly on what options you have, select one template and note why you have decided to use this particular structure as a template. Include aspects of sequence similarity, length of the sequence, presence or absence of ligands and their potential effect on the structure, and experimental method and quality in your reasoning.
+
For this assignment, we start from the multiple sequence alignments we have constructed previously. We will edit the alignment to make it suitable for phylogenetic analysis. We will construct a phylogenetic tree and we will analyse the tree.
  
*Note which sequence is contained in the coordinate section of the PDB file; note if and how this implied sequence differs from the sequences ...
+
=====Introduction: Principle=====
  
:*listed in the seqres records;
+
In order to use molecular sequences for the construction of phylogenetic trees, you have to build a multiple alignment first, then edit it. This is important: all rows of sequences have to contain the exact same number of characters and to hold '''aligned characters in corresponding positions'''. Phylogeny programs are not meant to revise an alignment but to analyze evolutionary relationships, '''after''' the alignment has been determined. The program's inferences are made on a column-wise basis and if your columns contain data from unrelated positions, the inferences are going to be questionable.
:*given in the FASTA sequence for the template that the PDB provides;
 
:*and that stored by the NCBI.
 
  
* In a table, establish the correspondence of the coordinate sequence numbering (defined by the residue numbers/insertion codes in the atom records) with your target sequence numbering.
+
The result of the tree construction is a decision about the most likely evolutionary relationships. Fundamentally, tree-construction programs decide which sequences had common ancestors.
  
* Retrieve the most suitable template structure coordinate file from the PDB.
 
  
-->
+
'''Distance based''' phylogeny programs start by using sequence comparisons to estimate evolutionary distances:
  
 +
* they apply a model of evolution such as a mutation data matrix, to calculate a score for each pair of sequences,
 +
* this score is stored in a "distance matrix" ...
 +
* ... and used to estimate a tree that groups sequences with close relationships together. (e.g. by using an NJ, Neigbor Joining, algorithm).
  
&nbsp;
+
They are fast, can work on large numbers of sequences, but are less accurate if genes evolve at different rates.
&nbsp;
 
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
=== The input alignment (1 marks)===
 
</div>
 
&nbsp;<br>
 
  
The sequence alignment between target and template is the single most important factor that determines the quality of your model.
+
'''Parsimony based''' phylogeny programs build a tree that minimizes the number of mutation events that are required to get from a common ancestral sequence to all observed sequences. They take all columns into account, not just a single number per sequence pair, as the Distance Methods do. For closely related sequences they work very well, but they construct inaccurate trees when they can't make good estimates for the required number of sequence changes.
  
No homology modeling process will repair an incorrect alignment and it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment, rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these just because they are convenient, rather than the more sophisticated methods and more informed procedures we have discussed. Detailed analysis of fallacious models rarely leads to good results.
 
  
The best possible alignment is usually constructed from a multiple sequence alignment that includes at least the target and template sequence and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Typically such an alignment will also include additional optimization steps to move insertions or deletions between target and template out of the secondary structure elements of the template structure.
+
'''ML''', or '''Maximum Likelihood''' methods attempt to find the tree for which the observed sequences would be the most likely under a particular evolutionary model. They are based on a rigorous statistical framework and yield the most robust results. But they are also quite compute intensive and a tree of the size that we are building in this assignment is a challenge for the resources of common workstation (runs about an hour on my computer). If the problem is too large, one may split a large problem into smaller, obvious subtrees (e.g. analysing orthologues as a group, only including a few paralogues for comparison) and then merge the smaller trees; this way even very large problems can become tractable.
  
Here is an excerpt from the T-coffee aligned Mbp1 sequences: it contains all the residues of the yeast sequence that are found in the 1MB1 crystal structure - the '''template''' sequence for our homology model - and it has been edited to remove the N-terminal gaps in the sequence. Thus the N-terminus is 21 amino acids longer than the definition of the APSES domain in CDD (which starts with <code>SIMKR...</code>), the C- terminus is slightly shorter.  
+
ML methods suffer less from "long-branch attraction" - the phenomenon that weakly similar sequences can be grouped inappropriately close together in a tree due to spuriously shared differences.
  
Since the sequences are very similar between each other, there is no ambiguity in the alignment and the construction of a homology model should be straightforward. Normally one would spend considerable some effort at this stage to consider which parts of the target sequence and the template sequence appear to  correctly aligned and to edit the alignment manually. In our case, evolutionary pressure was so strong that essentially all have evolved without a single indel in their sequence.
 
  
I have added to the alignment the APSES domain of [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=Protein&list_uids=116197493&dopt=GenPept XP_001224558], the ''Chaetomium globosum'' Mbp1 orthologue (MBP1_CHAGL). This will serve as the reference and fallback sequence.
+
Clearly, in order for tree-estimation to work, one must not include fragments of sequence which have evolved under a different evolutionary model as all others, e.g. after domain fusion, or after accommodating large stretches of indels. Thus it is appropriate to edit the sequences and pare them down to a most characteristic subset of amino acids. The goal is not to be as comprehensive as possible, but to input those columns of aligned residues that will best represent the true phylogenetic relationships between the sequences.
  
1MB1            NQIYSARYSGVDVYEFIHSTG---SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEV
+
=====Introduction: Gaps=====
MBP1_CANGL      NQIYSAKYSGVDVYEFIHPTG---SIMKRKNDGWVNATHILKAANFAKAKRTRILEKEV
 
MBP1_EREGO      TQIYSAKYSGVEVYEFLHPTG---SIMKRKADDWVNATHILKAAKFAKAKRTRILEKEV
 
MBP1_KLULA      NQIYSAKYSGVDVYEFIHPTG---SIMKRKADNWVNATHILKAAKFPKAKRTRILEKEV
 
MBP1_CANAL      SQIYSATYSNVPAFEFVTSEG---PIMRRKKDSWINATHILKIAKFPKAKRTRILEKDV
 
MBP1_DEBHA      TQIYSATYSNVPVFEFVTLEG---PIMRRKLDSWINATHILKIAKFPKAKRTRILEKDV
 
MBP1_YARLI      MSIYKATYSGVPVYEFQCKNV---AVMRRKSDGWVNATHILKVAGFDKPQRTRILEKEV
 
MBP1_SCHPO      SAVHVAVYSGVEVYECFIKGV---SVMRRRRDSWLNATQILKVADFDKPQRTRVLERQV
 
MBP1_USTMA      KTIFKATYSGVPVYECIINNV---AVMRRRSDDWLNATQILKVVGLDKPQRTRVLEREI
 
MBP1_ASPNI      SNVYSATYSSVPVYEFKIGTD---SVMRRRSDDWINATHILKVAGFDKPARTRILEREV
 
MBP1_ASPTE      SKIYSATYSSVPVYEFKIEGD---SVMRRRADDWINATHILKVAGFDKPARTRILEREV
 
MBP1_CRYNE      PKVYASVYSGVPVFEAMIRGI---SVMRRASDSWVNATQILKVAGVHKSARTKILEKEV
 
MBP1_GIBZE      G-IYSASYSGVDVYEMEVNNI---AVMRRRNDSWLNATQILKVAGVDKGKRTKILEKEI
 
MBP1_NEUCR      IYSLQATYSGVGVYEMEVNNV---AVMRRQKDGWVNATQILKVANIDKGRRTKILEKEI
 
MBP1_MAGGR      P-IYTAVYSNVEVYEFEVNGV---AVMKRIGDSKLNATQILKVAGVEKGKRTKILEKEI
 
MBP1_ASPFU      PQIYKAVYSNVSVYEMEVNGV---AVMKRRSDSWLNATQILKVAGVVKARRTKTLEKEI
 
MBP1_CHAGL      AGIYSATYSGIPVYEYQFGPDMKEHVMRRREDNWINATHILKAAGFDKPARTRILERDV
 
 
1MB1            LKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLF
 
MBP1_CANGL      LKEMHEKVQGGFGKYQGTWVPLNIAINLAEKFDVYQDLKPLF
 
MBP1_EREGO      IKDTHEKVQGGFGKYQGTWVPLDIARRLAQKFEVLEELRPLF
 
MBP1_KLULA      ITDTHEKVQGGFGKYQGTWIPLELASKLAEKFEVLDELKPLF
 
MBP1_CANAL      QTGIHEKVQGGYGKYQGTYVPLDLGAAIARNFGVYDVLKPIF
 
MBP1_DEBHA      QTGVHEKVQGGYGKYQGTYVPLDLGADIAKNFGVFDSLRPIF
 
MBP1_YARLI      QKGVHEKVQGGYGKYQGTWVPLERAREIATLYDVDSHLAPIF
 
MBP1_SCHPO      QIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKVDGIMSPIL
 
MBP1_USTMA      QKGIHEKVQGGYGKYQGTWIPLDVAIELAERYNIQGLLQPIT
 
MBP1_ASPNI      QKGVHEKVQGGYGKYQGTWIPLQEGRQLAERNNILDKLLPIF
 
MBP1_ASPTE      QKGVHEKVQGGYGKYQGTWIPLPEGRLLAERNNIIDKLRPIF
 
MBP1_CRYNE      LNGIHEKIQGGYGKYQGTWVPLDRGRDLAEQYGVGSYLSSVF
 
MBP1_GIBZE      QTGEHEKVQGGYGKYQGTWIKFERGLQVCRQYGVEELLRPLL
 
MBP1_NEUCR      QIGEHEKVQGGYGKYQGTWIPFERGLEVCRQYGVEELLSKLL
 
MBP1_MAGGR      QTGEHEKVQGGYGKYQGTWIKYERALEVCRQYGVEELLRPLL
 
MBP1_ASPFU      AAGEHEKVQGGYGKYQGTWVNYQRGVELCREYHVEELLRPLL
 
MBP1_CHAGL      QKDVHEKIQGGYGKYQGTWIPLEQGRALAQRNNIYDRLRPIF
 
  
&nbsp;<br>
+
Gaps are a real problem here, as usual. Strictly speaking, the similarity score of an '''alignment''' program as well as the distance score of a '''phylogeny''' program are not calculated for an ordered sequence, but for a sum of independent values, one for each aligned columns of characters. The order of the columns does not change the score. However in an optimal sequence alignment with gaps, this is no longer strictly true since a one-character gap creation has a different penalty score than a one-character gap extension! Most '''alignment''' programs use a model with a constant gap insertion penalty and a linear gap extension penalty. This is not rigorously justified from biology, but parametrized (or you could say "tweaked") to correspond to our observations. However, most '''phylogeny''' programs, (such as the programs in PHYLIP) do not work in this way. PHYLIP strictly operates on columns of characters and treats a gap character just like a residue with the one letter code "-". Thus gap insertion- and extension- characters get the same score. For short indels, this '''underestimates''' the distance between pairs of sequences, since any evolutionary model should reflect the fact that gaps are much less likely than point mutations. If the gap is very long though, all events are counted individually as many single substitutions (rather than one lengthy one) and this '''overestimates''' the distance. And it gets worse: long stretches of gaps can make sequences appear similar in a way that is not justified, just because they are identical in the "-" character. It is therefore common and acceptable to edit gaps in the alignment and delete all but one or two columns of gapped sequence, or to remove such columns altogether.
  
It should be obvious to you by now how you can copy a string of amino acids from such an alignment and create a FASTA file. However we need to take a little detour: this detour brings us to the question of sequence numbers.
 
  
It is not straightforward at all how to number sequence in such a project. The "natural" way would be to start a sequential numbering from the start-codon of the full length protein and go sequentially from there. However imagine what would happen if a curator would discover that one of the splice-sites for a gene has been missed in automatic annotation. All of a sudden a corrected sequence would have a different length than the one that may have been used for earlier studies. Unfortunatlety, there is no mechanism (''wouldn't it be nice!'') that automatically goes back through the literature and your lab-journal and updates the revised sequence numbering... But there are other possible complications, regarding sequence numbers. The first residue of the CDD-APSES domain is not Residue 1 of the Mbp1 protein. The first residue of the 1MB1 FASTA file ''is'' the first residue of Mbp1 protein, but the last five residues are an artifiical His tag. Is H125 of 1MB1 the equivalent residue to R125 in MBP1_SACCE? The N-terminus of the Mbp1 crystal structure is disordered. The first residue in the structure is ASN 3, whereas the SEQRES records start with MET ... and so on. The take-home message is that a sequence number is nothing absolute, but something that makes sense only in a particular context. To emphasize this, we will write a FASTA header for our '''target''' sequence that lists the residues of the source sequence it correspond to. In terms of actual sequence numbering, we will adopt the numbering of the 1MB1 protein throughout to be able to consistently label particular amino acids.
+
=====Introduction: The outgroup=====
  
Access the sequence of "your" organism's Mbp1 Orthologue at UniProt. (You can use the links I have provided in the table below).  
+
To analyse phylogenetic trees it is useful (and for some algorithms required) to define an outgroup, a sequence that presumably diverged from all other sequences in a clade before they split up among themselves. Wherever the outgroup inserts into the tree, this is the root of the rest of the tree. And whenever a molecular clock is assumed, the branching point that connects the outgroup can be assumed to be the oldest divergence event. I have defined an outgroup sequence and added it to the [[Reference APSES domains|reference APSES domains page]]. The procedure is explained in detail on that page.
  
 +
>gi|301025594|ref|ZP_07189117.1| KilA-N domain protein [Escherichia coli MS 69-1]
 +
<span style="color: #999999;">MTSFQLSLISRE</span>IDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGIPISELIQS
 +
FKGGRPENQGTWVHPDIAINLAQ<span style="color: #999999;">WLSPKFAVQVSRWVREWMSGERTTAEMPVHLKRYMVNRSRIPHTHFS
 +
ILNELTFNLVAPLEQAGYTLPEKMVPDISQGRVFSQWLRDNRNVEPKTFPTYDHEYPDGRVYPARLYPNE
 +
YLADFKEHFNNIWLPQYAPKYFADRDKKALALIEKIMLPNLDGNEQF</span>
 +
''E. coli'' KilA-N protein. Residues that do not align with APSES domains are shown in grey.
  
<table style="border-left:1px solid #AAAAAA; border-bottom:1px solid #AAAAAA;" cellpadding="10" cellspacing="0">
+
=====Preparing APSES sequences=====
<tr style="background: #BDC3DC;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b><i>Organism</i></b></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><b>Uniprot Accession</b></td>
 
</tr>
 
  
<tr style="background: #FFFFFF;">
+
<div style="padding: 5px; background: #DDDDEE;">
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus fumigatus</i></td>
+
#Navigate to the [[Reference APSES domains|reference APSES domains page]] and copy the sequences.
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q4WGN2_ASPFU Q4WGN2]</td>
+
#Open Jalview, select '''File &rarr; Input Alignment &rarr; from Textbox''' and paste the sequences into the textbox.
</tr>
+
#Add the APSES domain sequences '''from your species''' that you have defined in the previous assignment.
 +
#When all the sequences are present, click on '''New Window'''.
 +
#In Jalview, select Web Service &rarr; Alignment &rarr; MAFFT Multiple Sequence Alignment. The alignment is calculated in a few minutes and displayed in a new window.
 +
#Choose any colour scheme and add '''Colour &rarr; by Conservation'''. Adjust the slider left or right to see which columns are highly conserved.
 +
#Save the alignment as a Jalview project before editing it for phylogenetic analysis. You may need it again.  
 +
</div>
  
<tr style="background: #E9EBF3;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus nidulans</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q5B8H6_EMENI Q5B8H6]</td>
 
</tr>
 
  
<tr style="background: #FFFFFF;">
+
=====Introduction: Alignment editing for phylogenetic reconstruction=====
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Aspergillus terreus</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q0CQJ5_ASPTE Q0CQJ5]</td>
 
</tr>
 
  
<tr style="background: #E9EBF3;">
+
In practice, follow the fundamental principle that '''all characters in a column should be related by homology'''. This implies the following rules of thumb:
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Candida albicans</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q5ANP5_CANAL Q5ANP5]</td>
 
</tr>
 
  
<tr style="background: #FFFFFF;">
+
*Remove all stretches of residues in which the ''alignment'' appears ambiguous (not just highly variable, but ambiguous regarding the aligned positions).
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Candida glabrata</i></td>
+
*Remove all frayed N- and C- termini, especially regions in which not all sequences that are being compared appear homologous and that may stem from unrelated domains.
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q6FWD6_CANGL Q6FWD6]</td>
+
*Remove all gapped regions that appear to be alignment artefacts due to inappropriate input sequences.
</tr>
+
*Remove all but approximately one column from gapped regions '''in those cases where the presence of several related insertions suggest that the indel is real, and not just an alignment artefact.''' (Some researchers simply remove all gapped regions).
 +
*Remove sections N- and C- terminal of gaps where the alignment appears questionable.
 +
*Also, consider that neither residues that are completely different between all species, nor residues that are completely conserved are informative for relationship distances.
 +
*If your sequences are too long, your tree calculations may run out of memory. 60-80 aligned residues should be plenty and if the sequences fit on a single line you will save yourself potential trouble with block-wise vs. interleaved input.If you do run out of memory try removing columns of sequence.
 +
*Move the KilA-N outgroup sequence to the first line of your alignment, since this is where PHYLIP will look for it by default.
  
<tr style="background: #E9EBF3;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Cryptococcus neoformans</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q5KHS0_CRYNE Q5KHS0]</td>
 
</tr>
 
  
<tr style="background: #FFFFFF;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Debaryomyces hansenii</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q6BSN6_DEBHA Q6BSN6]</td>
 
</tr>
 
  
<tr style="background: #E9EBF3;">
+
[[Image:EditingGuide.jpg|frame|none|(Possible) steps in editing a multiple sequence alignment towards a PHYLIP input file. '''a''': raw alignment (CLUSTAL format); '''b''': sequences assembled into single lines; '''c''': columns to be deleted highlighted in red - 1, 3 and 4: large gaps; 2: uncertain alignment and 5: frayed C-terminus: both would put non-homologous characters into the same column; '''d''': input data for PHYLIP: names for sequences must not be longer than 10 characters, the first line must contain the number of sequences and the sequence length. PHYLIP is very picky about incorrectly formatted input, read the [http://evolution.genetics.washington.edu/phylip/doc/sequence.html PHYLIP sequence format guide].]]
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Eremothecium gossypii</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q752H3_ASHGO Q752H3]</td>
 
</tr>
 
  
<tr style="background: #FFFFFF;">
+
;Once you are satisfied with your editing, proceed as follows:
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Gibberella zeae</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q4IEY8_GIBZE Q4IEY8]</td>
 
</tr>
 
  
<tr style="background: #E9EBF3;">
+
<div style="padding: 5px; background: #DDDDEE;">
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Kluyveromyces lactis</i></td>
+
#Download the PHYLIP package from the [http://evolution.genetics.washington.edu/phylip.html Phylip homepage] and install it on your computer.
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=MBP1_KLULA P39679]</td>
+
#Prepare a PHYLIP input file from your Jalview alignment. The simplest way to achieve this appears to be:
</tr>
+
##In Jalview, use '''File &rarr; Output to Textbox&rarr;FASTA''', then '''Edit&rarr;Select All''' and '''Edit&rarr;copy''' the sequences.
 +
##In a browser, navigate to the [http://www-bimas.cit.nih.gov/molbio/readseq/ '''Readseq sequence conversion service'''].
 +
##Paste your sequences into the form and choose '''Phylip''' as the output format. Click on '''submit'''.
 +
##Save the resulting page as a text file in the directory where the phylip executables reside on your computer. Give it some useful name such as <code>All-APSES_domains.phy</code>.  
 +
#Make a copy of that file and name it <code>infile</code>. Note: make sure that your Microsoft Windows operating system does not silently append the extension ".txt" to your file. It should be called "infile", nothing else and you should never, never, ever permit your operating systems to slyly hide file extensions from you when it displays filenames. You have been warned.
 +
</div>
  
<tr style="background: #FFFFFF;">
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Magnaporthe grisea</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q3S405_MAGGR Q3S405]</td>
 
</tr>
 
  
<tr style="background: #E9EBF3;">
+
<div style="padding: 5px; background: #E9EBF3; border:solid 1px #AAAAAA;">
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Neurospora crassa</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q7SBG9_NEUCR Q7SBG9]</td>
 
</tr>
 
  
<tr style="background: #FFFFFF;">
+
===(1.2) Calculating a Tree===
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Saccharomyces cerevisiae</i></td>
+
</div>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=MBP1_YEAST P39678]</td>
 
</tr>
 
  
<tr style="background: #E9EBF3;">
+
&nbsp;<br>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Schizosaccharomyces pombe</i></td>
+
&nbsp;<br>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=RES2_SCHPO P41412]</td>
+
<div style="padding: 5px; background: #DDDDEE;">
</tr>
 
  
<tr style="background: #FFFFFF;">
+
*Use the '''proml''' program of PHYLIP (protein sequences, maximum likelihood tree) to calculate a phylogenetic tree. Use the default parameters except that you must change option <code>S: Speedier but rougher analysis?</code> to No - your analysis should not sacrifice accuracy for speed. The calculation will take a while.
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Ustilago maydis</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q4P117_USTMA Q4P117]</td>
 
</tr>
 
  
<tr style="background: #E9EBF3;">
+
</div>
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;"><i>Yarrowia lipolytica</i></td>
 
  <td style="border-right:1px solid #AAAAAA; border-top:1px solid #AAAAAA;">[http://www.ebi.uniprot.org/uniprot-srv/uniProtView.do?proteinId=Q6CGF5_YARLI Q6CGF5]</td>
 
</tr>
 
  
</table>
 
 
 
<div style="padding: 5px; background: #EEEEEE;">
 
*Copy your organism's Mbp1 sequence from the alignment above. Then define the start- and end- sequence numbers of the '''target''' sequence relative to the full-length protein. Prepare a FASTA formatted file for the '''target''' sequence in your organism, giving it an appropriate header and include the sequence numbers. Refer to the [[Assignment_5_fallback_data|'''Fallback data''']] file if you are not sure about the format. (1 mark)
 
</div>
 
 
&nbsp;<br>
 
&nbsp;<br>
 
+
&nbsp;<br>
Your FASTA sequence should look similar to this:
 
 
 
>1MB1: Mbp1_SACCE 1..100
 
NQIYSARYSGVDVYEFIHSTG---SIMKRKKDDWVNATHILKAANFAKAKRTRILEKEV
 
LKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLF
 
 
 
&nbsp;
 
&nbsp;
 
  
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
  
==(2) Homology model==
+
==(2) Analysis (2 marks)==
</div>
 
&nbsp;
 
&nbsp;
 
 
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
=== (2.1) SwissModel (1 mark)===
 
 
</div>
 
</div>
&nbsp;<br>
 
  
Access the Swissmodel server at [http://swissmodel.expasy.org '''http://swissmodel.expasy.org'''] . Navigate to the '''Alignment Interface'''.
+
I have constructed a cladogram for the species we are analysing, based on data published for 1551 fungal ribosomal sequences. Such reference tres from rRNA data are a standard method of phylogenetic analysis, supported by the assumption that rRNA sequences are monophyletic and have evolved under comparable selective pressure in all species.
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
[[Image:FungiCladogram.jpg|frame|none|Cladogram of fungi studied in the assignments. This cladogram is based on small subunit ribosomal rRNA sequences, and largely follows ''Tehler et al.'' (2003) ''Mycol Res.'' '''107''':901-916. Even though many details of fungal phylogeny remain unresolved, the branches shown here individually appear to have strong support. In a cladogram such as this, the branch lengths are not drawn to any scale of similarity.]]
*Copy from the alignment above the 1MB1 sequence and the sequence from your organism, and paste it into the form field. Refer to the [[Assignment_5_fallback_data|'''Fallback Data file''']] if you are not sure about the format.
 
:(You have to choose the format, and, if e.g. you choose a CLUSTAL format, you have to include a header line and a blank line. Other common problems uploading your alignment may include uploading a file that has not been saved as "text only" and periods i.e.  "."  in sequence names. Underscores appear to be safe.)
 
  
* Click '''submit''' and define your '''target''' and '''template''' sequence. For the '''template sequence''' define the coordinate file and chain. (In our case the coordinate file is <code>'''1MB1'''</code> and the chain is "<code>'''_'''</code>" i.e. none, since the PDB file does not contain more than one chain.
+
Your species may not be included in this cladogram, but you can easily calculate your own with the following procedure:
  
*Click '''submit''' and request the construction of a homology model: Enter your e-mail address and check the button for '''Normal Mode''', not "Swiss-PDB Viewer mode. (Important, since there will be problems with the output otherwise). Click '''submit'''. You should receive four files files by e-mail within half an hour or so. (1 mark)
+
<div style="padding: 5px; background: #DDDDEE;">
 +
#Access the [http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=retrieve&db=taxonomy NCBI taxonomy database Entrez query page].
 +
#Edit the list of reference species below to include your species and paste it into the form.
  
(You do not need to submit any coordinate files with your assignment.)
+
"Emericella nidulans"[Scientific Name] OR
 +
"Candida albicans"[Scientific Name] OR
 +
"Neurospora crassa"[Scientific Name] OR
 +
"Saccharomyces cerevisiae"[Scientific Name] OR
 +
"Schizosaccharomyces pombe"[Scientific Name] OR
 +
"Ustilago maydis"[Scientific Name]
  
 +
#Next, as '''Display''' option, select '''Common Tree'''.
 +
#Then select the '''phylip tree''' option and click '''save as''' to save the tree in Newick format.
 +
#The output can be edited, and visualized in any program that reads Newick trees.
 
</div>
 
</div>
&nbsp;<br>
 
In case you do not wish to submit the modelling job yourself, you can access the result files for the  from the  [[Assignment_5_fallback_data|'''Fallback Data file''']].
 
  
  
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
+
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
  
==(3) Model analysis==
+
===(2.2) Visualizing the APSES domain Phylogenetic Tree===
 
</div>
 
</div>
&nbsp;
 
&nbsp;
 
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
=== (3.1) The PDB file (1 mark)===
 
</div>
 
&nbsp;<br>
 
  
Open your  '''model''' coordinates in a text-editor (make sure you view the PDB file in a fixed-width font) and consider the following questions: (Alternatively, view the coordinates linked to the [[Assignment_5_fallback_data|'''Fallback Data file''']].)
+
Once Phylip is done calculating the tree, the tree in a text format will be contained in the Phylip <code>outfile</code> - the documentation of what the program has done. Open this textfile for a first look. The tree is complicated and it can look confusing at first. The tree in Newick format is contained in the Phylip file <code>outtree</code>. Visualize it as follows:
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
<div style="padding: 5px; background: #DDDDEE;">
*What is the residue number of the first residue in the '''model'''? What should it be, based on the alignment? If the putative DNA binding region was reported to be residues 50-74 in the Mbp1 protein, which residues of the '''model''' correspond to that? (1 mark)
+
#Open <code>outtree</code> in a texteditor and copy the tree.
 +
#Visualize the tree in alternative representations:
 +
##Navigate to the [http://www.proweb.org/treeviewer/ Proweb treeviewer], paste and visualize your tree.
 +
##Navigate to the [http://www.trex.uqam.ca/index.php?action=newick&project=trex Trex-online Newick tree viewer] for an alternative view. Visualize the tree as a phylogram. You can increase the window height to keep the labels from overlapping.
 +
##In your Jalview window, choose '''File &rarr; Load associated Tree''' and load the Phylip <code>outtree</code> file. You can click into the tree-window to show which clades branch off at what level - it should be obvious that you can identify three major subclades.
 +
##Study the tree: understand what you see and what you would have expected.
 
</div>
 
</div>
  
<!-- discuss flagging of loops - setting of B-factor to 99.0 -->
+
Here are two principles that will help you make sense of the tree.
  
&nbsp;
 
&nbsp;
 
  
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
+
A: '''A gene that is present in an ancestral species is inherited in all descendant species'''. The gene has to be observed in all OTUs, unless its has been lost (which is a rare event).
===(3.2) first visualization (3 marks)===
 
</div>
 
&nbsp;<br>
 
 
 
In assignment 2 you have already studied the 1MB1 coordinate file and compared it to your organism's Mbp1 APSES domain, Since a homology model inherits its structural details from the '''template''', the model should look very similar to the original structure but contain the sequence of the '''target'''.
 
  
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
B: '''Paralogous genes in an ancestral species should give rise to monophyletic subtrees for each of the paralogues, in all descendants'''; this means: if the LCA of a branch has e.g. three genes, we would expect three copies of the species cladogram below this branchpoint, one for each of these genes. Each of these subtrees should recapitulate the reference phylogenetic tree of the species, up to the branchpoint of their LCA.
*Save the attachment of your '''model''' coordinates to your harddisk and visualize it in RasMol. (Alternatively, copy and save the coordinates from the [[Assignment_5_fallback_data|'''Fallback Data file''']] to your harddisk.) Make an informative view, divergent stereo and paste it into your assignment. (3 marks)
 
  
</div>
 
&nbsp;<br>
 
  
 +
With these two simple principles (you should draw them out on a piece of paper if they do not seem obvious to you), you can probably pry your tree apart quite nicely. A few colored pencils and a printout of the tree will help.
  
[[Image:A5_Mbp1_subdomain.jpg|frame|none|Stereo-view of a subdomain within the 1MB1 structure that includes residues 36 to 76. The color gradient ramps from blue (36) to green (76).]]
 
  
 
&nbsp;
 
&nbsp;
Line 369: Line 227:
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
  
===(3.3) modeling a DNA ligand (4 marks)===
+
===(2.1) The Cenancestor's APSES Domains===
 
</div>
 
</div>
&nbsp;<br>
 
  
The really interesting question we could begin to address with our model is how sequence variation might be converted into changing DNA recognition sites, and then lead to changed cognate DNA binding sequences. But in order to address this, we would need to add a plausible model for a bound DNA molecule to our model.
+
Refer to your tree for the following tasks. (Please remember to include your tree in your Assignment submission - it is a result of your computational experiment. Its easiest to copy/paste the tree from the Phylip outfile, rather than copying an image from a Tree viewer). Be specific in your discussion, i.e. refer to specific branchpoints (branchpoints are numbered in the Phylip output) and OTU or gene names in your analysis (see the example below).  
  
Since there is currently no software available that would accurately model such a complex from first principles, we will base this on homology modeling as well. This means we need to find a similar structure for which the complex structure is known. However, you may remember from the third assignment that the APSES domains in fungi seem to be a relatively small family. And there is no structure available of a protein-DNA complex.  Now what?
 
  
Remember that homologous sequences can have diverged to the point where their sequence similarity is no longer recognizable, however their structure may be quite well conserved. Thus if we could find similar structures in the PDB, these might provide us with some plausible hypotheses for how DNA is bound by APSES domains. We thus need a tool similar to BLAST, but not for the purpose of sequence alignment, but for structure alignment. A kind of BLAST for structures.
+
<div style="padding: 5px; background: #DDDDEE;">
 +
*Consider how many APSES domain proteins the fungal cenancestor appears to have possessed and what evidence you see in the tree that this is so.  
 +
</div>
  
However, very similar to BLAST, we might not want to search with the entire protein, if all we are interested in is a subdomain that binds to DNA. Attempting to match all structural elements in addition to the ones we are actually interested in is likely to make the search less specific - we would find false positives that are similar to some irrelevant part of our structure. However, defining too small of a subdomain would also lead to a loss of specificity: in the extreme it is easy to imagine that the search for e.g. a single helix would retrieve very many hits that would be quite meaningless. The arrangement of the residues from 50 to 74 that we have already discussed in Assignment 2 suggests that the compact subdomain from 36 to 76 (see the image above) might be a useful structure to search with: it contains the residues we are interested in and enough of connected secondary structure elements to be structurally meaningful.
 
  
At the '''NCBI''', [http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml VAST] is a search tool for structural similarity search tool for this purpose. Unfortunately it does not seem to be able to handle a query with such a structural subdomain (the process did not finish after several days) but at least you can get a list of structural neighbors of the 1MB1 full-length template structure, by entering the PDB ID in a small form field on the VAST home page, and then clicking on the colored bar labeled "Chain" on the MMDB structure summary page. This precomputed page for the 1MB1 structure shows a number of diverse proteins matching to various helices and strands of the structure.
+
<div style="padding: 5px; background: #E9EBF3;  border:solid 1px #AAAAAA;">
  
At the '''EBI''' there are a number of very well designed structure analysis tools linked off the [http://www.ebi.ac.uk/Tools/structural.html '''Structural Analysis''' page]. As part of its MSD Services, the SSM (Secondary Structure Matching service) provides a well thought out interface for searching files from the PDB or uploading coordinates.
+
===(2.2) Unraveling your organism's APSES domains (2 marks)===
 +
</div>
  
After uploading the coordinates for residues 36 to 76 of the 1MB1 structure running the search and sorting the results by alignment length, the top hits include a number of nucleotide binding proteins such as a replication terminator (1F4K), the LexA repressor (1MVD) and a "Winged Helix" protein (1KQ8). These are all members of a much larger superfamily, the "winged helix" DNA binding domains ([http://cathwww.biochem.ucl.ac.uk/cgi-bin/cath/GotoCath.pl?cath=1.10.10.10 CATH 1.10.10.10]), of which hundreds of structures have been solved. They represent one branch of the tree of helix-turn-helix (HTH) DNA binding modules. (A recent review on HTH proteins is linked from the resources section at the bottom of this page). Winged Helix domains typically bind their cognate DNA with a "recognition helix" which precedes the beta hairpin and binds into the major groove; additional stabilizing interactions are provided by the edge of the beta strand binding into the minor groove.
+
&nbsp;<br>
 +
&nbsp;<br>
 +
<div style="padding: 5px; background: #FFCC99;">
 +
;Analysis (2 marks)
  
<!-- The other service the EBI structure links to is the DALI server. DALI was one of the first algorithms capable of large-scale protein structure searches; it was developed by Liisa Holm and is now hosted by her group in Helsinki. Submitting our search domain generates the e-mailed result linked to here. Both results (there are only two) are also found in the top 100 list of the SSM service. The winged helix domain 1DP7 merits some comment though: its structure shows a novel mode of binding for DNA. Here, it is the beta-wing, not the "recognition helix" that inserts into the major groove! We will consider this in more detail below.
+
Assume that the cladogram for fungi that I have given above is correct, and that the mixed gene tree you have calculated is fundamentally correct in its overall arrangement but may have local inaccuracies due to the limited resolution of the method. You have identified the APSES domain genes of the fungal cenancestor above. Apply the expectations we have stated above to discuss briefly through what sequence of duplications and/or gene loss your organism has ended up with the APSES domains it possesses today. Make specific reference to the cladogram of species and note in particular in case some of your sequences appear to have been placed into regions of the tree where they don't seem to belong. Also note which branchpoints in the evolutionary history of your sequences correspond to speciations and which ones to duplications.
  
First we shall explore some of the structures that SSM has returned. The SSM server presents its result details in Web pages, but it also allows to download the entire result set in an XML formatted file. This is a common method of data-interchange in bioinformatics but you would not want to actually read such a file and manually extract information (even though you could, in principle). Thus I have prepared a summary file of the alignment details of the SSM results. This should allow you to rapidly find the exact aligned residues in the matched domains. While I have derived this file from the output through a computer program I have written, you could easily have accessed the same information on the Web, had you run the query yourself. -->
 
  
This is good news: once we have determined that the APSES domain is actually an example of a larger group of transcription factors, we can pick one of these for which a DNA complex structure is known. I have picked one such structure from the list of hits that were returned by SSM: it is the Elk-1 transcription factor.
+
Note: A common confusion about cenancestral genes arises from the fact that by far not all expected genes are present in the OTUs. Some will have been lost, some will have been incorrectly annotated in their genome (frameshifts!) and not been found with PSI-BLAST, some may have been missed by you. In general you have to ask: '''given the species represented in a subclade, what is the last common ancestor of that branch'''? The expectation is that '''all''' descendants of that ancestor should be represented in that branch '''unless''' one of the above reasons why a gene might be absent would apply.
  
[[Image:A5_canonical_wHTH.jpg|frame|none|Stereo-view of the canonical DNA binding mode of the Winged Helix domain family. Shown here is the Elk-1 transcription factor - an ETS DNA binding domain - in complex with a high-affinity binding site (pdb|1DUX). Note how the "recognition helix" inserts into the major groove of the DNA molecule. The color gradient ramps from blue (34) to green (84). Note how the first helix of the "helix-turn-helix" architecture serves only to position the recognition helix and makes few interactions by itself.]]
 
  
Now all that is left to do is to bring the DNA molecule  into the correct orientation for our '''model''' and then to combine the two files. We need to superimpose the Elk-1 protein/DNA complex onto our '''model'''.
+
If your species does not have all the genes you would expect it to have inherited from its ancestors, you MUST note that fact and attempt to explain it.  
 
+
</div>
;Structure superposition
 
There are quite a number of superposition servers available on the Web, a remarkably comprehensive overview can be found in [http://en.wikipedia.org/wiki/Structural_alignment Wikipedia]. However, overengineering and black-box mentality makes our task more difficult than it need be: most tools do not allow users to specify particular alignment zones but attempt to automatically define the zones of residues to be supoerimposed according to some geometric target function. Almost none return the actual rotation matrix and translation vector that is used for the superposition. And almost none transform the coordinates of heteratoms such as solvent, ligands or DNA molecules along with the protein coordinates. An exception that I have found to be very useable is the [http://www.predictioncenter.org/local/lga/lga.html Local-Global Alignment server ('''LGA''')], written by Adam Zemla. The procedure is quite straightforward:
 
 
 
*Define the structure to be rotated (1DUX in this case). This is a dimer, so download the file from the PDB and manually edit to contain only DNA chains A and B and protein chain C.
 
*Define the structure to be held constant (1MB1 in this case). Download from PDB.
 
*Use the "browse" option to define both files as input on the LGA inpput form
 
*Use the option to have both coordinate sets included in your output: <code>-o2</code>
 
*Submit
 
  
The results arrive per e-mail. I have linked the resulting PDB file to the [[Assignment_5_fallback_data|'''Fallback Data page''']]. <small>If you run this analysis on your own, you may want to review the types of edits the edits I made to the PDB file to get it displayed correctly in Rasmol.</small>
 
  
 
+
For example the following discusion for ''Saccharomyces cerevisiae'' would be sufficient for full marks:
&nbsp;<br><div style="padding: 5px; background: #EEEEEE;">
+
:(Numbers refer to branchpoints of the mixed gene tree, letters to branchpoints of the species tree). I have found five homologues to ''Saccharomyces cerevisiae'' Mbp1 and included them in the mixed gene tree. Two subclades are well defined, and contain all current species, they branch from 41 (Xbp1) and 50 (Sok2/Phd1). The subclade below 6 includes Mbp1 orthologues as well as Swi4 orthologues that do not appear well resolved. Considering only species below the ''saccharomycetales'' branchpoint, I postulate a duplication at that branchpoint that gave rise to yeast Mbp1 and Swi4 since the respective branches contain representatives from all fungi that descended from that branch. There is no good support for the idea that the cenancestor had a Swi4 paralogue. Therefore the cenancestor most likely posessed two paralogues: Mbp1, and Sok2. ''Saccharomyces cerevisiae'' has one gene in each of the major subclades, there is no gene loss. It also has an additional paralogue to Sok2: the Phd1 gene that duplicated at branchpoint 3.  
*Save the superimposed  coordinates in a file, open and view in Rasmol and note how well the "recognition helix" and adjacent beta strands superimpose! (Alternatively, copy and save the coordinates from the c to your harddisk.) Make an informative view, divergent stereo and paste it into your assignment. (4 marks)
 
</div>
 
&nbsp;<br>
 
&nbsp;
 
  
  
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #BDC3DC;  border:solid 1px #AAAAAA;">
  
==(4) Summary of Resources==
+
==(3) Summary of Resources==
 
</div>
 
</div>
 
&nbsp;<br>
 
&nbsp;<br>
  
 
;Links
 
;Links
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Peitsch_2002_UseOfModels.pdf '''Review (PDF, restricted)''' Manuel Peitsch on Homology Modeling]
+
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Baldauf_2003_PhylogenyTutorial.pdf '''Review (PDF, restricted)''' Sandra Baldauf: Phylogeny for the Faint of Heart]
:* [http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/restricted/Aravind_2005_HTHdomains.pdf '''Review (PDF, restricted)''' Aravind ''et al.'' Helix-turn-helix domains] (background reading, not required reading)
+
:* [http://evolution.genetics.washington.edu/phylip.html '''PHYLIP''' home page]
:* [[Organism_list_2006|Assigned Organisms]]
 
:* [http://www.rcsb.org/pdb/file_formats/pdb/pdbguide2.2/guide2.2_frame.html '''PDB file format''']
 
:* [http://en.wikipedia.org/wiki/Structural_alignment Wikipedia on '''Structural Superposition'''] <small>(although the article is called "Structural Alignment")</small>
 
  
:* [[Assignment_5_fallback_data|'''Fallback Data page''']]
+
;Sequences
 
+
:* [[Reference APSES domains|Reference APSES domains page]]
;Alignments
 
:* [[All_Mbp1_T-COFFEE|Mbp1 proteins '''T-Coffee''' aligned (text version)]]
 
 
 
&nbsp;
 
&nbsp;
 
  
 
<div style="padding: 5px; background: #D3D8E8;  border:solid 1px #AAAAAA;">
 
<div style="padding: 5px; background: #D3D8E8;  border:solid 1px #AAAAAA;">
Line 441: Line 279:
 
</div>
 
</div>
  
If you have any questions at all, don't hesitate to mail me at [mailto:boris.steipe@utoronto.ca boris.steipe@utoronto.ca] or post your question to the [mailto:bch441_2006@googlegroups.com Course Mailing List]
+
If you have any questions at all, don't hesitate to mail me at [mailto:boris.steipe@utoronto.ca boris.steipe@utoronto.ca] or post your question to the [mailto:bch441_2011@googlegroups.com Course Mailing List]

Latest revision as of 23:34, 21 September 2012

Note! This assignment is currently active. All significant changes will be announced on the mailing list.

 
 


   

Assignment 4 (last: 2011) - Phylogenetic Analysis

Introduction  

Nothing in Biology makes sense except in the light of evolution.
Theodosius Dobzhansky

... but does evolution make sense in the light of biology?

As we have seen in the previous assignments, the Mbp1 transcription factor has homologues in all other fungi, yet there is not always a clear one-to-one mapping between members of a family in distantly related species. It appears that various systems of APSES domain transcription factors have evolved independently. Of course this bears directly on our notion of function - what it means to say that two genes in different organisms have the "same" function. In case two organisms both have an orthologous gene for the same, distinct function, saying that the function is the same may be warranted. But what if that gene has duplicated in one species, and the two paralogues now perform different, related functions in one organism? Theses two are still orthologues to the other species, but now we expect functionally significant residues to have adapted to the new role of one paralogue. In order to be able to even ask such questions, we need to make the evolutionary history of gene families explicit. This is the domain of phylogenetic analysis. We can ask questions like: how many paralogues did the cenancestor of a clade possess? Which of these underwent additional duplications in the phylogenesis of the organism I am studying? Did any genes get lost? And - adding additional biological insight to the picture - did the observed duplications lead to the "invention" of new biological systems? When was that? And perhaps even: how did the species benefit from this event?


We will develop this kind of analysis in this assignment. In the previous assignment you have established which gene in your species is the reciprocally most closely related orthologue to yeast Mbp1 and you have identified the full complement of APSES domain genes in your assigned organism. In this assignment, we will analyse these genes' evolutionary relationship and compare it to the evolutionary relationship of all fungal APSES domains. The goal is to define families of related transcription factors and their evolutionary history.

A number of excellent tools for phylogenetic analysis exist; general purpose packages include the (free) PHYLIP package and the (commercial) PAUP package. Specialized tools for tree-building include Treepuzzle or Mr. Bayes. This assignment is constructed around programs that are available in PHYLIP, however you are welcome to use other tools that fulfill a similar purpose if you wish. In this field, researchers consider trees that have been built with ML (maximum likelihood) methods to be more reliable than trees that are built with parsimony methods, or distance methods such as NJ (Neighbor Joining). However ML methods are also much more compute-intensive. Just like with multiple sequence alignments, some algorithms will come closer to guessing the truth and others will not and usually it is hard to tell which is the more trustworthy of two diverging results. The prudent researcher tries out alternatives and forms her own opinion. Specifically, we may usually assume results that converge when computed with different algorithms, to be more reliable than those that depend strongly on a particular algorithm, parameters, or details of input data.

However: regarding algorithm and resources, we will take a shortcut in this assignment (something you should not do in real life). We will assume that the tree the algorithm constructs is correct. In "real life" you would establish its reliability with a bootstrap procedure: repeat the tree-building a hundred times with partial data and see which branches and groupings are robust and which depend on the details of the data. In this assignment, we should simply acknowledge that bifurcations that are very close to each other have not been "resolved" and be appropriately cautious in our inferences. In phylogenetic analysis, not all lines a program draws are equally trustworthy. Don't take the trees as a given fact just because a program suggests this. Look at the evidence, include independent information where available, use your reasoning, and analyse the results critically. As you will see, there are some facts that we know for certain: we know which species the genes we have sequenced come from, and we can (usually) make good assumptions about the relationship of the species themselves - the history of speciation events that underlies all evolution of genes. This is extremely helpful information for our work.

Introduction: Tasks

For this assignment, we start from the APSES domains you have collected previously. You will align these domains with a set of reference domains and edit the alignment to make it suitable for phylogenetic analysis, using Jalview. Then you will construct a phylogenetic tree and interpret the tree. The goal is to identify orthologues and paralogues.

In case you want to review concept of trees, clades, LCAs, OTUs and the like, I have linked an excellent and very understandable introduction-level article on phylogenetic analysis (pdf) here and to the resource section at the bottom of this page.

 

Preparation, submission and due date

Read carefully.
Be sure you have understood all parts of the assignment and cover all questions in your answers! Sadly, we always get assignments back in which important aspects have simply overlooked marks unnecessarily. If you did not notice that the above did not make sense, you are reading what you expect, not what is written.

Review the guidelines for preparation and submission of BCH441 assignments.

The due date for the assignment is Monday, November 28 at 12:00 in the morning.

   

Your documentation for the procedures you follow in this assignment will be worth 1 mark.

   

(1) Preparations

   

(1.1) Preparing Input Files

 

For this assignment, we start from the multiple sequence alignments we have constructed previously. We will edit the alignment to make it suitable for phylogenetic analysis. We will construct a phylogenetic tree and we will analyse the tree.

Introduction: Principle

In order to use molecular sequences for the construction of phylogenetic trees, you have to build a multiple alignment first, then edit it. This is important: all rows of sequences have to contain the exact same number of characters and to hold aligned characters in corresponding positions. Phylogeny programs are not meant to revise an alignment but to analyze evolutionary relationships, after the alignment has been determined. The program's inferences are made on a column-wise basis and if your columns contain data from unrelated positions, the inferences are going to be questionable.

The result of the tree construction is a decision about the most likely evolutionary relationships. Fundamentally, tree-construction programs decide which sequences had common ancestors.


Distance based phylogeny programs start by using sequence comparisons to estimate evolutionary distances:

  • they apply a model of evolution such as a mutation data matrix, to calculate a score for each pair of sequences,
  • this score is stored in a "distance matrix" ...
  • ... and used to estimate a tree that groups sequences with close relationships together. (e.g. by using an NJ, Neigbor Joining, algorithm).

They are fast, can work on large numbers of sequences, but are less accurate if genes evolve at different rates.


Parsimony based phylogeny programs build a tree that minimizes the number of mutation events that are required to get from a common ancestral sequence to all observed sequences. They take all columns into account, not just a single number per sequence pair, as the Distance Methods do. For closely related sequences they work very well, but they construct inaccurate trees when they can't make good estimates for the required number of sequence changes.


ML, or Maximum Likelihood methods attempt to find the tree for which the observed sequences would be the most likely under a particular evolutionary model. They are based on a rigorous statistical framework and yield the most robust results. But they are also quite compute intensive and a tree of the size that we are building in this assignment is a challenge for the resources of common workstation (runs about an hour on my computer). If the problem is too large, one may split a large problem into smaller, obvious subtrees (e.g. analysing orthologues as a group, only including a few paralogues for comparison) and then merge the smaller trees; this way even very large problems can become tractable.

ML methods suffer less from "long-branch attraction" - the phenomenon that weakly similar sequences can be grouped inappropriately close together in a tree due to spuriously shared differences.


Clearly, in order for tree-estimation to work, one must not include fragments of sequence which have evolved under a different evolutionary model as all others, e.g. after domain fusion, or after accommodating large stretches of indels. Thus it is appropriate to edit the sequences and pare them down to a most characteristic subset of amino acids. The goal is not to be as comprehensive as possible, but to input those columns of aligned residues that will best represent the true phylogenetic relationships between the sequences.

Introduction: Gaps

Gaps are a real problem here, as usual. Strictly speaking, the similarity score of an alignment program as well as the distance score of a phylogeny program are not calculated for an ordered sequence, but for a sum of independent values, one for each aligned columns of characters. The order of the columns does not change the score. However in an optimal sequence alignment with gaps, this is no longer strictly true since a one-character gap creation has a different penalty score than a one-character gap extension! Most alignment programs use a model with a constant gap insertion penalty and a linear gap extension penalty. This is not rigorously justified from biology, but parametrized (or you could say "tweaked") to correspond to our observations. However, most phylogeny programs, (such as the programs in PHYLIP) do not work in this way. PHYLIP strictly operates on columns of characters and treats a gap character just like a residue with the one letter code "-". Thus gap insertion- and extension- characters get the same score. For short indels, this underestimates the distance between pairs of sequences, since any evolutionary model should reflect the fact that gaps are much less likely than point mutations. If the gap is very long though, all events are counted individually as many single substitutions (rather than one lengthy one) and this overestimates the distance. And it gets worse: long stretches of gaps can make sequences appear similar in a way that is not justified, just because they are identical in the "-" character. It is therefore common and acceptable to edit gaps in the alignment and delete all but one or two columns of gapped sequence, or to remove such columns altogether.


Introduction: The outgroup

To analyse phylogenetic trees it is useful (and for some algorithms required) to define an outgroup, a sequence that presumably diverged from all other sequences in a clade before they split up among themselves. Wherever the outgroup inserts into the tree, this is the root of the rest of the tree. And whenever a molecular clock is assumed, the branching point that connects the outgroup can be assumed to be the oldest divergence event. I have defined an outgroup sequence and added it to the reference APSES domains page. The procedure is explained in detail on that page.

>gi|301025594|ref|ZP_07189117.1| KilA-N domain protein [Escherichia coli MS 69-1]
MTSFQLSLISREIDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGIPISELIQS
FKGGRPENQGTWVHPDIAINLAQWLSPKFAVQVSRWVREWMSGERTTAEMPVHLKRYMVNRSRIPHTHFS
ILNELTFNLVAPLEQAGYTLPEKMVPDISQGRVFSQWLRDNRNVEPKTFPTYDHEYPDGRVYPARLYPNE
YLADFKEHFNNIWLPQYAPKYFADRDKKALALIEKIMLPNLDGNEQF

E. coli KilA-N protein. Residues that do not align with APSES domains are shown in grey.

Preparing APSES sequences
  1. Navigate to the reference APSES domains page and copy the sequences.
  2. Open Jalview, select File → Input Alignment → from Textbox and paste the sequences into the textbox.
  3. Add the APSES domain sequences from your species that you have defined in the previous assignment.
  4. When all the sequences are present, click on New Window.
  5. In Jalview, select Web Service → Alignment → MAFFT Multiple Sequence Alignment. The alignment is calculated in a few minutes and displayed in a new window.
  6. Choose any colour scheme and add Colour → by Conservation. Adjust the slider left or right to see which columns are highly conserved.
  7. Save the alignment as a Jalview project before editing it for phylogenetic analysis. You may need it again.


Introduction: Alignment editing for phylogenetic reconstruction

In practice, follow the fundamental principle that all characters in a column should be related by homology. This implies the following rules of thumb:

  • Remove all stretches of residues in which the alignment appears ambiguous (not just highly variable, but ambiguous regarding the aligned positions).
  • Remove all frayed N- and C- termini, especially regions in which not all sequences that are being compared appear homologous and that may stem from unrelated domains.
  • Remove all gapped regions that appear to be alignment artefacts due to inappropriate input sequences.
  • Remove all but approximately one column from gapped regions in those cases where the presence of several related insertions suggest that the indel is real, and not just an alignment artefact. (Some researchers simply remove all gapped regions).
  • Remove sections N- and C- terminal of gaps where the alignment appears questionable.
  • Also, consider that neither residues that are completely different between all species, nor residues that are completely conserved are informative for relationship distances.
  • If your sequences are too long, your tree calculations may run out of memory. 60-80 aligned residues should be plenty and if the sequences fit on a single line you will save yourself potential trouble with block-wise vs. interleaved input.If you do run out of memory try removing columns of sequence.
  • Move the KilA-N outgroup sequence to the first line of your alignment, since this is where PHYLIP will look for it by default.


(Possible) steps in editing a multiple sequence alignment towards a PHYLIP input file. a: raw alignment (CLUSTAL format); b: sequences assembled into single lines; c: columns to be deleted highlighted in red - 1, 3 and 4: large gaps; 2: uncertain alignment and 5: frayed C-terminus: both would put non-homologous characters into the same column; d: input data for PHYLIP: names for sequences must not be longer than 10 characters, the first line must contain the number of sequences and the sequence length. PHYLIP is very picky about incorrectly formatted input, read the PHYLIP sequence format guide.
Once you are satisfied with your editing, proceed as follows
  1. Download the PHYLIP package from the Phylip homepage and install it on your computer.
  2. Prepare a PHYLIP input file from your Jalview alignment. The simplest way to achieve this appears to be:
    1. In Jalview, use File → Output to Textbox→FASTA, then Edit→Select All and Edit→copy the sequences.
    2. In a browser, navigate to the Readseq sequence conversion service.
    3. Paste your sequences into the form and choose Phylip as the output format. Click on submit.
    4. Save the resulting page as a text file in the directory where the phylip executables reside on your computer. Give it some useful name such as All-APSES_domains.phy.
  3. Make a copy of that file and name it infile. Note: make sure that your Microsoft Windows operating system does not silently append the extension ".txt" to your file. It should be called "infile", nothing else and you should never, never, ever permit your operating systems to slyly hide file extensions from you when it displays filenames. You have been warned.


(1.2) Calculating a Tree

 
 

  • Use the proml program of PHYLIP (protein sequences, maximum likelihood tree) to calculate a phylogenetic tree. Use the default parameters except that you must change option S: Speedier but rougher analysis? to No - your analysis should not sacrifice accuracy for speed. The calculation will take a while.

 
 

(2) Analysis (2 marks)

I have constructed a cladogram for the species we are analysing, based on data published for 1551 fungal ribosomal sequences. Such reference tres from rRNA data are a standard method of phylogenetic analysis, supported by the assumption that rRNA sequences are monophyletic and have evolved under comparable selective pressure in all species.

Cladogram of fungi studied in the assignments. This cladogram is based on small subunit ribosomal rRNA sequences, and largely follows Tehler et al. (2003) Mycol Res. 107:901-916. Even though many details of fungal phylogeny remain unresolved, the branches shown here individually appear to have strong support. In a cladogram such as this, the branch lengths are not drawn to any scale of similarity.

Your species may not be included in this cladogram, but you can easily calculate your own with the following procedure:

  1. Access the NCBI taxonomy database Entrez query page.
  2. Edit the list of reference species below to include your species and paste it into the form.
"Emericella nidulans"[Scientific Name] OR
"Candida albicans"[Scientific Name] OR
"Neurospora crassa"[Scientific Name] OR
"Saccharomyces cerevisiae"[Scientific Name] OR
"Schizosaccharomyces pombe"[Scientific Name] OR
"Ustilago maydis"[Scientific Name]
  1. Next, as Display option, select Common Tree.
  2. Then select the phylip tree option and click save as to save the tree in Newick format.
  3. The output can be edited, and visualized in any program that reads Newick trees.


(2.2) Visualizing the APSES domain Phylogenetic Tree


Once Phylip is done calculating the tree, the tree in a text format will be contained in the Phylip outfile - the documentation of what the program has done. Open this textfile for a first look. The tree is complicated and it can look confusing at first. The tree in Newick format is contained in the Phylip file outtree. Visualize it as follows:

  1. Open outtree in a texteditor and copy the tree.
  2. Visualize the tree in alternative representations:
    1. Navigate to the Proweb treeviewer, paste and visualize your tree.
    2. Navigate to the Trex-online Newick tree viewer for an alternative view. Visualize the tree as a phylogram. You can increase the window height to keep the labels from overlapping.
    3. In your Jalview window, choose File → Load associated Tree and load the Phylip outtree file. You can click into the tree-window to show which clades branch off at what level - it should be obvious that you can identify three major subclades.
    4. Study the tree: understand what you see and what you would have expected.

Here are two principles that will help you make sense of the tree.


A: A gene that is present in an ancestral species is inherited in all descendant species. The gene has to be observed in all OTUs, unless its has been lost (which is a rare event).

B: Paralogous genes in an ancestral species should give rise to monophyletic subtrees for each of the paralogues, in all descendants; this means: if the LCA of a branch has e.g. three genes, we would expect three copies of the species cladogram below this branchpoint, one for each of these genes. Each of these subtrees should recapitulate the reference phylogenetic tree of the species, up to the branchpoint of their LCA.


With these two simple principles (you should draw them out on a piece of paper if they do not seem obvious to you), you can probably pry your tree apart quite nicely. A few colored pencils and a printout of the tree will help.


   


(2.1) The Cenancestor's APSES Domains

Refer to your tree for the following tasks. (Please remember to include your tree in your Assignment submission - it is a result of your computational experiment. Its easiest to copy/paste the tree from the Phylip outfile, rather than copying an image from a Tree viewer). Be specific in your discussion, i.e. refer to specific branchpoints (branchpoints are numbered in the Phylip output) and OTU or gene names in your analysis (see the example below).


  • Consider how many APSES domain proteins the fungal cenancestor appears to have possessed and what evidence you see in the tree that this is so.


(2.2) Unraveling your organism's APSES domains (2 marks)

 
 

Analysis (2 marks)

Assume that the cladogram for fungi that I have given above is correct, and that the mixed gene tree you have calculated is fundamentally correct in its overall arrangement but may have local inaccuracies due to the limited resolution of the method. You have identified the APSES domain genes of the fungal cenancestor above. Apply the expectations we have stated above to discuss briefly through what sequence of duplications and/or gene loss your organism has ended up with the APSES domains it possesses today. Make specific reference to the cladogram of species and note in particular in case some of your sequences appear to have been placed into regions of the tree where they don't seem to belong. Also note which branchpoints in the evolutionary history of your sequences correspond to speciations and which ones to duplications.


Note: A common confusion about cenancestral genes arises from the fact that by far not all expected genes are present in the OTUs. Some will have been lost, some will have been incorrectly annotated in their genome (frameshifts!) and not been found with PSI-BLAST, some may have been missed by you. In general you have to ask: given the species represented in a subclade, what is the last common ancestor of that branch? The expectation is that all descendants of that ancestor should be represented in that branch unless one of the above reasons why a gene might be absent would apply.


If your species does not have all the genes you would expect it to have inherited from its ancestors, you MUST note that fact and attempt to explain it.


For example the following discusion for Saccharomyces cerevisiae would be sufficient for full marks:

(Numbers refer to branchpoints of the mixed gene tree, letters to branchpoints of the species tree). I have found five homologues to Saccharomyces cerevisiae Mbp1 and included them in the mixed gene tree. Two subclades are well defined, and contain all current species, they branch from 41 (Xbp1) and 50 (Sok2/Phd1). The subclade below 6 includes Mbp1 orthologues as well as Swi4 orthologues that do not appear well resolved. Considering only species below the saccharomycetales branchpoint, I postulate a duplication at that branchpoint that gave rise to yeast Mbp1 and Swi4 since the respective branches contain representatives from all fungi that descended from that branch. There is no good support for the idea that the cenancestor had a Swi4 paralogue. Therefore the cenancestor most likely posessed two paralogues: Mbp1, and Sok2. Saccharomyces cerevisiae has one gene in each of the major subclades, there is no gene loss. It also has an additional paralogue to Sok2: the Phd1 gene that duplicated at branchpoint 3.


(3) Summary of Resources

 

Links
Sequences

[End of assignment]

If you have any questions at all, don't hesitate to mail me at boris.steipe@utoronto.ca or post your question to the Course Mailing List