Difference between revisions of "BIO Assignment Week 7"

From "A B C"
Jump to navigation Jump to search
m
m
Line 2: Line 2:
 
<div class="b1">
 
<div class="b1">
 
Assignment for Week 8<br />
 
Assignment for Week 8<br />
<span style="font-size: 70%">Homology Modeling</span>
+
<span style="font-size: 70%">Phylogenetic Analysis</span>
 
</div>
 
</div>
 
<table style="width:100%;"><tr>
 
<table style="width:100%;"><tr>
Line 18: Line 18:
  
 
&nbsp;
 
&nbsp;
==Introduction==
 
  
  
<div style="padding: 15px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
+
 
;How could the search for ultimate truth have revealed so hideous and visceral-looking an object?
+
<div style="padding: 2px; background: #F0F1F7;  border:solid 1px #AAAAAA; font-size:125%;color:#444444">
::''<small>Max Perutz (on his first glimpse of the Hemoglobin structure)</small>''
+
 
</div>
 
&nbsp;
 
 
&nbsp;
 
&nbsp;
  
Where is the hidden beauty in structure, and where, the "ultimate truth"? In the previous assignments we have studied sequence conservation in APSES family domains and we have discovered homologues in all fungal species. This is an ancient protein family that had already duplicated to several paralogues at the time the cenancestor of all fungi lived, more than 600,000,000 years ago, in the [http://www.ucmp.berkeley.edu/fungi/fungifr.html Vendian period] of the Proterozoic era of Precambrian times.
+
;Nothing in Biology makes sense except in the light of evolution.
 +
:''Theodosius Dobzhansky''
 +
</div>
  
In order to understand how specific residues in the sequence contribute to the putative function of the protein, and why and how they are conserved throughout evolution, we would need to study an explicit molecular model of an APSES domain protein, bound to its cognate DNA sequence. Explanations of a protein's observed properties and functions can't rely on the general fact that it binds DNA, we need to consider details in terms of specific residues and their spatial arrangement. In particular, it would be interesting to correlate the conservation patterns of key residues with their potential to make specific DNA binding interactions. Unfortunately, no APSES domain structures in complex with bound DNA has been solved up to now, and the experimental evidence we have considered in Assignment 2 ([http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=10747782 Taylor ''et al.'', 2000]) is not sufficient to unambiguously define the details of how a DNA double helix might be bound. Moreover, at least two distinct modes of DNA binding are known for proteins of the winged-helix superfamily, of which the APSES domain is a member.
+
... but does evolution make sense in the light of biology?
  
In this and the following assignment you will (1) construct a molecular model of the APSES domain from the Mbp1 orthologue in your assigned species, (2) identify similar structures of distantly related domains for which protein-DNA complexes are known, (3) assemble a hypothetical complex structure and (4) consider whether the available evidence allows you to distinguish between different modes of ligand binding.
+
As we have seen in the previous assignments, the Mbp1 transcription factor has homologues in all other fungi, yet there is not always a clear one-to-one mapping between members of a family in distantly related species. It appears that various systems of APSES domain transcription factors have evolved independently. Of course this bears directly on our notion of function - what it means to say that two genes in different organisms have the "same" function. In case two organisms both have an orthologous gene for the same, distinct function, saying that the function is the same may be warranted. But what if that gene has duplicated in one species, and the two paralogues now perform different, related functions in one organism? Theses two are still orthologues to the other species, but now we expect functionally significant residues to have adapted to the new role of one paralogue. In order to be able to even ask such questions, we need to make the evolutionary history of gene families explicit. This is the domain of '''phylogenetic analysis'''. We can ask questions like: how many paralogues did the cenancestor of a clade possess? Which of these underwent additional duplications in the phylogenesis of the organism I am studying? Did any genes get lost? And - adding additional biological insight to the picture - did the observed duplications lead to the "invention" of new biological systems? When was that? And perhaps even: how did the species benefit from this event?
  
For the following, please remember the following terminology:
+
We will develop this kind of analysis in this assignment. In the previous assignment you have established which gene in your species is the reciprocally most closely related orthologue to yeast Mbp1 (with ''reciprocal best match'') and you have identified the full complement of APSES domain genes in your assigned organism (as a result of your PSI-BLAST search). In this assignment, we will analyse these genes' evolutionary relationship and compare it to the evolutionary relationship of other fungal APSES domains. The goal is to define families of related transcription factors and their evolutionary history. I have prepared APSES domains from six diverse reference species, you will add YFO's APSES domain sequences and compute the phylogram for all genes. The goal is to identify orthologues and paralogues. <!-- Optionally, you will look at structural and functional conservation of residues. -->
  
;Target
+
A number of excellent tools for phylogenetic analysis exist; ''general purpose packages'' include the (free) [http://evolution.genetics.washington.edu/phylip.html '''PHYLIP'''] package, the [http://www.megasoftware.net/ '''MEGA''' package] and the (commercial) PAUP* package. Of these, only MEGA is still under active development, although PHYLIP still functions perfectly (except for problems with graphical windows under Mac OS 10.6). ''Specialized tools'' for tree-building include Treepuzzle or Mr. Bayes. This assignment is constructed around programs that are available in PHYLIP, however you are welcome to use other tools that fulfill a similar purpose if you wish. In this field, researchers consider trees that have been built with ML (maximum likelihood) methods to be more reliable than trees that are built with parsimony methods, or distance methods such as NJ (Neighbor Joining). However ML methods are also much more compute-intensive. Just like with multiple sequence alignments, some algorithms will come closer to guessing the truth and others will not and usually it is hard to tell which is the more trustworthy of two diverging results. The prudent researcher tries out alternatives and forms her own opinion. Specifically, we may usually assume results that converge when computed with different algorithms, to be more reliable than those that depend strongly on a particular algorithm, parameters, or details of input data.
:The protein that you are planning to model.
 
;Template
 
:The protein whose structure you are using as a guide to build the model.
 
;Model
 
:The structure that results from the modeling process. It has the '''Target sequence''' and is similar to the '''Template structure'''.
 
&nbsp;
 
  
A brief overview article on the construction and use of homology models is linked to the resource section at the bottom of this page. That section also contains links to other sites and resources you might find useful or interesting.
+
However, we will take a shortcut in this assignment (something you should not do in real life). We will skip establishing the reliability of the tree with a bootstrap procedure, i.e. repeat the tree-building a hundred times with partial data and see which branches and groupings are robust and which depend on the details of the data. <small>(If you are interested, have a look [[BIO_bootstrapping_with_PHYLIP| '''here''']] for the procedure for running a bootstrap analysis on the data set you are working with, but this may require a day or so of computing time on your computer.)</small> In this assignment, we will simply acknowledge that bifurcations that are very close to each other have not been "resolved" and be appropriately cautious in our inferences. In phylogenetic analysis, not all lines a program draws are equally trustworthy. Don't take the trees as a given fact just because a program suggests this. Look at the evidence, include independent information where available, use your reasoning, and analyse the results critically. As you will see, there are some facts that we know for certain: we know which species the genes come from, and we can (usually) make good assumptions about the relationship of the species themselves - the history of speciation events that underlies all evolution of genes. This is extremely helpful information for our work.
  
  
&nbsp;
+
If you would like to review concepts of trees, clades, LCAs, OTUs and the like, I have linked an excellent and very understandable introduction-level article on phylogenetic analysis here and to the resource section at the bottom of this page.
==Warm-up: a minimal change==
 
Minimal changes to structure models can be done directly in Chimera. This illustrates the principle of full-scale modeling quite nicely. For an example, let us consider the residue <code>A&nbsp;42</code> of the 1BM8 structure. It is oriented twards the core of the protein, but most other Mbp1 orthologs have a larger amino acid in this position, <code>V</code>, or even <code>I</code>.
 
 
 
{{task|1=
 
# Open <code>1BM8</code> in Chimera, hide the ribbons and show all atoms as a stick model.
 
# Color the protein white.
 
# Open the sequence window and select <code>A&nbsp;42</code>. Color it red. Choose '''Actions&nbsp;&rarr;&nbsp;Set pivot'''. Then study how nicely the alanine sidechain fits into the cavity formed by its surrounding residues.
 
# To emphasize this better, hide the solvent molecules and select only the protein atoms. Display them as a '''sphere''' model to better appreciate the packing, i.e. the Van der Waals contacts we discussed in class. Use the '''Favorites&nbsp;&rarr;&nbsp;Side view''' panel to move the clipping plane and see a section through the protein. Study the packing, in particular, note that the additional methyl groups of a valine or isoleucine would not have enough space in the structure. Then restore the clipping planes so you can see the whole molecule.
 
# Lets simplify the view: choose '''Actions &rarr; Atoms/Bonds &rarr; backbone&nbsp;only &rarr; chain&nbsp;trace'''. Then select <code>A&nbsp;42</code> again in the sequence window and choose '''Actions &rarr; Atoms/Bonds &rarr; show'''.
 
# Add the surrounding residues: choose '''Select &rarr; Zone...'''. In the window, see that the box is checked that selects all atoms at a distance of less then 5&Aring; to the current selection, and check the lower box to select the whole residue of any atom that matches the distance cutoff criterion. Click '''OK''' and choose '''Actions &rarr; Atoms/Bonds &rarr; show'''.
 
#Select <code>A&nbsp;42</code> again: '''left-click''' (control click) on any atom of the alanine to select the atom, then '''up-arrow''' to select the entire residue. Now let's mutate this residue to isoleucine.
 
#Choose '''Tools &rarr; Structure&nbsp;Editing &rarr; Rotamers''' and select <code>ILE</code> as the rotamer type. Click '''OK''', a window will pop up that shows you the possible rotamers for isoleucine together with their database-derived probabilities; you can select them in the window and cycle through them with your arrow keys. But note that the probabilities are '''very''' different - and thus show you high-energy and low-energy rotamers to choose from. Therefore, unless you have compelling reasons to do otherwise, try to find the highest-probability rotamer that may fit. This is where your stereo viewing practice becomes important, if not essential. It is really, really hard to do this reasonably in a 2D image! It becomes quite obvious in 3D. Btw: I find such "quantitative" work - where the real distances are important - easier in '''orthographic''' than in '''perspective''' view (cf. the '''Camera''' panel).
 
#I find that the first rotamer is actually not such a bad fit. The <code>CD</code> atom comes close to the sidechains of <code>I&nbsp;25</code> and <code>L&nbsp;96</code>. But we can assume that these are somewhat mobile and can accommodate a denser packing, because - as you can easily verify in your Jalview alignment - it is '''NOT''' the case that sequences that have <code>I&nbsp;42</code>, have a smaller residue in position <code>25</code> and/or <code>96</code>. So let's accept the most frequent <code>ILE</code> rotamer by selecting it in the rotamer window and clicking '''OK''' (while '''existing side chain(s): replace''' is selected).
 
#Done.  
 
}}
 
  
If you want to go over this in more detail, check the video tutorial on YouTube published by the NIAID bioinformatics group [http://www.youtube.com/watch?v=bcXMexN6hjY '''here''']. I would also encourage you to go over [http://www.youtube.com/watch?v=eJkrvr-xeXY '''Part 2 of the video tutorial'''] that discusses how to check for and resolve (by energy minimization) steric clashes. But do remember that it is not clear whether energy minimization will make your structure more correct in the sense of a smaller overall RMSD with the real, mutated protein.
+
{{#pmid: 12801728}}
  
What we have done here with one residue is exactly the way homology modeling works with entire sequences. Let's now build a homology model for YFO Mbp1.
+
==Preparing input alignments==
  
==Preparation==
+
In this section, we start from a collection of homologous APSES domains, construct a multiple sequence alignment, and edit the alignment to make it suitable for phylogenetic analysis.
  
===Target sequence===
 
The first step of homology modelling is to determine which sequence to model. We have determined the putative orthologue with conserved function in YFO by reciprocal best match with ''saccharomyces cervisiae'' Mbp1. Your sequence was initially found with an APSES domain search in YFO and the alignments with the yeast sequence are straightforward for the most part.
 
  
There are two  exceptions however: the alignment of '''ASPFU''' gene XP_754232 and the '''CAPCO''' gene XP_007722875 both are missing part of the domin's N-terminus. This is odd, because this may imply the APSES domain of these genes might not be properly folded. When such surprising results of alignement occurr,  you '''must''' consider whether there could be an error in the published sequence, perhaps stemming from an erroneous gene model. This is not absolutely germane to this assignment, so I have placed the process into the collapsible section below - optional reading. However it may be useful for you to understand what the issue is here and how to address it.
+
===Principles===
  
<div class="mw-collapsible mw-collapsed" data-expandtext="Expand to read about gene model correction" data-collapsetext="Collapse">
+
In order to use molecular sequences for the construction of phylogenetic trees, you have to build a multiple alignment first, then edit it. This is important: all rows of sequences have to contain the exact same number of characters and to hold '''aligned characters in corresponding positions'''. Phylogeny programs are not meant to revise an alignment but to analyze evolutionary relationships, '''after''' the alignment has been determined. The program's inferences are made on a column-wise basis and if your columns contain data from unrelated positions, the inferences are going to be questionable. Clearly, in order for tree-estimation to work, one must not include fragments of sequence which have evolved under a different evolutionary model as all others, e.g. after domain fusion, or after accommodating large stretches of indels. Thus it is appropriate to edit the sequences and pare them down to a most characteristic subset of amino acids. The goal is not to be as comprehensive as possible, but to input those columns of aligned residues that will best represent the true phylogenetic relationships between the sequences.
;Correcting the ASPFU Mbp1 gene model.
 
  
  
<div class="mw-collapsible-content">
+
The result of the tree construction is a decision about the most likely evolutionary relationships. Fundamentally, tree-construction programs decide which sequences had common ancestors.
An alignment of APSES domain sequence shows the shortened N-terminus of the ASPFU and the CAPCOprotein, relative to SACCE and e.g. the closely related ''aspergillus nidulans'', ASPNI:
 
APSES domains:
 
Mbp1_SACCE  QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAA...
 
Mbp1_ASPNI  NVYSATYSSVPVYEFKIGTDSVMRRRSDDWINATHILKVA...
 
Mbp1_ASPFU  ----------------------MRRRGDDWINATHILKVA...
 
Mbp1_CAPCO  ----------------------MRRRSDDWVNATHILKVA...
 
  
We analyse this for the ASPFU gene.
 
  
Working from the possibility that this may be a gene model error - e.g. a false translational start, a frameshift due to a sequencing error, or an erroneously modelled intron, we check whether the translation of the genomic sequence supports the presence of the expected amino acids. This is easily done running TBLASTN - BLASTing the protein query against the six reading frames of the ASPFU genome. We find the following:
+
'''Distance based''' phylogeny programs start by using sequence comparisons to estimate evolutionary distances:
  
 +
* they apply a model of evolution such as a mutation data matrix, to calculate a score for each pair of sequences,
 +
* this score is stored in a "distance matrix" ...
 +
* ... and used to estimate a tree that groups sequences with close relationships together. (e.g. by using an NJ, Neigbor Joining, algorithm).
  
Aspergillus fumigatus Af293 chromosome 3, whole genome shotgun sequence
+
They are fast, can work on large numbers of sequences, but are less accurate if genes evolve at different rates.
Sequence ID: ref|NC_007196.1|Length: 4079167Number of Matches: 2
 
[...]
 
Query  10      VDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILE ...
 
                V VYEF    S+M+R+ DDW+NATHILK A F K  RTRILE ...
 
Sbjct  3691193  VPVYEFKVDGESVMRRRGDDWINATHILKVAGFDKPARTRILE ...
 
  
Indeed, there is sequence upstream of the gene's published translation start that matches well with our query! But where is the correct translation start? For that we need to look at the actual nucleotide sequence and translate it. Remember: BLAST is a '''local''' sequence alignment algorithm and it won't retrieve everything that matches to our query, just the best matching segment. ASPFU chromosome 3 is over 4 megabases large, so let us try to obtain only the region we are actually interested in: downstream of bases 3691193, lets say 3691100 (make sure this offset is divisible by three, to stay in the same reading frame) and upstream to, say, 3691372.
 
  
#At the [http://www.ncbi.nlm.nih.gov/genome/browse/ '''NCBI genome project site'''] we search for ''aspergillus fumigatus''.
+
'''Parsimony based''' phylogeny programs build a tree that minimizes the number of mutation events that are required to get from a common ancestral sequence to all observed sequences. They take all columns into account, not just a single number per sequence pair, as the Distance Methods do. For closely related sequences they work very well, but they construct inaccurate trees when they can't make good estimates for the required number of sequence changes.
#At the [http://www.ncbi.nlm.nih.gov/genome/18 '''''aspergillus fumigatus''''' '''genome project site'''] we click on chromosome 3 to access the map viewer.
 
#Hovering over the ''Download/View sequence'' link shows us how an URL to access sequence data is structured:
 
<nowiki>http://www.ncbi.nlm.nih.gov/projects/mapview/seq_reg.cgi?taxid=746128&chr=3&from=1&to=4079167</nowiki>
 
:We can easily adapt this to the sequence range we need ...
 
<ol start="4">
 
<li>... and follow: http://www.ncbi.nlm.nih.gov/nuccore/NC_007196.1?from=3691003&to=3691243&report=fasta to yield:
 
</ol>
 
>gi|71025130:3691003-3691243 Aspergillus fumigatus Af293 chromosome 3, whole genome shotgun sequence
 
ACGGTTTGCGGAGACGGGCATTATGGCGGCGGTGGATTTCTCAAAAATCTATTCTGCTACATACAGCAGC
 
GTAAGTCTCTTCTAATTGCGTATCTCTGTTTTCCCTACAGCCTCAAATTTTCCCCAATGCCTCTTTCCAT
 
CCATTTTGCCCCTTCCTTCGCCGCGAAGCCAATCTAACGCAGTTCAATAGGTTCCAGTTTACGAGTTCAA
 
AGTCGATGGCGAAAGTGTTATGCGCCGACGA
 
  
  
<ol start="5">
+
'''ML''', or '''Maximum Likelihood''' methods attempt to find the tree for which the observed sequences would be the most likely under a particular evolutionary model. They are based on a rigorous statistical framework and yield the most robust results. But they are also quite compute intensive and a tree of the size that we are building in this assignment is a challenge for the resources of common workstation (runs about an hour on my computer). If the problem is too large, one may split a large problem into smaller, obvious subtrees (e.g. analysing orthologues as a group, only including a few paralogues for comparison) and then merge the smaller trees; this way even very large problems can become tractable.
<li>To translate this, we navigate to any of the [http://bips.u-strasbg.fr/EMBOSS/ '''EMBOSS''' tools servers] and use "remap" - we want to see the translation matched to the nucleotide sequence. We turn restriction sites off, translate all three forward frames and paste and manually align the SACCE Mbp1 sequence into the output to see what we expect and what we got. I have selected only the frame(s) that actually give a match, and I have pasted the homologous CAPCO and SACCE sequences (lower case) to demonstrate their similarity:
 
</ol>
 
ASPFU    ACGGTTTGCGGAGACGGGCATTATGGCGGCGGTGGATTTCTCAAAAATCTATTCTGCTACATACAGCAGC
 
                                                                       
 
ASPFU      R  F  A  E  T  G  I  M  A  A  V  D  F  S  K  I  Y  S  A  T  Y  S  S 
 
CAPCO                          m  -  a  f  d  -  k  e  i  y  s  a  t  y  s  n 
 
SACCE                          m  s  -  -  -  -  n  q  i  y  s  a  r  y  s  g
 
 
         
 
ASPFU    GTAAGTCTCTTCTAATTGCGTATCTCTGTTTTCCCTACAGCCTCAAATTTTCCCCAATGCCTCTTTCCAT
 
 
 
ASPFU    V  S  L  F  *  ...
 
CAPCO    v  a -  -    ...
 
SACCE    v  d  -  -    ...
 
         
 
ASPFU    CCATTTTGCCCCTTCCTTCGCCGCGAAGCCAATCTAACGCAGTTCAATAGGTTCCAGTTTACGAGTTCAA
 
                                                              ...  V  Y  E  F  K
 
CAPCO                                                        ...  v  y  e l  k
 
SACCE                                                        ...  v  y  e  f  i
 
         
 
ASPFU      AGTCGATGGCGAAAGTGTTATGCGCCGACGAGGCGATGATTGGATCAATGCTACACATATTCTTAAA
 
 
ASPFU      V  D  G  E  S  V  M  R  R  R  G  D  D  W  I  N  A  T  H  I  L  K ...
 
CAPCO      v  a g  d  h  i  m  r  r  r  s  d  d  w  v  n  a t  h  i  l  k ...
 
SACCE      h  s  t  g  s  i  m  k  r  k  k  d  d  w  v  n  a  t  h  i  l  k ...
 
  
 +
ML methods suffer less from "long-branch attraction" - the phenomenon that weakly similar sequences can be grouped inappropriately close together in a tree due to spuriously shared differences.
  
:This clearly shows us that there is N-terminal sequence that ought to be added to the gene model, upstream of the reported translational start of <tt>MRRR...</tt>. The sequences thus most likely begin as follows:
 
  
ASPFU  MAAVDFSKIYSATYSSVSLFVYEFKVDGE-----SVMRRRGDDWINATHILK...
+
'''Bayesian''' methods don't estimate the tree that gives the highest likelihood for the observed data, but find the most probably tree, given that the data have been observed. If this sounds conceptually similar to you, then you are not wrong. However, the approaches employ very different algorithms. And Bayesian methods need a "prior" on trees before observation.
CAPCO  ma-fd-keiysatysnva--vyelkvagd-----himrrrsddwvnathilk...
 
SACCE  ms----nqiysarysgvd--ysgvdvyefihstgsimkrkkddwvnathilk...
 
  
The fact that the truncated N-terminus appears in both closely '''related''' genes and species suggests that what we see here is a mis-annotated intron. The take-home lesson is: if your retrieved protein sequence does not conform to your expectations, it may be worthwhile to follow up with the actual nucleotide sequence.
 
  
</div>
+
===Choosing sequences===
</div>
 
  
  
&nbsp;
+
In principle, we have discussed strategies for using PSI-BLAST to collect suitable sequences earlier. To prepare the process, I have collected all APSES domains for six reference fungal species, together with the KilA-N domain of ''E. coli''. The process is explained on the [[Reference APSES domains (reference species)|reference APSES domains page]].
  
===Template choice and template sequence===
 
  
 +
====Renaming sequences====
  
The [http://swissmodel.expasy.org/ SWISS-MODEL] server provides several different options for constructing homology models. The easiest option requires only a target sequence as input. In this mode the program will automatically choose suitable templates and create an input alignment. I would argue however that that is not the best way to use such a service: template choice and alignment both may be significantly influenced by biochemical reasoning, and an automated algorithm cannot make the necessary decisions. Should you use a structure of reduced resolution that however has a ligand bound? Should you move an indel from an active site to a loop region even though the sequence similarity score might be less? Questions like that may yield answers that are counter to the best choices an automated algorithm could make. But Swiss Model is flexible and allows us to upload an explicit alignment between target and template. Please note: the model you will produce is "easy" - the sequence similarity is high and there are no indels to consider, the automated mode would have done just as well. But the strategy we pursue here is suitable also for much more difficult problems. The automated strategy probably is not.
 
  
Template choice is the first step. Often more than one related structure can be found in the PDB. We have touched on principles of selecting template structures in the lectures; please refer to the [[Template_choice_principles|template choice principles]] page on this Wiki where I have reviewed the principles and discussed more details and alternatives. One can either search the PDB itself through its '''Advanced Search''' interface; for example one can search for sequence similarity with a BLAST search, or search for structural similarity by accessing structures according to their CATH or SCOP classification. But the BLAST search is probably the method of choice: after all, the most important measure of the probability of success for homology modeling is sequence similarity.
+
Renaming sequences so that their species is apparent is crucial for the interpretation of mixed gene trees. Refer to the [[Reference APSES domains (reference species)|reference APSES domains page]] to see how I have prepared the FASTA sequence headers.
  
In [[BIO_Assignment_Week_3#Search_input|Assignment 3]], you have defined the extent of the APSES domain in yeast Mbp1. In [[BIO_Assignment_Week_6|Assignment 6]], you have used PSI-BLAST to search for APSES domains in YFO. In [[BIO_Assignment_Week_7|Assignment 7]] you have confirmed by ''Reciprocal Best Match'' which of these APSES domain sequences is the closest related orthologue to yeast Mbp1. This sequence is the best candidate for having a conserved function similar to yeast Mbp1. Therefore, this sequence is the one you will model: it is called the '''target''' for the homology modeling procedure. In the same assignment you have also computed a multiple sequence alignment that includes the sequence of  Mbp1 with YFO.
 
  
Defining a '''template''' means finding a PDB coordinate set that has sufficient sequence similarity to your '''target''' that you can build a model based on that '''template'''. In  [[BIO_Assignment_Week_2#Structure_search|Assignment 2]] you have used a keyword search at the PDB to find "Mbp1" structures - but some of these structures were not homologs: keyword searches are notoriously unreliable. To find suitable PDB structures, we will perform a BLAST search at the PDB instead.
+
===Adding an outgroup===
  
  
<!-- NOTE TO SELF: use the following sequence to test the procedure
+
To analyse phylogenetic trees it is useful (and for some algorithms required) to define an outgroup, a sequence that presumably diverged from all other sequences in a clade before they split up among themselves. Wherever the outgroup inserts into the tree, this is the root of the rest of the tree. And whenever a molecular clock is assumed, the branching point that connects the outgroup can be assumed to be the oldest divergence event. I have defined an outgroup sequence and added it to the [[Reference APSES domains (reference species)|reference APSES domains page]]. The procedure is explained in detail on that page.
>Mbp1_SCHPO/2-100 NP_593032
 
AVHVAVYSGVEVYECFIKGVSVMRRRRDSWLNATQILKVADFDKPQRTRVLERQVQIGAHEKVQGGYGKYQG
 
TWVPFQRGVDLATKYKVDGIMSPILSL
 
>1BM8_A
 
QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQG
 
TWVPLNIAKQLAEKFSVYDQLKPLFDF
 
-->
 
  
 +
>gi|301025594|ref|ZP_07189117.1| KilA-N domain protein [Escherichia coli MS 69-1]
 +
<span style="color: #999999;">MTSFQLSLISRE</span>IDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGIPISELIQS
 +
FKGGRPENQGTWVHPDIAINLAQ<span style="color: #999999;">WLSPKFAVQVSRWVREWMSGERTTAEMPVHLKRYMVNRSRIPHTHFS
 +
ILNELTFNLVAPLEQAGYTLPEKMVPDISQGRVFSQWLRDNRNVEPKTFPTYDHEYPDGRVYPARLYPNE
 +
YLADFKEHFNNIWLPQYAPKYFADRDKKALALIEKIMLPNLDGNEQF</span>
 +
''E. coli'' KilA-N protein. Residues that do not align with APSES domains are shown in grey.
  
  
 +
===Calculating alignments===
  
 
{{task|1=
 
{{task|1=
# Retrieve your YFO Mbp1-like APSES domain sequence. You can find the domain boundaries for the yeast protein in the [[Reference annotation yeast Mbp1|Mbp1 annotation reference page]], and you can get the aligned sequence from your Jalview alignment, or simply recompute it with the <code>needle</code> program of the EMBOSS suite. This YFO sequence is your '''target''' sequence.
+
#Navigate to the [[Reference APSES domains (reference species)|reference APSES domains page]] and copy the APSES/KilA-N domain sequences.
# Navigate to the [http://www.pdb.org/pdb/home/home.do PDB].
+
#Open Jalview, select '''File &rarr; Input Alignment &rarr; from Textbox''' and paste the sequences into the textbox.
# Click on '''Advanced''' to enter the advanced search interface.
+
#Add the APSES domain sequences '''from your species (YFO)''' that you have previously defined through PSI-BLAST. Don't worry that the sequences are longer, the MSA algorithm should be able to take care of that. However: do rename your sequences to follow the pattern for the other domains, i.e. edit the FASTA header line to begin with the five-letter abbreviated species code.
# Open the menu to '''Choose a Query Type:'''
+
#When all the sequences are present, click on '''New Window'''.
# Find the '''Sequence features''' section and choose '''Sequence (BLAST...)'''
+
#In Jalview, select Web Service &rarr; Alignment &rarr; MAFFT Multiple Sequence Alignment. The alignment is calculated in a few minutes and displayed in a new window.
# Paste your '''target''' sequence into the '''Sequence''' field, select '''not''' to mask low-complexity regions and '''Submit Query'''. Since the E-value is set rather high by default, you will get a number of low-confidence hits as well as the actual homologs, these have very low E-values.
+
#Choose any colour scheme and add '''Colour &rarr; by Conservation'''. Adjust the slider left or right to see which columns are highly conserved.
 +
#Save the alignment as a Jalview project before editing it for phylogenetic analysis. You may need it again.  
 +
}}
  
All hits that are homologs are potentially suitable '''templates''', but some are more suitable than others. Consider how the coordinate sets differ and which features would make each more or less suitable for creating a homology model: you should consider ...
+
===Editing sequences===
 +
As discussed in the lecture, we should edit our alignments to make them suitable for phylogeny calculations. Here are the principles:
  
:*sequence similarity to your target
+
Follow the fundamental principle that '''all characters in a column should be related by homology'''. This implies the following rules of thumb:
:*size of expected model (= length of alignment)
 
:*presence or absence of ligands
 
:*experimental method and quality of the data set
 
  
Sequence similarity is the most important, but we can have the PDB tabulate the other features concisely for this task.  
+
*Remove all stretches of residues in which the ''alignment'' appears ambiguous (not just highly variable, but ambiguous regarding the aligned positions).
 +
*Remove all frayed N- and C- termini, especially regions in which not all sequences that are being compared appear homologous and that may stem from unrelated domains. You want to only retain the APSES domains. All the extra residues from the YFO sequence can be deleted.
 +
*Remove all gapped regions that appear to be alignment artefacts due to inappropriate input sequences.
 +
*Remove all but approximately one column from gapped regions '''in those cases where the presence of several related insertions suggest that the indel is real, and not just an alignment artefact.''' (Some researchers simply remove all gapped regions).
 +
*Remove sections N- and C- terminal of gaps where the alignment appears questionable. 
 +
*If the sequences fit on a single line you will save yourself potential trouble with block-wise vs. interleaved input. If you do run out of memory try removing columns of sequence. Or remove species that you are less interested in from the alignment.
 +
*Move your outgroup sequence to the first line of your alignment, since this is where PHYLIP will look for it by default.  
  
# There is a menu to create '''Reports:''' - select '''customizable table'''.
+
====Handling indels====
# Select (at least) the following information items:
 
;Structure Summary
 
* Experimental Method
 
;Sequence
 
* Chain Length
 
;Ligands
 
* Ligand Name
 
;Biological details
 
* Macromolecule Name
 
; refinement Details
 
* Resolution
 
* R Work
 
* R free
 
# click: '''Create report'''.
 
  
Unfortunately you don't get the E-values into the report, and those should strongly influence your final decision. However in our case the sequences and therefore the E-values of the top three hits are all the same. Neither of the structures has a bound DNA ligand, but the experimental methods and structure quality are different. Two of the sequences have a longer chain-length ... but those are only disordered residues (otherwise these would be better suited templates; regrettably, you'd need to check that in the ''real world'', there is no automatic tool to evaluate disorder and its effects on template choice). In my opinion that leaves pretty much only one unambiguous choice: 1BM8. In case you don't agree, please let me know.
+
Gaps are a real problem, as usual. Strictly speaking, the similarity score of an '''alignment''' program as well as the distance score of a '''phylogeny''' program are not calculated for an ordered sequence, but for a sum of independent values, one for each aligned columns of characters. The order of the columns does not change the score. However in an optimal sequence alignment with gaps, this is no longer strictly true since a one-character gap creation has a different penalty score than a one-character gap extension! Most '''alignment''' programs use a model with a constant gap insertion penalty and a linear gap extension penalty. This is not rigorously justified from biology, but parametrized (or you could say "tweaked") to correspond to our observations. However, most '''phylogeny''' programs, (such as the programs in PHYLIP) do not work in this way. PHYLIP strictly operates on columns of characters and treats a gap character just like a residue with the one letter code "-". Thus gap insertion- and extension- characters get the same score. For short indels, this '''underestimates''' the distance between pairs of sequences, since any evolutionary model should reflect the fact that gaps are much less likely than point mutations. If the gap is very long though, all events are counted individually as many single substitutions (rather than one lengthy one) and this '''overestimates''' the distance. And it gets worse: long stretches of gaps can make sequences appear similar in a way that is not justified, just because they are identical in the "-" character. It is therefore common and acceptable to edit gaps in the alignment and delete all but one or two columns of gapped sequence, or to remove such columns altogether.
  
;Finally: Click on the 1BM8 ID to navigate to the structure page for the '''template''' and save the FASTA sequence to your computer. This is '''the template sequence'''.
 
  
}}
+
[[Image:EditingGuide.jpg|frame|none|(Possible) steps in editing a multiple sequence alignment towards a PHYLIP input file. '''a''': raw alignment (CLUSTAL format); '''b''': sequences assembled into single lines; '''c''': columns to be deleted highlighted in red - 1, 3 and 4: large gaps; 2: uncertain alignment and 5: frayed C-terminus: both would put non-homologous characters into the same column; '''d''': input data for PHYLIP: names for sequences must not be longer than 10 characters, the first line must contain the number of sequences and the sequence length. PHYLIP is very picky about incorrectly formatted input, read the [http://evolution.genetics.washington.edu/phylip/doc/sequence.html PHYLIP sequence format guide].]]
  
  
&nbsp;
+
{{task|1=
  
 +
Prepare a PHYLIP input file from the sequences you have prepared following the principles above. The simplest way to achieve this appears to be:
  
 +
##Copy the sequences you want into a textfile. Make sure the "reference sequences", are included, the outgroup and the sequences from YFO.
 +
##In a browser, navigate to the [http://www-bimas.cit.nih.gov/molbio/readseq/ '''Readseq sequence conversion service'''].
 +
##Paste your sequences into the form and choose '''Phylip''' as the output format. Click on '''submit'''.
 +
##Save the resulting page as a text file. Give it some useful name such as <code>APSES_domains.phy</code>.
  
===Sequence numbering===
+
}}
  
  
&nbsp;
+
==Calculating trees==
  
It is not straightforward at all how to number sequence in such a project. A "natural" numbering starts with the start-codon of the full length protein and goes sequentially from there. However, this does not map exactly to other numbering schemes we have encountered. As you know the first residue of the APSES domain (as defined by CDD) is not Residue 1 of the Mbp1 protein. The first residue of the 1BM8 FASTA file <small>(one of the related PDB structures)</small> '''is''' the fourth residue of the Mbp1 protein. The first residue in the structure is GLN 3, therefore Q is the first residue in a FASTA sequence derived from the cordinate section of the PDB file (the <code>ATOM  </code> records. In the 1MB1 structure, the original N-terminal amino acids are present in the molecule, therefore they are present in the FASTA file which starts with <code>MSNQIY...</code>, but they are disordered in the structure and no coordinates are present for M and S. A sequence derived explicitly from the coordinates is therefore different from the reported FASTA sequence, which is really bad because that is what the modeling program has to work with  ... and so on. It can get complicated. You need to remember: a sequence number is not absolute, but assigned in a particular context and you need to be careful how to do this.  
+
In this section we perform the actual phylogenetic calculation.
  
Fortunately, the numbering for the residues in the coordinate section of our '''target''' structure corresponds not to its FASTA sequence, but to the numbering of the gene. Otherwise we would need to renumber the sequence <small>(e.g. by using the bio3D R package)</small>. If we would not do this, the sequence numbers in the model might not correspond to the sequence numbers of our target.
+
{{task|1=
  
<!--
+
#Download the PHYLIP package from the [http://evolution.genetics.washington.edu/phylip.html Phylip homepage] and install it on your computer.
BELOW IS NOT NECESSARY FOR THE 1BM8 TEMPLATE. ALSO extraction can be done with bio3D
+
# Make a copy of your PHYLIP formatted sequence alignment file and name it <code>infile</code>. Note: make sure that your Microsoft Windows operating system does not silently append the extension ".txt" to your file. It should be called "infile", nothing else. Place this file into the directory where the PHYLIP executables reside on your computer.
 +
#Run the '''proml''' program of PHYLIP (protein sequences, maximum likelihood tree) to calculate a phylogenetic tree (on the Mac, use proml.app). The program will automatically use "infile" for its input. Use the default parameters except that you should change option <code>S: Speedier but rougher analysis?</code> to <code>No, not rough</code> - your analysis should not sacrifice accuracy for speed. The calculation may take some fifteen minutes or so..
  
  
The homology '''model''' will be based on an alignment of '''target''' and '''template'''. Thus we have to define the target sequence. As discussed in class, PDB files have an explicit  and an implied sequence and these do not necessarily have to be the same. To compare the implied and the explicit sequence for the template, you need to extract sequence information from coordinates. One way to do this is via the Web interface for [http://swift.cmbi.ru.nl/servers/html/index.html '''WhatIf'''], a crystallography and molecular modeling package that offers many useful tools for coordinate manipulation tasks.  
+
The program produces two output files: the <code>outfile</code> contains a summary of the run, the likelihood of bifurcations, and '''an ASCII representation of the tree'''. Open it with your usual text editor to have a look, and save the file with a meaningful name. The <code>outtree</code> contains the resulting tree in so-called "Newick" format. Again, have a look and save it with a meaningful filename.
  
  
*Navigate to the '''Administration''' sub-menu of the [http://swift.cmbi.ru.nl/servers/html/index.html WhatIf Web server]. Follow the link to '''Make sequence file from PDB file'''. Enter the PDB-ID of your template into the form field and '''Send''' the request to the server. The server accesses the PDB file and extracts sequence information directly from the <code>ATOM&nbsp;&nbsp;</code> records of the file. The results will be returned in PIR format. Copy the results, edit them to FASTA format and save them in a text-only file. Make sure you create a valid FASTA formatted file! Use this '''implied''' sequence to check if and how it differs from the sequence ...
+
}}
  
:*... listed in the <code>SEQRES</code> records of the coordinate file;
 
:*... given in the FASTA sequence for the template, which is provided by the PDB;
 
:*... stored in the protein database of the NCBI.
 
: and record your results.
 
  
* Establish how the sequence numbers in the coordinate section of your template(*) correspond to your target sequence numbering.
+
<!-- Bootstrapping ...
 
+
* run seqboot
:(*) <small>These residue numbers are important, since they are referenced e.g. by VMD when you visualize the structure. The easiest way to list them is via the ''Sequence Viewer'' extension of VMD.</small>.
+
* rename outfile to infile
:<small>Don't do this for every residue individually but define ranges. Look at the correspondence of the first and last residue of target and template sequence and take indels into account. Establishing sequence correspondence precisely is crucially important! For example, when a publication refers to a residue by its sequence number, you have to be able to relate that number to the residue numbers of the model as well as your target sequence.</small>.
+
* rerun proml, use option M for multiple datasets with speedy option (use "jumble" of 1)
&nbsp;
+
* rename outtree to intree
&nbsp;
+
* run consense
 +
* Use option R to define trees as rooted
  
 +
Should run at least overnight.
 
-->
 
-->
  
 +
==Analysing your tree==
  
&nbsp;
+
In order to analyse your tree, you need a species tree as reference. Then you can begin comparing your expectations with the observed tree.
  
  
===The input alignment===
+
===The species tree reference===
  
  
&nbsp;
+
I have constructed a cladogram for many of the species we are analysing, based on data published for 1551 fungal ribosomal sequences. The six reference species are included. Such reference trees from rRNA data are a standard method of phylogenetic analysis, supported by the assumption that rRNA sequences are monophyletic and have evolved under comparable selective pressure in all species.
The sequence alignment between target and template is the single most important factor that determines the quality of your model. No comparative modeling process will repair an incorrect alignment; it is useful to consider a homology model rather like a three-dimensional map of a sequence alignment rather than a structure in its own right. In a homology modeling project, typically the largest amount of time should be spent on preparing the best possible alignment. Even though automated servers like the SwissModel server will align sequences and select template structures for you, it would be unwise to use these just because they are convenient. You should take advantage of the much more sophisticated alignment methods available. Analysis of wrong models can't be expected to produce right results.
 
  
The best possible alignment is usually constructed from a multiple sequence alignment that includes at least '''the target and template sequence''' and other related sequences as well. The additional sequences are an important aid in identifying the correct placement of insertions and deletions. Your alignment should have been carefully reviewed by you and wherever required, manually adjusted to move insertions or deletions between target and template out of the secondary structure elements of the template structure.
+
[[Image:FungiCladogram.jpg|frame|none|Cladogram of many fungi studied in the assignments. This cladogram is based on small subunit ribosomal rRNA sequences, and largely follows ''Tehler et al.'' (2003) ''Mycol Res.'' '''107''':901-916. Even though many details of fungal phylogeny remain unresolved, the branches shown here individually appear to have strong support. In a cladogram such as this, the branch lengths are not drawn to any scale of similarity.]]
  
In most of the Mbp1 orthologues, we do not observe indels in the APSES domain regions. Evolutionary pressure on the APSES domains has selected against indels in the more than 600 million years these sequences have evolved independently in their respective species. To obtain an alignment between the '''template sequence''' and the '''target sequence''' from your species, proceed as follows.
+
Your species may not be included in this cladogram, but you can easily create your own species tree with the following procedure:
 
 
 
 
&nbsp;
 
  
 
{{task|1=
 
{{task|1=
Choose on of the following options to align your '''target''' and '''template''' sequence.
+
#Access the [http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=retrieve&db=taxonomy NCBI taxonomy database Entrez query page].
 +
#Edit the list of reference species below to include your species and paste it into the form.
  
 +
"Aspergillus nidulans"[Scientific Name] OR
 +
"Candida albicans"[Scientific Name] OR
 +
"Neurospora crassa"[Scientific Name] OR
 +
"Saccharomyces cerevisiae"[Scientific Name] OR
 +
"Schizosaccharomyces pombe"[Scientific Name] OR
 +
"Ustilago maydis"[Scientific Name]
  
;In Jalview...
+
#Next, as '''Display Settings''' option, select '''Common Tree'''.
* Load your Jalview project with aligned APSES domain sequences or recreate it from the Mbp1 orthologue sequences from the [[Reference Mbp1 orthologues (all fungi)|'''Mbp1 protein orthologs page''']] that I prepared for Assignment 7. Include the sequence of your '''template protein''' and re-align.
 
* Delete all sequence you no longer need, i.e. keep only the APSES domains of the '''target''' (from your species) and the '''template''' (from the PDB) and choose '''Edit &rarr; Remove empty columns'''. This is your '''input alignment'''.
 
* Choose '''File&rarr;Output to textbox&rarr;FASTA''' to obtain the aligned sequences. They should both have exactly the same length, i.e. N- or C- termini have to be padded by hyphens if the original sequences had different length. Save the sequences in a text-file.
 
  
 +
You can use that tree as is - or visualize it more nicely as follows
  
;Using a different MSA program
+
#Select the '''phylip tree''' option from the menu, and click '''save as''' to save the tree in phylip (Newick) tree format.
* Copy the FASTA formatted sequences of the Mbp1 proteins in the reference  species from the [[Reference APSES domains (reference species)|'''Reference APSES domain page''']].
+
#The output can be edited, and visualized in any program that reads phylip trees. One particularly nice viewer is the [http://itol.embl.de/ '''iTOL''' - Interactive Tree of Life project''']. Copy the contents of the <code>phyliptree.phy</code> file that the NCBI page has written, navigate to the iTOL project, click on the '''Data Upload''' tab, paste your tree data and click '''Upload'''. Then '''go to the main display page''' to view the tree. Change the view from '''Circular''' to '''Normal'''.
* Access e.g. the MSA tools page at the EBI.  
 
* Paste the Mbp1 sequence set, your '''target''' sequence and the '''template''' sequence into the input form.
 
*Run the alignment and save the output.
 
 
 
 
 
;Using the EMBOSS explorer
 
* Use the <code>needle</code> tool for the alignment  ... but remember that pairwise alignments will only be suitable in case the alignment is absolutely unambiguous (such as here) . If there are any indels, an MSA will give much more reliable information.
 
 
 
 
 
;By hand
 
APSES domains are strongly conserved and have few if any indels. You could also simply align by hand.
 
 
 
* Copy the CLUSTAL formatted reference alignment of the Mbp1 proteins in the reference species from the [[Reference APSES domains (reference species)|'''Reference APSES domain page''']].
 
* Open a new file in a text editor.
 
* Paste the Mbp1 sequence set, your '''target''' sequence and the '''template''' sequence into the file.
 
*Align by hand, replace all spaces with hyphens and save the output.
 
 
}}
 
}}
  
 +
;Alternatively ...
 +
You can look up your species in the latest version of the species tree for the fungi:
 +
{{#pmid: 22114356}}
  
Whatever method you use: the result should be a two sequence alignment in '''multi-FASTA''' format, that was constructed from a number of supporting sequences and that contains your aligned '''target''' and '''template''' sequence. This is your '''input alignment''' for the homology modeling server. For a ''Schizosaccharomyces pombe'' model, which I am using as an example here, it looks like this:
+
===Visualizing the tree===
 
 
>1BM8_A
 
QIYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHILKAANFAKAKRTRI
 
LEKEVLKETHEKVQGGFGKYQGTWVPLNIAKQLAEKFSVYDQLKPLFDF
 
>Mbp1_SCHPO 2-100 NP_593032
 
AVHVAVYSGVEVYECFIKGVSVMRRRRDSWLNATQILKVADFDKPQRTRV
 
LERQVQIGAHEKVQGGYGKYQGTWVPFQRGVDLATKYKVDGIMSPILSL
 
 
 
 
 
&nbsp;
 
 
 
==Homology model==
 
 
 
 
 
&nbsp;
 
 
 
 
 
===SwissModel===
 
  
&nbsp;<br>
 
  
Access the Swissmodel server at '''http://swissmodel.expasy.org''' and click on '''Start Modelling'''. Then, under the '''Supported Inputs''', click on '''Target-Template Alignment'''.
+
Once Phylip is done calculating the tree, the tree in a text format will be contained in the Phylip <code>outfile</code> - the documentation of what the program has done. Open this textfile for a first look. The tree is complicated and it can look confusing at first. The tree in Newick format is contained in the Phylip file <code>outtree</code>. Visualize it as follows:
  
 
{{task|1=
 
{{task|1=
*Paste your alignment for target and model into the form field. Click on the question mark next to "Supported Inputs" if you are not sure about the format. SwissModel will analyse the sequences and ask you to identify target and template. The YFO sequence is your target. The 1BM8 sequence is the template.
 
  
* Click '''Validate Target Template Alignment''' and check that the returned alignment is correct.
 
  
*Click '''Build Model''' to start the modeling process.
+
#Open <code>outtree</code> in a texteditor and copy the tree.
 +
#Visualize the tree in alternative representations:
 +
##I have already mentioned the [http://itol.embl.de/ '''iTOL''' - Interactive Tree of Life project'''] viewer.
 +
##Navigate to the [http://www.proweb.org/treeviewer/ Proweb treeviewer], paste and visualize your tree.
 +
##Navigate to the [http://www.trex.uqam.ca/index.php?action=newick&project=trex Trex-online Newick tree viewer] for an alternative view. Visualize the tree as a phylogram. You can increase the window height to keep the labels from overlapping.
 +
# A particularly useful viwer is actually Jalview.
 +
##Open Jalview, copy the sequences you have used and paste them via '''File &rarr; Input Alignment &rarr; from Textbox'''.
 +
##In the alignment window, choose '''File &rarr; Load associated Tree''' and load the Phylip <code>outtree</code> file. You can click into the tree-window to show which clades branch off at what level - it should be obvious that you can identify three major subclades (plus the outgroup). This view is particularly informative, since you can associate the clades of the tree with the actual sequences in the alignment, and get a good sense what sequence features the tree is based on.
 +
##Try the '''Calculate &rarr; Sort &rarr; By Tree Order''' option to sort the sequences by their position in the tree. Also note that you can flip the tree around a node by double-clicking on it. This is especially useful: try to rearrange the tree so that the subdivisions into clades are apparent. Clicking into the window "cuts" the tree and colours your sequences according to the clades in which they are found. This is useful to understand what particular sequences contributed to which part of the phylogenetic inference.
 +
##Study the tree: understand what you see and what you would have expected.  
  
* The resulting page returns information about the resulting model. Mouse over the '''Model 01''', open the '''PDB file''' and save the coordinates to your computer. Read the information on what is being returned by the server (click on the question mark icon). Study the quality measures.
+
}}
  
* Also save:
 
  
** The output page as pdf (for reference)
 
** The modeling report (as pdf)
 
}}
 
  
==Model analysis==
+
Here are two principles that will help you make sense of the tree.
  
&nbsp;
 
&nbsp;
 
  
=== The PDB file ===
+
A: '''A gene that is present in an ancestral species is inherited in all descendant species'''. The gene has to be observed in all OTUs, unless its has been lost (which is a rare event).
&nbsp;<br>
 
  
{{task|1=
+
B: '''Paralogous genes in an ancestral species should give rise to monophyletic subtrees for each of the paralogues, in all descendants'''; this means: if the LCA of a branch has e.g. three genes, we would expect three copies of the species cladogram below this branchpoint, one for each of these genes. Each of these subtrees should recapitulate the reference phylogenetic tree of the species, up to the branchpoint of their LCA.
Open your '''model''' coordinates in a text-editor (make sure you view the PDB file in a fixed-width font (like "courier") so all the columns line up correctly) and consider the following questions:
 
  
*What is the residue number of the first residue in the '''model'''? What should it be, based on the alignment? If the putative DNA binding region was reported to be residues 50-74 in the Mbp1 protein, which residues of your '''model''' correspond to that region?
 
}}
 
  
<!-- discuss flagging of loops - setting of B-factor to 99.0 phps. ANOLEA vs. Gromos ... packing vs. energy? -->
+
With these two simple principles (you should draw them out on a piece of paper if they do not seem obvious to you), you can probably pry your tree apart quite nicely. A few colored pencils and a printout of the tree will help.
  
 +
===The APSES domains of LCA===
  
===R code: renumbering the model ===
+
Note: A common confusion about cenancestral genes (LCA = Last Common Ancestor) arises from the fact that by far not all expected genes are present in the OTUs. Some will have been lost, some will have been incorrectly annotated in their genome (frameshifts!) and not been found with PSI-BLAST, some may have diverged beyond recognizability. In general you have to ask: '''given the species represented in a subclade, what is the last common ancestor of that branch'''? The expectation is that '''all''' descendants of that ancestor should be represented in that branch '''unless''' one of the above reasons why a gene might be absent would apply.
  
As you have seen, SwissModel numbers the first residue "1" and does not keep the numbering of the template. We should renumber the model so we can compare the model and the template with the same residue numbers. Fortunately there is a very useful R package that will help us with that.
 
  
 
{{task|1=
 
{{task|1=
# Navigate to the [http://thegrantlab.org/bio3d/index.php '''bio3D'''] home page. '''bio3d''' is not available for installation via CRAN, but needs to be installed from source. Instructions for the different platforms are here http://thegrantlab.org/bio3d/tutorials/installing-bio3d Follow the instructions and install '''bio3d''' for '''R''' on your platform.
 
  
# Explore and execute the following '''R''' script. I am assuming that your model is in your working directory, change paths and filenames as required.
 
  
<source lang="rsplus">
+
* Consider how many APSES domain proteins the fungal cenancestor appears to have possessed and what evidence you see in the tree that this is so. Note that the hallmark of a clade that originated in the cenancestor is that it contains species from '''all''' subsequent major branches of the species tree.  
# renumberPDB.R
 
  
# This is a simple renumbering script that uses the bio3D
 
# package. We simply set the first residue number to what it
 
# should be and renumber all residues based on the first one.
 
# The script assumes your input PDBfile is in your working
 
# directory.
 
  
# To run this, you must have installed the bio3D R package; instructions
+
}}
# are here: http://thegrantlab.org/bio3d/tutorials/installing-bio3d
 
  
setwd("~/my/working/directory")
 
PDBin      <- "YFO_model.pdb"
 
PDBout    <- "YFO_model_ren.pdb"
 
  
first <- 4  # residue number that the first residue should have
 
 
# ================================================
 
#    Read coordinate file
 
# ================================================
 
 
# read PDB file using bio3D function read.pdb()
 
library(bio3d)
 
pdb  <- read.pdb(PDBin) # read the PDB file into a list
 
  
pdb            # examine the information
+
===The APSES domains of YFO===
pdb$atom[1,]  # get information for the first atom
 
  
# you can explore ?read.pdb and study the examples.
+
Assume that the cladogram for fungi that I have given above is correct, and that the mixed gene tree you have calculated is fundamentally correct in its overall arrangement but may have local inaccuracies due to the limited resolution of the method. You have identified the APSES domain genes of the fungal cenancestor above. Apply the expectations we have stated above to  identify the sequence of duplications and/or gene loss in your organism through which YFO has ended up with the APSES domains it possesses today.  
  
# ================================================
+
{{task|1=
#    Change residue numbers
 
# ================================================
 
  
 +
# Print the tree to a single sheet of paper.
 +
# Mark the clades for the genes of the cenancestor.
 +
# Label all subsequent branchpoints that affect the gene tree for YFO  with either '''"D"''' (for duplication) or '''"S"''' (for speciation). Remember that specific speciation events can appear more than once in a tree. Identify such events.
 +
# '''Bring this sheet with you to the quiz on Wednesday.'''
  
resNum <- as.numeric(pdb$atom[,"resno"])  # get residue numbers for all atoms
 
resNum <- resNum + (first - resNum[1])        # calculate offset
 
pdb$atom[,"resno"] <- resNum            # replace old numbers with new
 
pdb$atom[1,]                                  # check result
 
 
 
# ================================================
 
#    Write output to file
 
# ================================================
 
 
write.pdb(pdb=pdb,file=PDBout)
 
 
# Done. Open the PDB file you have written in a text editor and confirm
 
# that this has worked.
 
 
</source>
 
 
}}
 
}}
  
 +
==Bonus: when did it happen?==
  
&nbsp;
+
A very cool resource is [http://www.timetree.org/ '''Timetree'''] - a tool that allows you to estimate divergence times between species. For example, the speciation event that separated the main branches of the fungi - i.e. the time when the fungal cenacestor lived - is given by the divergence time of ''Schizosaccharomyces pombe'' and ''Saccharomyces cerevisiaea'': 761,000,000 years ago. For comparison, these two fungi are therefore approximately as related to each other as '''you''' are ...
  
===First visualization===
+
A) to the rabbit?<br>
 +
B) to the opossum?<br>
 +
C) to the chicken?<br>
 +
D) to the rainbow trout?<br>
 +
E) to the warty sea squirt?<br>
 +
F) to the bumblebee?<br>
 +
G) to the earthworm?<br>
 +
H) to the fly agaric?<br>
  
&nbsp;<br>
+
Check it out - the question will be on the quiz.
  
Since a homology model inherits its structural details from the '''template''', your model of the YFO sequence should look very similar to the original 1BM8 structure.
+
== Links and resources ==
 
 
{{task|1=
 
# Start Chimera and load the '''model''' coordinates that you have just renumbered.
 
# From the PDB, also load the '''template''' structure. (Use File &rarr; Fetch by ID ...)
 
# In the '''Favourites''' &rarr; '''Model Panel''' window you can switch between the two molecules.
 
# Hide the ribbon and choose '''backbone only &rarr; full'''. You will note that the backbone of the two structures is virtually identical.
 
# Next, choose '''Actions &rarr; Atoms/Bonds &rarr; show''' to display display the two molecules in a stick style and note how the sidechains have been modeled. Note especially how sidechain coordinates have been guessed, where the template had shorter sidechains than the target. It may be more clear if you hide H-atoms: '''Select &rarr; Chemistry &rarr; Element &rarr; H''' and '''Actions &rarr; Atoms/Bonds &rarr; hide'''
 
# Display only residue 50 to 74 to focus on the putative helix-turn-helix domain. Choose '''Favourites &rarr; Sequence''', select the residues for one model, then '''Select &rarr; Invert (selected model)''' and '''Actions &rarr; Atoms/Bonds &rarr; hide'''.
 
# Study the result. A model of the HTH domain of YFO Mbp1.
 
}}
 
  
&nbsp;<br>
 
&nbsp;<br>
 
  
==Coloring the model by energy ==
 
  
SwissModel calculates energies for each residue of the model with a molecular mechanics forcefield. The SwissModel modeling summary page contains a plot of these energies as a function of sequence number like. The values - between 0.0 and 1.0 - are stored in the PDB files B-factor field.
+
;That is all.
  
 
+
==Links and Resources==
{{task|1=
+
;Literature
# Back in Chimera, use the model panel to '''close''' the 1BM8 structure.
+
{{#pmid: 22114356}}
# Choose '''Tools &rarr; Depiction &rarr; Render by attribute''' and select '''attributes of atoms''', '''Attribute: bfactor''', check '''color atoms''' and click '''OK'''.
+
{{#pmid: 19190756}}
# Study the result: It seems that residues in the core of the protein have better energies than residues at the surface. Why could that be the case?
+
{{#pmid: 12801728}}
 +
:* [http://evolution.genetics.washington.edu/phylip/phylip.html '''PHYLIP''' documentation]
 +
{{PDF
 +
|authors= Tuimala, Jarno
 +
|year= 2006
 +
|title= A primer to phylogenetic analysis using the PHYLIP package
 +
|journal=
 +
|volume=
 +
|pages=
 +
|URL= http://koti.mbnet.fi/tuimala/oppaat/phylip2.pdf
 +
|doi=
 +
|file= Tuimala_PHYLIP.pdf
 +
|abstract= The purpose of this tutorial is to demonstrate how to use PHYLIP, a collection of phylogenetic analysis software, and some of the options that are available. This tutorial is not intended to be a course in phylogenetics, although some phylogenetic concepts will be discussed briefly. There are other books available which cover the theoretical sides of the phylogenetic analysis, but the actual data analysis work is less well covered. Here we will mostly deal with molecular sequence data analysis in the current PHYLIP version 3.66.
 
}}
 
}}
  
Study the options of this window a bit, rendering by attribute is a powerful way to store and depict all manners of information with the molecule. Simply write a little R script that uses bio3D to replace the B-factor or occupancy values with any value you might be interested in: energies, conservation scores, information ... whatever. The rewnder this property to map it on the 3D structure of your molecule. If you want to experience with this a bit, you could apply the information scores from the previous assignment to your model, using a script that is easy to derive from the renumbering R-script you have studied above.
 
  
 +
;Software
 +
:* [http://evolution.genetics.washington.edu/phylip.html '''PHYLIP''' home page]
 +
:* [http://itol.embl.de/ '''iTOL''' - Interactive Tree of Life project''']
  
 +
;Sequences
 +
:* [[Reference APSES domains (reference species)|'''reference APSES domains page''']]
  
;That is all.
 
  
==Links and Resources==
 
  
 
&nbsp;<br>
 
&nbsp;<br>
Line 472: Line 329:
 
{{#pmid: 12117790}}
 
{{#pmid: 12117790}}
  
 
:* [http://www.wwpdb.org/documentation/format23/v2.3.html '''PDB file format'''] (see the Coordinate Section if you are unsure about chain identifiers)
 
:* [http://en.wikipedia.org/wiki/Structural_alignment Wikipedia on '''Structural Superposition'''] <small>(although the article is called "Structural Alignment")</small>
 
 
 
 
;Reference sequences
 
 
:* [[Reference Mbp1 orthologues (all fungi)|'''Mbp1 ortholog sequences (all fungi)''']]
 
  
 
<!-- {{#pmid: 19957275}} -->
 
<!-- {{#pmid: 19957275}} -->

Revision as of 14:44, 2 October 2015

Assignment for Week 8
Phylogenetic Analysis

< Assignment 7 Assignment 9 >

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.



 


 

Nothing in Biology makes sense except in the light of evolution.
Theodosius Dobzhansky

... but does evolution make sense in the light of biology?

As we have seen in the previous assignments, the Mbp1 transcription factor has homologues in all other fungi, yet there is not always a clear one-to-one mapping between members of a family in distantly related species. It appears that various systems of APSES domain transcription factors have evolved independently. Of course this bears directly on our notion of function - what it means to say that two genes in different organisms have the "same" function. In case two organisms both have an orthologous gene for the same, distinct function, saying that the function is the same may be warranted. But what if that gene has duplicated in one species, and the two paralogues now perform different, related functions in one organism? Theses two are still orthologues to the other species, but now we expect functionally significant residues to have adapted to the new role of one paralogue. In order to be able to even ask such questions, we need to make the evolutionary history of gene families explicit. This is the domain of phylogenetic analysis. We can ask questions like: how many paralogues did the cenancestor of a clade possess? Which of these underwent additional duplications in the phylogenesis of the organism I am studying? Did any genes get lost? And - adding additional biological insight to the picture - did the observed duplications lead to the "invention" of new biological systems? When was that? And perhaps even: how did the species benefit from this event?

We will develop this kind of analysis in this assignment. In the previous assignment you have established which gene in your species is the reciprocally most closely related orthologue to yeast Mbp1 (with reciprocal best match) and you have identified the full complement of APSES domain genes in your assigned organism (as a result of your PSI-BLAST search). In this assignment, we will analyse these genes' evolutionary relationship and compare it to the evolutionary relationship of other fungal APSES domains. The goal is to define families of related transcription factors and their evolutionary history. I have prepared APSES domains from six diverse reference species, you will add YFO's APSES domain sequences and compute the phylogram for all genes. The goal is to identify orthologues and paralogues.

A number of excellent tools for phylogenetic analysis exist; general purpose packages include the (free) PHYLIP package, the MEGA package and the (commercial) PAUP* package. Of these, only MEGA is still under active development, although PHYLIP still functions perfectly (except for problems with graphical windows under Mac OS 10.6). Specialized tools for tree-building include Treepuzzle or Mr. Bayes. This assignment is constructed around programs that are available in PHYLIP, however you are welcome to use other tools that fulfill a similar purpose if you wish. In this field, researchers consider trees that have been built with ML (maximum likelihood) methods to be more reliable than trees that are built with parsimony methods, or distance methods such as NJ (Neighbor Joining). However ML methods are also much more compute-intensive. Just like with multiple sequence alignments, some algorithms will come closer to guessing the truth and others will not and usually it is hard to tell which is the more trustworthy of two diverging results. The prudent researcher tries out alternatives and forms her own opinion. Specifically, we may usually assume results that converge when computed with different algorithms, to be more reliable than those that depend strongly on a particular algorithm, parameters, or details of input data.

However, we will take a shortcut in this assignment (something you should not do in real life). We will skip establishing the reliability of the tree with a bootstrap procedure, i.e. repeat the tree-building a hundred times with partial data and see which branches and groupings are robust and which depend on the details of the data. (If you are interested, have a look here for the procedure for running a bootstrap analysis on the data set you are working with, but this may require a day or so of computing time on your computer.) In this assignment, we will simply acknowledge that bifurcations that are very close to each other have not been "resolved" and be appropriately cautious in our inferences. In phylogenetic analysis, not all lines a program draws are equally trustworthy. Don't take the trees as a given fact just because a program suggests this. Look at the evidence, include independent information where available, use your reasoning, and analyse the results critically. As you will see, there are some facts that we know for certain: we know which species the genes come from, and we can (usually) make good assumptions about the relationship of the species themselves - the history of speciation events that underlies all evolution of genes. This is extremely helpful information for our work.


If you would like to review concepts of trees, clades, LCAs, OTUs and the like, I have linked an excellent and very understandable introduction-level article on phylogenetic analysis here and to the resource section at the bottom of this page.

Baldauf (2003) Phylogeny for the faint of heart: a tutorial. Trends Genet 19:345-51. (pmid: 12801728)

PubMed ] [ DOI ] Phylogenetic trees seem to be finding ever broader applications, and researchers from very different backgrounds are becoming interested in what they might have to say. This tutorial aims to introduce the basics of building and interpreting phylogenetic trees. It is intended for those wanting to understand better what they are looking at when they look at someone else's trees or to begin learning how to build their own. Topics covered include: how to read a tree, assembling a dataset, multiple sequence alignment (how it works and when it does not), phylogenetic methods, bootstrap analysis and long-branch artefacts, and software and resources.

Preparing input alignments

In this section, we start from a collection of homologous APSES domains, construct a multiple sequence alignment, and edit the alignment to make it suitable for phylogenetic analysis.


Principles

In order to use molecular sequences for the construction of phylogenetic trees, you have to build a multiple alignment first, then edit it. This is important: all rows of sequences have to contain the exact same number of characters and to hold aligned characters in corresponding positions. Phylogeny programs are not meant to revise an alignment but to analyze evolutionary relationships, after the alignment has been determined. The program's inferences are made on a column-wise basis and if your columns contain data from unrelated positions, the inferences are going to be questionable. Clearly, in order for tree-estimation to work, one must not include fragments of sequence which have evolved under a different evolutionary model as all others, e.g. after domain fusion, or after accommodating large stretches of indels. Thus it is appropriate to edit the sequences and pare them down to a most characteristic subset of amino acids. The goal is not to be as comprehensive as possible, but to input those columns of aligned residues that will best represent the true phylogenetic relationships between the sequences.


The result of the tree construction is a decision about the most likely evolutionary relationships. Fundamentally, tree-construction programs decide which sequences had common ancestors.


Distance based phylogeny programs start by using sequence comparisons to estimate evolutionary distances:

  • they apply a model of evolution such as a mutation data matrix, to calculate a score for each pair of sequences,
  • this score is stored in a "distance matrix" ...
  • ... and used to estimate a tree that groups sequences with close relationships together. (e.g. by using an NJ, Neigbor Joining, algorithm).

They are fast, can work on large numbers of sequences, but are less accurate if genes evolve at different rates.


Parsimony based phylogeny programs build a tree that minimizes the number of mutation events that are required to get from a common ancestral sequence to all observed sequences. They take all columns into account, not just a single number per sequence pair, as the Distance Methods do. For closely related sequences they work very well, but they construct inaccurate trees when they can't make good estimates for the required number of sequence changes.


ML, or Maximum Likelihood methods attempt to find the tree for which the observed sequences would be the most likely under a particular evolutionary model. They are based on a rigorous statistical framework and yield the most robust results. But they are also quite compute intensive and a tree of the size that we are building in this assignment is a challenge for the resources of common workstation (runs about an hour on my computer). If the problem is too large, one may split a large problem into smaller, obvious subtrees (e.g. analysing orthologues as a group, only including a few paralogues for comparison) and then merge the smaller trees; this way even very large problems can become tractable.

ML methods suffer less from "long-branch attraction" - the phenomenon that weakly similar sequences can be grouped inappropriately close together in a tree due to spuriously shared differences.


Bayesian methods don't estimate the tree that gives the highest likelihood for the observed data, but find the most probably tree, given that the data have been observed. If this sounds conceptually similar to you, then you are not wrong. However, the approaches employ very different algorithms. And Bayesian methods need a "prior" on trees before observation.


Choosing sequences

In principle, we have discussed strategies for using PSI-BLAST to collect suitable sequences earlier. To prepare the process, I have collected all APSES domains for six reference fungal species, together with the KilA-N domain of E. coli. The process is explained on the reference APSES domains page.


Renaming sequences

Renaming sequences so that their species is apparent is crucial for the interpretation of mixed gene trees. Refer to the reference APSES domains page to see how I have prepared the FASTA sequence headers.


Adding an outgroup

To analyse phylogenetic trees it is useful (and for some algorithms required) to define an outgroup, a sequence that presumably diverged from all other sequences in a clade before they split up among themselves. Wherever the outgroup inserts into the tree, this is the root of the rest of the tree. And whenever a molecular clock is assumed, the branching point that connects the outgroup can be assumed to be the oldest divergence event. I have defined an outgroup sequence and added it to the reference APSES domains page. The procedure is explained in detail on that page.

>gi|301025594|ref|ZP_07189117.1| KilA-N domain protein [Escherichia coli MS 69-1]
MTSFQLSLISREIDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGIPISELIQS
FKGGRPENQGTWVHPDIAINLAQWLSPKFAVQVSRWVREWMSGERTTAEMPVHLKRYMVNRSRIPHTHFS
ILNELTFNLVAPLEQAGYTLPEKMVPDISQGRVFSQWLRDNRNVEPKTFPTYDHEYPDGRVYPARLYPNE
YLADFKEHFNNIWLPQYAPKYFADRDKKALALIEKIMLPNLDGNEQF

E. coli KilA-N protein. Residues that do not align with APSES domains are shown in grey.


Calculating alignments

Task:

  1. Navigate to the reference APSES domains page and copy the APSES/KilA-N domain sequences.
  2. Open Jalview, select File → Input Alignment → from Textbox and paste the sequences into the textbox.
  3. Add the APSES domain sequences from your species (YFO) that you have previously defined through PSI-BLAST. Don't worry that the sequences are longer, the MSA algorithm should be able to take care of that. However: do rename your sequences to follow the pattern for the other domains, i.e. edit the FASTA header line to begin with the five-letter abbreviated species code.
  4. When all the sequences are present, click on New Window.
  5. In Jalview, select Web Service → Alignment → MAFFT Multiple Sequence Alignment. The alignment is calculated in a few minutes and displayed in a new window.
  6. Choose any colour scheme and add Colour → by Conservation. Adjust the slider left or right to see which columns are highly conserved.
  7. Save the alignment as a Jalview project before editing it for phylogenetic analysis. You may need it again.

Editing sequences

As discussed in the lecture, we should edit our alignments to make them suitable for phylogeny calculations. Here are the principles:

Follow the fundamental principle that all characters in a column should be related by homology. This implies the following rules of thumb:

  • Remove all stretches of residues in which the alignment appears ambiguous (not just highly variable, but ambiguous regarding the aligned positions).
  • Remove all frayed N- and C- termini, especially regions in which not all sequences that are being compared appear homologous and that may stem from unrelated domains. You want to only retain the APSES domains. All the extra residues from the YFO sequence can be deleted.
  • Remove all gapped regions that appear to be alignment artefacts due to inappropriate input sequences.
  • Remove all but approximately one column from gapped regions in those cases where the presence of several related insertions suggest that the indel is real, and not just an alignment artefact. (Some researchers simply remove all gapped regions).
  • Remove sections N- and C- terminal of gaps where the alignment appears questionable.
  • If the sequences fit on a single line you will save yourself potential trouble with block-wise vs. interleaved input. If you do run out of memory try removing columns of sequence. Or remove species that you are less interested in from the alignment.
  • Move your outgroup sequence to the first line of your alignment, since this is where PHYLIP will look for it by default.

Handling indels

Gaps are a real problem, as usual. Strictly speaking, the similarity score of an alignment program as well as the distance score of a phylogeny program are not calculated for an ordered sequence, but for a sum of independent values, one for each aligned columns of characters. The order of the columns does not change the score. However in an optimal sequence alignment with gaps, this is no longer strictly true since a one-character gap creation has a different penalty score than a one-character gap extension! Most alignment programs use a model with a constant gap insertion penalty and a linear gap extension penalty. This is not rigorously justified from biology, but parametrized (or you could say "tweaked") to correspond to our observations. However, most phylogeny programs, (such as the programs in PHYLIP) do not work in this way. PHYLIP strictly operates on columns of characters and treats a gap character just like a residue with the one letter code "-". Thus gap insertion- and extension- characters get the same score. For short indels, this underestimates the distance between pairs of sequences, since any evolutionary model should reflect the fact that gaps are much less likely than point mutations. If the gap is very long though, all events are counted individually as many single substitutions (rather than one lengthy one) and this overestimates the distance. And it gets worse: long stretches of gaps can make sequences appear similar in a way that is not justified, just because they are identical in the "-" character. It is therefore common and acceptable to edit gaps in the alignment and delete all but one or two columns of gapped sequence, or to remove such columns altogether.


(Possible) steps in editing a multiple sequence alignment towards a PHYLIP input file. a: raw alignment (CLUSTAL format); b: sequences assembled into single lines; c: columns to be deleted highlighted in red - 1, 3 and 4: large gaps; 2: uncertain alignment and 5: frayed C-terminus: both would put non-homologous characters into the same column; d: input data for PHYLIP: names for sequences must not be longer than 10 characters, the first line must contain the number of sequences and the sequence length. PHYLIP is very picky about incorrectly formatted input, read the PHYLIP sequence format guide.


Task:
Prepare a PHYLIP input file from the sequences you have prepared following the principles above. The simplest way to achieve this appears to be:

    1. Copy the sequences you want into a textfile. Make sure the "reference sequences", are included, the outgroup and the sequences from YFO.
    2. In a browser, navigate to the Readseq sequence conversion service.
    3. Paste your sequences into the form and choose Phylip as the output format. Click on submit.
    4. Save the resulting page as a text file. Give it some useful name such as APSES_domains.phy.


Calculating trees

In this section we perform the actual phylogenetic calculation.

Task:

  1. Download the PHYLIP package from the Phylip homepage and install it on your computer.
  2. Make a copy of your PHYLIP formatted sequence alignment file and name it infile. Note: make sure that your Microsoft Windows operating system does not silently append the extension ".txt" to your file. It should be called "infile", nothing else. Place this file into the directory where the PHYLIP executables reside on your computer.
  3. Run the proml program of PHYLIP (protein sequences, maximum likelihood tree) to calculate a phylogenetic tree (on the Mac, use proml.app). The program will automatically use "infile" for its input. Use the default parameters except that you should change option S: Speedier but rougher analysis? to No, not rough - your analysis should not sacrifice accuracy for speed. The calculation may take some fifteen minutes or so..


The program produces two output files: the outfile contains a summary of the run, the likelihood of bifurcations, and an ASCII representation of the tree. Open it with your usual text editor to have a look, and save the file with a meaningful name. The outtree contains the resulting tree in so-called "Newick" format. Again, have a look and save it with a meaningful filename.


Analysing your tree

In order to analyse your tree, you need a species tree as reference. Then you can begin comparing your expectations with the observed tree.


The species tree reference

I have constructed a cladogram for many of the species we are analysing, based on data published for 1551 fungal ribosomal sequences. The six reference species are included. Such reference trees from rRNA data are a standard method of phylogenetic analysis, supported by the assumption that rRNA sequences are monophyletic and have evolved under comparable selective pressure in all species.

Cladogram of many fungi studied in the assignments. This cladogram is based on small subunit ribosomal rRNA sequences, and largely follows Tehler et al. (2003) Mycol Res. 107:901-916. Even though many details of fungal phylogeny remain unresolved, the branches shown here individually appear to have strong support. In a cladogram such as this, the branch lengths are not drawn to any scale of similarity.

Your species may not be included in this cladogram, but you can easily create your own species tree with the following procedure:

Task:

  1. Access the NCBI taxonomy database Entrez query page.
  2. Edit the list of reference species below to include your species and paste it into the form.
"Aspergillus nidulans"[Scientific Name] OR
"Candida albicans"[Scientific Name] OR
"Neurospora crassa"[Scientific Name] OR
"Saccharomyces cerevisiae"[Scientific Name] OR
"Schizosaccharomyces pombe"[Scientific Name] OR
"Ustilago maydis"[Scientific Name]
  1. Next, as Display Settings option, select Common Tree.

You can use that tree as is - or visualize it more nicely as follows

  1. Select the phylip tree option from the menu, and click save as to save the tree in phylip (Newick) tree format.
  2. The output can be edited, and visualized in any program that reads phylip trees. One particularly nice viewer is the iTOL - Interactive Tree of Life project. Copy the contents of the phyliptree.phy file that the NCBI page has written, navigate to the iTOL project, click on the Data Upload tab, paste your tree data and click Upload. Then go to the main display page to view the tree. Change the view from Circular to Normal.
Alternatively ...

You can look up your species in the latest version of the species tree for the fungi:

Ebersberger et al. (2012) A consistent phylogenetic backbone for the fungi. Mol Biol Evol 29:1319-34. (pmid: 22114356)

PubMed ] [ DOI ] The kingdom of fungi provides model organisms for biotechnology, cell biology, genetics, and life sciences in general. Only when their phylogenetic relationships are stably resolved, can individual results from fungal research be integrated into a holistic picture of biology. However, and despite recent progress, many deep relationships within the fungi remain unclear. Here, we present the first phylogenomic study of an entire eukaryotic kingdom that uses a consistency criterion to strengthen phylogenetic conclusions. We reason that branches (splits) recovered with independent data and different tree reconstruction methods are likely to reflect true evolutionary relationships. Two complementary phylogenomic data sets based on 99 fungal genomes and 109 fungal expressed sequence tag (EST) sets analyzed with four different tree reconstruction methods shed light from different angles on the fungal tree of life. Eleven additional data sets address specifically the phylogenetic position of Blastocladiomycota, Ustilaginomycotina, and Dothideomycetes, respectively. The combined evidence from the resulting trees supports the deep-level stability of the fungal groups toward a comprehensive natural system of the fungi. In addition, our analysis reveals methodologically interesting aspects. Enrichment for EST encoded data-a common practice in phylogenomic analyses-introduces a strong bias toward slowly evolving and functionally correlated genes. Consequently, the generalization of phylogenomic data sets as collections of randomly selected genes cannot be taken for granted. A thorough characterization of the data to assess possible influences on the tree reconstruction should therefore become a standard in phylogenomic analyses.

Visualizing the tree

Once Phylip is done calculating the tree, the tree in a text format will be contained in the Phylip outfile - the documentation of what the program has done. Open this textfile for a first look. The tree is complicated and it can look confusing at first. The tree in Newick format is contained in the Phylip file outtree. Visualize it as follows:

Task:

  1. Open outtree in a texteditor and copy the tree.
  2. Visualize the tree in alternative representations:
    1. I have already mentioned the iTOL - Interactive Tree of Life project viewer.
    2. Navigate to the Proweb treeviewer, paste and visualize your tree.
    3. Navigate to the Trex-online Newick tree viewer for an alternative view. Visualize the tree as a phylogram. You can increase the window height to keep the labels from overlapping.
  3. A particularly useful viwer is actually Jalview.
    1. Open Jalview, copy the sequences you have used and paste them via File → Input Alignment → from Textbox.
    2. In the alignment window, choose File → Load associated Tree and load the Phylip outtree file. You can click into the tree-window to show which clades branch off at what level - it should be obvious that you can identify three major subclades (plus the outgroup). This view is particularly informative, since you can associate the clades of the tree with the actual sequences in the alignment, and get a good sense what sequence features the tree is based on.
    3. Try the Calculate → Sort → By Tree Order option to sort the sequences by their position in the tree. Also note that you can flip the tree around a node by double-clicking on it. This is especially useful: try to rearrange the tree so that the subdivisions into clades are apparent. Clicking into the window "cuts" the tree and colours your sequences according to the clades in which they are found. This is useful to understand what particular sequences contributed to which part of the phylogenetic inference.
    4. Study the tree: understand what you see and what you would have expected.


Here are two principles that will help you make sense of the tree.


A: A gene that is present in an ancestral species is inherited in all descendant species. The gene has to be observed in all OTUs, unless its has been lost (which is a rare event).

B: Paralogous genes in an ancestral species should give rise to monophyletic subtrees for each of the paralogues, in all descendants; this means: if the LCA of a branch has e.g. three genes, we would expect three copies of the species cladogram below this branchpoint, one for each of these genes. Each of these subtrees should recapitulate the reference phylogenetic tree of the species, up to the branchpoint of their LCA.


With these two simple principles (you should draw them out on a piece of paper if they do not seem obvious to you), you can probably pry your tree apart quite nicely. A few colored pencils and a printout of the tree will help.

The APSES domains of LCA

Note: A common confusion about cenancestral genes (LCA = Last Common Ancestor) arises from the fact that by far not all expected genes are present in the OTUs. Some will have been lost, some will have been incorrectly annotated in their genome (frameshifts!) and not been found with PSI-BLAST, some may have diverged beyond recognizability. In general you have to ask: given the species represented in a subclade, what is the last common ancestor of that branch? The expectation is that all descendants of that ancestor should be represented in that branch unless one of the above reasons why a gene might be absent would apply.


Task:

  • Consider how many APSES domain proteins the fungal cenancestor appears to have possessed and what evidence you see in the tree that this is so. Note that the hallmark of a clade that originated in the cenancestor is that it contains species from all subsequent major branches of the species tree.


The APSES domains of YFO

Assume that the cladogram for fungi that I have given above is correct, and that the mixed gene tree you have calculated is fundamentally correct in its overall arrangement but may have local inaccuracies due to the limited resolution of the method. You have identified the APSES domain genes of the fungal cenancestor above. Apply the expectations we have stated above to identify the sequence of duplications and/or gene loss in your organism through which YFO has ended up with the APSES domains it possesses today.

Task:

  1. Print the tree to a single sheet of paper.
  2. Mark the clades for the genes of the cenancestor.
  3. Label all subsequent branchpoints that affect the gene tree for YFO with either "D" (for duplication) or "S" (for speciation). Remember that specific speciation events can appear more than once in a tree. Identify such events.
  4. Bring this sheet with you to the quiz on Wednesday.

Bonus: when did it happen?

A very cool resource is Timetree - a tool that allows you to estimate divergence times between species. For example, the speciation event that separated the main branches of the fungi - i.e. the time when the fungal cenacestor lived - is given by the divergence time of Schizosaccharomyces pombe and Saccharomyces cerevisiaea: 761,000,000 years ago. For comparison, these two fungi are therefore approximately as related to each other as you are ...

A) to the rabbit?
B) to the opossum?
C) to the chicken?
D) to the rainbow trout?
E) to the warty sea squirt?
F) to the bumblebee?
G) to the earthworm?
H) to the fly agaric?

Check it out - the question will be on the quiz.

Links and resources

That is all.

Links and Resources

Literature
Ebersberger et al. (2012) A consistent phylogenetic backbone for the fungi. Mol Biol Evol 29:1319-34. (pmid: 22114356)

PubMed ] [ DOI ] The kingdom of fungi provides model organisms for biotechnology, cell biology, genetics, and life sciences in general. Only when their phylogenetic relationships are stably resolved, can individual results from fungal research be integrated into a holistic picture of biology. However, and despite recent progress, many deep relationships within the fungi remain unclear. Here, we present the first phylogenomic study of an entire eukaryotic kingdom that uses a consistency criterion to strengthen phylogenetic conclusions. We reason that branches (splits) recovered with independent data and different tree reconstruction methods are likely to reflect true evolutionary relationships. Two complementary phylogenomic data sets based on 99 fungal genomes and 109 fungal expressed sequence tag (EST) sets analyzed with four different tree reconstruction methods shed light from different angles on the fungal tree of life. Eleven additional data sets address specifically the phylogenetic position of Blastocladiomycota, Ustilaginomycotina, and Dothideomycetes, respectively. The combined evidence from the resulting trees supports the deep-level stability of the fungal groups toward a comprehensive natural system of the fungi. In addition, our analysis reveals methodologically interesting aspects. Enrichment for EST encoded data-a common practice in phylogenomic analyses-introduces a strong bias toward slowly evolving and functionally correlated genes. Consequently, the generalization of phylogenomic data sets as collections of randomly selected genes cannot be taken for granted. A thorough characterization of the data to assess possible influences on the tree reconstruction should therefore become a standard in phylogenomic analyses.

Marcet-Houben & Gabaldón (2009) The tree versus the forest: the fungal tree of life and the topological diversity within the yeast phylome. PLoS ONE 4:e4357. (pmid: 19190756)

PubMed ] [ DOI ] A recurrent topic in phylogenomics is the combination of various sequence alignments to reconstruct a tree that describes the evolutionary relationships within a group of species. However, such approach has been criticized for not being able to properly represent the topological diversity found among gene trees. To evaluate the representativeness of species trees based on concatenated alignments, we reconstruct several fungal species trees and compare them with the complete collection of phylogenies of genes encoded in the Saccharomyces cerevisiae genome. We found that, despite high levels of among-gene topological variation, the species trees do represent widely supported phylogenetic relationships. Most topological discrepancies between gene and species trees are concentrated in certain conflicting nodes. We propose to map such information on the species tree so that it accounts for the levels of congruence across the genome. We identified the lack of sufficient accuracy of current alignment and phylogenetic methods as an important source for the topological diversity encountered among gene trees. Finally, we discuss the implications of the high levels of topological variation for phylogeny-based orthology prediction strategies.

Baldauf (2003) Phylogeny for the faint of heart: a tutorial. Trends Genet 19:345-51. (pmid: 12801728)

PubMed ] [ DOI ] Phylogenetic trees seem to be finding ever broader applications, and researchers from very different backgrounds are becoming interested in what they might have to say. This tutorial aims to introduce the basics of building and interpreting phylogenetic trees. It is intended for those wanting to understand better what they are looking at when they look at someone else's trees or to begin learning how to build their own. Topics covered include: how to read a tree, assembling a dataset, multiple sequence alignment (how it works and when it does not), phylogenetic methods, bootstrap analysis and long-branch artefacts, and software and resources.

Tuimala, Jarno (2006) A primer to phylogenetic analysis using the PHYLIP package.  
(pmid: None)Source URL ] The purpose of this tutorial is to demonstrate how to use PHYLIP, a collection of phylogenetic analysis software, and some of the options that are available. This tutorial is not intended to be a course in phylogenetics, although some phylogenetic concepts will be discussed briefly. There are other books available which cover the theoretical sides of the phylogenetic analysis, but the actual data analysis work is less well covered. Here we will mostly deal with molecular sequence data analysis in the current PHYLIP version 3.66.


Software
Sequences


 

Biasini et al. (2014) SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information. Nucleic Acids Res 42:W252-8. (pmid: 24782522)

PubMed ] [ DOI ] Protein structure homology modelling has become a routine technique to generate 3D models for proteins when experimental structures are not available. Fully automated servers such as SWISS-MODEL with user-friendly web interfaces generate reliable models without the need for complex software packages or downloading large databases. Here, we describe the latest version of the SWISS-MODEL expert system for protein structure modelling. The SWISS-MODEL template library provides annotation of quaternary structure and essential ligands and co-factors to allow for building of complete structural models, including their oligomeric structure. The improved SWISS-MODEL pipeline makes extensive use of model quality estimation for selection of the most suitable templates and provides estimates of the expected accuracy of the resulting models. The accuracy of the models generated by SWISS-MODEL is continuously evaluated by the CAMEO system. The new web site allows users to interactively search for templates, cluster them by sequence similarity, structurally compare alternative templates and select the ones to be used for model building. In cases where multiple alternative template structures are available for a protein of interest, a user-guided template selection step allows building models in different functional states. SWISS-MODEL is available at http://swissmodel.expasy.org/.

Bordoli & Schwede (2012) Automated protein structure modeling with SWISS-MODEL Workspace and the Protein Model Portal. Methods Mol Biol 857:107-36. (pmid: 22323219)

PubMed ] [ DOI ] Comparative protein structure modeling is a computational approach to build three-dimensional structural models for proteins using experimental structures of related protein family members as templates. Regular blind assessments of modeling accuracy have demonstrated that comparative protein structure modeling is currently the most reliable technique to model protein structures. Homology models are often sufficiently accurate to substitute for experimental structures in a wide variety of applications. Since the usefulness of a model for specific application is determined by its accuracy, model quality estimation is an essential component of protein structure prediction. Comparative protein modeling has become a routine approach in many areas of life science research since fully automated modeling systems allow also nonexperts to build reliable models. In this chapter, we describe practical approaches for automated protein structure modeling with SWISS-MODEL Workspace and the Protein Model Portal.

Peitsch (2002) About the use of protein models. Bioinformatics 18:934-8. (pmid: 12117790)

PubMed ] [ DOI ] Protein models can be of great assistance in functional genomics, as they provide the structural insights often necessary to understand protein function. Although comparative modelling is far from yielding perfect structures, this is still the most reliable method and the quality of the predictions is now well understood. Models can be classified according to their correctness and accuracy, which will impact their applicability and usefulness in functional genomics and a variety of situations.



 


Footnotes and references


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.



< Assignment 7 Assignment 9 >