Difference between revisions of "BIN-PHYLO-Concepts"

From "A B C"
Jump to navigation Jump to search
m
m
Line 44: Line 44:
 
<!-- included from "ABC-unit_components.wtxt", section: "notes-prerequisites" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "notes-prerequisites" -->
 
You need to complete the following units before beginning this one:
 
You need to complete the following units before beginning this one:
*[[BIN-ALI-MSA]]
+
*[[BIN-ALI-MSA|BIN-ALI-MSA (Multile Sequence Alignment)]]
  
 
{{Vspace}}
 
{{Vspace}}
Line 149: Line 149:
 
Here is the formula to calculate the number of trees one can create
 
Here is the formula to calculate the number of trees one can create
 
from n OTUs, as an R function. It's the number of unrooted binary trees
 
from n OTUs, as an R function. It's the number of unrooted binary trees
with n labeled leaves, and unlabeled internal nodes. Copy, paste into an R script and try it out. Then figure out: 10 reference species and your YFO: how many possible trees? Could you create them all and select the best one by complete enumeration?
+
with n labeled leaves, and unlabeled internal nodes. Copy, paste into an R script and try it out. Then figure out: 10 reference species and your MYSPE: how many possible trees? Could you create them all and select the best one by complete enumeration?
  
 
<source lang="R>
 
<source lang="R>

Revision as of 02:52, 4 October 2017

Concepts of Phylogenetic Analysis


 

Keywords:  Phylogenetic trees, orthologues and paralogues, horizontal gene transfer (HGT)


 



 


Caution!

This unit is under development. There is some contents here but it is incomplete and/or may change significantly: links may lead to nowhere, the contents is likely going to be rearranged, and objectives, deliverables etc. may be incomplete or missing. Do not work with this material until it is updated to "live" status.


 


Abstract

...


 


This unit ...

Prerequisites

You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:

  • Evolution: Theory of evolution; variation, neutral drift and selection.

You need to complete the following units before beginning this one:


 


Objectives

...


 


Outcomes

...


 


Deliverables

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.


 


Evaluation

Evaluation: NA

This unit is not evaluated for course marks.


 


Contents

 
Nothing in Biology makes sense except in the light of evolution.
Theodosius Dobzhansky

... but does evolution make sense in the light of biology?


 

 

Task:


 

As we have seen in the previous assignments, the Mbp1 transcription factor has homologues in all other fungi, yet there is not always a clear one-to-one mapping between members of a family in distantly related species. It appears that various systems of APSES domain transcription factors have evolved independently. Of course this bears directly on our notion of function - what it means to say that two genes in different organisms have the "same" function. In case two organisms both have an orthologous gene for the same, distinct function, calling these functions "the same" may be warranted. But what if that gene has duplicated in one species, and the two paralogues now perform different, related functions in one organism? Theses two are still orthologues to both their homologues in the other species, but now we expect functionally significant residues to have adapted to the new - and possibly distinct - roles of each paralogue. In order to be able to even ask such questions, we need to make the evolutionary history of gene families explicit. This is the domain of phylogenetic analysis. We can ask questions like: how many paralogues did the cenancestor of a clade possess? Which of these underwent additional duplications in the phylogenesis of the organism I am studying? Did any genes get lost? And - adding additional biological insight to the picture - did the observed duplications lead to the "invention" of new biological systems? When was that? And perhaps even: how did the species benefit from this event?

We will develop this kind of analysis in this assignment. In the previous assignment you have established which gene in your species is the reciprocally most closely related orthologue to yeast Mbp1 (with reciprocal best match) and you have identified the full complement of APSES domain genes in your assigned organism (as a result of your PSI-BLAST search). In this assignment, we will analyse these genes' evolutionary relationship and compare it to the evolutionary relationship of other fungal APSES domains. The goal is to define families of related transcription factors and their evolutionary history. All APSES domain annotations are now available in your protein "database". Now we will attempt to compute the phylogram for these proteins. The goal is to identify orthologues and paralogues.

A number of excellent tools for phylogenetic analysis exist; general purpose packages include the (free) PHYLIP package, the MEGA package and the (commercial) PAUP* package. Of these, only MEGA is still under active development, although PHYLIP still functions perfectly (except for problems with graphical windows under Mac OS 10.6). Specialized tools for tree-building include Treepuzzle or Mr. Bayes. This assignment is constructed around programs that are available in PHYLIP, however you are welcome to use other tools that fulfill a similar purpose if you wish. In this field, researchers consider trees that have been built with ML (maximum likelihood) methods to be more reliable than trees that are built with parsimony methods, or distance methods such as NJ (Neighbor Joining). However ML methods are also much more compute-intensive. Just like with multiple sequence alignments, some algorithms will come closer to guessing the truth and others will not and usually it is hard to tell which is the more trustworthy of two diverging results. The prudent researcher tries out alternatives and forms her own opinion. Specifically, we may usually assume results that converge when computed with different algorithms, to be more reliable than those that depend strongly on a particular algorithm, parameters, or details of input data.

In this assignment, we will take a computational shortcut, (something you should not do in real life). We will skip establishing the reliability of the tree with a bootstrap procedure, i.e. repeat the tree-building a hundred times with partial data and see which branches and groupings are robust and which depend on the details of the data. (If you are interested, have a look here for the procedure for running a bootstrap analysis on the data set you are working with, but this may require a day or so of computing time on your computer.) In this assignment, we will simply acknowledge that bifurcations that are very close to each other have not been "resolved" and be appropriately cautious in our inferences. In phylogenetic analysis, not all lines a program draws are equally trustworthy. Don't take the trees as a given fact just because a program suggests this. Look at the evidence, include independent information where available, use your reasoning, and analyse the results critically. As you will see, there are some facts that we know for certain: we know which species the genes come from, and we can (usually) make good assumptions about the relationship of the species themselves - the history of speciation events that underlies all evolution of genes. This is extremely helpful information for our work.


If you would like to review concepts of trees, clades, LCAs, OTUs and the like, I have linked an excellent and very understandable introduction-level article on phylogenetic analysis here and to the resource section at the bottom of this page.

Baldauf (2003) Phylogeny for the faint of heart: a tutorial. Trends Genet 19:345-51. (pmid: 12801728)

PubMed ] [ DOI ] Phylogenetic trees seem to be finding ever broader applications, and researchers from very different backgrounds are becoming interested in what they might have to say. This tutorial aims to introduce the basics of building and interpreting phylogenetic trees. It is intended for those wanting to understand better what they are looking at when they look at someone else's trees or to begin learning how to build their own. Topics covered include: how to read a tree, assembling a dataset, multiple sequence alignment (how it works and when it does not), phylogenetic methods, bootstrap analysis and long-branch artefacts, and software and resources.


 

R packages that may be useful include the following:

  • R task view Phylogenetics - this task-view gives an excellent, curated overview of the important R-packages in the domain.
  • package ape - general purpose phylogenetic analysis, but (as far as I can tell ape only supports analysis with DNA sequences).
  • package ips - wrapper for MrBayes, Beast, RAxML "heavy-duty" phylogenetic analysis packages.
  • package Rphylip - Wrapper for Phylip, the most versatile set of phylogenetic inference tools.


 

Tidbit: the number of possible trees

Here is the formula to calculate the number of trees one can create from n OTUs, as an R function. It's the number of unrooted binary trees with n labeled leaves, and unlabeled internal nodes. Copy, paste into an R script and try it out. Then figure out: 10 reference species and your MYSPE: how many possible trees? Could you create them all and select the best one by complete enumeration?

nTrees <- function(nOTU) {
    if (nOTU < 3)  { return(1) }
    if (nOTU > 87) { return(Inf) }
    return(factorial((2 * nOTU) - 4) / ((2 ^ (nOTU - 2)) * factorial(nOTU - 2)))
}
nTrees(5)  # 15
nTrees(22) # approximately Loschmidt's number


 


 


Further reading, links and resources

Baldauf (2003) Phylogeny for the faint of heart: a tutorial. Trends Genet 19:345-51. (pmid: 12801728)

PubMed ] [ DOI ] Phylogenetic trees seem to be finding ever broader applications, and researchers from very different backgrounds are becoming interested in what they might have to say. This tutorial aims to introduce the basics of building and interpreting phylogenetic trees. It is intended for those wanting to understand better what they are looking at when they look at someone else's trees or to begin learning how to build their own. Topics covered include: how to read a tree, assembling a dataset, multiple sequence alignment (how it works and when it does not), phylogenetic methods, bootstrap analysis and long-branch artefacts, and software and resources.


 


Notes


 


Self-evaluation

 



 




 

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.



 

About ...
 
Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2017-08-05

Version:

0.1

Version history:

  • 0.1 First stub

CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.