Building Phylogenetic Trees

Contents
Calculating trees
Further Reading
Questions, comments
References

Expected Preparations:

	Evolution: Theory of evolution; Variation, neutral drift and selection.		[BIN-PHYLO] Data_preparation
	If you are not already familiar with the prior knowledge listed above, you need to prepare yourself from other information sources.		The units listed above are part of this course and contain important preparatory material.

Keywords: Calculating phylogenetic trees; tree visualization

Objectives:

This unit will …

… introduce the concepts and algorithms used to build phylogenetic trees;
… teach how to compute a maximum likelihood tree with the PHYLIP proml program in R;

Outcomes:

After working through this unit you …

… are familar with concepts and algorithms used to build phylogenetic trees;
… have computed a phylogenetic tree of Mbp1 orthologue APSES domains with the PHYLIP proml program via the RPhylip:: package.

Deliverables:

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.

Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these.

Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Evaluation:

NA: This unit is not evaluated for course marks.

Building phylogenetic trees in theory - and with phylip in R.

Task…

Read the introductory notes on building phylogenetic treesPDF.
Also read:

Kapli, Paschalia, Ziheng Yang, and Maximilian J Telford. (2020). “Phylogenetic tree building in the genomic age”. Nature Reviews. Genetics 21(7):428–444 .
[PMID: 32424311] [DOI: 10.1038/s41576-020-0233-0]

Abstract …

Knowing phylogenetic relationships among species is fundamental for many studies in biology. An accurate phylogenetic tree underpins our understanding of the major transitions in evolution, such as the emergence of new body plans or metabolism, and is key to inferring the origin of new genes, detecting molecular adaptation, understanding morphological character evolution and reconstructing demographic changes in recently diverged species. Although data are ever more plentiful and powerful analysis methods are available, there remain many challenges to reliable tree building. Here, we discuss the major steps of phylogenetic analysis, including identification of orthologous genes or proteins, multiple sequence alignment, and choice of substitution models and inference methodologies. Understanding the different sources of errors and the strategies to mitigate them is essential for assembling an accurate tree of life.

The result of the tree construction is a decision about the most likely evolutionary relationships. Fundamentally, tree-construction programs decide which sequences had common ancestors.

“Distance based” and “Parsimony based” methods are fast, but less acurate.

Distance based phylogeny programs start by using sequence comparisons to estimate evolutionary distances:

They are fast, can work on large numbers of sequences, but are less accurate if genes evolve at different rates.

Parsimony based phylogeny programs build a tree that minimizes the number of mutation events that are required to get from a common ancestral sequence to all observed sequences. They take all columns into account, not just a single number per sequence pair, as the Distance Methods do. For closely related sequences they work very well, but they construct inaccurate trees when they can’t make good estimates for the required number of sequence changes.

“Maximum Likelihood” and “Bayesian” methods are accurate, but can take up very significant computational resources.

ML, or Maximum Likelihood methods attempt to find the tree for which the observed sequences would be the most likely under a particular evolutionary model. They are based on a rigorous statistical framework and yield the most robust results. But they are also quite compute intensive and a tree of the size that we are building in this assignment is a challenge for the resources of common workstation (runs about an hour on my computer). If the problem is too large, one may split a large problem into smaller, obvious subtrees (e.g. analysing orthologues as a group, only including a few paralogues for comparison) and then merge the smaller trees; this way even very large problems can become tractable.

ML methods suffer less from “long-branch attraction” - the phenomenon that weakly similar sequences can be grouped inappropriately close together in a tree due to spuriously shared differences.

Bayesian methods don’t estimate the tree that gives the highest likelihood for the observed data, but find the most probable tree, given the data that has been observed. If you think this sounds conceptually similar to ML methods, then you are not wrong. However, the approaches employ very different algorithms. And Bayesian methods need a “prior” on trees before observation.

Calculating trees

In this section we perform the actual phylogenetic calculation. For this we use an online server at the ATGC Bioinformatics platform of the French National Centre for Scientific Research, in Montpellier, which runs the PhhyML tree-inference algorithm.

Task…

Open RStudio and load the ABC-units R project. If you have loaded it before, choose File ▹ Recent projects ▹ ABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
Choose Tools ▹ Version Control ▹ Pull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included. This ensures that your data and code remain up to date when we update, or fix bugs.
Type init() if requested.
Open the file BIN-PHYLO-Tree_building.R and follow the instructions.

Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.

PhyML is a phylogeny software based on the maximum-likelihood principle. Early PhyML versions used a fast algorithm performing nearest neighbor interchanges to improve a reasonable starting tree topology. Since the original publication (Guindon S., Gascuel O. 2003. A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52:696-704), PhyML has been widely used (>2500 citations in ISI Web of Science) because of its simplicity and a fair compromise between accuracy and speed. In the meantime, research around PhyML has continued, and this article describes the new algorithms and methods implemented in the program. First, we introduce a new algorithm to search the tree space with user-defined intensity using subtree pruning and regrafting topological moves. The parsimony criterion is used here to filter out the least promising topology modifications with respect to the likelihood function. The analysis of a large collection of real nucleotide and amino acid data sets of various sizes demonstrates the good performance of this method. Second, we describe a new test to assess the support of the data for internal branches of a phylogeny. This approach extends the recently proposed approximate likelihood-ratio test and relies on a nonparametric, Shimodaira-Hasegawa-like procedure. A detailed analysis of real alignments sheds light on the links between this new approach and the more classical nonparametric bootstrap method. Overall, our tests show that the last version (3.0) of PhyML is fast, accurate, stable, and ready to use. A Web server and binary files are available from http://www.atgc-montpellier.fr/phyml/.

Questions, comments

If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.

Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.

References

About this page …

[END]

Building Phylogenetic Trees

Boris Steipe

Contents

Calculating trees

Further Reading

Questions, comments

References