Difference between revisions of "BIN-PHYLO-Tree building"

Latest revision as of 09:10, 26 September 2020

Building Phylogenetic Trees

(Calculating phylogenetic trees; tree visualization)

Abstract:

Building phylogenetic trees in theory - and with phylip in R.

Objectives:
This unit will ...

... introduce the concepts and algorithms used to build phylogenetic trees;
... teach how to compute a maximum likelihood tree with the PHYLIP proml program in R;

Outcomes:
After working through this unit you ...

... are familar with concepts and algorithms used to build phylogenetic trees;
... have computed a phylogenetic tree of Mbp1 orthologue APSES domains with the PHYLIP proml program via the RPhylip:: package.

Deliverables:

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.

Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.

Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Prerequisites:
You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:

Evolution: Theory of evolution; variation, neutral drift and selection.

This unit builds on material covered in the following prerequisite units:

BIN-PHYLO-Data_preparation (Preparing Data for Phylogenetic Analysis)

[ PubMed ] [ DOI ] Knowing phylogenetic relationships among species is fundamental for many studies in biology. An accurate phylogenetic tree underpins our understanding of the major transitions in evolution, such as the emergence of new body plans or metabolism, and is key to inferring the origin of new genes, detecting molecular adaptation, understanding morphological character evolution and reconstructing demographic changes in recently diverged species. Although data are ever more plentiful and powerful analysis methods are available, there remain many challenges to reliable tree building. Here, we discuss the major steps of phylogenetic analysis, including identification of orthologous genes or proteins, multiple sequence alignment, and choice of substitution models and inference methodologies. Understanding the different sources of errors and the strategies to mitigate them is essential for assembling an accurate tree of life.

The result of the tree construction is a decision about the most likely evolutionary relationships. Fundamentally, tree-construction programs decide which sequences had common ancestors.

"Distance based" and "Parsimony based" methods are fast, but less acurate.

Distance based phylogeny programs start by using sequence comparisons to estimate evolutionary distances:

they apply a model of evolution such as a mutation data matrix, to calculate a score for each pair of sequences,
this score is stored in a "distance matrix" ...
... and used to estimate a tree that groups sequences with close relationships together. (e.g. by using an NJ, Neigbor Joining, algorithm).

They are fast, can work on large numbers of sequences, but are less accurate if genes evolve at different rates.

Parsimony based phylogeny programs build a tree that minimizes the number of mutation events that are required to get from a common ancestral sequence to all observed sequences. They take all columns into account, not just a single number per sequence pair, as the Distance Methods do. For closely related sequences they work very well, but they construct inaccurate trees when they can't make good estimates for the required number of sequence changes.

"Maximum Likelihood" and "Bayesian" methods are accurate, but can take up very significant computational resources.

ML, or Maximum Likelihood methods attempt to find the tree for which the observed sequences would be the most likely under a particular evolutionary model. They are based on a rigorous statistical framework and yield the most robust results. But they are also quite compute intensive and a tree of the size that we are building in this assignment is a challenge for the resources of common workstation (runs about an hour on my computer). If the problem is too large, one may split a large problem into smaller, obvious subtrees (e.g. analysing orthologues as a group, only including a few paralogues for comparison) and then merge the smaller trees; this way even very large problems can become tractable.

ML methods suffer less from "long-branch attraction" - the phenomenon that weakly similar sequences can be grouped inappropriately close together in a tree due to spuriously shared differences.

Bayesian methods don't estimate the tree that gives the highest likelihood for the observed data, but find the most probable tree, given the data that has been observed. If you think this sounds conceptually similar to ML methods, then you are not wrong. However, the approaches employ very different algorithms. And Bayesian methods need a "prior" on trees before observation.

Calculating trees

In this section we perform the actual phylogenetic calculation.

Task:

Download the PHYLIP suite of programs from the Phylip homepage and install it on your computer.

Task:

Open RStudio and load the ABC-units R project. If you have loaded it before, choose File → Recent projects → ABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
Choose Tools → Version Control → Pull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
Type init() if requested.
Open the file BIN-PHYLO-Tree_building.R and follow the instructions.

Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.

Notes

About ...

Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2020-09-25

Version:

1.1

Version history:

1.1 2020 Maintenance
1.0 First live version.
0.1 First stub

This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.

Difference between revisions of "BIN-PHYLO-Tree building"

Latest revision as of 09:10, 26 September 2020

Contents

Evaluation

Contents

Calculating trees

Further reading, links and resources

Notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools

@@ Line 1: / Line 1: @@
-<div id="BIO">
+<div id="ABC">
-  <div class="b1">
+<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce; font-size:300%; font-weight:400; color: #000000; width:100%;">
 Building Phylogenetic Trees
-  </div>
+<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#b3dbce; font-size:30%; font-weight:200; color: #000000; ">
+(Calculating phylogenetic trees; tree visualization)
-  {{Vspace}}
+</div>
-<div class="keywords">
-<b>Keywords:</b>&nbsp;
-Calculating phylogenetic trees; tree visualization
 </div>
-{{Vspace}}
+{{Smallvspace}}
-__TOC__
-{{Vspace}}
-{{DEV}}
-{{Vspace}}
+<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce33; font-size:85%;">
+<div style="font-size:118%;">
+<b>Abstract:</b><br />
-</div>
-<div id="ABC-unit-framework">
-== Abstract ==
 <section begin=abstract />
-<!-- included from "../components/BIN-PHYLO-Tree_building.components.wtxt", section: "abstract" -->
 Building phylogenetic trees in theory - and with phylip in R.
 <section end=abstract />
+</div>
-{{Vspace}}
+<!-- ============================  -->
+<hr>
+<table>
-== This unit ... ==
+<tr>
-=== Prerequisites ===
+<td style="padding:10px;">
-<!-- included from "../components/BIN-PHYLO-Tree_building.components.wtxt", section: "prerequisites" -->
+<b>Objectives:</b><br />
-<!-- included from "ABC-unit_components.wtxt", section: "notes-external_prerequisites" -->
+This unit will ...
-You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:
+* ... introduce the concepts and algorithms used to build phylogenetic trees;
-<!-- included from "FND-prerequisites.wtxt", section: "evolution" -->
+* ... teach how to compute a maximum likelihood tree with the PHYLIP proml program in R;
+</td>
+<td style="padding:10px;">
+<b>Outcomes:</b><br />
+After working through this unit you ...
+* ... are familar with concepts and algorithms used to build phylogenetic trees;
+* ... have computed a phylogenetic tree of Mbp1 orthologue APSES domains with the PHYLIP proml program via the <tt>RPhylip::</tt> package.
+</td>
+</tr>
+</table>
+<!-- ============================  -->
+<hr>
+<b>Deliverables:</b><br />
+<section begin=deliverables />
+<li><b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.</li>
+<li><b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.</li>
+<li><b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].</li>
+<section end=deliverables />
+<!-- ============================  -->
+<hr>
+<section begin=prerequisites />
+<b>Prerequisites:</b><br />
+You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:<br />
 *<b>Evolution</b>: Theory of evolution; variation, neutral drift and selection.
-<!-- included from "ABC-unit_components.wtxt", section: "notes-prerequisites" -->
+This unit builds on material covered in the following prerequisite units:<br />
-You need to complete the following units before beginning this one:
 *[[BIN-PHYLO-Data_preparation|BIN-PHYLO-Data_preparation (Preparing Data for Phylogenetic Analysis)]]
+<section end=prerequisites />
+<!-- ============================  -->
+</div>
-{{Vspace}}
+{{Smallvspace}}
-=== Objectives ===
-<!-- included from "../components/BIN-PHYLO-Tree_building.components.wtxt", section: "objectives" -->
-This unit will ...
-* ... introduce ;
-* ... demonstrate ;
-* ... teach ;
-{{Vspace}}
+{{Smallvspace}}
-=== Outcomes ===
+__TOC__
-<!-- included from "../components/BIN-PHYLO-Tree_building.components.wtxt", section: "outcomes" -->
-After working through this unit you ...
-* ... can ;
-* ... are familar with ;
-* ... have begun to.
-{{Vspace}}
-=== Deliverables ===
-<!-- included from "../components/BIN-PHYLO-Tree_building.components.wtxt", section: "deliverables" -->
-<!-- included from "ABC-unit_components.wtxt", section: "deliverables-time_management" -->
-*<b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
-<!-- included from "ABC-unit_components.wtxt", section: "deliverables-journal" -->
-*<b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.
-<!-- included from "ABC-unit_components.wtxt", section: "deliverables-insights" -->
-*<b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].
 {{Vspace}}
@@ Line 82: / Line 68: @@
 === Evaluation ===
-<!-- included from "../components/BIN-PHYLO-Tree_building.components.wtxt", section: "evaluation" -->
-<!-- included from "ABC-unit_components.wtxt", section: "eval-none" -->
 <b>Evaluation: NA</b><br />
-:This unit is not evaluated for course marks.
+<div style="margin-left: 2rem;">This unit is not evaluated for course marks.</div>
-{{Vspace}}
-</div>
-<div id="BIO">
 == Contents ==
-<!-- included from "../components/BIN-PHYLO-Tree_building.components.wtxt", section: "contents" -->
 {{Task|1=
 *Read the introductory notes on {{ABC-PDF|BIN-PHYLO-Tree_building|building phylogenetic trees}}.
+* Also read:
+{{#pmid:32424311}}
 }}
@@ Line 155: / Line 135: @@
 * Download the PHYLIP suite of programs from the [http://evolution.genetics.washington.edu/phylip.html Phylip homepage] and install it on your computer.
-* Return to the '''RStudio project''' and work through <tt>PART FOUR: Calculating trees</tt>.
+}}
+{{Smallvspace}}
+{{ABC-unit|BIN-PHYLO-Tree_building.R}}
-}}
 {{Vspace}}
@@ Line 171: / Line 154: @@
 Should run at least overnight.
 -->
-==Analysing your tree==
-{{Vspace}}
-In order to analyse your tree, you need a species tree as reference. This really is an absolute prerequisite to make your expectations about the observed tree explicit. Fortunately we have all species nicely documented in our database.
-{{Vspace}}
-===The reference species tree===
-{{Vspace}}
-{{task|1=
-* Navigate to the [http://www.ncbi.nlm.nih.gov/taxonomy '''NCBI Taxonomy page''']
-* Execute the following '''R''' command to create an Entrez command that will retrieve all taxonomy records for the species in your database:
-<source lang="R">
-cat(paste(paste(c(myDB$taxonomy$ID, "83333"), "[taxid]", sep=""), collapse=" OR "))
-</source>
-* Copy the Entrez command, and enter it into the search field of the NCBI taxonomy page. Click on '''Search'''. The resulting page should have twelve species listed - ten "reference" fungi, ''E. coli'' (as the outgroup), and MYSPE. Make sure MYSPE is included! If it's not there, you did something wrong that needs to be fixed.
-* Click on the '''Summary''' options near the top-left of the page, and select '''Common Tree'''. This places all the species into the universal tree of life and identifies their relationships.
-* At the top, there is an option to '''Save as''' ... and the option to select a format to save the tree in. Select '''Phylip Tree''' as the format and click the '''Save as''' button. The file <code>phyliptree.phy</code> will be downloaded to your computer into your default download directory. Move it to the directory you have defined as <code>PROJECTDIR</code>.
-*Open the file in a text-editor. This is a tree, specified in the so-called {{WP|Newick_format|'''"Newick format"'''}}. The topology of the tree is defined through the brackets, and the branch-lengths are all the same: this is a cladogram, not a phylogram. The tree contains the long names for the species/strains and for our purposes we really need the "biCodes" instead. I can't think of a very elegant way to make that change programmatically, so just go ahead and replace the species names (not the taxonomic ranks though) with their biCode in your text editor. Remove all the single quotes, and replace any remaining blanks in names with an underscore. Take care however not to delete any colons or parentheses. Save the file.
-My version looks like this - '''Your version must have MYSPE somewhere in the tree.'''.
- (
- 'ESCCO':4,
- (
- (
- 'PUCGR':4,
- 'USTMA':4,
- (
- 'WALME':4,
- 'COPCI':4,
- 'CRYNE':4
- )Agaricomycotina:4
- )Basidiomycota:4,
- (
- (
- (
- 'ASPNI':4,
- 'BIPOR':4,
- 'NEUCR':4
- )leotiomyceta:4,
- 'SACCE':4
- )saccharomyceta:4,
- 'SCHPO':4
- )Ascomycota:4
- )Dikarya:4
- )'cellular organisms':4;
-*Now read the tree in '''R''' and plot it.
-<source lang="R">
-# Download the EDITED phyliptree.phy
-orgTree <- read.tree("phyliptree.phy")
-# Plot the tree in a new window
-dev.new(width=6, height=3)
-plot(orgTree, cex=1.0, root.edge=TRUE, no.margin=TRUE)
-nodelabels(text=orgTree$node.label, cex=0.6, adj=0.2, bg="#D4F2DA")
-</source>
-}}
-{{Vspace}}
-I have constructed a cladogram for many of the species we are analysing, based on data published for 1551 fungal ribosomal sequences. The six reference species are included. Such reference trees from rRNA data are a standard method of phylogenetic analysis, supported by the assumption that rRNA sequences are monophyletic and have evolved under comparable selective pressure in all species.
-{{vspace}}
-<div class="reference-box">
-[[Image:FungiCladogram.jpg|600px|none]]
-<small>'''Cladogram of the "reference" fungi''' studied in the assignments. This cladogram is based on a tree returned by the NCBI Common Tree. It is thus a digest of cladistic relationships, not a representation of a specific molecular phylogeny.</small>
-</div>
-Alternatively, you can look up your species in the latest version of the species tree for the fungi and add it to the tree by hand while resolving the trifurcations. See:
- {{#pmid: 22114356}}
-{{Vspace}}
-{{task|1=
-* Return to the RStudio project and continue with the script to its end. Note the deliverable at the end: to print out your trees and bring them to class.
-}}
-<!--
-#Copy the tree-string from the R console.
-#Visualize the tree online: navigate to the [http://www.trex.uqam.ca/index.php?action=newick&project=trex Trex-online Newick tree viewer]. Visualize the tree as a phylogram. Explore the options.
-# A particularly useful viewer is actually Jalview - although this may be more apparent with the larger alignment of '''all''' sequences we'll produce later.
-##Open Jalview and load your alignment of all APSES domain proteins.
-##Save the Newick-formatted tree.
-##In the alignment window, choose '''File &rarr; Load associated Tree''' and load your tree file. You can click into the tree-window to show which clades branch off at what level - it should be obvious that you can identify three major subclades (plus the outgroup). This view is particularly informative, since you can associate the clades of the tree with the actual sequences in the alignment, and get a good sense what sequence features the tree is based on.
-##Try the '''Calculate &rarr; Sort &rarr; By Tree Order''' option to sort the sequences by their position in the tree. Also note that you can flip the tree around a node by double-clicking on it. This is especially useful: try to rearrange the tree so that the subdivisions into clades are apparent. Clicking into the window "cuts" the tree and colours your sequences according to the clades in which they are found. This is useful to understand what particular sequences contributed to which part of the phylogenetic inference.
-ANALYSIS
-* First, the APS and ANK trees should have the same topology, since they are only different parts of the same protein (unless that protein has swapped its domains with another one during evolution). Clearly, that is not the case. The ''basidiomycota'' are reasonably consistent, although their internal ordering is poorly resolved, particularly in the APS tree. The ''ascomycota'' show two major differences, but they are actually consistent between the APS and the ANK tree: SACCE is less similar to all than we would expect from the species tree. And NEUCR is more similar to the ''basidiomycotal'' proteins.
-* Consider the scale bars: ANK domains have evolved at about twice the rate of the APS domains. This alone should tell us to be cautious with our interpretations since this shows there are different degrees of selective pressure on different parts of the protein. Moreover the <u>relative rates</u> differ as well. NEUCR's APSES domain has evolved much faster by comparison to other proteins than its ankyrin domain. Has its biological function changed?
-* Secondly, both gene trees should follow the species tree. Again, there are differences. But if we exclude SACCE and NEUCR, the remainder actually turns out relatively consistent.
-In any case: this is what the data tells us. The big picture is mostly conserved, but there are differences in the details. However: now we know what degree of accuracy we can expect from the analysis.
-{{Vspace}}
-==The mixed gene tree==
-{{vspace}}
-You have now practiced how to calculate, manipulate, plot, annotate and compare trees.
-{{task|1=
-* Now use Rproml to calculate a mixed gene tree based on '''all'' APSES domains. You saved it as <code>APSES.mfa</code>. For the fifty or so domains, each run will take about an hour. Thus run as many <code>random.addition</code> cycles as reasonable during a study break, or overnight. Thus the command will be something like:
-<source lang="R">
-allApsIn <- read.protein("APSES.mfa")
-fullApsTree <- Rproml(allApsIn, path=PROMLPATH, random.addition=3)
-#... and don't forget:
-save(fullApsTree, file="fullApsTree.rda")
-</source>
-}}
-{{Vspace}}
-{{Vspace}}
@@ Line 337: / Line 172: @@
 |abstract= The purpose of this tutorial is to demonstrate how to use PHYLIP, a collection of phylogenetic analysis software, and some of the options that are available. This tutorial is not intended to be a course in phylogenetics, although some phylogenetic concepts will be discussed briefly. There are other books available which cover the theoretical sides of the phylogenetic analysis, but the actual data analysis work is less well covered. Here we will mostly deal with molecular sequence data analysis in the current PHYLIP version 3.66.
 }}
-{{Vspace}}
 == Notes ==
-<!-- included from "../components/BIN-PHYLO-Tree_building.components.wtxt", section: "notes" -->
-<!-- included from "ABC-unit_components.wtxt", section: "notes" -->
 <references />
 {{Vspace}}
-</div>
-<div id="ABC-unit-framework">
-== Self-evaluation ==
-<!-- included from "../components/BIN-PHYLO-Tree_building.components.wtxt", section: "self-evaluation" -->
-<!--
-=== Question 1===
-Question ...
-<div class="toccolours mw-collapsible mw-collapsed" style="width:800px">
-Answer ...
-<div class="mw-collapsible-content">
-Answer ...
-</div>
-  </div>
-  {{Vspace}}
--->
-{{Vspace}}
-{{Vspace}}
-<!-- included from "ABC-unit_components.wtxt", section: "ABC-unit_ask" -->
-----
-{{Vspace}}
-<b>If in doubt, ask!</b> If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
-----
-{{Vspace}}
 <div class="about">
@@ Line 398: / Line 187: @@
 :2017-08-05
 <b>Modified:</b><br />
-:2017-10-31
+:2020-09-25
 <b>Version:</b><br />
-:1.0
+:1.1
 <b>Version history:</b><br />
+*1.1 2020 Maintenance
 *1.0 First live version.
 *0.1 First stub
 </div>
-[[Category:ABC-units]]
-<!-- included from "ABC-unit_components.wtxt", section: "ABC-unit_footer" -->
 {{CC-BY}}
+[[Category:ABC-units]]
+{{UNIT}}
+{{LIVE}}
 </div>
 <!-- [END] -->