Difference between revisions of "BIO Assignment Week 7"
m (→Analysis) |
m (→Analysis) |
||
Line 1,099: | Line 1,099: | ||
# ... and then we plot: | # ... and then we plot: | ||
− | plot( | + | plot(<TREE>, tip.color=tipColors, |
cex=0.7, root.edge=TRUE, no.margin=TRUE) | cex=0.7, root.edge=TRUE, no.margin=TRUE) | ||
− | + | ||
</source> | </source> |
Revision as of 00:38, 30 November 2015
Assignment for Week 8
Phylogenetic Analysis
< Assignment 7 | Assignment 9 > |
Note! This assignment is currently active. All significant changes will be announced on the mailing list.
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.
Contents
- Nothing in Biology makes sense except in the light of evolution.
- Theodosius Dobzhansky
... but does evolution make sense in the light of biology?
As we have seen in the previous assignments, the Mbp1 transcription factor has homologues in all other fungi, yet there is not always a clear one-to-one mapping between members of a family in distantly related species. It appears that various systems of APSES domain transcription factors have evolved independently. Of course this bears directly on our notion of function - what it means to say that two genes in different organisms have the "same" function. In case two organisms both have an orthologous gene for the same, distinct function, calling these functions "the same" may be warranted. But what if that gene has duplicated in one species, and the two paralogues now perform different, related functions in one organism? Theses two are still orthologues to both their homologues in the other species, but now we expect functionally significant residues to have adapted to the new - and possibly distinct - roles of each paralogue. In order to be able to even ask such questions, we need to make the evolutionary history of gene families explicit. This is the domain of phylogenetic analysis. We can ask questions like: how many paralogues did the cenancestor of a clade possess? Which of these underwent additional duplications in the phylogenesis of the organism I am studying? Did any genes get lost? And - adding additional biological insight to the picture - did the observed duplications lead to the "invention" of new biological systems? When was that? And perhaps even: how did the species benefit from this event?
We will develop this kind of analysis in this assignment. In the previous assignment you have established which gene in your species is the reciprocally most closely related orthologue to yeast Mbp1 (with reciprocal best match) and you have identified the full complement of APSES domain genes in your assigned organism (as a result of your PSI-BLAST search). In this assignment, we will analyse these genes' evolutionary relationship and compare it to the evolutionary relationship of other fungal APSES domains. The goal is to define families of related transcription factors and their evolutionary history. All APSES domain annotations are now available in your protein "database". Now we will attempt to compute the phylogram for these proteins. The goal is to identify orthologues and paralogues.
A number of excellent tools for phylogenetic analysis exist; general purpose packages include the (free) PHYLIP package, the MEGA package and the (commercial) PAUP* package. Of these, only MEGA is still under active development, although PHYLIP still functions perfectly (except for problems with graphical windows under Mac OS 10.6). Specialized tools for tree-building include Treepuzzle or Mr. Bayes. This assignment is constructed around programs that are available in PHYLIP, however you are welcome to use other tools that fulfill a similar purpose if you wish. In this field, researchers consider trees that have been built with ML (maximum likelihood) methods to be more reliable than trees that are built with parsimony methods, or distance methods such as NJ (Neighbor Joining). However ML methods are also much more compute-intensive. Just like with multiple sequence alignments, some algorithms will come closer to guessing the truth and others will not and usually it is hard to tell which is the more trustworthy of two diverging results. The prudent researcher tries out alternatives and forms her own opinion. Specifically, we may usually assume results that converge when computed with different algorithms, to be more reliable than those that depend strongly on a particular algorithm, parameters, or details of input data.
In this assignment, we will take a computational shortcut, (something you should not do in real life). We will skip establishing the reliability of the tree with a bootstrap procedure, i.e. repeat the tree-building a hundred times with partial data and see which branches and groupings are robust and which depend on the details of the data. (If you are interested, have a look here for the procedure for running a bootstrap analysis on the data set you are working with, but this may require a day or so of computing time on your computer.) In this assignment, we will simply acknowledge that bifurcations that are very close to each other have not been "resolved" and be appropriately cautious in our inferences. In phylogenetic analysis, not all lines a program draws are equally trustworthy. Don't take the trees as a given fact just because a program suggests this. Look at the evidence, include independent information where available, use your reasoning, and analyse the results critically. As you will see, there are some facts that we know for certain: we know which species the genes come from, and we can (usually) make good assumptions about the relationship of the species themselves - the history of speciation events that underlies all evolution of genes. This is extremely helpful information for our work.
If you would like to review concepts of trees, clades, LCAs, OTUs and the like, I have linked an excellent and very understandable introduction-level article on phylogenetic analysis here and to the resource section at the bottom of this page.
Baldauf (2003) Phylogeny for the faint of heart: a tutorial. Trends Genet 19:345-51. (pmid: 12801728) |
[ PubMed ] [ DOI ] Phylogenetic trees seem to be finding ever broader applications, and researchers from very different backgrounds are becoming interested in what they might have to say. This tutorial aims to introduce the basics of building and interpreting phylogenetic trees. It is intended for those wanting to understand better what they are looking at when they look at someone else's trees or to begin learning how to build their own. Topics covered include: how to read a tree, assembling a dataset, multiple sequence alignment (how it works and when it does not), phylogenetic methods, bootstrap analysis and long-branch artefacts, and software and resources. |
R packages that may be useful include the following:
- R task view Phylogenetics - this task-view gives an excellent, curated overview of the important R-packages in the domain.
- package ape - general purpose phylogenetic analysis, but (as far as I can tell ape only supports analysis with DNA sequences).
- package ips - wrapper for MrBayes, Beast, RAxML "heavy-duty" phylogenetic analysis packages.
- package Rphylip - Wrapper for Phylip, the most versatile set of phylogenetic inference tools.
Preparing input alignments
You have previously collected homologous sequences and their annotations. We will use these as input for phylogenetic analysis. But let's discuss first how such an input file should be constructed.
Principles
In order to use molecular sequences for the construction of phylogenetic trees, you have to build a multiple alignment first. This is important: phylogenetic analysis does not build alignments, nor does it revise alignments, it analyses them after the alignment has been computed. A precondition for the analysis to be meaningful is that all rows of sequences have to contain the exact same number of characters and to hold aligned characters in corresponding positions (i.e. columns). The program's inferences are made on a column-wise basis and if your columns contain data from unrelated positions, the inferences are going to be questionable. Clearly, in order for tree-estimation to work, one must not include fragments of sequence which have evolved under a different evolutionary model as all others, e.g. after domain fusion, or after accommodating large stretches of indels. Thus it is appropriate to edit the sequences and pare them down to a most characteristic subset of amino acids. The goal is not to be as comprehensive as possible, but to input those columns of aligned residues that will best represent the true phylogenetic relationships between the sequences.
The result of the tree construction is a decision about the most likely evolutionary relationships. Fundamentally, tree-construction programs decide which sequences had common ancestors.
Distance based phylogeny programs start by using sequence comparisons to estimate evolutionary distances:
- they apply a model of evolution such as a mutation data matrix, to calculate a score for each pair of sequences,
- this score is stored in a "distance matrix" ...
- ... and used to estimate a tree that groups sequences with close relationships together. (e.g. by using an NJ, Neigbor Joining, algorithm).
They are fast, can work on large numbers of sequences, but are less accurate if genes evolve at different rates.
Parsimony based phylogeny programs build a tree that minimizes the number of mutation events that are required to get from a common ancestral sequence to all observed sequences. They take all columns into account, not just a single number per sequence pair, as the Distance Methods do. For closely related sequences they work very well, but they construct inaccurate trees when they can't make good estimates for the required number of sequence changes.
ML, or Maximum Likelihood methods attempt to find the tree for which the observed sequences would be the most likely under a particular evolutionary model. They are based on a rigorous statistical framework and yield the most robust results. But they are also quite compute intensive and a tree of the size that we are building in this assignment is a challenge for the resources of common workstation (runs about an hour on my computer). If the problem is too large, one may split a large problem into smaller, obvious subtrees (e.g. analysing orthologues as a group, only including a few paralogues for comparison) and then merge the smaller trees; this way even very large problems can become tractable.
ML methods suffer less from "long-branch attraction" - the phenomenon that weakly similar sequences can be grouped inappropriately close together in a tree due to spuriously shared differences.
Bayesian methods don't estimate the tree that gives the highest likelihood for the observed data, but find the most probably tree, given that the data have been observed. If this sounds conceptually similar to you, then you are not wrong. However, the approaches employ very different algorithms. And Bayesian methods need a "prior" on trees before observation.
Choosing sequences
To illustrate the principle we will construct input files by joining APSES domain and Ankyrin domain sequences and for this we will use the Prosite annotations we have collected for the reference set of sequences and your YFO sequences.
Task:
- In order to proceed, you must have updated dbUtilities.R to version 0.7 or higher. Your feature annotations must have been added to your protein database. You must have your database loaded. Wherever I type
<YFO_DB>
, you need to type the name of your YFO database instead.
Carefully work through the following code:
# Start by loading libraries
library(Biostrings)
library(msa)
# Collect APSES and ankyrin region sequences from your database. The function
# getFeatureFasta() retrieves the sequence that is annotated for a feature
# from its start and end coordinates. The parameter exactlyOne ensures that
# one sequence per ID is returned - a string of hyphens if there was no feature
# annotated, and if there were several features, only the first one. This is
# necessary to obtain a one to one match for APSES and ankyrin sequences.
# outFormat df directs the function to return the output in a dataframe, not
# as FASTA text.
APSES <- getFeatureFasta(<YFO_DB>,
fName = "HTH_APSES",
exactlyOne = TRUE,
outFormat = "df")
# inspect the result
head(APSES)
ANKYRIN <- getFeatureFasta(<YFO_DB>,
fName = "ANK_REP_REGION",
exactlyOne = TRUE,
outFormat = "df")
# inspect the result
head(ANKYRIN)
# Add a "names" column to one table, from the output of
# makeNames(). makeNames() limits the length of the name to ten
# characters, which is what phylip can handle as a sequence label.
# The names use the last four characters of the "name" in the protein
# table, and a biCode for the organism. This is necessary to
# match nodes in the phylogenetic tree with the species
# they come from.
APSES <- cbind(names = makeNames(<YFO_DB>), APSES, stringsAsFactors=FALSE)
head(APSES)
ANKYRIN <- cbind(names = makeNames(<YFO_DB>), ANKYRIN, stringsAsFactors=FALSE)
head(ANKYRIN)
Adding an Outgroup
An outgroup is a sequence that is more distantly related to all of the other sequences than any of them are to each other. This allows us to root the tree, because the root - the last common ancestor to all - must be somewhere on the branch that connects the outgroup to the rest. And whenever a molecular clock is assumed, the branching point that connects the outgroup can be assumed to be the oldest divergence event. Having a root that we can compare to the phylogram of species makes the tree interpretation much more intuitive. In our case, we are facing the problem that our species cover all of the known fungi, thus we can' rightly say that any of them are more distant to the rest. We have to look outside the fungi. The problem is, outside of the fungi there are no proteins with APSES domains, and certainly none that have APSES as well as ankyrin domains in the same gene. We can take the E. coli KilA-N domain sequence - a known, distant homologue to the APSES domain, and we can get an ankyrin region from e.g. a plant. Both outgroup domains then will have the property that they are more distant individually to any of the fungal sequences, even though they don't appear in the same protein.
Here is the KilA-N domain sequence:
>gi|301025594|ref|ZP_07189117.1| KilA-N domain protein [Escherichia coli MS 69-1] MTSFQLSLISREIDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGIPISELIQS FKGGRPENQGTWVHPDIAINLAQWLSPKFAVQVSRWVREWMSGERTTAEMPVHLKRYMVNRSRIPHTHFS ILNELTFNLVAPLEQAGYTLPEKMVPDISQGRVFSQWLRDNRNVEPKTFPTYDHEYPDGRVYPARLYPNE YLADFKEHFNNIWLPQYAPKYFADRDKKALALIEKIMLPNLDGNEQF
E. coli KilA-N protein. Residues that do not align with APSES domains are shown in grey.
And here is an ankyrin repeat region, found by BLAST search in Solanum tuberosum, the potato, and confirmed with ScanProsite. Since the potato is more distant in evolution from any fungus than all fungi are to each other, this sequence is suitable to root our ankyrin domain tree.
>NP_001275294 ankyrin repeat containing protein [Solanum tuberosum] MAPDATDALAVREKVNKFLKAACSGDIELFKKLAKQLDDGKGLAGTVADVKDGNKRGALIFAARESKIEL CKYLVEELKVDVNEKDDEGETPLLHAAREGHTATVQYLIEQGADPAIPSASGATALHHAAGNGHVELVKL LLSKGVDVDLQSEAGTPLMWAAGFGQEKVVKVLLEHHANVHAQTKDENNVCPLVSAVATDSLPCVELLAK AGADVNVRTGDATPLLIAAHNGSAGVINCLLQAGADPNAAEEDGTKPIQVAAASGSREAVEALLPVTERI QSVPEWSVDGVIEFVQSEYKREQERAEAGRKANKSREPIIPKRDLPEVSPEAKKRAADAKARGDEAFKRN DFATAIDAYTQAIDFDPTDGTLFSNRSLCWLRLGQAERALSDARACRELRPDWAKGCYREGAALRLLQRF EEAANAFYEGVQINPINMELVTAFREAVEAGRKVHATNKFNSPSSLS
S. tuberosum "ankyrin repeat and KH domain-containing protein 1-like" protein. Ankyrin repeat region shown in black.
Task:
# Let's add our outgroups to the feature sequence tables:
# APSES domain feature from E. coli
apsOutGroupSeq <- paste(
"IDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGI",
"PISELIQSFKGGRPENQGTWVHPDIAINLAQ",
sep = "")
apsOutGroupHead <- ">apses domain from E. coli KilA-N"
apsOutGroupName <- "APS_OUTGRP"
# ankyrin region feature from S. tuberosum
ankOutGroupSeq <- paste(
"PEWSVDGVIEFVQSEYKREQERAEAGRKANKSREPIIPKRDLPEVSPEAK",
"KRAADAKARGDEAFKRNDFATAIDAYTQAIDFDPTDGTLFSNRSLCWLRL",
"GQAERALSDARACRELRPDWAKGCYREGAALRLLQRFEEAANAFYEGVQI",
"NPINMELVTAFREAVEAGRKVHATNKFNSPSSLS",
sep = "")
ankOutGroupHead <- ">ankyrin repeat region from S. tuberosum"
ankOutGroupName <- "ANK_OUTGRP"
# add the synthetic proteins to the feature compilations
APSES <- rbind(APSES, data.frame(names = apsOutGroupName,
head = apsOutGroupHead,
seq = apsOutGroupSeq,
stringsAsFactors = FALSE))
ANKYRIN <- rbind(ANKYRIN, data.frame(names = ankOutGroupName,
head = ankOutGroupHead,
seq = ankOutGroupSeq,
stringsAsFactors = FALSE))
# Remove hyphens, concatenate APSES and ANK_REP_REGION
# sequences and use names for rownames.
apsSeq <- character()
ankSeq <- character()
for (i in 1:nrow(APSES)) {
aps <- gsub("-", "", APSES$seq[i])
ank <- gsub("-", "", ANKYRIN$seq[i])
if (nchar(aps) > 0) {
apsSeq <- c(apsSeq, aps)
names(apsSeq)[length(apsSeq)] <- APSES$names[i]
}
if (nchar(ank) > 0) {
ankSeq <- c(ankSeq, ank)
names(ankSeq)[length(ankSeq)] <- ANKYRIN$names[i]
}
}
head(apsSeq)
head(ankSeq)
# These are vectors of sequences with named elements. We can
# import them into Biostrings objects.
# import into BioStrings object
apsSeqSet <- AAStringSet(apsSeq)
ankSeqSet <- AAStringSet(ankSeq)
# Run multiple sequence alignments. It seems that Muscle
# has a hardcoded maximum for number of input sequences
# of 45. That is of course very silly, more so since it
# appears to be undocumented. For this task, we will
# use Clustal Omega instead. Note that Clustal Omega is
# a completely different algorithm from Clustal W.
# DON'T USE CLUSTAL W. EVER. IT PRODUCES THE WORST OF
# ALL ALIGNMENTS. Clustal Omega is mostly fine.
# See http://www.clustal.org/omega/README for parameter
# details. I use a high number of iterations here. The
# alignment takes about 20 seconds each.
apsMsaSet <- msaClustalOmega(apsSeqSet, maxiters=10, order = "aligned")
ankMsaSet <- msaClustalOmega(ankSeqSet, maxiters=10, order = "aligned")
# inspect the alignments.
writeSeqSet(apsMsaSet, format = "ali")
writeSeqSet(ankMsaSet, format = "ali")
What do you think? Are these good alignments? Can they be used for phylogenetic inference?
Reviewing and Editing alignments
As discussed in the lecture, it is usually necessary to edit a multiple sequence alignment to make it suitable for phylogenetic inference. Here are the principles:
All characters in a column should be related by homology.
This implies the following rules of thumb:
- Remove all stretches of residues in which the alignment appears ambiguous (not just highly variable, but ambiguous regarding the aligned positions).
- Remove all frayed N- and C- termini, especially regions in which not all sequences that are being compared appear homologous and that may stem from unrelated domains. You want to only retain the APSES domains. All the extra residues from the YFO sequence can be deleted.
- Remove all gapped regions that appear to be alignment artefacts due to inappropriate input sequences.
- Remove all but approximately one column from gapped regions in those cases where the presence of several related insertions suggest that the indel is real, and not just an alignment artefact. (Some researchers simply remove all gapped regions).
- Remove sections N- and C- terminal of gaps where the alignment appears questionable.
- If the sequences fit on a single line you will save yourself potential trouble with block-wise vs. interleaved input. If you do run out of memory try removing columns of sequence. Or remove species that you are less interested in from the alignment.
- Move your outgroup sequence to the first line of your alignment, since this is where PHYLIP will look for it by default.
Indels are even more of a problem than usual. Strictly speaking, the similarity score of an alignment program as well as the distance score of a phylogeny program are not calculated for an ordered sequence, but for a sum of independent values, one for each aligned columns of characters. The order of the columns does not change the score. However in an optimal sequence alignment with gaps, this is no longer strictly true since a one-character gap creation has a different penalty score than a one-character gap extension! Most alignment programs use a model with a constant gap insertion penalty and a linear gap extension penalty. This is not rigorously justified from biology, but parametrized (or you could say "tweaked") to correspond to our observations. However, most phylogeny programs do not work in this way. They strictly operate on columns of characters and treat a gap character just like a residue with the one letter code "-". Thus gap insertion- and extension- characters get the same score. For short indels, this underestimates the distance between pairs of sequences, since any evolutionary model should reflect the fact that gaps are much less likely than point mutations. If the gap is very long though, all events are counted individually as many single substitutions (rather than one lengthy one) and this overestimates the distance. And it gets worse: long stretches of gaps can make sequences appear similar in a way that is not justified, just because they are identical in the "-" character. It is therefore common and acceptable to edit gaps in the alignment and delete all but a few columns of gapped sequence, or to remove such columns altogether.
There is more to learn about this important step of working with aligned sequences, and here is an overview of the literature on various algorithms and tools that are available. Read at least the abstracts.
Talavera & Castresana (2007) Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst Biol 56:564-77. (pmid: 17654362) |
[ PubMed ] [ DOI ] Alignment quality may have as much impact on phylogenetic reconstruction as the phylogenetic methods used. Not only the alignment algorithm, but also the method used to deal with the most problematic alignment regions, may have a critical effect on the final tree. Although some authors remove such problematic regions, either manually or using automatic methods, in order to improve phylogenetic performance, others prefer to keep such regions to avoid losing any information. Our aim in the present work was to examine whether phylogenetic reconstruction improves after alignment cleaning or not. Using simulated protein alignments with gaps, we tested the relative performance in diverse phylogenetic analyses of the whole alignments versus the alignments with problematic regions removed with our previously developed Gblocks program. We also tested the performance of more or less stringent conditions in the selection of blocks. Alignments constructed with different alignment methods (ClustalW, Mafft, and Probcons) were used to estimate phylogenetic trees by maximum likelihood, neighbor joining, and parsimony. We show that, in most alignment conditions, and for alignments that are not too short, removal of blocks leads to better trees. That is, despite losing some information, there is an increase in the actual phylogenetic signal. Overall, the best trees are obtained by maximum-likelihood reconstruction of alignments cleaned by Gblocks. In general, a relaxed selection of blocks is better for short alignment, whereas a stringent selection is more adequate for longer ones. Finally, we show that cleaned alignments produce better topologies although, paradoxically, with lower bootstrap. This indicates that divergent and problematic alignment regions may lead, when present, to apparently better supported although, in fact, more biased topologies. |
Capella-Gutiérrez et al. (2009) trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25:1972-3. (pmid: 19505945) |
[ PubMed ] [ DOI ] SUMMARY: Multiple sequence alignments are central to many areas of bioinformatics. It has been shown that the removal of poorly aligned regions from an alignment increases the quality of subsequent analyses. Such an alignment trimming phase is complicated in large-scale phylogenetic analyses that deal with thousands of alignments. Here, we present trimAl, a tool for automated alignment trimming, which is especially suited for large-scale phylogenetic analyses. trimAl can consider several parameters, alone or in multiple combinations, for selecting the most reliable positions in the alignment. These include the proportion of sequences with a gap, the level of amino acid similarity and, if several alignments for the same set of sequences are provided, the level of consistency across different alignments. Moreover, trimAl can automatically select the parameters to be used in each specific alignment so that the signal-to-noise ratio is optimized. AVAILABILITY: trimAl has been written in C++, it is portable to all platforms. trimAl is freely available for download (http://trimal.cgenomics.org) and can be used online through the Phylemon web server (http://phylemon2.bioinfo.cipf.es/). Supplementary Material is available at http://trimal.cgenomics.org/publications. |
Blouin et al. (2009) Reproducing the manual annotation of multiple sequence alignments using a SVM classifier. Bioinformatics 25:3093-8. (pmid: 19770262) |
[ PubMed ] [ DOI ] MOTIVATION: Aligning protein sequences with the best possible accuracy requires sophisticated algorithms. Since the optimal alignment is not guaranteed to be the correct one, it is expected that even the best alignment will contain sites that do not respect the assumption of positional homology. Because formulating rules to identify these sites is difficult, it is common practice to manually remove them. Although considered necessary in some cases, manual editing is time consuming and not reproducible. We present here an automated editing method based on the classification of 'valid' and 'invalid' sites. RESULTS: A support vector machine (SVM) classifier is trained to reproduce the decisions made during manual editing with an accuracy of 95.0%. This implies that manual editing can be made reproducible and applied to large-scale analyses. We further demonstrate that it is possible to retrain/extend the training of the classifier by providing examples of multiple sequence alignment (MSA) annotation. Near optimal training can be achieved with only 1000 annotated sites, or roughly three samples of protein sequence alignments. AVAILABILITY: This method is implemented in the software MANUEL, licensed under the GPL. A web-based application for single and batch job is available at http://fester.cs.dal.ca/manuel. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
Penn et al. (2010) GUIDANCE: a web server for assessing alignment confidence scores. Nucleic Acids Res 38:W23-8. (pmid: 20497997) |
[ PubMed ] [ DOI ] Evaluating the accuracy of multiple sequence alignment (MSA) is critical for virtually every comparative sequence analysis that uses an MSA as input. Here we present the GUIDANCE web-server, a user-friendly, open access tool for the identification of unreliable alignment regions. The web-server accepts as input a set of unaligned sequences. The server aligns the sequences and provides a simple graphic visualization of the confidence score of each column, residue and sequence of an alignment, using a color-coding scheme. The method is generic and the user is allowed to choose the alignment algorithm (ClustalW, MAFFT and PRANK are supported) as well as any type of molecular sequences (nucleotide, protein or codon sequences). The server implements two different algorithms for evaluating confidence scores: (i) the heads-or-tails (HoT) method, which measures alignment uncertainty due to co-optimal solutions; (ii) the GUIDANCE method, which measures the robustness of the alignment to guide-tree uncertainty. The server projects the confidence scores onto the MSA and points to columns and sequences that are unreliably aligned. These can be automatically removed in preparation for downstream analyses. GUIDANCE is freely available for use at http://guidance.tau.ac.il. |
Rajan (2013) A method of alignment masking for refining the phylogenetic signal of multiple sequence alignments. Mol Biol Evol 30:689-712. (pmid: 23193120) |
[ PubMed ] [ DOI ] Inaccurate inference of positional homologies in multiple sequence alignments and systematic errors introduced by alignment heuristics obfuscate phylogenetic inference. Alignment masking, the elimination of phylogenetically uninformative or misleading sites from an alignment before phylogenetic analysis, is a common practice in phylogenetic analysis. Although masking is often done manually, automated methods are necessary to handle the much larger data sets being prepared today. In this study, we introduce the concept of subsplits and demonstrate their use in extracting phylogenetic signal from alignments. We design a clustering approach for alignment masking where each cluster contains similar columns-similarity being defined on the basis of compatible subsplits; our approach then identifies noisy clusters and eliminates them. Trees inferred from the columns in the retained clusters are found to be topologically closer to the reference trees. We test our method on numerous standard benchmarks (both synthetic and biological data sets) and compare its performance with other methods of alignment masking. We find that our method can eliminate sites more accurately than other methods, particularly on divergent data, and can improve the topologies of the inferred trees in likelihood-based analyses. Software available upon request from the author. |
Sequence masking with R
As you saw while inspecting the alignment above, there are many regions that are uncertain due to the large numbers of gaps - or the missing (or not annotated) ankyrin domain regions.
A good approach to edit the alignment is to import your sequences into Jalview and remove uncertain columns by hand.
But for this assignment, let's write code for a simple masking heuristic.
# Let's mask out all columns that have observations for
# less than 1/3 of the sequences in the dataset. This
# means they have more than round(nrow(msaSet) * (2/3))
# hyphens in a column.
# We take all sequences, split them into single
# characters, and put them into a matrix. Then we
# go through the matrix, column by column and decide
# whether we want to include that column.
# Step 1. Go through this by hand...
# get the lngth of the alignment
lenAli <- apsMsaSet@unmasked@ranges@width[1]
# initialize a matrix that can hold all characters
# individually
msaMatrix <- matrix(character(nrow(apsMsaSet) * lenAli),
ncol = lenAli)
# assign the correct rownames
rownames(msaMatrix) <- apsMsaSet@unmasked@ranges@NAMES
for (i in 1:nrow(apsMsaSet)) {
seq <- as.character(apsMsaSet@unmasked[i])
msaMatrix[i, ] <- unlist(strsplit(seq, ""))
}
# inspect the result
msaMatrix[1:5, 1:15]
# Now let's make a logical vector with an element
# for each column that selects which columns should
# be masked out.
# To count the number of elements in a vector, R has
# the table() function. For example ...
table(msaMatrix[ , 4])
# ... says: there are mostly hyphens, and a few
# other residues in column 4 of the alignment.
# Since the return value of table() is a named vector, where
# the name is the element that was counted in each slot,
# we can simply get the counts for hyphens from the
# return value of table(). We don't even need to assign
# the result to an intermediate variable, but we
# can attach the selection via square brackets,
# i.e.: ["-"], directly to the function call:
table(msaMatrix[ , 4])["-"]
# ... to get the number of hyphens. And we can compare
# whether it is eg. > 20.
table(msaMatrix[ , 4])["-"] > 20
# Thus filling our logical vector is really simple:
# initialize the mask with TRUE values
colMask <- rep(TRUE, lenAli)
# define the threshold for rejecting a column
limit <- round(nrow(apsMsaSet) * (2/3))
# iterate over all columns, and write FALSE whenever
# a column should be rejected
for (i in 1:lenAli) {
count <- table(msaMatrix[ , i])["-"]
if (! is.na(count) & count > limit) {
colMask[i] <- FALSE
}
}
# inspect the mask
colMask
# How many positions were masked? R has a simple trick
# to count the number of TRUE and FALSE in a logical
# vector. If a logical TRUE or FALSE is converted into
# a number, it becomes 1 or 0 respectively. If we use
# the sum() function on the vector, the conversion is
# done implicitly. Thus ...
sum(colMask)
# ... gives the number of TRUE elements.
cat(sprintf("We are masking %4.2f %% of alignment columns.\n",
100 * (1 - (sum(colMask) / length(colMask)))))
# Next, we use colMask to remove the masked columns from the matrix
# in one step:
maskedMatrix <- msaMatrix[ , colMask]
# check:
ncol(maskedMatrix)
# ... then collapse each row back into a sequence ...
apsMaskedSeq <- character()
for (i in 1:nrow(maskedMatrix)) {
apsMaskedSeq[i] <- paste(maskedMatrix[i, ], collapse="")
}
names(apsMaskedSeq) <- rownames(maskedMatrix)
# ... and read it back into an AAStringSet object
apsMaskedSet <- AAStringSet(apsMaskedSeq)
# inspect ...
writeSeqSet(apsMaskedSet, format = "ali")
# Step 2. Turn this code into a function...
# Even though the procedure is simple, doing this
# more than once is tedious and prone to errors. Let's
# assemble the steps we just did into a function
# instead.
maskSet <- function(set,
fGap = (2/3),
cGap="-",
verbose = TRUE) {
# mask columns in "set" that contain more
# then fHyphen fraction of cGap characters.
if (class(set) != "MsaAAMultipleAlignment") {
stop(paste("This function needs an object of class",
" MsaAAMultipleAlignment as input."))
}
lenAli <- set@unmasked@ranges@width[1]
mat <- matrix(character(nrow(set) * lenAli),
ncol = lenAli)
rownames(mat) <- set@unmasked@ranges@NAMES
for (i in 1:nrow(set)) {
seq <- as.character(set@unmasked[i])
mat[i, ] <- unlist(strsplit(seq, ""))
}
colMask <- rep(TRUE, lenAli)
limit <- round(nrow(set) * fGap)
for (i in 1:lenAli) {
count <- table(mat[ , i])[cGap]
if (! is.na(count) & count > limit) {
colMask[i] <- FALSE
}
}
if (verbose) {
cat(sprintf("Masking %4.2f %% of alignment columns.\n",
100 * (1 - (sum(colMask) / length(colMask)))))
}
mat <- mat[ , colMask]
seqSet <- character()
for (i in 1:nrow(mat)) {
seqSet[i] <- paste(mat[i, ], collapse="")
}
names(seqSet) <- rownames(mat)
return(AAStringSet(seqSet))
}
# Check that the function gives identical results
# to what we did before by hand:
identical(apsMaskedSet, maskSet(apsMsaSet))
# The result must be TRUE. If it's not TRUE you have
# an error somewhere.
# Step 3: Mask the ankyrin set:
ankMaskedSet <- maskSet(ankMsaSet)
# ... and inspect it.
writeSeqSet(ankMaskedSet, format = "ali")
# This little piece of code has made the
# alignment a lot more suitable for analyis.
# We save the aligned, masked domains to file in FASTA format.
writeSeqSet(apsMaskedSet, file = "APSES.mfa", format = "mfa")
writeSeqSet(ankMaskedSet, file = "ANKYRIN.mfa", format = "mfa")
Selecting a Sequence Subset
Running a full maximum-likelihood tree calculation on this set of fift-something sequences takes more than half a day. To hone the procedure and test our parameters, we will first calculate a tree only with sequences that fulfil the RBM with Mbp1, and with the outgroup. Obviously, the best we could hope for from such an analysis is to reproduce the species tree. But we already know what that should be. This computational experiment thus serves as our control: to estimate how accurate the tree is going to be in the best case, especially since we can check whether our APSES and ankyrin trees give the same results.
Go through this part of code carefully: you will need to change some parts to work with YFO.
Task:
# To extract the Mbp1 sequences from the set, we
# define a vector of names for the sequences we
# want.
# Here are the "names" for Mbp1 RBM proteins in the
# reference protein set. You need to add the name of
# the YFO RBM to Mbp1 to this set. Make sure you
# include the "biCode" and use no more than ten
# characters total!
apsMbp1Names <- c("APS_OUTGRP",
"res2_SCHPO",
"7587_NEUCR",
"_338_BIPOR",
"54.2_ASPNI",
"9726_WALME",
"0840_CRYNE",
"1306_COPCI",
"8863_PUCGR",
"1222_USTMA",
"MBP1_SACCE",
"<YFO-RBM_name>")
# YOU MUST ADD THE NAME OF YOUR YFO RBM TO MBP1 TO
# THIS VECTOR BEFORE YOU CONTINUE.
# Copy this vector for ankyrin domains, and substitute the
# correct outgroup name.
ankMbp1Names <- apsMbp1Names
ankMbp1Names[1] <- "ANK_OUTGRP"
# First we create a vector of TRUE/FALSE by checking
# which of the names in the sequence set is also
# in our vector of Mbp1 names:
select <- apsMaskedSet@ranges@NAMES %in% apsMbp1Names
# inspect this
select
# Then we apply the vector to copy only the sequences
# we want to a new Set.
apsMbp1Set <- apsMaskedSet[select]
# Finally, we write the set to a file in mfa format.
# We need the file as an input file for Rphylip.
writeSeqSet(apsMbp1Set, file = "apsMbp1Set.mfa", format = "mfa")
# Same for the ankyrin domains ...
select <- ankMaskedSet@ranges@NAMES %in% ankMbp1Names
ankMbp1Set <- ankMaskedSet[select]
writeSeqSet(ankMbp1Set, file = "ankMbp1Set.mfa", format = "mfa")
- And with that, we have finally prepared the data we need to calculate trees.
Calculating trees
In this section we perform the actual phylogenetic calculation.
Task:
- Download the PHYLIP suite of programs from the Phylip homepage and install it on your computer.
- Execute the following code.
install.packages("Rphylip")
library(Rphylip)
# This will install RPhylip, as well as its dependency, the package "ape".
# The next part may be tricky. You will need to figure out where
# on your computer Phylip has been installed and define the path
# to the proml program that calculates a maximum-likelihood tree.
# I give you instructions for the Mac below.
# You'll need to figure out the equivalent Windows commands and
# please post instructions on the mailing list once you have got
# this to work on Windows.
# On the Mac, the standard installation places a phylip folder
# in the /Applications directory. That folder contains all the
# individual phylip programs as <name>.app files. These are not
# the actual executables, but "app" files are actually directories
# that contain the required resources for a program to run.
# The executable is in a subdirectory and you can point Rphylip
# directly to that subdirectory to find the program it needs:
PROMLPATH <- "/Applications/phylip-3.695/exe/proml.app/Contents/MacOS"
# Now read the mfa files you have saved as "proseq" objects:
apsIn <- read.protein("apsMbp1Set.mfa")
ankIn <- read.protein("ankMbp1Set.mfa")
# ... and you are ready to build a tree.
# Building maximum-likelihood trees can eat as much computer time
# as you can throw at it. Calculating a tree of 48 APSES domains
# with default parameters of Rproml() runs for more than half a day
# on my computer. This is why we'll start off with a smaller subset
# and we'll also calculate a less accurate tree:
# We'll use a faster algorithm ... speedier = TRUE
# We'll reduce the number of global rearrangements from
# ten (default) to three.
apsTree <- Rproml(apsIn, path=PROMLPATH, speedier = TRUE, random.addition=3)
ankTree <- Rproml(ankIn, path=PROMLPATH, speedier = TRUE, random.addition=3)
# This should take about half a minute each.
# A quick first look:
layout(matrix(1:2, 1, 2))
plot(apsTree)
plot(ankTree)
layout(matrix(1), widths=1.0, heights=1.0)
If everything went as planned, you are looking at two phylogenetic trees. But they look quite different. How different is their topology really? Shouldn't they be virtually identical because after all we are only analysing different parts of the same sequence? And how do they relate to what we would expect: a recapitulation of the species cladogram in the gene tree?
Time to analyse the results.
Analysing your tree
In order to analyse your tree, you need a species tree as reference. This really is an absolute prerequisite to make your expectations about the observed tree explicit. Fortunately we have all species nicely documented in our database.
The reference species tree
Task:
- Navigate to the NCBI Taxonomy page
- Execute the following R command to create an Entrez command that will retrieve all taxonomy records for the species in your database:
cat(paste(<YFO_DB>$taxonomy$id, collapse="[taxid] OR "), "[taxid]")
- Copy the command, and enter it into the search field of the NCBI taxonomy page. Click on Search. The resulting page should have twelve species listed - ten "reference" fungi, E. coli (as the outgroup), and YFO. Make sure YFO is included! If it's not there, you did something wrong that needs to be fixed.
- Click on the Summary options near the top-left of the page, and select Common Tree. This places all the species into the universal tree of life and identifies their relationships.
- At the top, there is an option to Save as ... and the option to select a format to save the tree in. Select Phylip Tree as the format and click the Save as button. The file
phyliptree.phy
will be downloaded to your computer into your default download directory. Move it to the directory you have defined asPROJECTDIR
.
- Open the file in a text-editor. This is a tree, specified in the so-called "Newick Format". The topology of the tree is defined through the brackets, and the branch-lengths are all the same: this is a cladogram, not a phylogram. The tree contains the long names for the species/strains and for our purposes we really need the "biCodes" instead. I can't think of a very elegant way to make that change programmatically, so just go ahead and replace the species names (not the taxonomic ranks though) with their biCode in your text editor. Remove all the single quotes, and replace any remaining blanks in names with an underscore. Take care however not to delete any colons or parentheses. Save the file.
My version looks like this - Your version must have YFO somewhere in the tree..
((( PUCGR:4, ( WALME:4, CRYNE:4, COPCI:4 )Agaricomycotina:4, USTMA:4 )Basidiomycota:4, ( SCHPO:4, ( SACCE:4, ( BIPOR:4, NEUCR:4, ASPNI:4 )leotiomyceta:4 )saccharomyceta:4 )Ascomycota:4 )Dikarya:4, ESCCO:4 )cellular_organisms:4;
- Now read the tree in R and plot it.
# Download the EDITED phyliptree.phy
orgTree <- read.tree("phyliptree.phy")
# Plot the tree in a new window
dev.new(width=6, height=3)
plot(orgTree, cex=1.0, root.edge=TRUE, no.margin=TRUE)
nodelabels(text=orgTree$node.label, cex=0.6, adj=0.2, bg="#D4F2DA")
I have constructed a cladogram for many of the species we are analysing, based on data published for 1551 fungal ribosomal sequences. The six reference species are included. Such reference trees from rRNA data are a standard method of phylogenetic analysis, supported by the assumption that rRNA sequences are monophyletic and have evolved under comparable selective pressure in all species.
Cladogram of the "reference" fungi studied in the assignments. This cladogram is based on a tree returned by the NCBI Common Tree. It is thus a digest of cladistic relationships, not a representation of a specific molecular phylogeny.
Alternatively, you can look up your species in the latest version of the species tree for the fungi and add it to the tree by hand while resolving the trifurcations. See:
Ebersberger et al. (2012) A consistent phylogenetic backbone for the fungi. Mol Biol Evol 29:1319-34. (pmid: 22114356) |
[ PubMed ] [ DOI ] The kingdom of fungi provides model organisms for biotechnology, cell biology, genetics, and life sciences in general. Only when their phylogenetic relationships are stably resolved, can individual results from fungal research be integrated into a holistic picture of biology. However, and despite recent progress, many deep relationships within the fungi remain unclear. Here, we present the first phylogenomic study of an entire eukaryotic kingdom that uses a consistency criterion to strengthen phylogenetic conclusions. We reason that branches (splits) recovered with independent data and different tree reconstruction methods are likely to reflect true evolutionary relationships. Two complementary phylogenomic data sets based on 99 fungal genomes and 109 fungal expressed sequence tag (EST) sets analyzed with four different tree reconstruction methods shed light from different angles on the fungal tree of life. Eleven additional data sets address specifically the phylogenetic position of Blastocladiomycota, Ustilaginomycotina, and Dothideomycetes, respectively. The combined evidence from the resulting trees supports the deep-level stability of the fungal groups toward a comprehensive natural system of the fungi. In addition, our analysis reveals methodologically interesting aspects. Enrichment for EST encoded data-a common practice in phylogenomic analyses-introduces a strong bias toward slowly evolving and functionally correlated genes. Consequently, the generalization of phylogenomic data sets as collections of randomly selected genes cannot be taken for granted. A thorough characterization of the data to assess possible influences on the tree reconstruction should therefore become a standard in phylogenomic analyses. |
Visualizing your tree
The trees that are produced by Rphylip are stored as an object of class phylo
. This is a class for phylogenetic trees that is widely used in the community, practically all R phylogenetics packages will options to read and manipulate such trees. Outside of R, a popular interchange format is the Newick format that you have seen above. It's easy to output your calculated trees in Newick format and visualize them elsewhere.
Task:
# The "phylo" class object is one of R's "S3"
# objects and methods to plot and print it have been added
# to the system. You can simply call plot(<your-tree>) and
# R knows what to do with <your-tree> and how to plot it.
# The underlying function is plot.phylo(), and documentation
# for its many options can by found by tyoing:
?plot.phylo
plot(apsTree) # default type is "phylogram"
plot(apsTree, type="unrooted")
plot(apsTree, type="fan", no.margin = TRUE)
# rescale to show all of the labels:
# record the current plot parameters ...
tmp <- plot(apsTree, type="fan", no.margin = TRUE, plot=FALSE)
# ... and adjust the plot limits for a new plot
plot(apsTree,
type="fan",
x.lim = tmp$x.lim * 1.8,
y.lim = tmp$y.lim * 1.8,
cex = 0.8,
no.margin = TRUE)
# Inspect the tree object
str(apsTree)
apsTree$tip.label
apsTree$edge
apsTree$edge.length
# show the node / edge and tip labels on a plot
plot(apsTree)
nodelabels()
edgelabels()
tiplabels()
# show the number of nodes, edges and tips
Nnode(apsTree)
Nedge(apsTree)
Ntip(apsTree)
# Finally, write the tree to console in Newick format
write.tree(apsTree)
- Copy the tree-string from the R console.
- Visualize the tree online: navigate to the Trex-online Newick tree viewer. Visualize the tree as a phylogram. Explore the options.
Tree Analysis
In order to analyse the tree, it is helpful to root it first and reorder its clades.
Rooting Trees
Task:
# Contrary to documentation, Rproml() returns an unrooted tree.
is.rooted(apsTree)
is.rooted(ankTree)
# You can root the tree with the command root() from the "ape"
# package. ape is automatically installed and loaded with
# Rphylip.
plot(apsTree)
# add labels for internal nodes and tips
nodelabels(cex=0.5, frame="circle")
tiplabels(cex=0.5, frame="rect")
# The outgroup of the tree is tip "11" in my sample
# tree, it may be a different number in yours. If that's
# the case substitute the correct node number below for
# "outgroup".
apsTree <- root(apsTree, outgroup = 11, resolve.root = TRUE)
plot(apsTree)
is.rooted(apsTree)
# this tree _looks_ unchanged, beacuse when the root
# trifurcation was resolved, an edge of length zero
# was added to connect the MRCA (Most Recent Common
# Ancestor) of the ingroup.
# The edge lengths are stored in the phylo object:
apsTree$edge.length
# ... and you can assign a small arbitrary value to the edge
# to show how it connects to the tree without having an
# overlap.
apsTree$edge.length[1] <- 0.1
plot(apsTree, cex=0.7)
nodelabels(text="MRCA", node=12, cex=0.5, adj=0.1, bg="#ff8866")
# Repeat for the ankyrin domain tree. Be careful to
# change the code if necessary for YOUR tree.
ankTree <- root(ankTree, outgroup = 11, resolve.root = TRUE)
ankTree$edge.length[1] <- 0.1
# This procedure does however not assign an actual length to
# a root edge, and therefore no root edge is visible on the
# plot. Why, you might ask. I ask myself that too. We'll
# just add a length by hand.
apsTree$root.edge <- mean(apsTree$edge.length) * 1.5
ankTree$root.edge <- mean(ankTree$edge.length) * 1.5
# compare the two trees to confirm they are now rooted
layout(matrix(1:2, 1, 2))
plot(apsTree, cex=0.7, root.edge=TRUE)
nodelabels(text="MRCA", node=12, cex=0.5, adj=0.8, bg="#ff8866")
plot(ankTree, cex=0.7, root.edge=TRUE)
nodelabels(text="MRCA", node=12, cex=0.5, adj=0.8, bg="#ff8866")
layout(matrix(1), widths=1.0, heights=1.0)
Rotating Clades
To interpret the tree, it is useful to rotate the clades so that they appear in the order expected from the cladogram of species.
Task:
# We can either rotate around individual internal nodes:
layout(matrix(1:2, 1, 2))
plot(apsTree, no.margin=TRUE, root.edge=TRUE)
nodelabels(node=17, cex=0.7, bg="#ff8866")
plot(rotate(apsTree, node=17), no.margin=TRUE, root.edge=TRUE)
nodelabels(node=17, cex=0.7, bg="#88ff66")
layout(matrix(1), widths=1.0, heights=1.0)
# ... or we can plot the tree so it corresponds as
# well as possible to a predefined tip ordering. Here
# we use the ordering that NCBI Global Tree returns
# for the reference species - we have used it above to
# make the vecors apsMbp1Names and ankMbp1Names. You
# inserted your YFO name into that vector - but you
# should move it to its correct position in the
# cladogram.
# (Nb. we need to reverse the ordering for the plot.
# This is why we use the expression [nOrg:1] below
# instead of using the vector directly.)
nOrg <- length(apsMbp1Names)
dev.new(width=9, height=5)
layout(matrix(1:3, 1, 3))
plot(orgTree,
no.margin=TRUE, root.edge=TRUE)
nodelabels(text=orgTree$node.label, cex=0.5, adj=0.2, bg="#D4F2DA")
plot(rotateConstr(apsTree, apsMbp1Names[nOrg:1]),
no.margin=TRUE, root.edge=TRUE)
add.scale.bar(length=0.5)
plot(rotateConstr(ankTree, ankMbp1Names[nOrg:1]),
no.margin=TRUE, root.edge=TRUE)
add.scale.bar(length=0.5)
layout(matrix(1), widths=1.0, heights=1.0)
Study the three trees and consider their similarities and differences. What do you expect? What do you find?
- First, the APS and ANK trees should have the same topology, since they are only different parts of the same protein (unless that protein has swapped its domains with another one during evolution). Clearly, that is not the case. The basidiomycota are reasonably consistent, although their internal ordering is poorly resolved, particularly in the APS tree. The ascomycota show two major differences, but they are actually consistent between the APS and the ANK tree: SACCE is less similar to all than we would expect from the species tree. And NEUCR is more similar to the basidiomycotal proteins.
- Consider the scale bars: ANK domains have evolved at about twice the rate of the APS domains. This alone should tell us to be cautious with our interpretations since this shows there are different degrees of selective pressure on different parts of the protein. Moreover the relative rates differ as well. NEUCR's APSES domain has evolved much faster by comparison to other proteins than its ankyrin domain. Has its biological function changed?
- Secondly, both gene trees should follow the species tree. Again, there are differences. But if we exclude SACCE and NEUCR, the remainder actually turns out relatively consistent.
In any case: this is what the data tells us. The big picture is mostly conserved, but there are differences in the details. However: now we know what degree of accuracy we can expect from the analysis.
The mixed gene tree
You have now practiced how to calculate, manipulate, plot, annotate and compare trees.
Task:
- Now use Rproml to calculate a mixed gene tree based on 'all APSES domains. You saved it as
APSES.mfa
. For the fifty or so domains, each run will take about an hour. Thus run as manyrandom.addition
cycles as reasonable during a study break, or overnight. Thus the command will be something like:
allApsIn <- read.protein("APSES.mfa")
fullApsTree <- Rproml(allApsIn, path=PROMLPATH, random.addition=3)
#... and don't forget:
save(fullApsTree, file="fullApsTree.rda")
Analysis
Here are two principles that will help you make sense of the tree.
A: A gene that is present in an ancestral species is inherited in all descendant species. The gene has to be observed in all OTUs, unless its has been lost (which is a rare event).
B: Paralogous genes in an ancestral species should give rise to monophyletic subtrees for each of the paralogues, in all descendants; this means: if the MRCA of a branch has e.g. three genes, we would expect three copies of that branch below this node, one for each of the three genes. Each of these subtrees should recapitulate the reference phylogenetic tree of the species, up to the branchpoint of their MRCA. The precise relationships may not be readily apparent, due to the noise and limited resolution we saw above, but the gene ought to be somewhere in the tree and you can often assume that it is closest to where it ought to be if the topology was correct. In this way you try to reconcile your expectations with your observations - preferably with as small a number of changes as possible.
With these two simple principles (draw them out on a piece of paper if they do not seem obvious to you), you can probably pry your tree apart quite nicely. A few colored pencils and a printout of the tree will help. I would start by identifying all of the Mbp1 RBMs in the tree.
Here is a bit of code that you can use to colour the labels of the Mbp1 RBMs:
# You have previously defined the names for Mbp1 RBMs in
# the vector apsMbp1Names. You can use these to check
# which of the tree tipLabels are in that vector and
# then color them red in the plot.
# You'll need to replace <TREE> with whatever you called
# your full tree with all APSES domain proteins.
#First, have a look at the tip labels in your tree:
<TREE>$tip.label
# We'll create a vector of black colours of the same length
# as the tip label vector:
tipColors = rep("#000000", Ntip(<TREE>))
# ... then we replace each one for which the label is
# in apsMbp1Names with "#BB0000" (red)
tipColors[<TREE>$tip.label %in% apsMbp1Names] <- "#BB0000"
#inspect:
tipColors
# ... and then we plot:
plot(<TREE>, tip.color=tipColors,
cex=0.7, root.edge=TRUE, no.margin=TRUE)
The APSES domains of the MRCA
Note: A common confusion about cenancestral genes (MRCA = Most Recent Common Ancestor) arises from the fact that by far not all expected genes are present in the OTUs. Some will have been lost, some will have been incorrectly annotated in their genome (frameshifts!) and not been found with PSI-BLAST, some may have diverged beyond recognizability. In general you have to ask: given the species represented in a subclade, what is the last common ancestor of that branch? The expectation is that all descendants of that ancestor should be represented in that branch unless one of the above reasons why a gene might be absent would apply. Eg. if a branch contains species from Basidiomycota and Ascomycota, this means that its MRCA was the ancestor of all fungi.
Task:
- Consider the APSES domain proteins of the fungal cenancestor. What evidence do you see in the tree that identifies them. Note that the hallmark of a clade that originated in the cenancestor is that it contains species from all subsequent major branches of the species tree. How many of these proteins are there? What arer the names of their SACCE descendants?
The APSES domains of YFO
You have identified the APSES domain genes of the fungal cenancestor above. Accordingly, this defines the number of APSES protein genes the ancestor to YFO had. Identify the sequence of duplications and/or gene loss in your organism through which YFO has ended up with the APSES domains it possesses today.
Task:
- Print the tree to a single sheet of paper.
- Mark the clades for the genes of the cenancestor.
- Label all subsequent branchpoints that affect the gene tree for YFO with either "D" (for duplication) or "S" (for speciation). Remember that specific speciation events can appear more than once in a tree. Identify such events.
- Bring this sheet with you to the quiz on Tuesday. Your annotated printout will be worth half of the phylogeny quiz marks.
Bonus: when did it happen?
A very cool resource is Timetree - a tool that allows you to estimate divergence times between species. For example, the speciation event that separated the main branches of the fungi - i.e. the time when the fungal cenacestor lived - is given by the divergence time of Schizosaccharomyces pombe and Saccharomyces cerevisiaea: 761,000,000 years ago. For comparison, these two fungi are therefore approximately as related to each other as you are ...
A) to the rabbit?
B) to the opossum?
C) to the chicken?
D) to the rainbow trout?
E) to the warty sea squirt?
F) to the bumblebee?
G) to the earthworm?
H) to the fly agaric?
Check it out - the question will be on the quiz.
Links and Resources
- Literature
Ebersberger et al. (2012) A consistent phylogenetic backbone for the fungi. Mol Biol Evol 29:1319-34. (pmid: 22114356) |
[ PubMed ] [ DOI ] The kingdom of fungi provides model organisms for biotechnology, cell biology, genetics, and life sciences in general. Only when their phylogenetic relationships are stably resolved, can individual results from fungal research be integrated into a holistic picture of biology. However, and despite recent progress, many deep relationships within the fungi remain unclear. Here, we present the first phylogenomic study of an entire eukaryotic kingdom that uses a consistency criterion to strengthen phylogenetic conclusions. We reason that branches (splits) recovered with independent data and different tree reconstruction methods are likely to reflect true evolutionary relationships. Two complementary phylogenomic data sets based on 99 fungal genomes and 109 fungal expressed sequence tag (EST) sets analyzed with four different tree reconstruction methods shed light from different angles on the fungal tree of life. Eleven additional data sets address specifically the phylogenetic position of Blastocladiomycota, Ustilaginomycotina, and Dothideomycetes, respectively. The combined evidence from the resulting trees supports the deep-level stability of the fungal groups toward a comprehensive natural system of the fungi. In addition, our analysis reveals methodologically interesting aspects. Enrichment for EST encoded data-a common practice in phylogenomic analyses-introduces a strong bias toward slowly evolving and functionally correlated genes. Consequently, the generalization of phylogenomic data sets as collections of randomly selected genes cannot be taken for granted. A thorough characterization of the data to assess possible influences on the tree reconstruction should therefore become a standard in phylogenomic analyses. |
Marcet-Houben & Gabaldón (2009) The tree versus the forest: the fungal tree of life and the topological diversity within the yeast phylome. PLoS ONE 4:e4357. (pmid: 19190756) |
[ PubMed ] [ DOI ] A recurrent topic in phylogenomics is the combination of various sequence alignments to reconstruct a tree that describes the evolutionary relationships within a group of species. However, such approach has been criticized for not being able to properly represent the topological diversity found among gene trees. To evaluate the representativeness of species trees based on concatenated alignments, we reconstruct several fungal species trees and compare them with the complete collection of phylogenies of genes encoded in the Saccharomyces cerevisiae genome. We found that, despite high levels of among-gene topological variation, the species trees do represent widely supported phylogenetic relationships. Most topological discrepancies between gene and species trees are concentrated in certain conflicting nodes. We propose to map such information on the species tree so that it accounts for the levels of congruence across the genome. We identified the lack of sufficient accuracy of current alignment and phylogenetic methods as an important source for the topological diversity encountered among gene trees. Finally, we discuss the implications of the high levels of topological variation for phylogeny-based orthology prediction strategies. |
Baldauf (2003) Phylogeny for the faint of heart: a tutorial. Trends Genet 19:345-51. (pmid: 12801728) |
[ PubMed ] [ DOI ] Phylogenetic trees seem to be finding ever broader applications, and researchers from very different backgrounds are becoming interested in what they might have to say. This tutorial aims to introduce the basics of building and interpreting phylogenetic trees. It is intended for those wanting to understand better what they are looking at when they look at someone else's trees or to begin learning how to build their own. Topics covered include: how to read a tree, assembling a dataset, multiple sequence alignment (how it works and when it does not), phylogenetic methods, bootstrap analysis and long-branch artefacts, and software and resources. |
Tuimala, Jarno (2006) A primer to phylogenetic analysis using the PHYLIP package. |
(pmid: None) [ Source URL ] The purpose of this tutorial is to demonstrate how to use PHYLIP, a collection of phylogenetic analysis software, and some of the options that are available. This tutorial is not intended to be a course in phylogenetics, although some phylogenetic concepts will be discussed briefly. There are other books available which cover the theoretical sides of the phylogenetic analysis, but the actual data analysis work is less well covered. Here we will mostly deal with molecular sequence data analysis in the current PHYLIP version 3.66. |
- Software
- Sequences
Biasini et al. (2014) SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information. Nucleic Acids Res 42:W252-8. (pmid: 24782522) |
[ PubMed ] [ DOI ] Protein structure homology modelling has become a routine technique to generate 3D models for proteins when experimental structures are not available. Fully automated servers such as SWISS-MODEL with user-friendly web interfaces generate reliable models without the need for complex software packages or downloading large databases. Here, we describe the latest version of the SWISS-MODEL expert system for protein structure modelling. The SWISS-MODEL template library provides annotation of quaternary structure and essential ligands and co-factors to allow for building of complete structural models, including their oligomeric structure. The improved SWISS-MODEL pipeline makes extensive use of model quality estimation for selection of the most suitable templates and provides estimates of the expected accuracy of the resulting models. The accuracy of the models generated by SWISS-MODEL is continuously evaluated by the CAMEO system. The new web site allows users to interactively search for templates, cluster them by sequence similarity, structurally compare alternative templates and select the ones to be used for model building. In cases where multiple alternative template structures are available for a protein of interest, a user-guided template selection step allows building models in different functional states. SWISS-MODEL is available at http://swissmodel.expasy.org/. |
Bordoli & Schwede (2012) Automated protein structure modeling with SWISS-MODEL Workspace and the Protein Model Portal. Methods Mol Biol 857:107-36. (pmid: 22323219) |
[ PubMed ] [ DOI ] Comparative protein structure modeling is a computational approach to build three-dimensional structural models for proteins using experimental structures of related protein family members as templates. Regular blind assessments of modeling accuracy have demonstrated that comparative protein structure modeling is currently the most reliable technique to model protein structures. Homology models are often sufficiently accurate to substitute for experimental structures in a wide variety of applications. Since the usefulness of a model for specific application is determined by its accuracy, model quality estimation is an essential component of protein structure prediction. Comparative protein modeling has become a routine approach in many areas of life science research since fully automated modeling systems allow also nonexperts to build reliable models. In this chapter, we describe practical approaches for automated protein structure modeling with SWISS-MODEL Workspace and the Protein Model Portal. |
Peitsch (2002) About the use of protein models. Bioinformatics 18:934-8. (pmid: 12117790) |
[ PubMed ] [ DOI ] Protein models can be of great assistance in functional genomics, as they provide the structural insights often necessary to understand protein function. Although comparative modelling is far from yielding perfect structures, this is still the most reliable method and the quality of the predictions is now well understood. Models can be classified according to their correctness and accuracy, which will impact their applicability and usefulness in functional genomics and a variety of situations. |
Footnotes and references
Ask, if things don't work for you!
- If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.
- Do consider how to ask your questions so that a meaningful answer is possible:
- How to create a Minimal, Complete, and Verifiable example on stackoverflow and ...
- How to make a great R reproducible example are required reading.
< Assignment 7 | Assignment 9 > |