Expected Preparations:
|
|||||||||||
|
|||||||||||
Keywords: Preparing data for phylogenetic analysis | |||||||||||
|
|||||||||||
Objectives:
This unit will …
|
Outcomes:
After working through this unit you …
|
||||||||||
|
|||||||||||
Deliverables: Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit. Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these. Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page. |
|||||||||||
|
|||||||||||
Evaluation: NA: This unit is not evaluated for course marks. |
Preparing multiple sequence alignments as input for phylogenetic tree estimation programs.
Task…
You have previously collected homologous sequences and their annotations. We will use these as input for phylogenetic analysis. But let’s discuss first how such an input file should be constructed.
In order to use molecular sequences for the construction of phylogenetic trees, you have to build a multiple alignment first. This is important: phylogenetic analysis does not build alignments, nor does it revise alignments, it analyses relationships after an alignment has been computed. A precondition for the analysis to be meaningful is that all rows of sequences have to contain the exact same number of characters and to hold aligned characters in corresponding positions (i.e. columns). The program’s inferences are made on a column-wise basis and if your columns contain data from unrelated positions, the inferences are going to be questionable. Clearly, in order for tree-estimation to work, one must not include fragments of sequence which have evolved under a different evolutionary model as all others, e.g. after domain fusion, or after accommodating large stretches of indels. Thus it is appropriate to edit the sequences and pare them down to a most characteristic subset of amino acids. The goal is not to be as comprehensive as possible, but to input those columns of aligned residues that are informative and will best represent the true phylogenetic relationships between the sequences.
The result of the tree construction is a decision about the most likely evolutionary relationships. Fundamentally, tree-construction programs decide which sequences had common ancestors.
To illustrate the principle we will construct input files from APSES
domains.
The annotations have been made by you previously and you have saved them
in myDB
. The database contains APSES proteins from ten
reference species that were chosen to span the phylogenetic tree of all
fungi. Thus it should provide a good scaffold for anlyzing MYSPE as
well.
An outgroup is a sequence that is more distantly related to all of the other sequences than any of them are to each other. This allows us to root the tree, because the root - the last common ancestor to all - must be somewhere on the branch that connects the outgroup to the rest. And whenever a molecular clock is assumed, the branching point that connects the outgroup can be assumed to be the oldest divergence event. Having a root that we can compare to the phylogram of species makes the tree interpretation much more intuitive. In our case, we are facing the problem that our species cover all of the known fungi, thus we can’t rightly say that any of them are more distant to the rest. We have to look outside the fungi. The problem is, outside of the fungi there are no proteins with APSES domains. We can take the E. coli KilA-N domain sequence - a known, distant homologue to the APSES domain instead, even though it only aligns to a part of the APSES domains.
Here is the KilA-N domain sequence in the E. coli Kil-A protein:
WP_000200358.1 hypothetical protein [Escherichia coli] MTSFQLSLISREIDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGIPISELIQS FKGGRPENQGTWVHPDIAINLAQWLSPKFAVQVSRWVREWMSGERTTAEMPVHLKRYMVNRSRIPHTHFS ILNELTFNLVAPLEQAGYTLPEKMVPDISQGRVFSQWLRDNRNVEPKTFPTYDHEYPDGRVYPARLYPNE YLADFKEHFNNIWLPQYAPKYFADRDKKALALIEKIMLPNLDGNEQF
E. coli KilA-N protein. Residues that do not align with APSES domains are shown in grey.
The assignment R - code contains code to add it to the group of APSES sequences.
Task…
ABC-units
R project. If you
have loaded it before, choose File ▸ Recent
projects ▸ ABC-Units. If you have not loaded
it before, follow the instructions in the RPR-Introduction
unit.init()
if requested.BIN-PHYLO-Data_preparation.R
and follow
the instructions.
Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.
As discussed in the notes, it is often necessary to edit a multiple sequence alignment to make it suitable for phylogenetic inference. Here are the principles:
All characters in a column should be related by homology.
This implies the following rules of thumb:
Indels are even more of a problem than usual. Strictly speaking, the similarity score of an alignment program as well as the distance score of a phylogeny program are not calculated for an ordered sequence, but for a sum of independent values, one for each aligned columns of characters. The order of the columns does not change the score. However in an optimal sequence alignment with gaps, this is no longer strictly true since a one-character gap creation has a different penalty score than a one-character gap extension! Most alignment programs use a model with a constant gap insertion penalty and a linear gap extension penalty. This is not rigorously justified from biology, but parametrized (or you could say “tweaked”) to correspond to our observations. However, most phylogeny programs do not work in this way. They strictly operate on columns of characters and treat a gap character just like a residue with the one letter code “-”. Thus gap insertion- and extension- characters get the same score. For short indels, this underestimates the distance between pairs of sequences, since any evolutionary model should reflect the fact that gaps are much less likely than point mutations. If the gap is very long though, all events are counted individually as many single substitutions (rather than one lengthy one) and this overestimates the distance. And it gets worse: long stretches of gaps can make sequences appear similar in a way that is not justified, just because they are identical in the “-” character. It is therefore common and acceptable to edit gaps in the alignment and delete all but a few columns of gapped sequence, or to remove such columns altogether.
There is more to learn about this important step of working with aligned sequences, here is an overview of the literature on various algorithms and tools that are available. Read at least the abstracts.
Talavera,
Gerard and Jose Castresana. (2007). “Improvement of phylogenies
after removing divergent and ambiguously aligned blocks from protein
sequence alignments”. Systematic Biology
56(4):564–77 .
[PMID: 17654362]
[DOI: 10.1080/10635150701472164]
Capella-Gutia’errez,
Salvador, Josa’e M Silla-Marta’inez, and Toni Gabalda’on. (2009).
“trimAl: a tool for automated alignment trimming in large-scale
phylogenetic analyses”. Bioinformatics (Oxford, England)
25(15):1972–3 .
[PMID: 19505945]
[DOI: 10.1093/bioinformatics/btp348]
Blouin,
Christian et al.. (2009). “Reproducing the manual
annotation of multiple sequence alignments using a SVM classifier”.
Bioinformatics (Oxford, England) 25(23):3093–8
.
[PMID: 19770262]
[DOI: 10.1093/bioinformatics/btp552]
Penn, Osnat
et al.. (2010). “GUIDANCE: a web server for assessing alignment
confidence scores”. Nucleic Acids Research 38(Web
Server issue):W23–8 .
[PMID: 20497997]
[DOI: 10.1093/nar/gkq443]
Rajan,
Vaibhav. (2013). “A method of alignment masking for refining the
phylogenetic signal of multiple sequence alignments”. Molecular
Biology and Evolution 30(3):689–712 .
[PMID: 23193120]
[DOI: 10.1093/molbev/mss264]
As you saw while inspecting the multiple sequence alignment, there are regions that are poorly suited for phylogenetic analysis due to the large numbers of gaps.
A good approach to edit the alignment is to import your sequences into Jalview and remove uncertain columns by hand.
But for this unit, let’s write code for a simple masking heuristic.
Task…
Head back to the RStudio project and work through
the section titled Reviewing and Editing Alignments
: * [[Reference APSES domains (reference species)|reference APSES domains page]]
If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.
Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.
[END]