Difference between revisions of "BIN-PHYLO-Data preparation"

Latest revision as of 06:48, 26 September 2020

Preparing Data for Phylogenetic Analysis

(Preparing data for phylogenetic analysis)

Abstract:

Preparing multiple sequence alignments as input for phylogenetic tree estimation programs.

Objectives:
This unit will ...

... introduce the concepts of how to prepare input sequences for phylogenetic analysis.

Outcomes:
After working through this unit you ...

... can prepare an alignment for phylogenetic analysis.

Deliverables:

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.

Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.

Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Prerequisites:
You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:

Evolution: Theory of evolution; variation, neutral drift and selection.

This unit builds on material covered in the following prerequisite units:

BIN-PHYLO-Concepts (Concepts of Phylogenetic Analysis)

Preparing input alignments

You have previously collected homologous sequences and their annotations. We will use these as input for phylogenetic analysis. But let's discuss first how such an input file should be constructed.

Principles

In order to use molecular sequences for the construction of phylogenetic trees, you have to build a multiple alignment first. This is important: phylogenetic analysis does not build alignments, nor does it revise alignments, it analyses relationships after an alignment has been computed. A precondition for the analysis to be meaningful is that all rows of sequences have to contain the exact same number of characters and to hold aligned characters in corresponding positions (i.e. columns). The program's inferences are made on a column-wise basis and if your columns contain data from unrelated positions, the inferences are going to be questionable. Clearly, in order for tree-estimation to work, one must not include fragments of sequence which have evolved under a different evolutionary model as all others, e.g. after domain fusion, or after accommodating large stretches of indels. Thus it is appropriate to edit the sequences and pare them down to a most characteristic subset of amino acids. The goal is not to be as comprehensive as possible, but to input those columns of aligned residues that are informative and will best represent the true phylogenetic relationships between the sequences.

The result of the tree construction is a decision about the most likely evolutionary relationships. Fundamentally, tree-construction programs decide which sequences had common ancestors.

Choosing sequences

To illustrate the principle we will construct input files from APSES domains. The annotations have been made by you previously and you have saved them in myDB. The database contains APSES proteins from ten reference species that were chosen to span the phylogenetic tree of all fungi. Thus it should provide a good scaffold for anlyzing MYSPE as well.

Adding an Outgroup

An outgroup is a sequence that is more distantly related to all of the other sequences than any of them are to each other. This allows us to root the tree, because the root - the last common ancestor to all - must be somewhere on the branch that connects the outgroup to the rest. And whenever a molecular clock is assumed, the branching point that connects the outgroup can be assumed to be the oldest divergence event. Having a root that we can compare to the phylogram of species makes the tree interpretation much more intuitive. In our case, we are facing the problem that our species cover all of the known fungi, thus we can't rightly say that any of them are more distant to the rest. We have to look outside the fungi. The problem is, outside of the fungi there are no proteins with APSES domains. We can take the E. coli KilA-N domain sequence - a known, distant homologue to the APSES domain instead, even though it only aligns to a part of the APSES domains.

Here is the KilA-N domain sequence in the E. coli Kil-A protein:

>WP_000200358.1 hypothetical protein [Escherichia coli]
MTSFQLSLISREIDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGIPISELIQS
FKGGRPENQGTWVHPDIAINLAQWLSPKFAVQVSRWVREWMSGERTTAEMPVHLKRYMVNRSRIPHTHFS
ILNELTFNLVAPLEQAGYTLPEKMVPDISQGRVFSQWLRDNRNVEPKTFPTYDHEYPDGRVYPARLYPNE
YLADFKEHFNNIWLPQYAPKYFADRDKKALALIEKIMLPNLDGNEQF

E. coli KilA-N protein. Residues that do not align with APSES domains are shown in grey.

The assignment R - code contains code to add it to the group of APSES sequences.

Task:

Open RStudio and load the ABC-units R project. If you have loaded it before, choose File → Recent projects → ABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
Choose Tools → Version Control → Pull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
Type init() if requested.
Open the file BIN-PHYLO-Data_preparation.R and follow the instructions.

Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.

Reviewing and Editing Alignments

As discussed in the notes, it is often necessary to edit a multiple sequence alignment to make it suitable for phylogenetic inference. Here are the principles:

All characters in a column should be related by homology.

This implies the following rules of thumb:

Remove all stretches of residues in which the alignment appears ambiguous (not just highly variable, but ambiguous regarding the aligned positions).
Remove all frayed N- and C- termini, especially regions in which not all sequences that are being compared appear homologous and that may stem from unrelated domains. You want to only retain the APSES domains. All the extra residues from the MYSPE sequence can be deleted.
Remove all gapped regions that appear to be alignment artefacts due to inappropriate input sequences.
Remove all but approximately one column from gapped regions in those cases where the presence of several related insertions suggest that the indel is real, and not just an alignment artefact. (Some researchers simply remove all gapped regions).
Remove sections N- and C- terminal of gaps where the alignment appears questionable.
If the sequences fit on a single line you will save yourself potential trouble with block-wise vs. interleaved input. If you do run out of memory try removing columns of sequence. Or remove species that you are less interested in from the alignment.
Move your outgroup sequence to the first line of your alignment, since this is where PHYLIP will look for it by default.

Indels are even more of a problem than usual. Strictly speaking, the similarity score of an alignment program as well as the distance score of a phylogeny program are not calculated for an ordered sequence, but for a sum of independent values, one for each aligned columns of characters. The order of the columns does not change the score. However in an optimal sequence alignment with gaps, this is no longer strictly true since a one-character gap creation has a different penalty score than a one-character gap extension! Most alignment programs use a model with a constant gap insertion penalty and a linear gap extension penalty. This is not rigorously justified from biology, but parametrized (or you could say "tweaked") to correspond to our observations. However, most phylogeny programs do not work in this way. They strictly operate on columns of characters and treat a gap character just like a residue with the one letter code "-". Thus gap insertion- and extension- characters get the same score. For short indels, this underestimates the distance between pairs of sequences, since any evolutionary model should reflect the fact that gaps are much less likely than point mutations. If the gap is very long though, all events are counted individually as many single substitutions (rather than one lengthy one) and this overestimates the distance. And it gets worse: long stretches of gaps can make sequences appear similar in a way that is not justified, just because they are identical in the "-" character. It is therefore common and acceptable to edit gaps in the alignment and delete all but a few columns of gapped sequence, or to remove such columns altogether.

(Possible) steps in editing a multiple sequence alignment towards a PHYLIP input file. a: raw alignment (CLUSTAL format); b: sequences assembled into single lines; c: columns to be deleted highlighted in red - 1, 3 and 4: large gaps; 2: uncertain alignment and 5: frayed C-terminus: both would put non-homologous characters into the same column; d: input data for PHYLIP: names for sequences must not be longer than 10 characters, the first line must contain the number of sequences and the sequence length. PHYLIP is very picky about incorrectly formatted input, read the PHYLIP sequence format guide. Fortunately Rphylip does the formatting step for you.

There is more to learn about this important step of working with aligned sequences, here is an overview of the literature on various algorithms and tools that are available. Read at least the abstracts.

Talavera & Castresana (2007) Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. Syst Biol 56:564-77. (pmid: 17654362)

[ PubMed ] [ DOI ] Alignment quality may have as much impact on phylogenetic reconstruction as the phylogenetic methods used. Not only the alignment algorithm, but also the method used to deal with the most problematic alignment regions, may have a critical effect on the final tree. Although some authors remove such problematic regions, either manually or using automatic methods, in order to improve phylogenetic performance, others prefer to keep such regions to avoid losing any information. Our aim in the present work was to examine whether phylogenetic reconstruction improves after alignment cleaning or not. Using simulated protein alignments with gaps, we tested the relative performance in diverse phylogenetic analyses of the whole alignments versus the alignments with problematic regions removed with our previously developed Gblocks program. We also tested the performance of more or less stringent conditions in the selection of blocks. Alignments constructed with different alignment methods (ClustalW, Mafft, and Probcons) were used to estimate phylogenetic trees by maximum likelihood, neighbor joining, and parsimony. We show that, in most alignment conditions, and for alignments that are not too short, removal of blocks leads to better trees. That is, despite losing some information, there is an increase in the actual phylogenetic signal. Overall, the best trees are obtained by maximum-likelihood reconstruction of alignments cleaned by Gblocks. In general, a relaxed selection of blocks is better for short alignment, whereas a stringent selection is more adequate for longer ones. Finally, we show that cleaned alignments produce better topologies although, paradoxically, with lower bootstrap. This indicates that divergent and problematic alignment regions may lead, when present, to apparently better supported although, in fact, more biased topologies.

Capella-Gutiérrez et al. (2009) trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25:1972-3. (pmid: 19505945)

[ PubMed ] [ DOI ] SUMMARY: Multiple sequence alignments are central to many areas of bioinformatics. It has been shown that the removal of poorly aligned regions from an alignment increases the quality of subsequent analyses. Such an alignment trimming phase is complicated in large-scale phylogenetic analyses that deal with thousands of alignments. Here, we present trimAl, a tool for automated alignment trimming, which is especially suited for large-scale phylogenetic analyses. trimAl can consider several parameters, alone or in multiple combinations, for selecting the most reliable positions in the alignment. These include the proportion of sequences with a gap, the level of amino acid similarity and, if several alignments for the same set of sequences are provided, the level of consistency across different alignments. Moreover, trimAl can automatically select the parameters to be used in each specific alignment so that the signal-to-noise ratio is optimized. AVAILABILITY: trimAl has been written in C++, it is portable to all platforms. trimAl is freely available for download (http://trimal.cgenomics.org) and can be used online through the Phylemon web server (http://phylemon2.bioinfo.cipf.es/). Supplementary Material is available at http://trimal.cgenomics.org/publications.

Blouin et al. (2009) Reproducing the manual annotation of multiple sequence alignments using a SVM classifier. Bioinformatics 25:3093-8. (pmid: 19770262)

[ PubMed ] [ DOI ] MOTIVATION: Aligning protein sequences with the best possible accuracy requires sophisticated algorithms. Since the optimal alignment is not guaranteed to be the correct one, it is expected that even the best alignment will contain sites that do not respect the assumption of positional homology. Because formulating rules to identify these sites is difficult, it is common practice to manually remove them. Although considered necessary in some cases, manual editing is time consuming and not reproducible. We present here an automated editing method based on the classification of 'valid' and 'invalid' sites. RESULTS: A support vector machine (SVM) classifier is trained to reproduce the decisions made during manual editing with an accuracy of 95.0%. This implies that manual editing can be made reproducible and applied to large-scale analyses. We further demonstrate that it is possible to retrain/extend the training of the classifier by providing examples of multiple sequence alignment (MSA) annotation. Near optimal training can be achieved with only 1000 annotated sites, or roughly three samples of protein sequence alignments. AVAILABILITY: This method is implemented in the software MANUEL, licensed under the GPL. A web-based application for single and batch job is available at http://fester.cs.dal.ca/manuel. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Penn et al. (2010) GUIDANCE: a web server for assessing alignment confidence scores. Nucleic Acids Res 38:W23-8. (pmid: 20497997)

[ PubMed ] [ DOI ] Evaluating the accuracy of multiple sequence alignment (MSA) is critical for virtually every comparative sequence analysis that uses an MSA as input. Here we present the GUIDANCE web-server, a user-friendly, open access tool for the identification of unreliable alignment regions. The web-server accepts as input a set of unaligned sequences. The server aligns the sequences and provides a simple graphic visualization of the confidence score of each column, residue and sequence of an alignment, using a color-coding scheme. The method is generic and the user is allowed to choose the alignment algorithm (ClustalW, MAFFT and PRANK are supported) as well as any type of molecular sequences (nucleotide, protein or codon sequences). The server implements two different algorithms for evaluating confidence scores: (i) the heads-or-tails (HoT) method, which measures alignment uncertainty due to co-optimal solutions; (ii) the GUIDANCE method, which measures the robustness of the alignment to guide-tree uncertainty. The server projects the confidence scores onto the MSA and points to columns and sequences that are unreliably aligned. These can be automatically removed in preparation for downstream analyses. GUIDANCE is freely available for use at http://guidance.tau.ac.il.

Rajan (2013) A method of alignment masking for refining the phylogenetic signal of multiple sequence alignments. Mol Biol Evol 30:689-712. (pmid: 23193120)

[ PubMed ] [ DOI ] Inaccurate inference of positional homologies in multiple sequence alignments and systematic errors introduced by alignment heuristics obfuscate phylogenetic inference. Alignment masking, the elimination of phylogenetically uninformative or misleading sites from an alignment before phylogenetic analysis, is a common practice in phylogenetic analysis. Although masking is often done manually, automated methods are necessary to handle the much larger data sets being prepared today. In this study, we introduce the concept of subsplits and demonstrate their use in extracting phylogenetic signal from alignments. We design a clustering approach for alignment masking where each cluster contains similar columns-similarity being defined on the basis of compatible subsplits; our approach then identifies noisy clusters and eliminates them. Trees inferred from the columns in the retained clusters are found to be topologically closer to the reference trees. We test our method on numerous standard benchmarks (both synthetic and biological data sets) and compare its performance with other methods of alignment masking. We find that our method can eliminate sites more accurately than other methods, particularly on divergent data, and can improve the topologies of the inferred trees in likelihood-based analyses. Software available upon request from the author.

Sequence masking with R

As you saw while inspecting the multiple sequence alignment, there are regions that are poorly suited for phylogenetic analysis due to the large numbers of gaps.

A good approach to edit the alignment is to import your sequences into Jalview and remove uncertain columns by hand.

But for this unit, let's write code for a simple masking heuristic.

Task:

Head back to the RStudio project and work through the section titled Reviewing and Editing Alignments

Notes

About ...

Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2020-09-25

Version:

1.1

Version history:

1.1 2020 Maintenance
1.0 First live version.
0.1 First stub

This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.

@@ Line 1: / Line 1: @@
 <div id="ABC">
-<div style="padding:5px; border:1px solid #000000; background-color:#f4d7b7; font-size:300%; font-weight:400; color: #000000; width:100%;">
+<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce; font-size:300%; font-weight:400; color: #000000; width:100%;">
 Preparing Data for Phylogenetic Analysis
-<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#f4d7b7; font-size:30%; font-weight:200; color: #000000; ">
+<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#b3dbce; font-size:30%; font-weight:200; color: #000000; ">
 (Preparing data for phylogenetic analysis)
 </div>
@@ Line 10: / Line 10: @@
-<div style="padding:5px; border:1px solid #000000; background-color:#f4d7b733; font-size:85%;">
+<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce33; font-size:85%;">
 <div style="font-size:118%;">
 <b>Abstract:</b><br />
@@ Line 56: / Line 56: @@
-{{REVISE}}
 {{Smallvspace}}
@@ Line 75: / Line 74: @@
 *Read the introductory notes on {{ABC-PDF|BIN-PHYLO-Data_preparation|preparing data for phylogenetic analysis}}.
 }}
@@ Line 99: / Line 96: @@
 {{Vspace}}
-To illustrate the principle we will construct input files from APSES domains<!-- by joining APSES domains and Ankyrin domain sequences (by similarity to the Swi6 fold) -->. The annotations are in <code>myDB</code>. The database contains APSES proteins from ten reference species that were chosen to span the phylogenetic tree of all fungi. Thus it should provide a good scaffold for anlyzing MYSPE too.
+To illustrate the principle we will construct input files from APSES domains<!-- by joining APSES domains and Ankyrin domain sequences (by similarity to the Swi6 fold) -->. The annotations have been made by you previously and you have saved them in <code>myDB</code>. The database contains APSES proteins from ten reference species that were chosen to span the phylogenetic tree of all fungi. Thus it should provide a good scaffold for anlyzing MYSPE as well.
 {{Vspace}}
@@ Line 107: / Line 104: @@
 {{Vspace}}
-An outgroup is a sequence that is more distantly related to all of the other sequences than any of them are to each other. This allows us to root the tree, because the root - the last common ancestor to all - must be somewhere on the branch that connects the outgroup to the rest. And whenever a molecular clock is assumed, the branching point that connects the outgroup can be assumed to be the oldest divergence event. Having a root that we can compare to the phylogram of species makes the tree interpretation '''much''' more intuitive. In our case, we are facing the problem that our species cover all of the known fungi, thus we can' rightly say that any of them are more distant to the rest. We have to look outside the fungi. The problem is, outside of the fungi there are no proteins with APSES domains<!--, and certainly none that have APSES as well as ankyrin domains in the same gene-->. We can take the ''E. coli'' KilA-N domain sequence - a known, distant homologue to the APSES domain instead, even though it only aligns to a part of the APSES domains<!-- , and we can get an ankyrin region from e.g. a plant. Both outgroup domains then will have the property that they are more distant individually to any of the fungal sequences, even though they don't appear in the same protein -->.
+An outgroup is a sequence that is more distantly related to all of the other sequences than any of them are to each other. This allows us to root the tree, because the root - the last common ancestor to all - must be somewhere on the branch that connects the outgroup to the rest. And whenever a molecular clock is assumed, the branching point that connects the outgroup can be assumed to be the oldest divergence event. Having a root that we can compare to the phylogram of species makes the tree interpretation '''much''' more intuitive. In our case, we are facing the problem that our species cover all of the known fungi, thus we can't rightly say that any of them are more distant to the rest. We have to look outside the fungi. The problem is, outside of the fungi there are no proteins with APSES domains<!--, and certainly none that have APSES as well as ankyrin domains in the same gene-->. We can take the ''E. coli'' KilA-N domain sequence - a known, distant homologue to the APSES domain instead, even though it only aligns to a part of the APSES domains<!-- , and we can get an ankyrin region from e.g. a plant. Both outgroup domains then will have the property that they are more distant individually to any of the fungal sequences, even though they don't appear in the same protein -->.
 Here is the KilA-N domain sequence in the E. coli Kil-A protein:
@@ Line 146: / Line 143: @@
-<source lang="R">
+<pre>
 # Let's add our outgroups to the feature sequence tables:
@@ Line 201: / Line 198: @@
 head(ankSeq)
+</pre>
 -->
@@ Line 278: / Line 276: @@
 {{Vspace}}
-== Self-evaluation ==
-<!--
-=== Question 1===
-Question ...
-<div class="toccolours mw-collapsible mw-collapsed" style="width:800px">
-Answer ...
-<div class="mw-collapsible-content">
-Answer ...
-</div>
-  </div>
-  {{Vspace}}
--->
 == Further reading, links and resources ==
@@ Line 313: / Line 294: @@
 :2017-08-05
 <b>Modified:</b><br />
-:2017-10-31
+:2020-09-25
 <b>Version:</b><br />
-:1.0
+:1.1
 <b>Version history:</b><br />
+*1.1 2020 Maintenance
 *1.0 First live version.
 *0.1 First stub
@@ Line 325: / Line 307: @@
 [[Category:ABC-units]]
 {{UNIT}}
-{{REVISE}}
+{{LIVE}}
 </div>
 <!-- [END] -->

Difference between revisions of "BIN-PHYLO-Data preparation"

Latest revision as of 06:48, 26 September 2020

Contents

Evaluation

Contents

Preparing input alignments

Principles

Choosing sequences

Adding an Outgroup

Reviewing and Editing Alignments

Sequence masking with R

Further reading, links and resources

Notes

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools