Expected Preparations:
|
|||||||||||
|
|||||||||||
Keywords: Multiple sequence alignment | |||||||||||
|
|||||||||||
Objectives:
This unit will …
|
Outcomes:
After working through this unit you …
|
||||||||||
|
|||||||||||
Deliverables: Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit. Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these. Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page. |
|||||||||||
|
|||||||||||
Evaluation:
|
A carefully produced multiple sequence alignment is an indispensable, extarordinarily valuable asset for the analysis of sequence features. Fully automated methods are regularly inferior to knowledgeable manual curation of alignments. In this unit we will discuss the concepts, practice producing MSA’s online and in R, and analyze, write and display alignments. The goal is to empower you to produce the best alignments possible.
Task…
Multiple sequence alignments (MSAs) are enormously useful to resolve ambiguities in the precise placement of “indels”1 and to ensure that columns in alignments actually contain amino acids that evolve in a similar context. MSAs serve as input for * functional annotation; * protein homology modelling; * phylogenetic analyses; * sensitive homology searches in databases; * and more.
In order to perform a multiple sequence alignment, we obviously need a set of homologous sequences. This is not trivial. All interpretation of MSA results depends absolutely on how the input sequences were chosen. Should we include only orthologues, or paralogues as well? Should we include only species with fully sequenced genomes, or can we tolerate that some orthologous genes are possibly missing for a species? Should we include all sequences we can lay our hands on, or should we restrict the selection to a manageable number of representative sequences? All of these choices influence our interpretation: * orthologues are expected to be functionally and structurally conserved; * paralogues may have divergent function but have similar structure; * missing genes may make paralogs look like orthologs; and * selection bias may weight our results toward sequences that are over-represented and do not provide a fair representation of evolutionary divergence.
The EBI hosts a number of excellent MSA programs on their Website. Let’s perform an MSA of full length MBP1 orthologues:
Task…
NP_010227 NP_593032 XP_660758 XP_007682304 XP_955821
XP_001837394 XP_569090 XP_003327086 XP_011392621 XP_006957051
(add your MBP1_MYSPE RefSeq ID too!)
msaT.aln
.
(.aln
is the standard extension for CLUSTAL Formatted
aligment files, so it helps if we give the file that extension. Of
course you know better than to rely on an extension to
signal the filetype and format.)
Let’s move to our RStudio project to explore producing and analyzing multiple sequence alignments in R.
Task…
ABC-units
R project. If you
have loaded it before, choose File ▹ Recent
projects ▹ ABC-Units. If you have not loaded
it before, follow the instructions in the RPR-Introduction
unit.init()
if requested.BIN-ALI-MSA.R
and follow the
instructions.
Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.
Really excellent software tools have been written that help you visualize and manually curate multiple sequence alignments. If anything, I think they tend to do too much. Past versions of the course have used Jalview, but I have heard good things of AliView but there are more.
Here, I am just linking to three alignment editors and encourage you to explore and use them. If you have experience with comparing them, or know of other useful editors, let us know. (There are also many good alignment viewers!)
Before we all start editing computed alignments, we should spend a moment to consider the kind of improvements manual editing of alignments can aim for.
A good MSA comprises only columns of residues that play similar roles in the proteins’ mechanism and/or that evolve in a comparable structural context. Since the alignment reflects the result of biological selection and conservation, it has relatively few indels and the indels it has are usually not placed into elements of secondary structure or into functional motifs. For example, the contiguous features annotated for Mbp1 are expected to be left intact by a good alignment.
A poor MSA has many errors in its columns; these contain residues that actually have different functions or structural roles, even though they may look similar according to a (pairwise!) scoring matrix. A poor MSA also may have introduced indels in biologically irrelevant positions, to maximize spurious sequence similarities. Some of the features annotated for Mbp1 will be disrupted in a poor alignment and residues that are conserved may be placed into different columns.
Often errors or inconsistencies are easy to spot. The main goal of manual editing is to make an alignment biologically more plausible. Most commonly this means to mimize the number of rare evolutionary events that the alignment suggests and/or to emphasize conservation of known functional motifs. Here are some examples:
Reduce number of indels
From a Probcons alignment: 0447_DEBHA ILKTE-K-T---K--SVVK ILKTE----KTK---SVVK 9978_GIBZE MLGLN-PGLKEIT--HSIT MLGLNPGLKEIT---HSIT 1513_CANAL ILKTE-K-I---K--NVVK ILKTE----KIK---NVVK 6132_SCHPO ELDDI-I-ESGDY--ENVD ELDDI-IESGDY---ENVD 1244_ASPFU ----N-PGLREIC--HSIT -> ----NPGLREIC---HSIT 0925_USTMA LVKTC-PALDPHI--TKLK LVKTCPALDPHI---TKLK 2599_ASPTE VLDAN-PGLREIS--HSIT VLDANPGLREIS---HSIT 9773_DEBHA LLESTPKQYHQHI--KRIR LLESTPKQYHQHI--KRIR 0918_CANAL LLESTPKEYQQYI--KRIR LLESTPKEYQQYI--KRIR
Gaps marked in red were moved. The sequence similarity in the alignment does not change considerably, however the total number of indels in this excerpt is reduced to 13 from the original 22
Move indels to more plausible position
From a CLUSTAL alignment: 4966_CANGL MKHEKVQ------GGYGRFQ---GTW MKHEKVQ------GGYGRFQ---GTW 1513_CANAL KIKNVVK------VGSMNLK---GVW KIKNVVK------VGSMNLK---GVW 6132_SCHPO VDSKHP-----------QID---GVW -> VDSKHPQ-----------ID---GVW 1244_ASPFU EICHSIT------GGALAAQ---GYW EICHSIT------GGALAAQ---GYW
The two characters marked in red were swapped. This does not change the number of indels but places the “Q” into a a column in which it is more highly conserved (green). Progressive alignments are especially prone to this type of error.
Conserve motifsFrom a CLUSTAL W alignment: 6166_SCHPO --DKRVA---GLWVPP --DKRVA--G-LWVPP XBP1_SACCE GGYIKIQ---GTWLPM GGYIKIQ--G-TWLPM 6355_ASPTE --DEIAG---NVWISP -> ---DEIA--GNVWISP 5262_KLULA GGYIKIQ---GTWLPY GGYIKIQ--G-TWLPY
The first of the two residues marked in red is a conserved, solvent exposed hydrophobic residue that may mediate domain interactions. The second residue is the conserved glycine in a beta turn that cannot be mutated without structural disruption. Changing the position of a gap and insertion in one sequence improves the conservation of both motifs.
An example of alignment editing for ankyrin domains. This is example below came from alignment editing in JALVIEW. Columns were coloured by hydrophobicity, and the examples were exported to HTML and then pasted into the page source. Note that the bottom row of the alignment contains a manually added sequence that represents secondary structure elements that were determined by X-ray crystallography of the Swi6 ankyrin domain.
|
Aligned sequences before editing. The algorithm has placed
gaps into the Swi6 helix LKWIIAN
and the four-residue gaps
before the block of well aligned sequence on the right are poorly
supported.
|
Aligned sequence after editing. A significant cleanup of the frayed region is possible. Now there is only one insertion event, and it is placed into the loop that connects two helices of the 1SW6 structure.
This is a good, current recapitulation of many of the concepts you have encountered in this unit. Compact to read, I highly recommend this paper to reinforce what you have just learned.
Bawono,
Punto et al.. (2017). “Multiple Sequence Alignment”.
Methods in Molecular Biology (Clifton, N.j.)
1525:167–189 .
[PMID: 27896722]
[DOI: 10.1007/978-1-4939-6622-6_8]
Bena’itez-Pa’aez,
Alfonso, Sonia Ca’ardenas-Brito, and Andra’es J Gutia’errez.
(2012). “A practical guide for the computational selection of residues
to be experimentally characterized in protein families”. Briefings
in Bioinformatics 13(3):329–36 .
[PMID: 21930656]
[DOI: 10.1093/bib/bbr052]
Pais, Fabiano
S et al.. (2014). “Assessing the efficiency of multiple
sequence alignment programs”. Algorithms for Molecular Biology :
Amb 9(1):4 .
[PMID:
24602402]
[DOI: 10.1186/1748-7188-9-4]
Sievers,
Fabian and Desmond G Higgins. (2018). “Clustal Omega for making
accurate alignments of many protein sequences”. Protein Science : A
Publication of The Protein Society 27(1):135–145
.
[PMID: 28884485]
[DOI: 10.1002/pro.3290]
Iantorno,
Stefano et al.. (2014). “Who watches the watchmen? An
appraisal of benchmarks for multiple sequence alignment”. Methods in
Molecular Biology (Clifton, N.j.) 1079:59–73 .
[PMID: 24170395]
[DOI: 10.1007/978-1-62703-646-7_4]
Notredame,
Ca’edric. (2007). “Recent evolutions of multiple sequence
alignment algorithms”. Plos Computational Biology
3(8):e123 .
[PMID: 17784778]
[DOI: 10.1371/journal.pcbi.0030123]
If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.
Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.
[END]
“indel”: insertion / deletion – a difference in sequence length between two aligned sequences that is accommodated by gaps in the alignment. Since we can’t tell from the comparison of two sequences whether such a change was introduced by insertion into or deletion from the ancestral sequence, we join both into a portmanteau(W).↩︎