Multiple Sequence Alignment

Contents
Multiple Sequence Alignment
Further Reading
Questions, comments
References

Expected Preparations:

	[BIN-ALI-PSI] BLAST		[FND-STA] Information_theory
	The units listed above are part of this course and contain important preparatory material.

Keywords: Multiple sequence alignment

Objectives:

This unit will …

… introduce the benefits of multiple sequence alignments (MSA), the objective functions they pursue, algorithms and methods, practical considerations, and the analysis of alignments;
… demonstrate Web services that calculate MSAs;
… teach how to compute and analyze MSA’s in R.

Outcomes:

After working through this unit you …

… can critically assess available options for producing Multiple Sequence Alignments;
… are familar with online and R programming tools to produce alignments;
… have aligned the full length sequence of the MYSPE Mbp1 orthologue to a selected set of reference sequences.

Deliverables:

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.

Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don’t overlook these.

Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Evaluation:

Evaluation deliverables under revision:: While we are rebalancing formative feedback (advice) and summative feedback (grades) for this course, please ignore all instructions for the submission of assignments until this alert is removed.

A carefully produced multiple sequence alignment is an indispensable, extarordinarily valuable asset for the analysis of sequence features. Fully automated methods are regularly inferior to knowledgeable manual curation of alignments. In this unit we will discuss the concepts, practice producing MSA’s online and in R, and analyze, write and display alignments. The goal is to empower you to produce the best alignments possible.

Task…

Read the introductory notes on concepts of multiple sequence alignmentsPDF.

Multiple sequence alignments (MSAs) are enormously useful to resolve ambiguities in the precise placement of “indels”¹ and to ensure that columns in alignments actually contain amino acids that evolve in a similar context. MSAs serve as input for * functional annotation; * protein homology modelling; * phylogenetic analyses; * sensitive homology searches in databases; * and more.

Multiple Sequence Alignment

In order to perform a multiple sequence alignment, we obviously need a set of homologous sequences. This is not trivial. All interpretation of MSA results depends absolutely on how the input sequences were chosen. Should we include only orthologues, or paralogues as well? Should we include only species with fully sequenced genomes, or can we tolerate that some orthologous genes are possibly missing for a species? Should we include all sequences we can lay our hands on, or should we restrict the selection to a manageable number of representative sequences? All of these choices influence our interpretation: * orthologues are expected to be functionally and structurally conserved; * paralogues may have divergent function but have similar structure; * missing genes may make paralogs look like orthologs; and * selection bias may weight our results toward sequences that are over-represented and do not provide a fair representation of evolutionary divergence.

MSA’s on the web at the EBI

The EBI hosts a number of excellent MSA programs on their Website. Let’s perform an MSA of full length MBP1 orthologues:

Task…

Navigate to the NCBI protein database and paste the MBP1 protein RefSeq IDs from our database into the search form:

NP_010227 NP_593032 XP_660758 XP_007682304 XP_955821
XP_001837394 XP_569090 XP_003327086 XP_011392621 XP_006957051
(add your MBP1_MYSPE RefSeq ID too!)

This will give you a page with links to the retrieved sequences. Click on Summary and choose FASTA(text) as the Format to retrieve all sequences at once as a multi-FASTA formatted page (this is useful, remember it!)
Open another browser window and navigate to the EBI MSA tools page.
Click on Launch T-coffee.
Copy the FASTA sequences from the NCBI page, and paste them into the form at the EBI’s T-Coffee page.
Choose Output Format ▹ HTML. Then click Submit.
The result should show you the aligned sequences, with three blocks of high similarity:
- The most N-terminal block is the APSES domain - the main DNA binding domain of these transcription factors.
- In the middle, we have Ankyrin domains: these are protein-protein interaction modules that Mbp1 uses to recruit other proteins to the bound complex.
- At the end, there is one additional, shorter segment of high similarity.
Explore the tabs that are available, in particular note that you can save the result to a file.
Click on the Download Alignment File tab to load the alignment as text into a browser window. Then save the file into your project directory with a filename of msaT.aln. (.aln is the standard extension for CLUSTAL Formatted aligment files, so it helps if we give the file that extension. Of course you know better than to rely on an extension to signal the filetype and format.)

MSA’s in R

Let’s move to our RStudio project to explore producing and analyzing multiple sequence alignments in R.

Task…

Open RStudio and load the ABC-units R project. If you have loaded it before, choose File ▹ Recent projects ▹ ABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
Choose Tools ▹ Version Control ▹ Pull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included. This ensures that your data and code remain up to date when we update, or fix bugs.
Type init() if requested.
Open the file BIN-ALI-MSA.R and follow the instructions.

Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.

Sequence alignment editors

Really excellent software tools have been written that help you visualize and manually curate multiple sequence alignments. If anything, I think they tend to do too much. Past versions of the course have used Jalview, but I have heard good things of AliView but there are more.

Here, I am just linking to three alignment editors and encourage you to explore and use them. If you have experience with comparing them, or know of other useful editors, let us know. (There are also many good alignment viewers!)

Jalview is an integrated MSA editor and sequence annotation workbench from the Barton lab in Dundee. Lots of functions. Active development.
AliView by Anders Larsson in Uppsala: fast, lean, looks to be very practical. In recent development.
Strap by Christoph Gille at the Charité in Berlin. Haven’t played with that one yet, but there is also a browser-based version here.

Before we all start editing computed alignments, we should spend a moment to consider the kind of improvements manual editing of alignments can aim for.

Alignment Editing

A good MSA comprises only columns of residues that play similar roles in the proteins’ mechanism and/or that evolve in a comparable structural context. Since the alignment reflects the result of biological selection and conservation, it has relatively few indels and the indels it has are usually not placed into elements of secondary structure or into functional motifs. For example, the contiguous features annotated for Mbp1 are expected to be left intact by a good alignment.

A poor MSA has many errors in its columns; these contain residues that actually have different functions or structural roles, even though they may look similar according to a (pairwise!) scoring matrix. A poor MSA also may have introduced indels in biologically irrelevant positions, to maximize spurious sequence similarities. Some of the features annotated for Mbp1 will be disrupted in a poor alignment and residues that are conserved may be placed into different columns.

Often errors or inconsistencies are easy to spot. The main goal of manual editing is to make an alignment biologically more plausible. Most commonly this means to mimize the number of rare evolutionary events that the alignment suggests and/or to emphasize conservation of known functional motifs. Here are some examples:

Reduce number of indels

 From a Probcons alignment:
 
 0447_DEBHA    ILKTE-K-T---K--SVVK      ILKTE----KTK---SVVK
 9978_GIBZE    MLGLN-PGLKEIT--HSIT      MLGLNPGLKEIT---HSIT
 1513_CANAL    ILKTE-K-I---K--NVVK      ILKTE----KIK---NVVK
 6132_SCHPO    ELDDI-I-ESGDY--ENVD      ELDDI-IESGDY---ENVD
 1244_ASPFU    ----N-PGLREIC--HSIT  ->  ----NPGLREIC---HSIT
 0925_USTMA    LVKTC-PALDPHI--TKLK      LVKTCPALDPHI---TKLK
 2599_ASPTE    VLDAN-PGLREIS--HSIT      VLDANPGLREIS---HSIT
 9773_DEBHA    LLESTPKQYHQHI--KRIR      LLESTPKQYHQHI--KRIR
 0918_CANAL    LLESTPKEYQQYI--KRIR      LLESTPKEYQQYI--KRIR

Gaps marked in red were moved. The sequence similarity in the alignment does not change considerably, however the total number of indels in this excerpt is reduced to 13 from the original 22

Move indels to more plausible position

 From a CLUSTAL alignment:
 
 4966_CANGL     MKHEKVQ------GGYGRFQ---GTW      MKHEKVQ------GGYGRFQ---GTW
 1513_CANAL     KIKNVVK------VGSMNLK---GVW      KIKNVVK------VGSMNLK---GVW
 6132_SCHPO     VDSKHP-----------QID---GVW  ->  VDSKHPQ-----------ID---GVW
 1244_ASPFU     EICHSIT------GGALAAQ---GYW      EICHSIT------GGALAAQ---GYW

The two characters marked in red were swapped. This does not change the number of indels but places the “Q” into a a column in which it is more highly conserved (green). Progressive alignments are especially prone to this type of error.

Conserve motifs

 From a CLUSTAL W alignment:
 
 6166_SCHPO      --DKRVA---GLWVPP      --DKRVA--G-LWVPP
 XBP1_SACCE      GGYIKIQ---GTWLPM      GGYIKIQ--G-TWLPM
 6355_ASPTE      --DEIAG---NVWISP  ->  ---DEIA--GNVWISP
 5262_KLULA      GGYIKIQ---GTWLPY      GGYIKIQ--G-TWLPY

The first of the two residues marked in red is a conserved, solvent exposed hydrophobic residue that may mediate domain interactions. The second residue is the conserved glycine in a beta turn that cannot be mutated without structural disruption. Changing the position of a gap and insertion in one sequence improves the conservation of both motifs.

An example of alignment editing for ankyrin domains. This is example below came from alignment editing in JALVIEW. Columns were coloured by hydrophobicity, and the examples were exported to HTML and then pasted into the page source. Note that the bottom row of the alignment contains a manually added sequence that represents secondary structure elements that were determined by X-ray crystallography of the Swi6 ankyrin domain.

10
|

20
|

30
|

40
|

MBP1_USTMA/341-368

MBP1B_SCHCO/470-498

MBP1_ASHGO/465-494

MBP1_CLALU/550-586

MBPA_COPCI/514-542

MBP1_DEBHA/507-550

MBP1A_SCHCO/388-415

MBP1_AJECA/374-403

MBP1_PARBR/380-409

MBP1_NEOFI/363-392

MBP1_ASPNI/365-394

MBP1_UNCRE/377-406

MBP1_PENCH/439-468

MBPA_TRIVE/407-436

MBP1_PHANO/400-429

MBPA_SCLSC/294-313

MBPA_PYRIS/363-392

MBP1_/361-390

MBP1_ASPFL/328-364

MBPA_MAGOR/375-404

MBP1_CHAGL/361-390

MBP1_PODAN/372-401

MBP1_LACTH/458-487

MBP1_FILNE/433-460

MBP1_KLULA/477-506

MBP1_SCHST/468-501

MBP1_SACCE/496-525

CD00204/1-19

CD00204/99-118

1SW6/203-232

SecStruc/203-232

Aligned sequences before editing. The algorithm has placed gaps into the Swi6 helix LKWIIAN and the four-residue gaps before the block of well aligned sequence on the right are poorly supported.

10
|

20
|

30
|

40
|

MBP1_USTMA/341-368

MBP1B_SCHCO/470-498

MBP1_ASHGO/465-494

MBP1_CLALU/550-586

MBPA_COPCI/514-542

MBP1_DEBHA/507-550

MBP1A_SCHCO/388-415

MBP1_AJECA/374-403

MBP1_PARBR/380-409

MBP1_NEOFI/363-392

MBP1_ASPNI/365-394

MBP1_UNCRE/377-406

MBP1_PENCH/439-468

MBPA_TRIVE/407-436

MBP1_PHANO/400-429

MBPA_SCLSC/294-313

MBPA_PYRIS/363-392

MBP1_/361-390

MBP1_ASPFL/328-364

MBPA_MAGOR/375-404

MBP1_CHAGL/361-390

MBP1_PODAN/372-401

MBP1_LACTH/458-487

MBP1_FILNE/433-460

MBP1_KLULA/477-506

MBP1_SCHST/468-501

MBP1_SACCE/496-525

CD00204/1-19

CD00204/99-118

1SW6/203-232

SecStruc/203-232

Aligned sequence after editing. A significant cleanup of the frayed region is possible. Now there is only one insertion event, and it is placed into the loop that connects two helices of the 1SW6 structure.

The increasing importance of Next Generation Sequencing (NGS) techniques has highlighted the key role of multiple sequence alignment (MSA) in comparative structure and function analysis of biological sequences. MSA often leads to fundamental biological insight into sequence-structure-function relationships of nucleotide or protein sequence families. Significant advances have been achieved in this field, and many useful tools have been developed for constructing alignments, although many biological and methodological issues are still open. This chapter first provides some background information and considerations associated with MSA techniques, concentrating on the alignment of protein sequences. Then, a practical overview of currently available methods and a description of their specific advantages and limitations are given, to serve as a helpful guide or starting point for researchers who aim to construct a reliable MSA.

Bena’itez-Pa’aez, Alfonso, Sonia Ca’ardenas-Brito, and Andra’es J Gutia’errez. (2012). “A practical guide for the computational selection of residues to be experimentally characterized in protein families”. Briefings in Bioinformatics 13(3):329–36 .
[PMID: 21930656] [DOI: 10.1093/bib/bbr052]

Abstract …

In recent years, numerous biocomputational tools have been designed to extract functional and evolutionary information from multiple sequence alignments (MSAs) of proteins and genes. Most biologists working actively on the characterization of proteins from a single or family perspective use the MSA analysis to retrieve valuable information about amino acid conservation and the functional role of residues in query protein(s). In MSAs, adjustment of alignment parameters is a key point to improve the quality of MSA output. However, this issue is frequently underestimated and/or misunderstood by scientists and there is no in-depth knowledge available in this field. This brief review focuses on biocomputational approaches complementary to MSA to help distinguish functional residues in protein families. These additional analyses involve issues ranging from phylogenetic to statistical, which address the detection of amino acids pivotal for protein function at any level. In recent years, a large number of tools has been designed for this very purpose. Using some of these relevant, useful tools, we have designed a practical pipeline to perform in silico studies with a view to improving the characterization of family proteins and their functional residues. This review-guide aims to present biologists a set of specially designed tools to study proteins. These tools are user-friendly as they use web servers or easy-to-handle applications. Such criteria are essential for this review as most of the biologists (experimentalists) working in this field are unfamiliar with these biocomputational analysis approaches.

Pais, Fabiano S et al.. (2014). “Assessing the efficiency of multiple sequence alignment programs”. Algorithms for Molecular Biology : Amb 9(1):4 .
[PMID: 24602402] [DOI: 10.1186/1748-7188-9-4]

Abstract …

BACKGROUND: Multiple sequence alignment (MSA) is an extremely useful tool for molecular and evolutionary biology and there are several programs and algorithms available for this purpose. Although previous studies have compared the alignment accuracy of different MSA programs, their computational time and memory usage have not been systematically evaluated. Given the unprecedented amount of data produced by next generation deep sequencing platforms, and increasing demand for large-scale data analysis, it is imperative to optimize the application of software. Therefore, a balance between alignment accuracy and computational cost has become a critical indicator of the most suitable MSA program. We compared both accuracy and cost of nine popular MSA programs, namely CLUSTALW, CLUSTAL OMEGA, DIALIGN-TX, MAFFT, MUSCLE, POA, Probalign, Probcons and T-Coffee, against the benchmark alignment dataset BAliBASE and discuss the relevance of some implementations embedded in each program’s algorithm. Accuracy of alignment was calculated with the two standard scoring functions provided by BAliBASE, the sum-of-pairs and total-column scores, and computational costs were determined by collecting peak memory usage and time of execution.

Sievers, Fabian and Desmond G Higgins. (2018). “Clustal Omega for making accurate alignments of many protein sequences”. Protein Science : A Publication of The Protein Society 27(1):135–145 .
[PMID: 28884485] [DOI: 10.1002/pro.3290]

Abstract …

Clustal Omega is a widely used package for carrying out multiple sequence alignment. Here, we describe some recent additions to the package and benchmark some alternative ways of making alignments. These benchmarks are based on protein structure comparisons or predictions and include a recently described method based on secondary structure prediction. In general, Clustal Omega is fast enough to make very large alignments and the accuracy of protein alignments is high when compared to alternative packages. The package is freely available as executables or source code from www.clustal.org or can be run on-line from a variety of sites, especially the EBI www.ebi.ac.uk.

Iantorno, Stefano et al.. (2014). “Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment”. Methods in Molecular Biology (Clifton, N.j.) 1079:59–73 .
[PMID: 24170395] [DOI: 10.1007/978-1-62703-646-7_4]

Abstract …

Multiple sequence alignment (MSA) is a fundamental and ubiquitous technique in bioinformatics used to infer related residues among biological sequences. Thus alignment accuracy is crucial to a vast range of analyses, often in ways difficult to assess in those analyses. To compare the performance of different aligners and help detect systematic errors in alignments, a number of benchmarking strategies have been pursued. Here we present an overview of the main strategies-based on simulation, consistency, protein structure, and phylogeny-and discuss their different advantages and associated risks. We outline a set of desirable characteristics for effective benchmarking, and evaluate each strategy in light of them. We conclude that there is currently no universally applicable means of benchmarking MSA, and that developers and users of alignment tools should base their choice of benchmark depending on the context of application-with a keen awareness of the assumptions underlying each benchmarking strategy.

Notredame, Ca’edric. (2007). “Recent evolutions of multiple sequence alignment algorithms”. Plos Computational Biology 3(8):e123 .
[PMID: 17784778] [DOI: 10.1371/journal.pcbi.0030123]

Questions, comments

If in doubt, ask! If anything about this contents is not clear to you, do not proceed but ask for clarification. If you have ideas about how to make this material better, let’s hear them. We are aiming to compile a list of FAQs for all learning units, and your contributions will count towards your participation marks.

Improve this page! If you have questions or comments, please post them on the Quercus Discussion board with a subject line that includes the name of the unit.

References

About this page …

[END]

“indel”: insertion / deletion – a difference in sequence length between two aligned sequences that is accommodated by gaps in the alignment. Since we can’t tell from the comparison of two sequences whether such a change was introduced by insertion into or deletion from the ancestral sequence, we join both into a portmanteau(W).↩︎

Multiple Sequence Alignment

Boris Steipe

Contents

Multiple Sequence Alignment

MSA’s on the web at the EBI

MSA’s in R

Sequence alignment editors

Alignment Editing

Further Reading

Questions, comments

References