Difference between revisions of "Multiple Sequence Alignment"

From "A B C"
Jump to navigation Jump to search
 
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
<!-- <div style="color: #000000; background-color:#FF4560; font-size:100%; text-decoration:none; border:solid 2px #000000; padding: 5px;">
+
<div id="APB">
'''Update Warning!'''
+
<div class="b1">
This page has not been revised yet for the 2009 Fall term.
+
Multiple Sequence Alignment
Some of the slides will probably be reused, but please consider the page as a whole out of date
+
</div>
as long as this warning appears here. Also, the lectures may be taught in a different order than stated on the Main page.
+
 
</div> -->
+
 
&nbsp;<br>
+
{{dev}}
&nbsp;<br>
+
 
<div class="toclimit-3">__TOC__</div>
+
 
&nbsp;<br>
+
MSA: Multiple sequence alignments
&nbsp;<br><br>
+
 
&nbsp;<br>
+
 
<div style="color: #FFFFFF; background-color:#457DB5; font-size:150%; text-decoration:none; border:solid 4px #999999; padding:10px;">
+
__TOC__
Multiple Sequence Alignment (MSA)
+
 
&nbsp;<br>
+
 
</div>&nbsp;<br>
+
 
<div style="color: #000000; background-color:#A6AFD0; font-size:100%; text-decoration:none; border:solid 2px #999999; padding: 5px;">
+
&nbsp;
==Objectives==
+
==Introductory reading==
</div><br>
+
<section begin=reading />
* Understand that MSA is an unsolved, difficult problem with different "best" solutions for different purposes.<br>
+
Caution: 2005 article.
* Be familiar with different biological heuristics that distinguish a "good" alignment from a "poor" alignment.<br>
+
{{#pmid: 15963889}}
* Understand the importance of benchmarks for assessing the performance of computational tools.<br>
+
<section end=reading />
* Be aware of how different biological priorities have resulted in different algorithmic strategies and know some of the available software tools that represent them.<br>
+
 
* Be aware that the most frequently used and referenced tool - CLUSTAL - is no longer state-of-the-art and know which modern tools are much better.<br>
+
 
* Confidently be able to survey recent developments and choose an appropriate algorithm.<br>
+
&nbsp;
* Be able to perform and interpret MSAs in practice, know how to prepare input, which formats to use and what common output formats look like.<br>
+
==Contents==
* Understand strategies to prepare input and improve alignments, based on the requirement of columnwise homology.<br>
+
{{PDFlink|[http://biochemistry.utoronto.ca/undergraduates/courses/BCH441H/resources/13-Homology_IV_MSA.pdf Multiple Sequence Alignment]}} - Lecture by Boris Steipe. BCH441 - 2011<br />
* Know about strategies and tools for manual editing of alignments.<br>
+
{{PDFlink|[[Media:N-Nursimulu_BCB410_2011_MSA-Presentation.pdf|MSA]]}} - Presentation by Nirvana Nursimulu, BCB410 - 2011
&nbsp;<br>&nbsp;<br>
 
<div style="color: #000000; background-color:#A6AFD0; font-size:100%; text-decoration:none; border:solid 2px #999999; padding: 5px;">
 
==Links==
 
</div><br>
 
*[http://en.wikipedia.org/wiki/Multiple_sequence_alignment '''Wikipedia''' page on Multiple Sequence Alignment]<br>*[http://www.ebi.ac.uk/clustalw/ Clustal alignments online]<br>
 
*[http://cbcsrv.watson.ibm.com/Tmsa.html MUSCA]<br>
 
*[http://meme.sdsc.edu/ MEME algorithm for motif discovery]<br>
 
*[http://hmmer.janelia.org/ HMMER, a profile hidden Markov model tool]<br>
 
*[http://pfam.sanger.ac.uk/Pfam Pfam domain database]<br>
 
*[http://www.ebi.ac.uk/t-coffee/ EBI '''TCoffee''' Web server]<br>
 
*[http://www.tcoffee.org/ '''TCoffee Web page''']<br>
 
*[http://www.ebi.ac.uk/muscle/ EBI '''MUSCLE Web server''']<br>
 
*[http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py &nbsp;Berkeley '''MUSCLE server''']<br>
 
*[http://www.drive5.com/muscle/ Muscle homepage]<br>
 
*[http://sparks.informatics.iupui.edu/Softwares-Services_files/spem.htm Indiana '''SPEM server]<br>
 
*[http://prodata.swmed.edu/promals/ Dallas '''PROMALS Web server''']<br>
 
*[http://www.jalview.org/ '''Jalview''' home page]<br>
 
*[http://pfaat.sourceforge.net/ The '''PFAAT''' homepage]<br>
 
*[http://aig.cs.man.ac.uk/research/utopia/cinema/cinema.php The '''CINEMA''' homepage]<br>
 
*[http://www.ch.embnet.org/software/BOX_form.html Embnet BOXSHADE server]<br>
 
*[http://bioweb.pasteur.fr/seqanal/interfaces/boxshade.html Pasteur institute BOXSHADE server]<br>
 
  
<br>
+
&nbsp;
&nbsp;<br>&nbsp;<br>
 
<div style="color: #000000; background-color:#A6AFD0; font-size:100%; text-decoration:none; border:solid 2px #999999; padding: 5px;">
 
 
==Exercises==
 
==Exercises==
</div><br>
+
<section begin=exercises />
[...]
+
{{PDFlink|[[Media:N-Nursimulu_BCB410_2011_MSA-Exercises.pdf|Exercises]]}} - by Nirvana Nursimulu, BCB410 - 2011
<br>
+
<section end=exercises />
&nbsp;<br>&nbsp;<br>
+
 
<div style="color: #000000; background-color:#A6AFD0; font-size:100%; text-decoration:none; border:solid 2px #999999; padding: 5px;">
+
 
==Slides==
+
<!--
</div><br>
+
&nbsp;
<br>
+
==Notes==
&nbsp;<br>&nbsp;<br>
+
<references />
<div style="color: #000000; background-color:#BDC3DC; font-size:100%; text-decoration:none; border:solid 2px #999999; padding: 5px;">
+
 
===Uses and problems===
+
 
</div><br>
+
-->
<br>
+
&nbsp;
&nbsp;<br><div style="padding: 5px;">
+
 
=====Slide 0008=====
+
==Further reading and resources==
</div>
+
<div class="reference-box">Methods in Molecular Biology <br />
[[Image:Multiple Sequence Alignment_slide0008.jpg|frame|none|Multiple Sequence Alignment, slide 0008<br>
+
[http://link.springer.com/book/10.1007/978-1-62703-646-7 '''Multiple Sequence Alignment Methods''']<br />
MSA show conservation patterns.
+
Springer (2014)</div>
]]
+
{{#pmid: 24222208}}
&nbsp;<br>
+
{{#pmid: 24170400}}
Multiple sequence alignments don't only match residues. They also give information on how strongly a residue is conserved, what it can be replaced with, which species share particular sequence patterns, and where in the sequence indels can be tolerated. An analysis of conservation even allows to distinguish between structurally and functionally conserved residues! This makes multiple sequence alignments the method of choice for many applications.<br>
+
{{#pmid: 21979275}}
*Multiple sequence alignments are more accurate than pairwise alignments, thus they are the method of choice for starting '''homology modeling''' projects.<br>
+
{{#pmid: 21465564}}
*Combined information from numerous sequences is invaluable for '''secondary structure prediction''' and '''sensitive sequence database searches'''.<br>
+
{{#pmid: 15318951}}
*They contain the information needed for inferences about '''evolutionary relationships''', i.e. the order in which particular sequence changes occurred.
+
{{#pmid: 18372315}}
&nbsp;<br><div style="padding: 5px;">
+
{{#pmid: 19648142}}
=====Slide 0009=====
+
{{#pmid: 22536955}}
</div>
+
<!-- {{#pmid: 19645596}} -->
[[Image:Multiple Sequence Alignment_slide0009.jpg|frame|none|Multiple Sequence Alignment, slide 0009<br>
+
<!-- {{WWW|WWW_UniProt}} -->
<!-- caption -->
+
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
]]
+
 
&nbsp;<br>
+
 
Multiple alignments cannot necessarily be '''constructed''' from pairwise alignments. Moreover, it may be impossible to merge three mutually pairwise alignments into a non-contradicting multiple alignment. However the inverse is always possible: a multiple alignment can be '''decomposed''' into pairwise alignments.
+
&nbsp;
&nbsp;<br><div style="padding: 5px;">
+
[[Category:Applied_Bioinformatics]]
=====Slide 0010=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0010.jpg|frame|none|Multiple Sequence Alignment, slide 0010<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
Besides being intractable, it is questionable how meaningful the objective function of optimal sequence alignments is for multiple alignments. This pair score maximizes the score derived from a mutation data matrix, for pairs of aligned residues. But - for example - the pair score does not otimize the pattern of indel placements, or whether a particular motif is well-conserved.
 
&nbsp;<br>&nbsp;<br>
 
<div style="color: #000000; background-color:#BDC3DC; font-size:100%; text-decoration:none; border:solid 2px #999999; padding: 5px;">
 
===Good, poor, right, wrong - the objective functions===
 
</div><br>
 
<br>
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0012=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0012.jpg|frame|none|Multiple Sequence Alignment, slide 0012<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
If we want an algorithm to optimize anything at all, we first must define how we can measure the quality of the result. This metric defines the '''target function''' or '''objective function'''. (Note that "objective" is not used in the sense of "unbiased" but in the sense of "target", or "goal".)
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0013=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0013.jpg|frame|none|Multiple Sequence Alignment, slide 0013<br>
 
Reasonable alignment metrics are based on models of how evolution has shaped a family of related sequences.
 
]]
 
&nbsp;<br>
 
Each of the reasonable biological objectives suggests a different alignment strategy! The most modern algorithms currently available attempt to satisfy these heuristics simultaneously. Note that these are '''heuristics''', they are not the result of some rigorously applied theory, but reflect the complex relationship between protein sequence, structure, evolution and selection.
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0014=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0014.jpg|frame|none|Multiple Sequence Alignment, slide 0014<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
<!-- text -->
 
&nbsp;<br>&nbsp;<br>
 
<div style="color: #000000; background-color:#BDC3DC; font-size:100%; text-decoration:none; border:solid 2px #999999; padding: 5px;">
 
===Algorithms and software tools===
 
</div><br>
 
<br>
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0016=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0016.jpg|frame|none|Multiple Sequence Alignment, slide 0016<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
Exact methods certainly have their place where it comes to analyzing and improving algorithms; they are especially of interest to computer science because high-dimensional optimal alignment is a difficult problem. However they cannot compete in terms of result-quality with modern heuristic methods.
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0017=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0017.jpg|frame|none|Multiple Sequence Alignment, slide 0017<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
'''Progressive''' alignment is one of three fundamental algorithmic approaches to MSA. The EBI offers [http://www.ebi.ac.uk/clustalw/ Clustal alignments online].
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0018=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0018.jpg|frame|none|Multiple Sequence Alignment, slide 0018<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
'''Consistency''' based multiple alignment is one of three fundamental algorithmic approaches to MSA. Many modern algorithms have a consistency based step included, however none of them relies solely on consistency, since problems from spurious local similarity can corrupt the alignment. [http://cbcsrv.watson.ibm.com/Tmsa.html MUSCA, based on the Teiresias pattern discovery algorithm] is offered through IBM's Watson Labs Web server. Similarly, the [http://meme.sdsc.edu/ MEME algorithm for motif discovery]which is more commonly used in sequence analysis infers a motif-based alignment.
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0019=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0019.jpg|frame|none|Multiple Sequence Alignment, slide 0019<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
''Probabilistic''' multiple alignment is one of three fundamental algorithmic approaches to MSA. A statistical model of the sequences is built, then the alignment can be generated by aligning the sequences to the model. Of course, aligning sequences to a profile is a special case of this procedure: PSI BLAST can thus be used as an alignment algorithm. The most widely used algortihm is Sean Eddy's [http://hmmer.janelia.org/ HMMER, a profile hidden Markov model tool]&nbsp; which is also used in the generation of the&nbsp; [http://pfam.sanger.ac.uk/Pfam Pfam domain database].
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0020=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0020.jpg|frame|none|Multiple Sequence Alignment, slide 0020<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
[http://www.ncbi.nlm.nih.gov/sites/entrez?Db=pubmed&Cmd=ShowDetailView&TermToSearch=9254694 Altschul ''et al.'' (1998) ''Nucleic Acids Research'' '''25''':3389-3402]
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0021=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0021.jpg|frame|none|Multiple Sequence Alignment, slide 0021<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
I personally rate TCoffee as the most useful and useable tool that is currently available. It is robust, fast, and gives reasonable results for many cases. Usually it is '''very''' noticeably better then CLUSTAL and I would reject any result based on CLUSTAL for that reason. Run TCoffee via the&nbsp; [http://www.ebi.ac.uk/t-coffee/ EBI '''TCoffee''' Web server] which is very easy to use (although alignment size is limited;). Source code can be obtained and a local installation on UNIX machines is straightforward. The [http://www.tcoffee.org/ '''TCoffee Web page'''] links to another Web server and also offers 3DCoffee, a variant that automatically fetches related structures and incorporates structural alignments for increased accuracy.<br>
 
&nbsp;<br>
 
The inset image shows one of the useful features of TCoffee: an alignment output in which sequence is coloured according to the local quality of the alignment. This makes reliable and unreliable regions easy to spot, and immediately highlights outliers that could for example be due to sequence errors, such as frameshifts in exons. (MSA taken from the Mbp1 full-length alignment).
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0022=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0022.jpg|frame|none|Multiple Sequence Alignment, slide 0022<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
Run the MUSCLE MSAs via the [http://www.ebi.ac.uk/muscle/ EBI '''MUSCLE Web server'''] which is very easy to use, or via the [http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py &nbsp;Berkeley '''MUSCLE server'''] courtesy of Kimmen Sjolander's lab. Source code and compiled code can be obtained from the [http://www.drive5.com/muscle/ Muscle homepage] and a local installation on UNIX and Windows machines is straightforward. The site also hosts the PREFAB multiple alignment benchmark.
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0023=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0023.jpg|frame|none|Multiple Sequence Alignment, slide 0023<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
One of the best algorithms that aligns sequences without additional database information. Run it on the web via the [http://probcons.stanford.edu Stanford '''PROBCONS server'''', or download the code and install locally.
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0024=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0024.jpg|frame|none|Multiple Sequence Alignment, slide 0024<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
SPEM is one of the most accurate algorithms currently available, in particular for sequences of very low similarity. Run alignments via the [http://sparks.informatics.iupui.edu/Softwares-Services_files/spem.htm Indiana '''SPEM server].
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0025=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0025.jpg|frame|none|Multiple Sequence Alignment, slide 0025<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
One of the latest additions to the toolkit, '''PROMALS is currently the most accurate MSA tool available'''. Run it on the [http://prodata.swmed.edu/promals/ Dallas '''PROMALS Web server''']. Read the [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btm017?ijkey=8VzLUe2lszEStAI&keytype=ref PROMALS paper in the 2007 NAR Web server issue].
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0026=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0026.jpg|frame|none|Multiple Sequence Alignment, slide 0026<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
Just what does PROMALS' improved performance mean, relative to e.g. CLUSTAL? For one, we can see a clear leap in performance through the inclusion of database information and consensus structure predictions (SPEM and PROMALS). On the other hand, regarding the SABmark superfamily dataset that is perhaps most characteristic of "typical" alignment problems with recognizeable, but low % identity, PROMALS achieves a 50% improvement relative to CLUSTAL, a 30% improvement relative to MUSCLE and ProbCons. This is much more than just statistical noise.
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0027=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0027.jpg|frame|none|Multiple Sequence Alignment, slide 0027<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
... from the SPEM paper (Zhou & Zhou, 2005). Above ~35% pairwise sequence identity, all algortihms get it more or less right. Below ~20% pairwise sequence identity the differences are dramatic with the methods that rely on the sequences only scoring more than 20% better than CLUSTAL and SPEM outperforming CLUSTAL by about 40%.
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0028=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0028.jpg|frame|none|Multiple Sequence Alignment, slide 0028<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
How do we know that a new algorithm is better than a previous one? Benchmarks, or "Gold Standards" are an essential part of scientific hygiene. We as users must demand objective comparisons to existing methods, as referees we must require them for publication, as members of the research community we must participate in defining them and provide raw data for their construction. But we must also realize that an "arms-race" of sorts may be ensuing: as developers use the benchmarks as a training set, artificially high performance scores may be generated and performance on novel problems may degrade.
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0029=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0029.jpg|frame|none|Multiple Sequence Alignment, slide 0029<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
Access [http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE/ the original BAliBASE] (1999) here. Two updated versions have been created: [http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE2/ BAliBASE 2.0] (2000) and [http://www-bio3d-igbmc.u-strasbg.fr/~julie/balibase/index.html BAliBASE 3.0]. Central to BAliBASE is the concept of '''core blocks''' of alignable regions in which a pairwise correspondence of residues can be defined; outside these regions an alignment is not possible since the structural differences are too large. ([http://bioinformatics.oxfordjournals.org/cgi/content/abstract/15/1/87 BAliBASE: Thompson J. ''et al.,'' (1999) ''Bioinformatics'' '''15''':87-88.]).
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0030=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0030.jpg|frame|none|Multiple Sequence Alignment, slide 0030<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
[http://bioinformatics.oxfordjournals.org/cgi/content/full/21/7/1267 SABmark: Van Walle ''et al.'' (2005) ''Bioinformatics'' '''21''':1267-1268].&nbsp; [http://bioinformatics.vub.ac.be/databases/databases.html SABmark homepage].
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0031=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0031.jpg|frame|none|Multiple Sequence Alignment, slide 0031<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
Construction of PREFAB is described in [http://nar.oxfordjournals.org/cgi/content/full/32/5/1792?ijkey=48Nmt1tta0fMg&keytype=ref MUSCLE: Edgar (2004) ''Nucl Acids Res'' '''32''':1792-1797].
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0032=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0032.jpg|frame|none|Multiple Sequence Alignment, slide 0032<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
<!-- text -->
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0033=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0033.jpg|frame|none|Multiple Sequence Alignment, slide 0033<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
<!-- text -->
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0034=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0034.jpg|frame|none|Multiple Sequence Alignment, slide 0034<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
Using CLUSTAL for anything but the simplest alignment problems is ''Cargo Cult Bioinformatics''. You are doing something that may look good to the non-expert, but you can't get good results. Benchmark results have identified '''significant''' progress in the field!<br>
 
&nbsp;<br>
 
"Relevance" for Google may not be the same as relevance for your work. For some applications, novelty is more important than cross-references and page-hits. For a more curated view, you can try the [http://en.wikipedia.org/wiki/Multiple_sequence_alignment '''Wikipedia''' page on Multiple Sequence Alignment] or the [http://wikiomics.org/wiki/Multiple_sequence_alignment Wikiomics] page. <small>(Wikiomics is a project you should know about, but it doesn't appear to be catching on very well.)</small>
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0035=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0035.jpg|frame|none|Multiple Sequence Alignment, slide 0035<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
The obvious first approach is to search for a recent review. For the last year of sequence alignment literature in PubMed: search <tt>("multiple sequence alignment"[ti] OR "multiple alignment"[ti]) AND (server OR algorithm) AND "last 1 years"[dp]</tt> or just [http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=search&Db=pubmed&term=%22multiple+sequence+alignment%22%5Bti%5D+OR+%22multiple+alignment%22%5Bti%5D+AND+%22last+1+Years%22%5Bdp%5D '''click here'''.] Note that not all "reviews" have been tagged by the PubMed curators as such. In the list returned in October 2008, the most recent review was found by the above search strategy. Of course, no recent review may be available, or the available reviews may not be very informative. [http://compbiol.plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pcbi.0030123 Cedric Notredame's MSA review (2007)] is technical and probably less-helpful for the non-expert, although it emphasizes the paradigm shift towards '''template based alignment''' strategies well. [http://www.ncbi.nlm.nih.gov/sites/entrez?Db=pubmed&Cmd=ShowDetailView&TermToSearch=16679011 Edgar and Batzoglou's MSA review (2005)], by the authors of MUSCLE and ProbCons, is much more readable and a good, comprehensive introduction to modern methods. <!-- URL encoding in example needed for Wikification of link, not NCBI -->
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0036=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0036.jpg|frame|none|Multiple Sequence Alignment, slide 0036<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
An alternative and more exploratory approach is to choose a recent '''highly relevant''' article, then to use the NCBI's "Related Articles" service. This search strategy allows you to search '''forward''' in time from a particular publication. In the above example, a serch for <tt>clustal[ti]</tt> yielded a publication on CLUSTAL from 2003 ...
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0037=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0037.jpg|frame|none|Multiple Sequence Alignment, slide 0037<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
... in the list of related articles (in September 2007) the article on PROMALS (2007): was number 5 in the hit-list, SPEM (2005) came as number 50.
 
&nbsp;<br>&nbsp;<br>
 
<div style="color: #000000; background-color:#BDC3DC; font-size:100%; text-decoration:none; border:solid 2px #999999; padding: 5px;">
 
===Uses and problems===
 
</div><br>
 
<br>
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0039=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0039.jpg|frame|none|Multiple Sequence Alignment, slide 0039<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
Spend some time and thought '''before''' you run the MSA to review the sequences that you are planning to align. Including un-alignable sequence '''will''' lead the algorithms astray and has the potential to degrade the entire alignment. The requirement not to align "non-hmologous" sequence should really be extended not to align (or at least: not to evaluate) sequence segments that have evolved in different context, such as in different local structural environments after insertions or deletions have occurred. The reason is: if the structural environment is not conserved, the mutation data matrix scores are irrelevant for the residues that are paired up. They may be "aligned" by the algorithm, but they are really not equivalent in structure or function, thus whether they have a good or poor similarity score is meaningless.
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0040=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0040.jpg|frame|none|Multiple Sequence Alignment, slide 0040<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
Three common formats exist for MSA results. An '''aligned''' multi FASTA file contains FASTA formatted sequences into which gap characters have been inserted. Of course, multi FASTA files can also be unaligned and they are the most common way of formatting '''input files''' for MSAs.
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0041=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0041.jpg|frame|none|Multiple Sequence Alignment, slide 0041<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
Three common formats exist for MSA results. MSF is a legacy format from the GCG package of sequence alignments, also produced by the EMBOSS tool EMMA, and supported as a valid input format for many programs. Gaps are denoted by periods and checksums are calculated for the sequences and for the alignment.
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0042=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0042.jpg|frame|none|Multiple Sequence Alignment, slide 0042<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
Three common formats exist for MSA results. A CLUSTAL formatted alignment is the format in most common use. Take care when formatting input files to ensure the '''first 10 characters in your input file are unique''' and contain '''no special characters'''! I have seen programs break on blanks, hyphens and &#124; (pipe). The latter is especially annoying, since the &#124; character is used in NCBI FASTA files to separate the database identifier from the accession number.&nbsp; (More information at the&nbsp; [http://www.ebi.ac.uk/help/formats_frame.html EBI help page on formats].)
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0043=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0043.jpg|frame|none|Multiple Sequence Alignment, slide 0043<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
It is common and perfectly permissible to manually edit a MSA with some biologically motivated heuristic in mind '''as long as you document what you have done'''! In the early days of MSAs, editing was simply <u>required</u> since the results were often obviously inadequate. In all cases in which the algorithm uses only the input sequences for the alignment, this still holds true. However, regarding the more modern template-based procedures (e.g. SPEM, PROMALS or PRALINE) I would be more reluctant to edit, since we may be actively ignoring/discarding the additional information the algorithm has used.
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0044=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0044.jpg|frame|none|Multiple Sequence Alignment, slide 0044<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
<!-- text -->
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0045=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0045.jpg|frame|none|Multiple Sequence Alignment, slide 0045<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
Jalview is integrated into the EBI multiple sequence alignment services, or you can access [http://www.jalview.org/ '''Jalview''' home page].
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0046=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0046.jpg|frame|none|Multiple Sequence Alignment, slide 0046<br>
 
Multiple structural alignment of a representative set of the class II aminoacyl-tRNA synthetases. Structures are colored by structural conservation Q<sub>res</sub>. User-selected residues are highlighted on both the sequence and structure displays. See Eargle ''et al.'' (2006)
 
]]
 
&nbsp;<br>
 
[http://bioinformatics.oxfordjournals.org/cgi/content/full/22/4/504 VMD structural alignment: Eargle ''et al.'' (2006) ''Bioinformatics'' '''22''':504-506]
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0047=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0047.jpg|frame|none|Multiple Sequence Alignment, slide 0047<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
[http://pfaat.sourceforge.net/ The '''PFAAT''' homepage]
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0048=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0048.jpg|frame|none|Multiple Sequence Alignment, slide 0048<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
[http://aig.cs.man.ac.uk/research/utopia/cinema/cinema.php The '''CINEMA''' homepage].
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0049=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0049.jpg|frame|none|Multiple Sequence Alignment, slide 0049<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
Purely for alignment visualization, run it from the [http://www.ch.embnet.org/software/BOX_form.html Embnet BOXSHADE server] or the [http://bioweb.pasteur.fr/seqanal/interfaces/boxshade.html Pasteur institute BOXSHADE server]. The EMBOSS package has tools with similar functionality.
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0050=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0050.jpg|frame|none|Multiple Sequence Alignment, slide 0050<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
<!-- text -->
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0051=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0051.jpg|frame|none|Multiple Sequence Alignment, slide 0051<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
<!-- text -->
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0052=====
 
</div>
 
[[Image:Multiple Sequence Alignment_slide0052.jpg|frame|none|Multiple Sequence Alignment, slide 0052<br>
 
For sequence logos, see e.g. http://weblogo.berkeley.edu/ and references there.
 
]]
 
&nbsp;<br>
 
<!-- text -->
 
&nbsp;<br><div style="padding: 5px;">
 
=====Slide 0053=====
 
 
</div>
 
</div>
[[Image:Multiple Sequence Alignment_slide0053.jpg|frame|none|Multiple Sequence Alignment, slide 0053<br>
 
<!-- caption -->
 
]]
 
&nbsp;<br>
 
<!-- text -->
 
&nbsp;<br>&nbsp;<br>
 
&nbsp;<br>
 

Latest revision as of 22:38, 15 November 2014

Multiple Sequence Alignment


This page is a placeholder, or under current development; it is here principally to establish the logical framework of the site. The material on this page is correct, but incomplete.


MSA: Multiple sequence alignments



 

Introductory reading

Caution: 2005 article.

Wallace et al. (2005) Multiple sequence alignments. Curr Opin Struct Biol 15:261-6. (pmid: 15963889)

PubMed ] [ DOI ] Multiple sequence alignments are very widely used in all areas of DNA and protein sequence analysis. The main methods that are still in use are based on 'progressive alignment' and date from the mid to late 1980s. Recently, some dramatic improvements have been made to the methodology with respect either to speed and capacity to deal with large numbers of sequences or to accuracy. There have also been some practical advances concerning how to combine three-dimensional structural information with primary sequences to give more accurate alignments, when structures are available.


 

Contents

Multiple Sequence Alignment - Lecture by Boris Steipe. BCH441 - 2011
MSA - Presentation by Nirvana Nursimulu, BCB410 - 2011

 

Exercises

Exercises - by Nirvana Nursimulu, BCB410 - 2011


 

Further reading and resources

Methods in Molecular Biology

Multiple Sequence Alignment Methods

Springer (2014)
Kim & Ma (2014) PSAR-align: improving multiple sequence alignment using probabilistic sampling. Bioinformatics 30:1010-2. (pmid: 24222208)

PubMed ] [ DOI ] SUMMARY: We developed PSAR-Align, a multiple sequence realignment tool that can refine a given multiple sequence alignment based on suboptimal alignments generated by probabilistic sampling. Our evaluation demonstrated that PSAR-Align is able to improve the results from various multiple sequence alignment tools. AVAILABILITY AND IMPLEMENTATION: The PSAR-Align source code (implemented mainly in C++) is freely available for download at http://bioen-compbio.bioen.illinois.edu/PSAR-Align.

Roshan (2014) Multiple sequence alignment using Probcons and Probalign. Methods Mol Biol 1079:147-53. (pmid: 24170400)

PubMed ] [ DOI ] Sequence alignment remains a fundamental task in bioinformatics. The literature contains programs that employ a host of exact and heuristic strategies available in computer science. Probcons was the first program to construct maximum expected accuracy sequence alignments with hidden Markov models and at the time of its publication achieved the highest accuracies on standard protein multiple alignment benchmarks. Probalign followed this strategy except that it used a partition function approach instead of hidden Markov models. Several programs employing both strategies have been published since then. In this chapter we describe Probcons and Probalign.

Taly et al. (2011) Using the T-Coffee package to build multiple sequence alignments of protein, RNA, DNA sequences and 3D structures. Nat Protoc 6:1669-82. (pmid: 21979275)

PubMed ] [ DOI ] T-Coffee (Tree-based consistency objective function for alignment evaluation) is a versatile multiple sequence alignment (MSA) method suitable for aligning most types of biological sequences. The main strength of T-Coffee is its ability to combine third party aligners and to integrate structural (or homology) information when building MSAs. The series of protocols presented here show how the package can be used to multiply align proteins, RNA and DNA sequences. The protein section shows how users can select the most suitable T-Coffee mode for their data set. Detailed protocols include T-Coffee, the default mode, M-Coffee, a meta version able to combine several third party aligners into one, PSI (position-specific iterated)-Coffee, the homology extended mode suitable for remote homologs and Expresso, the structure-based multiple aligner. We then also show how the T-RMSD (tree based on root mean square deviation) option can be used to produce a functionally informative structure-based clustering. RNA alignment procedures are described for using R-Coffee, a mode able to use predicted RNA secondary structures when aligning RNA sequences. DNA alignments are illustrated with Pro-Coffee, a multiple aligner specific of promoter regions. We also present some of the many reformatting utilities bundled with T-Coffee. The package is an open-source freeware available from http://www.tcoffee.org/.

Peng & Xu (2011) A multiple-template approach to protein threading. Proteins 79:1930-9. (pmid: 21465564)

PubMed ] [ DOI ] Most threading methods predict the structure of a protein using only a single template. Due to the increasing number of solved structures, a protein without solved structure is very likely to have more than one similar template structures. Therefore, a natural question to ask is if we can improve modeling accuracy using multiple templates. This article describes a new multiple-template threading method to answer this question. At the heart of this multiple-template threading method is a novel probabilistic-consistency algorithm that can accurately align a single protein sequence simultaneously to multiple templates. Experimental results indicate that our multiple-template method can improve pairwise sequence-template alignment accuracy and generate models with better quality than single-template models even if they are built from the best single templates (P-value <10(-6)) while many popular multiple sequence/structure alignment tools fail to do so. The underlying reason is that our probabilistic-consistency algorithm can generate accurate multiple sequence/template alignments. In another word, without an accurate multiple sequence/template alignment, the modeling accuracy cannot be improved by simply using multiple templates to increase alignment coverage. Blindly tested on the CASP9 targets with more than one good template structures, our method outperforms all other CASP9 servers except two (Zhang-Server and QUARK of the same group). Our probabilistic-consistency algorithm can possibly be extended to align multiple protein/RNA sequences and structures.

Edgar (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5:113. (pmid: 15318951)

PubMed ] [ DOI ] BACKGROUND: In a previous paper, we introduced MUSCLE, a new program for creating multiple alignments of protein sequences, giving a brief summary of the algorithm and showing MUSCLE to achieve the highest scores reported to date on four alignment accuracy benchmarks. Here we present a more complete discussion of the algorithm, describing several previously unpublished techniques that improve biological accuracy and / or computational complexity. We introduce a new option, MUSCLE-fast, designed for high-throughput applications. We also describe a new protocol for evaluating objective functions that align two profiles. RESULTS: We compare the speed and accuracy of MUSCLE with CLUSTALW, Progressive POA and the MAFFT script FFTNS1, the fastest previously published program known to the author. Accuracy is measured using four benchmarks: BAliBASE, PREFAB, SABmark and SMART. We test three variants that offer highest accuracy (MUSCLE with default settings), highest speed (MUSCLE-fast), and a carefully chosen compromise between the two (MUSCLE-prog). We find MUSCLE-fast to be the fastest algorithm on all test sets, achieving average alignment accuracy similar to CLUSTALW in times that are typically two to three orders of magnitude less. MUSCLE-fast is able to align 1,000 sequences of average length 282 in 21 seconds on a current desktop computer. CONCLUSIONS: MUSCLE offers a range of options that provide improved speed and / or alignment accuracy compared with currently available programs. MUSCLE is freely available at http://www.drive5.com/muscle.

Katoh & Toh (2008) Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinformatics 9:286-98. (pmid: 18372315)

PubMed ] [ DOI ] The accuracy and scalability of multiple sequence alignment (MSA) of DNAs and proteins have long been and are still important issues in bioinformatics. To rapidly construct a reasonable MSA, we developed the initial version of the MAFFT program in 2002. MSA software is now facing greater challenges in both scalability and accuracy than those of 5 years ago. As increasing amounts of sequence data are being generated by large-scale sequencing projects, scalability is now critical in many situations. The requirement of accuracy has also entered a new stage since the discovery of functional noncoding RNAs (ncRNAs); the secondary structure should be considered for constructing a high-quality alignment of distantly related ncRNAs. To deal with these problems, in 2007, we updated MAFFT to Version 6 with two new techniques: the PartTree algorithm and the Four-way consistency objective function. The former improved the scalability of progressive alignment and the latter improved the accuracy of ncRNA alignment. We review these and other techniques that MAFFT uses and suggest possible future directions of MSA software as a basis of comparative analyses. MAFFT is available at http://align.bmr.kyushu-u.ac.jp/mafft/software/.

Kemena & Notredame (2009) Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics 25:2455-65. (pmid: 19648142)

PubMed ] [ DOI ] This review focuses on recent trends in multiple sequence alignment tools. It describes the latest algorithmic improvements including the extension of consistency-based methods to the problem of template-based multiple sequence alignments. Some results are presented suggesting that template-based methods are significantly more accurate than simpler alternative methods. The validation of existing methods is also discussed at length with the detailed description of recent results and some suggestions for future validation strategies. The last part of the review addresses future challenges for multiple sequence alignment methods in the genomic era, most notably the need to cope with very large sequences, the need to integrate large amounts of experimental data, the need to accurately align non-coding and non-transcribed sequences and finally, the need to integrate many alternative methods and approaches.

Chang et al. (2012) Accurate multiple sequence alignment of transmembrane proteins with PSI-Coffee. BMC Bioinformatics 13 Suppl 4:S1. (pmid: 22536955)

PubMed ] [ DOI ] BACKGROUND: Transmembrane proteins (TMPs) constitute about 20~30% of all protein coding genes. The relative lack of experimental structure has so far made it hard to develop specific alignment methods and the current state of the art (PRALINE™) only manages to recapitulate 50% of the positions in the reference alignments available from the BAliBASE2-ref7. METHODS: We show how homology extension can be adapted and combined with a consistency based approach in order to significantly improve the multiple sequence alignment of alpha-helical TMPs. TM-Coffee is a special mode of PSI-Coffee able to efficiently align TMPs, while using a reduced reference database for homology extension. RESULTS: Our benchmarking on BAliBASE2-ref7 alpha-helical TMPs shows a significant improvement over the most accurate methods such as MSAProbs, Kalign, PROMALS, MAFFT, ProbCons and PRALINE™. We also estimated the influence of the database used for homology extension and show that highly non-redundant UniRef databases can be used to obtain similar results at a significantly reduced computational cost over full protein databases. TM-Coffee is part of the T-Coffee package, a web server is also available from http://tcoffee.crg.cat/tmcoffee and a freeware open source code can be downloaded from http://www.tcoffee.org/Packages/Stable/Latest.