Difference between revisions of "BIN-ALI-MSA"

Revision as of 22:52, 28 October 2017

Multiple Sequence Alignment

Keywords: Multiple sequence alignment

This unit is under development. There is some contents here but it is incomplete and/or may change significantly: links may lead to nowhere, the contents is likely going to be rearranged, and objectives, deliverables etc. may be incomplete or missing. Do not work with this material until it is updated to "live" status.

Abstract

...

This unit ...

Prerequisites

You need to complete the following units before beginning this one:

Objectives

This unit will ...

... introduce the benefits of multiple sequence alignments (MSA), the objective functions they pursue, algorithms and methods, practical considerations, and the analysis of alignments;
... demonstrate Web services that calculate MSAs;
... teach how to compute MSA's in R.

Outcomes

After working through this unit you ...

... can ;
... are familar with ;
... have begun to.

Deliverables

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Evaluation

Evaluation: Integrated Unit

This unit should be submitted for evaluation for a maximum of 10 marks. Details TBD.

Multiple Sequence Alignment

In order to perform a multiple sequence alignment, we obviously need a set of homologous sequences. This is not trivial. All interpretation of MSA results depends absolutely on how the input sequences were chosen. Should we include only orthologues, or paralogues as well? Should we include only species with fully sequenced genomes, or can we tolerate that some orthologous genes are possibly missing for a species? Should we include all sequences we can lay our hands on, or should we restrict the selection to a manageable number of representative sequences? All of these choices influence our interpretation:

orthologues are expected to be functionally and structurally conserved;
paralogues may have divergent function but have similar structure;
missing genes may make paralogs look like orthologs; and
selection bias may weight our results toward sequences that are over-represented and do not provide a fair representation of evolutionary divergence.

MSA's on the web at the EBI

The EBI hosts a number of excellent MSA programs on their Website. Let's perform an MSA of full length MBP1 orthologues:

Task:

Navigate to the NCBI protein database and paste the MBP1 protein RefSeq IDs from our database into the search form:

NP_010227 NP_593032 XP_660758 XP_007682304 XP_955821 XP_001837394
XP_569090 XP_003327086 XP_011392621 XP_006957051

(add your MBP1_MYSPE RefSeq ID too!)

This will give you a page with links to the retrieved sequences. Click on Summary and choose FASTA(text) as the Format to retrieve all sequences at once as a multi-FASTA formatted page (this is useful, remember it!)
Open another browser window and navigate to the EBI MSA tools page.
Click on Launch T-coffee.
Copy the FASTA sequences from the NCBI page, and paste them into the form at the EBI's T-Coffee page. Click Submit.
The result should show you the aligned sequences, with three blocks of high similarity:
- The most N-terminal block is the APSES domain - the main DNA binding domain of these transcription factors.
- In the middle, we have Ankyrin domains: these are protein-protein interaction modules that Mbp1 uses to recruit other proteins to the bound complex.
- At the end, there is one additional, shorter segment of high similarity.

Explore the tabs that are available, in particular note that you can save the result to a file.
Click on the Download Alignment File tab to load the alignment as text into a browser window. Then save the file into your project directory with a filename of msaT.aln. (.aln is the standard extension for CLUSTAl Formatted aligment files, so it helps if we give the file that extension. Of course you know better than to rely on an extension to signal the filetype and format.)

MSA's in R

Let's move to our RStudio project to explore producing and analyzing multiple sequence alignments in R.

Task:

Open RStudio and load the ABC-units R project. If you have loaded it before, choose File → Recent projects → ABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
Choose Tools → Version Control → Pull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
Type init() if requested.
Open the file BIN-ALI-MSA.R and follow the instructions.

Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.

Sequence alignment editors

Really excellent software tools have been written that help you visualize and manually curate multiple sequence alignments. If anything, I think they tend to do too much. Past versions of the course have used Jalview, but I have heard good things of AliView (and if you are on a Mac seqotron might interest you, but I only cover software that is free and runs on all three major platforms).

Here, I am just mentioning the two alignment editors and encourage you to explore and use them. If you have experience with comparing them, let us know.

[Jalview] an integrated MSA editor and sequence annotation workbench from the Barton lab in Dundee. Lots of functions.
[AliView] from Uppsala: fast, lean, looks to be very practical.

Further reading, links and resources

Bawono et al. (2017) Multiple Sequence Alignment. Methods Mol Biol 1525:167-189. (pmid: 27896722)

[ PubMed ] [ DOI ] Abstract

Benítez-Páez et al. (2012) A practical guide for the computational selection of residues to be experimentally characterized in protein families. Brief Bioinformatics 13:329-36. (pmid: 21930656)

[ PubMed ] [ DOI ] Abstract

Pais et al. (2014) Assessing the efficiency of multiple sequence alignment programs. Algorithms Mol Biol 9:4. (pmid: 24602402)

[ PubMed ] [ DOI ] Abstract

BACKGROUND: Multiple sequence alignment (MSA) is an extremely useful tool for molecular and evolutionary biology and there are several programs and algorithms available for this purpose. Although previous studies have compared the alignment accuracy of different MSA programs, their computational time and memory usage have not been systematically evaluated. Given the unprecedented amount of data produced by next generation deep sequencing platforms, and increasing demand for large-scale data analysis, it is imperative to optimize the application of software. Therefore, a balance between alignment accuracy and computational cost has become a critical indicator of the most suitable MSA program. We compared both accuracy and cost of nine popular MSA programs, namely CLUSTALW, CLUSTAL OMEGA, DIALIGN-TX, MAFFT, MUSCLE, POA, Probalign, Probcons and T-Coffee, against the benchmark alignment dataset BAliBASE and discuss the relevance of some implementations embedded in each program's algorithm. Accuracy of alignment was calculated with the two standard scoring functions provided by BAliBASE, the sum-of-pairs and total-column scores, and computational costs were determined by collecting peak memory usage and time of execution. RESULTS: Our results indicate that mostly the consistency-based programs Probcons, T-Coffee, Probalign and MAFFT outperformed the other programs in accuracy. Whenever sequences with large N/C terminal extensions were present in the BAliBASE suite, Probalign, MAFFT and also CLUSTAL OMEGA outperformed Probcons and T-Coffee. The drawback of these programs is that they are more memory-greedy and slower than POA, CLUSTALW, DIALIGN-TX, and MUSCLE. CLUSTALW and MUSCLE were the fastest programs, being CLUSTALW the least RAM memory demanding program. CONCLUSIONS: Based on the results presented herein, all four programs Probcons, T-Coffee, Probalign and MAFFT are well recommended for better accuracy of multiple sequence alignments. T-Coffee and recent versions of MAFFT can deliver faster and reliable alignments, which are specially suited for larger datasets than those encountered in the BAliBASE suite, if multi-core computers are available. In fact, parallelization of alignments for multi-core computers should probably be addressed by more programs in a near future, which will certainly improve performance significantly.

Sievers & Higgins (2018) Clustal Omega for making accurate alignments of many protein sequences. Protein Sci 27:135-145. (pmid: 28884485)

[ PubMed ] [ DOI ] Abstract

Iantorno et al. (2014) Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment. Methods Mol Biol 1079:59-73. (pmid: 24170395)

[ PubMed ] [ DOI ] Abstract

Notredame (2007) Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol 3:e123. (pmid: 17784778)

[ PubMed ] [ DOI ]

Notes

↑ "indel": insertion / deletion – a difference in sequence length between two aligned sequences that is accommodated by gaps in the alignment. Since we can't tell from the comparison of two sequences whether such a change was introduced by insertion into or deletion from the ancestral sequence, we join both into a portmanteau.

Self-evaluation

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.

About ...

Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2017-10-22

Version:

0.1

Version history:

0.1 First stub

This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.

[1] "indel": insertion / deletion – a difference in sequence length between two aligned sequences that is accommodated by gaps in the alignment. Since we can't tell from the comparison of two sequences whether such a change was introduced by insertion into or deletion from the ancestral sequence, we join both into a portmanteau.

[1]

Difference between revisions of "BIN-ALI-MSA"

Revision as of 22:52, 28 October 2017

Contents

Abstract

This unit ...

Prerequisites

Objectives

Outcomes

Deliverables

Evaluation

Contents

Multiple Sequence Alignment

MSA's on the web at the EBI

MSA's in R

Sequence alignment editors

Further reading, links and resources

Notes

Self-evaluation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools

@@ Line 84: / Line 84: @@
 <b>Evaluation: Integrated Unit</b><br />
 :This unit should be submitted for evaluation for a maximum of 10 marks. Details TBD.
+<!--
+Import a MAFFT alignment and compare
+Explore MUSCLE parameters and compare
+-->
 {{Vspace}}
@@ Line 144: / Line 157: @@
 * Explore the tabs that are available, in particular note that you can save the result to a file.
-* Click on the '''Download Alignment File''' tab to load the alignment as text into a browser window. Then save the file into your project directory with a filename of <code>MBP1orthologues.aln</code>. (<code>.aln</code> is the standard extension for CLUSTAl Formatted aligment files, so it helps if we give the file that extension. Of course you know better than to '''rely''' on an extension to signal the filetype and format.)
+* Click on the '''Download Alignment File''' tab to load the alignment as text into a browser window. Then save the file into your project directory with a filename of <code>msaT.aln</code>. (<code>.aln</code> is the standard extension for CLUSTAl Formatted aligment files, so it helps if we give the file that extension. Of course you know better than to '''rely''' on an extension to signal the filetype and format.)
 }}
@@ Line 151: / Line 164: @@
+===MSA's in R===
-Let's use the Bioconductor msa package to align the sequences we have. Study and run the following code
-===Computing an MSA's in R===
 {{Vspace}}
@@ Line 179: / Line 186: @@
 * [[http://www.jalview.org/ '''Jalview''']] an integrated MSA editor and sequence annotation workbench from the Barton lab in Dundee. Lots of functions.
 * [[http://www.ormbunkar.se/aliview/ '''AliView''']] from Uppsala: fast, lean, looks to be very practical.
 {{Vspace}}
+<!--
-====Jalview: alignment editor====
-Geoff Barton's lab in Dundee has developed an integrated MSA editor and sequence annotation workbench with a number of very useful functions. It is written in Java and should run on Mac, Linux and Windows platforms without modifications.
-{{#pmid: 19151095}}
-We will quickly install Jalview and explore its features in other assignments.
-{{task|1=
-#Navigate to the [http://www.jalview.org/ Jalview homepage] click on the '''Download''' link, and install Jalview on your computer. For Mac OS X, use the '''Install Jalview Only''' link.
-# Start Jalview. A number of windows that showcase the program's abilities will load, you can close these.
-#Select File &rarr; Input Alignment &rarr; from File and open the <code>APSES_proteins.mfa</code> file you have prepared above. An alignment window with sequences should appear.
-# Choose '''Web Service''' &rarr; '''Alignment''' &rarr; '''Tcoffee with Defaults''' to run a Tcoffee MSA remotely at the Barton lab. The program should execute remotely and download the aligned results into a new window. Scroll along the window to get a sense of what has and hasn't been aligned.
-#Select File &rarr; Input Alignment &rarr; from File and open the <code>APSES_proteins_muscle.mfa</code> file you have prepared above. An alignment window with your Muscle alignment should appear.
-#Compare the two alignments and get a sense for how similar or different they are.
-}}
-===Computing alignments===
- try two MSA's algorithms and load them in Jalview.
- Locally: which one do you prefer? Modify the consensus. Annotate domains.
-The EBI has a very convenient [http://www.ebi.ac.uk/Tools/msa/ page to access a number of MSA algorithms]. This is especially convenient when you want to compare, e.g. T-Coffee and Muscle and MAFFT results to see which regions of your alignment are robust. You could use any of these tools, just paste your sequences into a Webform, download the results and load into Jalview. Easy.
-But even easier is to calculate the alignments directly from Jalview.  available. (Not today. <small>Bummer.</small>)
- No. Calculate an external alignment and import.
-;Calculate a MAFFT alignment using the Jalview Web service option:
-{{task|1=
-#In Jalview, select '''Web Service &rarr; Alignment &rarr; MAFFT with defaults...'''. The alignment is calculated in a few minutes and displayed in a new window.
-}}
-;Calculate a MAFFT alignment when the Jalview Web service is NOT available:
-{{task|1=
-#In Jalview, select '''File &rarr; Output to Textbox &rarr; FASTA'''
-#Copy the sequences.
-#Navigate to the [http://www.ebi.ac.uk/Tools/msa/mafft/ '''MAFFT Input form'''] at the EBI.
-#Paste your sequences into the form.
-#Click on '''Submit'''.
-#Close the Jalview sequence window and either save your MAFFT alignment to file and load in Jalview, or simply ''''File &rarr; Input Alignment &rarr; from Textbox''', paste and click '''New Window'''.
-}}
-In any case, you should now have an alignment.
-{{task|1=
-#Choose '''Colour &rarr; Hydrophobicity''' and '''&rarr; by Conservation'''. Then adjust the slider left or right to see which columns are highly conserved. You will notice that the Swi6 sequence that was supposed to align only to the ankyrin domains was in fact aligned to other parts of the sequence as well. This is one part of the MSA that we will have to correct manually and a common problem when aligning sequences of different lengths.
-}}
-[[Image:InformationPlot.jpg|frame|none|Plot of information vs. sequence position produced by the '''R''' script above, for an alignment of Mbp1 ortholog APSES domains.]]
-== Calculating conservation scores ==
-  Regex task:
-===Mutiple sequence alignment===
-  Write a program in a language of your choice that extracts the multi-line sequences from a CLUSTAL or MSF formatted multiple sequence alignment and concatenates them into single sequences .
-Sample input data ...
-;CLUSTAL formatted alignment:
-  CLUSTAL multiple sequence alignment by MUSCLE (3.8)
-SOK2_SACCE      --NGISVVRRADNDMVNGTKLLN-----VTKMTRGRRDGILKAEKIR----------HVV
-PHD1_SACCE      --NGISVVRRADNNMINGTKLLN-----VTKMTRGRRDGILRSEKVR----------EVV
-KILA_ESCCO      -IDGEIIHLRAKDGYINATSMCRTAGKLLSDYTRLKTTQEFFDELSRDMGIPISELIQSF
-MBP1_SACCE      IHSTGSIMKRKKDDWVNATHILK-----AANFAKAKRTRILEKEVLKETH-------EKV
-SWI4_SACCE      ---TKIVMRRTKDDWINITQVFK-----IAQFSKTKRTKILEKESNDMQH-------EKV
-:  * .:. :* * : .      :. :. .    :  *               .
-SOK2_SACCE      KIGSMHLKGVWIPFERALAIAQREKI-
-  PHD1_SACCE      KIGSMHLKGVWIPFERAYILAQREQI-
-  KILA_ESCCO      KGGRPENQGTWVHPDIAINLAQ-----
-  MBP1_SACCE      QGGFGKYQGTWVPLNIAKQLAEKFSVY
-SWI4_SACCE      QGGYGRFQGTWIPLDSAKFLVNKYEI-
-  ----
-  ;MSF formatted alignment:
-  PileUp
-MSF: 87  Type: A  Check: 0000  ..
-Name: SOK2_SACCE  Len: 87  Check:  9836  Weight: 0.160458
-Name: PHD1_SACCE  Len: 87  Check:  2117  Weight: 0.160458
-Name: KILA_ESCCO  Len: 87  Check:  6044  Weight: 0.256296
-Name: MBP1_SACCE  Len: 87  Check:  4979  Weight: 0.211395
-Name: SWI4_SACCE  Len: 87  Check:  5197  Weight: 0.211395
-//
-  SOK2_SACCE    ..NGISVVRR ADNDMVNGTK LLN.....VT KMTRGRRDGI LKAEKIR...
-PHD1_SACCE    ..NGISVVRR ADNNMINGTK LLN.....VT KMTRGRRDGI LRSEKVR...
-KILA_ESCCO    .IDGEIIHLR AKDGYINATS MCRTAGKLLS DYTRLKTTQE FFDELSRDMG
-MBP1_SACCE    IHSTGSIMKR KKDDWVNATH ILK.....AA NFAKAKRTRI LEKEVLKETH
-SWI4_SACCE    ...TKIVMRR TKDDWINITQ VFK.....IA QFSKTKRTKI LEKESNDMQH
-SOK2_SACCE    .......HVV KIGSMHLKGV WIPFERALAI AQREKI.
-PHD1_SACCE    .......EVV KIGSMHLKGV WIPFERAYIL AQREQI.
-KILA_ESCCO    IPISELIQSF KGGRPENQGT WVHPDIAINL AQ.....
-MBP1_SACCE    .......EKV QGGFGKYQGT WVPLNIAKQL AEKFSVY
-SWI4_SACCE    .......EKV QGGYGRFQGT WIPLDSAKFL VNKYEI.
-Write a regex for a valid sequence line. Capture the ID part and the sequence part separately. Use the ID part as a key to a hash, and add the sequence part to the value for that key.
-;Perl example:
-  :This code uses a regex that recognizes both CLUSTAL and MSF formats:
-  :<code>^(\w+) {2,}([A-Za-z.\- ]+)$</code>
-  :Capture a sequence of word characters, followed by at least two consecutive blank spaces, and capture a sequence of alphabetic characters, gap characters (<code>-</code> or <code>.</code>) or spaces until the end of line. Note that <code>-</code> needs to be escaped (<code>\-</code>) since it has the meaning of a character range in the context of a character class (i.e. square brackets. Lines that contain numerals fail the match, as well as lines that contain special characters, or lines that begin with spaces. <small>Caution: this may not be fully compliant with the format specification.</small>
-                                                                                                                                                                                                    <source lang="perl">
-  #!/usr/bin/perl
-  use strict;
-use warnings;
-my %MSA;	# Hash to store the MSA
-while (my $line = <STDIN>) {
-  if ($line =~ m/^(\w+) {2,}([A-Za-z.\- ]+)$/) {
-    my $k = $1;	# save special variables so they don't get mangled before using them
-    my $v = $2;
-    $v =~ s/\s//g;   # remove blanks in case there are any
-    $MSA{$k} .= $v;  # "." is the perl string concatenation operator
-  }
-}
-#Done. Now do something with the sequences ...
-foreach my $k (keys(%MSA)) {
-  print("$k: $MSA{$k}\n");
-}
-exit();
-</source>
 ==Model Based Alignments: PSSMs and HMMs==
@@ Line 348: / Line 197: @@
 The sensitivity of PSI-BLAST is based on the alignment of profiles of related sequences. The profiles are represented as position specific scoring matrices compiled from the alignment of hits, first to the original sequence and then to the profile. Incidentally, this process can also be turned around, and a collection of pre-compiled PSSMs can be used to annotate protein sequence: this is the principle employed by RPS-BLAST, the tool that identifies conserved domains at the beginning of every BLAST search, and has been used to build the CDD database of conserved domains (for a very informative help-page on CDD [https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml '''see here'''].
+-->
-=== CDD domain annotation ===
-In the last assignment, you followed a link to '''CDD Search Results''' from the [http://www.ncbi.nlm.nih.gov/protein/NP_010227 RefSeq record for yeast Mbp1] and briefly looked at the information offered by the NCBI's Conserved Domain Database, a database of ''Position Specific Scoring Matrices'' that embody domain definitions. Rather than access precomputed results, you can also search CDD with sequences: assuming you have saved the MYSPE Mbp1 sequence in FASTA format, this is straightforward. If you did not save this sequence, return to [[BIO_Assignment_Week_3|Assignment 3]] and retrieve it again.
-{{task|1=
-# Access the [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml '''CDD database'''] at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
-# Read the information. CDD is a superset of various other database domain annotations as well as NCBI-curated domain definitions.
-# Copy the MYSPE Mbp1 FASTA sequence, paste it into the search form and click '''Submit'''.
-## On the result page, clik on '''View full result'''
-## Note that there are a number of partially overlapping ankyrin domain modules. We will study ankyrin domains in a later assignment.
-## Also note that there may be blocks of sequence colored cyan in the sequence bar. Hover your mouse over the blocks to see what these blocks signify.
-## Open the link to '''Search for similar domain architecture''' in a separate window and study it. This is the '''CDART''' database. Think about what these results may be useful for.
-## Click on one of the ANK superfamily graphics and see what the associated information looks like: there is a summary of structure and function, links to specific literature and a tree of the relationship of related sequences.
-}}
-; Hidden Markov Models (HMMs)
-An approach to represent such profile information that is more general than PSSMs is a {{WP|Hidden Markov model|'''Hidden Markov model (HMM)'''}} and the standard tool to use HMMs in Bioinformatics is [http://hmmer.org/ '''HMMER'''], written by Sean Eddy. HMMER has allowed to represent the entirety of protein sequences as a collection of profiles, stored in databases such as [http://pfam.xfam.org/ '''Pfam'''], [https://www.ebi.ac.uk/interpro/ '''Interpro'''], and [http://smart.embl-heidelberg.de/ '''SMART'''].  While the details are slightly different, all of these services allow to scan sequences for the presence of domains. Importantly thus, the alignment results are not collections of full-length protein families, but annotate to domain families, i.e. full length proteins are decomposed into their homologous domains. This is a very powerful approach towards the functional annotation of unknown sequences.
-In this section, we will annotate the MYSPE sequence with the domains it contains, using the database of domain HMMs curated by SMART in Heidelberg and Pfam at the EMBL. We will then compare these annotations with those determined for the orthologues in the reference species. In this way we can enhance the information about one protein by determining how its features are conserved.
 {{Vspace}}