Difference between revisions of "BIN-ALI-MSA"
m |
m |
||
Line 84: | Line 84: | ||
<b>Evaluation: Integrated Unit</b><br /> | <b>Evaluation: Integrated Unit</b><br /> | ||
:This unit should be submitted for evaluation for a maximum of 10 marks. Details TBD. | :This unit should be submitted for evaluation for a maximum of 10 marks. Details TBD. | ||
+ | |||
+ | <!-- | ||
+ | |||
+ | |||
+ | |||
+ | Import a MAFFT alignment and compare | ||
+ | Explore MUSCLE parameters and compare | ||
+ | |||
+ | --> | ||
+ | |||
+ | |||
+ | |||
+ | |||
{{Vspace}} | {{Vspace}} | ||
Line 144: | Line 157: | ||
* Explore the tabs that are available, in particular note that you can save the result to a file. | * Explore the tabs that are available, in particular note that you can save the result to a file. | ||
− | * Click on the '''Download Alignment File''' tab to load the alignment as text into a browser window. Then save the file into your project directory with a filename of <code> | + | * Click on the '''Download Alignment File''' tab to load the alignment as text into a browser window. Then save the file into your project directory with a filename of <code>msaT.aln</code>. (<code>.aln</code> is the standard extension for CLUSTAl Formatted aligment files, so it helps if we give the file that extension. Of course you know better than to '''rely''' on an extension to signal the filetype and format.) |
}} | }} | ||
Line 151: | Line 164: | ||
− | + | ===MSA's in R=== | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | === | ||
{{Vspace}} | {{Vspace}} | ||
Line 179: | Line 186: | ||
* [[http://www.jalview.org/ '''Jalview''']] an integrated MSA editor and sequence annotation workbench from the Barton lab in Dundee. Lots of functions. | * [[http://www.jalview.org/ '''Jalview''']] an integrated MSA editor and sequence annotation workbench from the Barton lab in Dundee. Lots of functions. | ||
* [[http://www.ormbunkar.se/aliview/ '''AliView''']] from Uppsala: fast, lean, looks to be very practical. | * [[http://www.ormbunkar.se/aliview/ '''AliView''']] from Uppsala: fast, lean, looks to be very practical. | ||
− | |||
{{Vspace}} | {{Vspace}} | ||
− | + | <!-- | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
==Model Based Alignments: PSSMs and HMMs== | ==Model Based Alignments: PSSMs and HMMs== | ||
Line 348: | Line 197: | ||
The sensitivity of PSI-BLAST is based on the alignment of profiles of related sequences. The profiles are represented as position specific scoring matrices compiled from the alignment of hits, first to the original sequence and then to the profile. Incidentally, this process can also be turned around, and a collection of pre-compiled PSSMs can be used to annotate protein sequence: this is the principle employed by RPS-BLAST, the tool that identifies conserved domains at the beginning of every BLAST search, and has been used to build the CDD database of conserved domains (for a very informative help-page on CDD [https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml '''see here''']. | The sensitivity of PSI-BLAST is based on the alignment of profiles of related sequences. The profiles are represented as position specific scoring matrices compiled from the alignment of hits, first to the original sequence and then to the profile. Incidentally, this process can also be turned around, and a collection of pre-compiled PSSMs can be used to annotate protein sequence: this is the principle employed by RPS-BLAST, the tool that identifies conserved domains at the beginning of every BLAST search, and has been used to build the CDD database of conserved domains (for a very informative help-page on CDD [https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml '''see here''']. | ||
− | + | --> | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
{{Vspace}} | {{Vspace}} | ||
− | |||
− | |||
− | |||
− | |||
Revision as of 22:52, 28 October 2017
Multiple Sequence Alignment
Keywords: Multiple sequence alignment
Contents
This unit is under development. There is some contents here but it is incomplete and/or may change significantly: links may lead to nowhere, the contents is likely going to be rearranged, and objectives, deliverables etc. may be incomplete or missing. Do not work with this material until it is updated to "live" status.
Abstract
...
This unit ...
Prerequisites
You need to complete the following units before beginning this one:
- BIN-ALI-Optimal_sequence_alignment (Optimal global and local sequence alignment)
- BIN-ALI-PSI-BLAST (PSI-BLAST)
- FND-STA-Information_theory (Concepts of Information Theory)
Objectives
This unit will ...
- ... introduce the benefits of multiple sequence alignments (MSA), the objective functions they pursue, algorithms and methods, practical considerations, and the analysis of alignments;
- ... demonstrate Web services that calculate MSAs;
- ... teach how to compute MSA's in R.
Outcomes
After working through this unit you ...
- ... can ;
- ... are familar with ;
- ... have begun to.
Deliverables
- Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
- Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
- Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.
Evaluation
Evaluation: Integrated Unit
- This unit should be submitted for evaluation for a maximum of 10 marks. Details TBD.
Contents
Task:
- Read the introductory notes on concepts of multiple sequence alignments.
Multiple sequence alignments (MSAs) are enormously useful to resolve ambiguities in the precise placement of "indels"[1] and to ensure that columns in alignments actually contain amino acids that evolve in a similar context. MSAs serve as input for
- functional annotation;
- protein homology modelling;
- phylogenetic analyses;
- sensitive homology searches in databases;
- and more.
Multiple Sequence Alignment
In order to perform a multiple sequence alignment, we obviously need a set of homologous sequences. This is not trivial. All interpretation of MSA results depends absolutely on how the input sequences were chosen. Should we include only orthologues, or paralogues as well? Should we include only species with fully sequenced genomes, or can we tolerate that some orthologous genes are possibly missing for a species? Should we include all sequences we can lay our hands on, or should we restrict the selection to a manageable number of representative sequences? All of these choices influence our interpretation:
- orthologues are expected to be functionally and structurally conserved;
- paralogues may have divergent function but have similar structure;
- missing genes may make paralogs look like orthologs; and
- selection bias may weight our results toward sequences that are over-represented and do not provide a fair representation of evolutionary divergence.
MSA's on the web at the EBI
The EBI hosts a number of excellent MSA programs on their Website. Let's perform an MSA of full length MBP1 orthologues:
Task:
- Navigate to the NCBI protein database and paste the MBP1 protein RefSeq IDs from our database into the search form:
NP_010227 NP_593032 XP_660758 XP_007682304 XP_955821 XP_001837394 XP_569090 XP_003327086 XP_011392621 XP_006957051
(add your MBP1_MYSPE RefSeq ID too!)
- This will give you a page with links to the retrieved sequences. Click on Summary and choose FASTA(text) as the Format to retrieve all sequences at once as a multi-FASTA formatted page (this is useful, remember it!)
- Open another browser window and navigate to the EBI MSA tools page.
- Click on Launch T-coffee.
- Copy the FASTA sequences from the NCBI page, and paste them into the form at the EBI's T-Coffee page. Click Submit.
- The result should show you the aligned sequences, with three blocks of high similarity:
- The most N-terminal block is the APSES domain - the main DNA binding domain of these transcription factors.
- In the middle, we have Ankyrin domains: these are protein-protein interaction modules that Mbp1 uses to recruit other proteins to the bound complex.
- At the end, there is one additional, shorter segment of high similarity.
- Explore the tabs that are available, in particular note that you can save the result to a file.
- Click on the Download Alignment File tab to load the alignment as text into a browser window. Then save the file into your project directory with a filename of
msaT.aln
. (.aln
is the standard extension for CLUSTAl Formatted aligment files, so it helps if we give the file that extension. Of course you know better than to rely on an extension to signal the filetype and format.)
MSA's in R
Let's move to our RStudio project to explore producing and analyzing multiple sequence alignments in R.
Task:
- Open RStudio and load the
ABC-units
R project. If you have loaded it before, choose File → Recent projects → ABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit. - Choose Tools → Version Control → Pull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
- Type
init()
if requested. - Open the file
BIN-ALI-MSA.R
and follow the instructions.
Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.
Sequence alignment editors
Really excellent software tools have been written that help you visualize and manually curate multiple sequence alignments. If anything, I think they tend to do too much. Past versions of the course have used Jalview, but I have heard good things of AliView (and if you are on a Mac seqotron might interest you, but I only cover software that is free and runs on all three major platforms).
Here, I am just mentioning the two alignment editors and encourage you to explore and use them. If you have experience with comparing them, let us know.
- [Jalview] an integrated MSA editor and sequence annotation workbench from the Barton lab in Dundee. Lots of functions.
- [AliView] from Uppsala: fast, lean, looks to be very practical.
Further reading, links and resources
Bawono et al. (2017) Multiple Sequence Alignment. Methods Mol Biol 1525:167-189. (pmid: 27896722) |
Benítez-Páez et al. (2012) A practical guide for the computational selection of residues to be experimentally characterized in protein families. Brief Bioinformatics 13:329-36. (pmid: 21930656) |
Pais et al. (2014) Assessing the efficiency of multiple sequence alignment programs. Algorithms Mol Biol 9:4. (pmid: 24602402) |
Sievers & Higgins (2018) Clustal Omega for making accurate alignments of many protein sequences. Protein Sci 27:135-145. (pmid: 28884485) |
Iantorno et al. (2014) Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment. Methods Mol Biol 1079:59-73. (pmid: 24170395) |
Notredame (2007) Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol 3:e123. (pmid: 17784778) |
Notes
- ↑ "indel": insertion / deletion – a difference in sequence length between two aligned sequences that is accommodated by gaps in the alignment. Since we can't tell from the comparison of two sequences whether such a change was introduced by insertion into or deletion from the ancestral sequence, we join both into a portmanteau.
Self-evaluation
If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
About ...
Author:
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2017-08-05
Modified:
- 2017-10-22
Version:
- 0.1
Version history:
- 0.1 First stub
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.