BIN-ALI-Optimal sequence alignment

From "A B C"
Revision as of 02:52, 4 October 2017 by Boris (talk | contribs)
Jump to navigation Jump to search

Optimal global and local sequence alignment


 

Keywords:  NWS (optimal global) and SW (optimal local) algorithms, alignment via EMBOSS tools in practice, interpretation of alignments


 



 


Caution!

This unit is under development. There is some contents here but it is incomplete and/or may change significantly: links may lead to nowhere, the contents is likely going to be rearranged, and objectives, deliverables etc. may be incomplete or missing. Do not work with this material until it is updated to "live" status.


 


Abstract

...


 


This unit ...

Prerequisites

You need to complete the following units before beginning this one:


 


Objectives

...


 


Outcomes

...


 


Deliverables

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.


 


Evaluation

Evaluation: NA

This unit is not evaluated for course marks.


 


Contents

Pairwise Alignments: Optimal

 

Task:



Optimal pairwise sequence alignment is the mainstay of sequence comparison. To consider such alignments in practice, we'll align the same sequences that we have just mapped in the dotplot exercise: Mbp1 and its MYSPE relative. For simplicity, I will call the two proteins MBP1_SACCE and MBP1_MYSPE through the remainder of the assignment. Your dotplots should have shown you two regions of similarity: a highly similar region focussed somewhere around the N-terminal 100 amino acids, and a more extended, but somewhat less similar region in the middle of the sequences. You can think of the sequence alignment algorithm as building the similarity matrix, and then discovering the best path along high-scoring diagonals.


 

Optimal Sequence Alignment: EMBOSS online tools

 

Online programs for optimal sequence alignment are part of the EMBOSS tools. The programs take FASTA files or raw text files as input.

Local optimal sequence alignment using "water"

Task:

  1. Fetch the sequences for MBP1_SACCE and MBP1_MYSPE from your database. You can simply select them by name (if you have given your sequence the suggested name when you eneterd it into your database): paste the following into the console:
  • to print the MBP1_SACCE protein sequence to the console
myDB$protein$sequence[myDB$protein$name == "MBP1_SACCE"]
  • to print the MBP1_MYSPE protein sequence to the console:
MYSPEseq <- paste("MBP1_", biCode(MYSPE), sep="")
myDB$protein$sequence[myDB$protein$name == MYSPEseq]

(If this didn't work, fix it. Did you give your sequence the right name?)

  1. Access the EMBOSS Explorer site (if you haven't done so yet, you might want to bookmark it.)
  2. Look for ALIGNMENT LOCAL, click on water, paste your sequences and run the program with default parameters.
  3. Study the results. You will probably find that the alignment extends over most of the protein, but does not include the termini.
  4. Considering the sequence identity cutoff we discussed in class (25% over the length of a domain), do you believe that the N-terminal domains (the APSES domains) are homologous?
  5. Change the Gap opening and Gap extension parameters to high values (e.g. 30 and 5). Then run the alignment again.
  6. Note what is different.


Global optimal sequence alignment using "needle"

Task:

  1. Look for ALIGNMENT GLOBAL, click on needle, paste the MBP1_SACCE and MBP1_MYSPE sequences again and run the program with default parameters.
  2. Study the results. You will find that the alignment extends over the entire protein, likely with long indels at the termini.



 


Optimal Sequence Alignment with R: Biostrings

 

Biostrings has extensive functions for sequence alignments. They are generally well written and tightly integrated with the rest of Bioconductor's functions. There are a few quirks however: for example alignments won't work with lower-case sequences[1].


 

Task:

  • Return to your RStudio session.
  • Once again, if you've been away from it for a while, it's always a good idea to update to pull updtaes from the master file on GitHub.
  • Study and work through the code in the Biostrings Pairwise Alignment section of the BCH441_A04.R script


 



 


Further reading, links and resources

 


Notes

  1. While this seems like an unnecessary limitation, given that we could easily write such code to transform to-upper when looking up values in the MDM, perhaps it is meant as an additional sanity check that we haven't inadvertently included text in the sequence that does not belong there, such as the FASTA header line perhaps.


 


Self-evaluation

 



 




 

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.



 

About ...
 
Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2017-08-05

Version:

0.1

Version history:

  • 0.1 First stub

CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.