Difference between revisions of "BIN-ALI-Optimal sequence alignment"

Revision as of 03:37, 23 October 2017

Optimal global and local sequence alignment

Keywords: NWS (optimal global) and SW (optimal local) algorithms, alignment via EMBOSS tools in practice, interpretation of alignments

Abstract

This unit covers the concepts and algorithms for optimal pairwise sequence alignments.

This unit ...

Prerequisites

You need to complete the following units before beginning this one:

Objectives

This unit will ...

... discuss how homology is inferred from optimal sequence alignments, by using scoring matrices that represent an evolutionary relationship;
... introduce the principle of dynamic programming alignment works by optimizing the sum of (context independent) pairscores, using an affine gap model for indels, and backtracking to reconstruct an alignment from contributing cells in the path-matrix;
... point out problems associated with affine gap functions and how parameter choice influences size and distribution of indels;
... teach the difference between global and local optimal alignment and in which situation these algorithms are appropriately used;
... demonstrate how to calculate optimal sequence alignments with online EMBOSS tools, and in R code with the Biostrings package.;

Outcomes

After working through this unit you ...

... can produce and interpret optimal sequence alignments, online, and in R code.

Deliverables

Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

Evaluation

Evaluation: NA

This unit is not evaluated for course marks.

Pairwise Alignments: Optimal

Task:

Read the introductory notes on concpets of optimal sequence alignment.

Optimal pairwise sequence alignment is the mainstay of sequence comparison. To try our first alignments in practice, we will start with aligning Mbp1 and its MYSPE relative. For simplicity, I will call the two proteins MBP1_SACCE and MBP1_MYSPE through the remainder of the unit.

Optimal Sequence Alignment: EMBOSS online tools

EMBOSS tools are a collection of standard sequence analysis programs. The most important ones are hosted at the EBI, but the EMBOSS explorer site hosts many more. They offer Needlman-Wunsch and Smith-Waterman alignments.

Task:

Fetch the sequences for MBP1_SACCE and MBP1_MYSPE from your database that you have prepared in the BIN-Storing_data unit. Open the RStudio project and enter the code below - substituting the proper name for MYSPE where appropriate.

source("makeProteinDB.R")

# Print the MBP1_SACCE sequence
sel <- myDB$protein$name == "MBP1_SACCE"
myDB$protein$sequence[sel]

# Print the MBP1_MYSPE sequence
sel <- myDB$protein$name == paste0("MBP1_", biCode(MYSPE))
myDB$protein$RefSeqID[sel]

(If this didn't work, fix it. Did you give your sequence the right name?)

Access the EMBOSS tools page at the EBI.
Look for Water, click on protein, paste your sequences and run the program with default parameters.
Study the results. You will probably find that the alignment extends over most of the protein, but does not include the termini.
Considering the sequence identity cutoff we discussed in class (25% over the length of a domain), do you believe that the N-terminal domains (the APSES domains) are homologous?
Change the Gap opening and Gap extension parameters to high values (e.g. 25 and 5). Then run the alignment again.
Note what is different.

Global optimal sequence alignment using "needle"

Task:

Look for Needle, click on protein, paste the MBP1_SACCE and MBP1_MYSPE sequences again and run the program with default parameters.
Study the results. You will find that the alignment extends over the entire protein, likely with significant indels at the termini.

Optimal Sequence Alignment with R: Biostrings

Biostrings has extensive functions for sequence alignments. They are generally well written and tightly integrated with the rest of Bioconductor's functions. There are a few quirks however: for example alignments won't work with lower-case sequences^[1].

Task:

Open RStudio and load the ABC-units R project. If you have loaded it before, choose File → Recent projects → ABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
Choose Tools → Version Control → Pull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
Type init() if requested.
Open the file BIN-ALI-Optimal_sequence_alignment.R and follow the instructions.

Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.

Notes

↑ While this seems like an unnecessary limitation, given that we could easily write such code to transform to-upper when looking up values in the MDM, perhaps it is meant as an additional sanity check that we haven't inadvertently included text in the sequence that does not belong there, such as the FASTA header line.

Self-evaluation

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.

About ...

Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2017-08-05

Version:

1.0

Version history:

1.0 First live
0.1 First stub

This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.

[1] While this seems like an unnecessary limitation, given that we could easily write such code to transform to-upper when looking up values in the MDM, perhaps it is meant as an additional sanity check that we haven't inadvertently included text in the sequence that does not belong there, such as the FASTA header line.

[1]

@@ Line 19: / Line 19: @@
-{{DEV}}
+{{LIVE}}
 {{Vspace}}
@@ Line 29: / Line 29: @@
 <section begin=abstract />
 <!-- included from "../components/BIN-ALI-Optimal_sequence_alignment.components.wtxt", section: "abstract" -->
-...
+This unit covers the concepts and algorithms for optimal pairwise sequence alignments.
 <section end=abstract />
@@ Line 48: / Line 48: @@
 === Objectives ===
 <!-- included from "../components/BIN-ALI-Optimal_sequence_alignment.components.wtxt", section: "objectives" -->
-...
+This unit will ...
+* ... discuss  how  homology is inferred from optimal sequence alignments, by using scoring matrices that represent an evolutionary relationship;
+* ... introduce the principle of dynamic programming alignment works by optimizing the sum of (context independent) pairscores, using an affine gap model for indels, and backtracking to reconstruct an alignment from contributing cells in the path-matrix;
+* ... point out problems associated with affine gap functions and how parameter choice influences size and distribution of indels;
+* ... teach the difference between global and local optimal alignment and in which situation these algorithms are appropriately used;
+* ... demonstrate how to calculate optimal sequence alignments with online EMBOSS tools, and in R code with the Biostrings package.;
 {{Vspace}}
@@ Line 55: / Line 60: @@
 === Outcomes ===
 <!-- included from "../components/BIN-ALI-Optimal_sequence_alignment.components.wtxt", section: "outcomes" -->
-...
+After working through this unit you ...
+* ... can produce and interpret optimal sequence alignments, online, and in R code.
 {{Vspace}}
@@ Line 85: / Line 91: @@
 == Contents ==
 <!-- included from "../components/BIN-ALI-Optimal_sequence_alignment.components.wtxt", section: "contents" -->
 == Pairwise Alignments: Optimal ==
@@ Line 95: / Line 100: @@
 }}
+{{Vspace}}
+Optimal pairwise sequence alignment is the mainstay of sequence comparison. To try our first alignments in practice, we will start with aligning Mbp1 and its MYSPE relative. For simplicity, I will call the two proteins <code>MBP1_SACCE</code> and <code>MBP1_MYSPE</code> through the remainder of the unit.
-Optimal pairwise sequence alignment is the mainstay of sequence comparison. To consider such alignments in practice, we'll align the same sequences that we have just mapped in the dotplot exercise: Mbp1 and its MYSPE relative. For simplicity, I will call the two proteins <code>MBP1_SACCE</code> and <code>MBP1_MYSPE</code> through the remainder of the assignment. Your dotplots should have shown you two regions of similarity: a highly similar region focussed somewhere around the N-terminal 100 amino acids, and a more extended, but somewhat less similar region in the middle of the sequences. You can think of the sequence alignment algorithm as building the similarity matrix, and then discovering the best path along high-scoring diagonals.
 {{Vspace}}
@@ Line 106: / Line 110: @@
 {{Vspace}}
-Online programs for optimal sequence alignment are part of the EMBOSS tools. The programs take FASTA files or raw text files as input.
+[https://www.ebi.ac.uk/Tools/emboss/ EMBOSS tools] are a collection of standard sequence analysis programs. The most important ones are hosted at the EBI, but the [http://www.bioinformatics.nl/emboss-explorer/ EMBOSS explorer site] hosts many more. They offer Needlman-Wunsch and Smith-Waterman alignments.
-'''Local''' optimal sequence alignment using "water"
 {{task|1=
-# Fetch the sequences for <code>MBP1_SACCE</code> and <code>MBP1_MYSPE</code> from your database. You can simply select them by name (if you have given your sequence the suggested name when you eneterd it into your database): paste the following into the console:
+* Fetch the sequences for <code>MBP1_SACCE</code> and <code>MBP1_MYSPE</code> from your database that you have prepared in the [[BIN-Storing_data]] unit. Open the RStudio project and enter the code below - substituting the proper name for MYSPE where appropriate.
-* to print the <code>MBP1_SACCE</code> protein sequence to the console
 <source lang="R">
-myDB$protein$sequence[myDB$protein$name == "MBP1_SACCE"]
+source("makeProteinDB.R")
-</source>
+# Print the MBP1_SACCE sequence
+sel <- myDB$protein$name == "MBP1_SACCE"
+myDB$protein$sequence[sel]
+# Print the MBP1_MYSPE sequence
+sel <- myDB$protein$name == paste0("MBP1_", biCode(MYSPE))
+myDB$protein$RefSeqID[sel]
-* to print the <code>MBP1_MYSPE</code> protein sequence to the console:
-<source lang="R">
-MYSPEseq <- paste("MBP1_", biCode(MYSPE), sep="")
-myDB$protein$sequence[myDB$protein$name == MYSPEseq]
 </source>
 (If this didn't work, fix it. Did you give your sequence the right '''name'''?)
-# Access the [http://emboss.bioinformatics.nl/ EMBOSS Explorer site] (if you haven't done so yet, you might want to bookmark it.)
+# Access the [https://www.ebi.ac.uk/Tools/emboss/ EMBOSS tools page] at the EBI.
-# Look for '''ALIGNMENT LOCAL''', click on '''water''', paste your sequences and run the program with default parameters.
+# Look for '''Water''', click on '''protein''', paste your sequences and run the program with default parameters.
 # Study the results. You will probably find that the alignment extends over most of the protein, but does not include the termini.
 # Considering the sequence identity cutoff we discussed in class (25% over the length of a domain), do you believe that the N-terminal domains (the APSES domains) are homologous?
-# Change the '''Gap opening''' and '''Gap extension''' parameters to high values (e.g. 30 and 5). Then run the alignment again.
+# Change the '''Gap opening''' and '''Gap extension''' parameters to high values (e.g. 25 and 5). Then run the alignment again.
 # Note what is different.
 }}
@@ Line 137: / Line 142: @@
 '''Global''' optimal sequence alignment using "needle"
 {{task|1=
-# Look for '''ALIGNMENT GLOBAL''', click on '''needle''', paste the <code>MBP1_SACCE</code> and <code>MBP1_MYSPE</code> sequences again and run the program with default parameters.
+# Look for '''Needle''', click on '''protein''', paste the <code>MBP1_SACCE</code> and <code>MBP1_MYSPE</code> sequences again and run the program with default parameters.
-# Study the results. You will find that the alignment extends over the entire protein, likely with long ''indels'' at the termini.
+# Study the results. You will find that the alignment extends over the entire protein, likely with significant ''indels'' at the termini.
 }}
@@ Line 150: / Line 155: @@
 {{Vspace}}
-Biostrings has extensive functions for sequence alignments. They are generally well written and tightly integrated with the rest of Bioconductor's functions. There are a few quirks however: for example alignments won't work with lower-case sequences<ref>While this seems like an unnecessary limitation, given that we could easily write such code to transform to-upper when looking up values in the MDM, perhaps it is meant as an additional sanity check that we haven't inadvertently included text in the sequence that does not belong there, such as the FASTA header line perhaps.</ref>.
+Biostrings has extensive functions for sequence alignments. They are generally well written and tightly integrated with the rest of Bioconductor's functions. There are a few quirks however: for example alignments won't work with lower-case sequences<ref>While this seems like an unnecessary limitation, given that we could easily write such code to transform to-upper when looking up values in the MDM, perhaps it is meant as an additional sanity check that we haven't inadvertently included text in the sequence that does not belong there, such as the FASTA header line.</ref>.
 {{Vspace}}
-{{task|1 =
+{{ABC-unit|BIN-ALI-Optimal_sequence_alignment.R}}
-* Return to your RStudio session.
-* Once again, if you've been away from it for a while, it's always a good idea to update to pull updtaes from the master file on GitHub.
-* Study and work through the code in the <code>Biostrings Pairwise Alignment</code> section of the <code>BCH441_A04.R</code> script
-}}
 {{Vspace}}
@@ Line 172: / Line 169: @@
 == Further reading, links and resources ==
-<!-- {{#pmid: 19957275}} -->
+{{#pmid: 10782117}}
 <!-- {{WWW|WWW_GMOD}} -->
 <!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
@@ Line 237: / Line 235: @@
 :2017-08-05
 <b>Version:</b><br />
-:0.1
+:1.0
 <b>Version history:</b><br />
+*1.0 First live
 *0.1 First stub
 </div>

Difference between revisions of "BIN-ALI-Optimal sequence alignment"

Revision as of 03:37, 23 October 2017

Contents

Abstract

This unit ...

Prerequisites

Objectives

Outcomes

Deliverables

Evaluation

Contents

Pairwise Alignments: Optimal

Optimal Sequence Alignment: EMBOSS online tools

Optimal Sequence Alignment with R: Biostrings

Further reading, links and resources

Notes

Self-evaluation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools