Difference between revisions of "BIN-ALI-Optimal sequence alignment"

From "A B C"
Jump to navigation Jump to search
m
m
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
<div id="BIO">
+
<div id="ABC">
  <div class="b1">
+
<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce; font-size:300%; font-weight:400; color: #000000; width:100%;">
 
Optimal global and local sequence alignment
 
Optimal global and local sequence alignment
  </div>
+
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#b3dbce; font-size:30%; font-weight:200; color: #000000; ">
 
+
(NWS (optimal global) and SW (optimal local) algorithms, alignment via EMBOSS tools in practice, interpretation of alignments)
  {{Vspace}}
+
</div>
 
 
<div class="keywords">
 
<b>Keywords:</b>&nbsp;
 
NWS (optimal global) and SW (optimal local) algorithms, alignment via EMBOSS tools in practice, interpretation of alignments
 
 
</div>
 
</div>
  
{{Vspace}}
+
{{Smallvspace}}
 
 
 
 
__TOC__
 
  
{{Vspace}}
 
  
 
+
<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce33; font-size:85%;">
{{LIVE}}
+
<div style="font-size:118%;">
 
+
<b>Abstract:</b><br />
{{Vspace}}
 
 
 
 
 
</div>
 
<div id="ABC-unit-framework">
 
== Abstract ==
 
 
<section begin=abstract />
 
<section begin=abstract />
<!-- included from "./components/BIN-ALI-Optimal_sequence_alignment.components.txt", section: "abstract" -->
 
 
This unit covers the concepts and algorithms for optimal pairwise sequence alignments.
 
This unit covers the concepts and algorithms for optimal pairwise sequence alignments.
 
<section end=abstract />
 
<section end=abstract />
 
+
</div>
{{Vspace}}
+
<!-- ============================ -->
 
+
<hr>
 
+
<table>
== This unit ... ==
+
<tr>
=== Prerequisites ===
+
<td style="padding:10px;">
<!-- included from "./components/BIN-ALI-Optimal_sequence_alignment.components.txt", section: "prerequisites" -->
+
<b>Objectives:</b><br />
<!-- included from "./data/ABC-unit_components.txt", section: "notes-prerequisites" -->
 
You need to complete the following units before beginning this one:
 
*[[BIN-ALI-Alignment|BIN-ALI-Alignment (Sequence alignment concepts)]]
 
*[[BIN-ALI-Similarity|BIN-ALI-Similarity (Measuring Sequence Similarity)]]
 
 
 
{{Vspace}}
 
 
 
 
 
=== Objectives ===
 
<!-- included from "./components/BIN-ALI-Optimal_sequence_alignment.components.txt", section: "objectives" -->
 
 
This unit will ...
 
This unit will ...
 
* ... discuss  how  homology is inferred from optimal sequence alignments, by using scoring matrices that represent an evolutionary relationship;
 
* ... discuss  how  homology is inferred from optimal sequence alignments, by using scoring matrices that represent an evolutionary relationship;
Line 54: Line 29:
 
* ... teach the difference between global and local optimal alignment and in which situation these algorithms are appropriately used;
 
* ... teach the difference between global and local optimal alignment and in which situation these algorithms are appropriately used;
 
* ... demonstrate how to calculate optimal sequence alignments with online EMBOSS tools, and in R code with the Biostrings package.;
 
* ... demonstrate how to calculate optimal sequence alignments with online EMBOSS tools, and in R code with the Biostrings package.;
 +
</td>
 +
<td style="padding:10px;">
 +
<b>Outcomes:</b><br />
 +
After working through this unit you ...
 +
* ... can produce and interpret optimal sequence alignments, online, and in R code.
 +
</td>
 +
</tr>
 +
</table>
 +
<!-- ============================  -->
 +
<hr>
 +
<b>Deliverables:</b><br />
 +
<section begin=deliverables />
 +
<li><b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.</li>
 +
<li><b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.</li>
 +
<li><b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].</li>
 +
;Your protein database: Add APSES domain annotations for MBP1_MYSPE proteins to your database.
 +
<section end=deliverables />
 +
<!-- ============================  -->
 +
<hr>
 +
<section begin=prerequisites />
 +
<b>Prerequisites:</b><br />
 +
This unit builds on material covered in the following prerequisite units:<br />
 +
*[[BIN-ALI-Alignment|BIN-ALI-Alignment (Sequence alignment concepts)]]
 +
*[[BIN-ALI-Similarity|BIN-ALI-Similarity (Measuring Sequence Similarity)]]
 +
<section end=prerequisites />
 +
<!-- ============================  -->
 +
</div>
  
{{Vspace}}
+
{{Smallvspace}}
  
  
=== Outcomes ===
 
<!-- included from "./components/BIN-ALI-Optimal_sequence_alignment.components.txt", section: "outcomes" -->
 
After working through this unit you ...
 
* ... can produce and interpret optimal sequence alignments, online, and in R code.
 
  
{{Vspace}}
+
{{Smallvspace}}
  
  
=== Deliverables ===
+
__TOC__
<!-- included from "./components/BIN-ALI-Optimal_sequence_alignment.components.txt", section: "deliverables" -->
 
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-time_management" -->
 
*<b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
 
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-journal" -->
 
*<b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.
 
<!-- included from "./data/ABC-unit_components.txt", section: "deliverables-insights" -->
 
*<b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].
 
;Your protein database: Add APSES domain annotations for MBP1_MYSPE proteins to your database.
 
  
 
{{Vspace}}
 
{{Vspace}}
  
  
</div>
+
=== Evaluation ===
<div id="BIO">
+
<b>Evaluation: NA</b><br />
 +
<div style="margin-left: 2rem;">This unit is not evaluated for course marks.</div>
 
== Contents ==
 
== Contents ==
<!-- included from "./components/BIN-ALI-Optimal_sequence_alignment.components.txt", section: "contents" -->
 
  
 
== Pairwise Alignments: Optimal ==
 
== Pairwise Alignments: Optimal ==
Line 89: Line 79:
  
 
{{Task|1=
 
{{Task|1=
*Read the introductory notes on {{ABC-PDF|BIN-ALI-Optimal_sequence_alignment|concpets of optimal sequence alignment}}.
+
*Read the introductory notes on {{ABC-PDF|BIN-ALI-Optimal_sequence_alignment|concepts of optimal sequence alignment}}.
 
}}
 
}}
  
Line 108: Line 98:
 
* Fetch the sequences for <code>MBP1_SACCE</code> and <code>MBP1_MYSPE</code> from your database that you have prepared in the [[BIN-Storing_data]] unit. Open the RStudio project and enter the code below - substituting the proper name for MYSPE where appropriate.
 
* Fetch the sequences for <code>MBP1_SACCE</code> and <code>MBP1_MYSPE</code> from your database that you have prepared in the [[BIN-Storing_data]] unit. Open the RStudio project and enter the code below - substituting the proper name for MYSPE where appropriate.
  
<source lang="R">
+
<pre>
 
source("makeProteinDB.R")
 
source("makeProteinDB.R")
  
Line 119: Line 109:
 
myDB$protein$RefSeqID[sel]
 
myDB$protein$RefSeqID[sel]
  
</source>
+
</pre>
  
(If this didn't work, fix it. Did you give your sequence the right '''name'''?)
+
(If this didn't work, fix the problem. Did you give your sequence the right '''name''' in your database?)
  
 
# Access the [https://www.ebi.ac.uk/Tools/emboss/ EMBOSS tools page] at the EBI.
 
# Access the [https://www.ebi.ac.uk/Tools/emboss/ EMBOSS tools page] at the EBI.
Line 137: Line 127:
 
# Study the results. You will find that the alignment extends over the entire protein, likely with significant ''indels'' at the termini.
 
# Study the results. You will find that the alignment extends over the entire protein, likely with significant ''indels'' at the termini.
 
}}
 
}}
 
  
  
Line 155: Line 144:
  
 
{{Vspace}}
 
{{Vspace}}
 
 
{{Vspace}}
 
 
  
 
== Further reading, links and resources ==
 
== Further reading, links and resources ==
Line 165: Line 150:
 
<!-- {{WWW|WWW_GMOD}} -->
 
<!-- {{WWW|WWW_GMOD}} -->
 
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 
{{Vspace}}
 
 
 
 
== Notes ==
 
== Notes ==
<!-- included from "./components/BIN-ALI-Optimal_sequence_alignment.components.txt", section: "notes" -->
 
<!-- included from "./data/ABC-unit_components.txt", section: "notes" -->
 
 
<references />
 
<references />
  
 
{{Vspace}}
 
{{Vspace}}
  
 
</div>
 
<div id="ABC-unit-framework">
 
== Self-evaluation ==
 
<!-- included from "./components/BIN-ALI-Optimal_sequence_alignment.components.txt", section: "self-evaluation" -->
 
<!--
 
=== Question 1===
 
 
Question ...
 
 
<div class="toccolours mw-collapsible mw-collapsed" style="width:800px">
 
Answer ...
 
<div class="mw-collapsible-content">
 
Answer ...
 
 
</div>
 
  </div>
 
 
  {{Vspace}}
 
 
-->
 
 
{{Vspace}}
 
 
 
 
{{Vspace}}
 
 
 
<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_ask" -->
 
 
----
 
 
{{Vspace}}
 
 
<b>If in doubt, ask!</b> If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
 
 
----
 
 
{{Vspace}}
 
  
 
<div class="about">
 
<div class="about">
Line 225: Line 164:
 
:2017-08-05
 
:2017-08-05
 
<b>Modified:</b><br />
 
<b>Modified:</b><br />
:2017-08-05
+
:2020-09-24
 
<b>Version:</b><br />
 
<b>Version:</b><br />
:1.0
+
:1.1
 
<b>Version history:</b><br />
 
<b>Version history:</b><br />
 +
*1.1 2020 Updates
 
*1.0 First live
 
*1.0 First live
 
*0.1 First stub
 
*0.1 First stub
 
</div>
 
</div>
[[Category:ABC-units]]
 
<!-- included from "./data/ABC-unit_components.txt", section: "ABC-unit_footer" -->
 
  
 
{{CC-BY}}
 
{{CC-BY}}
  
 +
[[Category:ABC-units]]
 +
{{UNIT}}
 +
{{LIVE}}
 
</div>
 
</div>
 
<!-- [END] -->
 
<!-- [END] -->

Latest revision as of 11:28, 25 September 2020

Optimal global and local sequence alignment

(NWS (optimal global) and SW (optimal local) algorithms, alignment via EMBOSS tools in practice, interpretation of alignments)


 


Abstract:

This unit covers the concepts and algorithms for optimal pairwise sequence alignments.


Objectives:
This unit will ...

  • ... discuss how homology is inferred from optimal sequence alignments, by using scoring matrices that represent an evolutionary relationship;
  • ... introduce the principle of dynamic programming alignment works by optimizing the sum of (context independent) pairscores, using an affine gap model for indels, and backtracking to reconstruct an alignment from contributing cells in the path-matrix;
  • ... point out problems associated with affine gap functions and how parameter choice influences size and distribution of indels;
  • ... teach the difference between global and local optimal alignment and in which situation these algorithms are appropriately used;
  • ... demonstrate how to calculate optimal sequence alignments with online EMBOSS tools, and in R code with the Biostrings package.;

Outcomes:
After working through this unit you ...

  • ... can produce and interpret optimal sequence alignments, online, and in R code.

Deliverables:

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.
  • Your protein database
    Add APSES domain annotations for MBP1_MYSPE proteins to your database.

    Prerequisites:
    This unit builds on material covered in the following prerequisite units:


     



     



     


    Evaluation

    Evaluation: NA

    This unit is not evaluated for course marks.

    Contents

    Pairwise Alignments: Optimal

     

    Task:


     

    Optimal pairwise sequence alignment is the mainstay of sequence comparison. To try our first alignments in practice, we will start with aligning Mbp1 and its MYSPE relative. For simplicity, I will call the two proteins MBP1_SACCE and MBP1_MYSPE through the remainder of the unit.


     

    Optimal Sequence Alignment: EMBOSS online tools

     

    EMBOSS tools are a collection of standard sequence analysis programs. The most important ones are hosted at the EBI, but the EMBOSS explorer site hosts many more. They offer Needlman-Wunsch and Smith-Waterman alignments.


    Task:

    • Fetch the sequences for MBP1_SACCE and MBP1_MYSPE from your database that you have prepared in the BIN-Storing_data unit. Open the RStudio project and enter the code below - substituting the proper name for MYSPE where appropriate.
    source("makeProteinDB.R")
    
    # Print the MBP1_SACCE sequence
    sel <- myDB$protein$name == "MBP1_SACCE"
    myDB$protein$sequence[sel]
    
    # Print the MBP1_MYSPE sequence
    sel <- myDB$protein$name == paste0("MBP1_", biCode(MYSPE))
    myDB$protein$RefSeqID[sel]
    
    

    (If this didn't work, fix the problem. Did you give your sequence the right name in your database?)

    1. Access the EMBOSS tools page at the EBI.
    2. Look for Water, click on protein, paste your sequences and run the program with default parameters.
    3. Study the results. You will probably find that the alignment extends over most of the protein, but does not include the termini.
    4. Considering the sequence identity cutoff we discussed in class (25% over the length of a domain), do you believe that the N-terminal domains (the APSES domains) are homologous?
    5. Change the Gap opening and Gap extension parameters to high values (e.g. 25 and 5). Then run the alignment again.
    6. Note what is different.


    Global optimal sequence alignment using "needle"

    Task:

    1. Look for Needle, click on protein, paste the MBP1_SACCE and MBP1_MYSPE sequences again and run the program with default parameters.
    2. Study the results. You will find that the alignment extends over the entire protein, likely with significant indels at the termini.


     


    Optimal Sequence Alignment with R: Biostrings

     

    Biostrings has extensive functions for sequence alignments. They are generally well written and tightly integrated with the rest of Bioconductor's functions. There are a few quirks however: for example alignments won't work with lower-case sequences[1].


     

    Task:

     
    • Open RStudio and load the ABC-units R project. If you have loaded it before, choose FileRecent projectsABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
    • Choose ToolsVersion ControlPull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
    • Type init() if requested.
    • Open the file BIN-ALI-Optimal_sequence_alignment.R and follow the instructions.


     

    Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.


     


     

    Further reading, links and resources

    Fitch (2000) Homology a personal view on some of the problems. Trends Genet 16:227-31. (pmid: 10782117)

    PubMed ] [ DOI ] There are many problems relating to defining the terminology used to describe various biological relationships and getting agreement on which definitions are best. Here, I examine 15 terminological problems, all of which are current, and all of which relate to the usage of homology and its associated terms. I suggest a set of definitions that are intended to be totally consistent among themselves and also as consistent as possible with most current usage.

    Notes

    1. While this seems like an unnecessary limitation, given that we could easily write such code to transform to-upper when looking up values in the MDM, perhaps it is meant as an additional sanity check that we haven't inadvertently included text in the sequence that does not belong there, such as the FASTA header line.


     


    About ...
     
    Author:

    Boris Steipe <boris.steipe@utoronto.ca>

    Created:

    2017-08-05

    Modified:

    2020-09-24

    Version:

    1.1

    Version history:

    • 1.1 2020 Updates
    • 1.0 First live
    • 0.1 First stub

    CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.