Difference between revisions of "BIN-ALI-Similarity"

From "A B C"
Jump to navigation Jump to search
m
m
Line 19: Line 19:
  
  
{{DEV}}
+
{{LIVE}}
  
 
{{Vspace}}
 
{{Vspace}}
Line 29: Line 29:
 
<section begin=abstract />
 
<section begin=abstract />
 
<!-- included from "../components/BIN-ALI-Similarity.components.wtxt", section: "abstract" -->
 
<!-- included from "../components/BIN-ALI-Similarity.components.wtxt", section: "abstract" -->
...
+
In order to compare protein sequences quantitatively, we must define how to measure the similarity of two amino acids. This can be done according to biophysical considerations, or empirically, based on the propensity of amino acids to substitute for each other in homologous sequences. "Mutation Data Matrices" make this information conveniently available.
 
<section end=abstract />
 
<section end=abstract />
  
Line 40: Line 40:
 
<!-- included from "ABC-unit_components.wtxt", section: "notes-prerequisites" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "notes-prerequisites" -->
 
You need to complete the following units before beginning this one:
 
You need to complete the following units before beginning this one:
*[[RPR-Biostrings]]
+
*[[RPR-Biostrings|RPR-Biostrings (The biostrings R Package)]]
  
 
{{Vspace}}
 
{{Vspace}}
Line 47: Line 47:
 
=== Objectives ===
 
=== Objectives ===
 
<!-- included from "../components/BIN-ALI-Similarity.components.wtxt", section: "objectives" -->
 
<!-- included from "../components/BIN-ALI-Similarity.components.wtxt", section: "objectives" -->
...
+
This unit will ...
 +
* ... introduce issues of defining amino acid similarity;
 +
* ... teach how to use the amino acid property tables from the seqinr package;
 +
* ... teach the use of mutation data matrices from the Biostrings package.
  
 
{{Vspace}}
 
{{Vspace}}
Line 54: Line 57:
 
=== Outcomes ===
 
=== Outcomes ===
 
<!-- included from "../components/BIN-ALI-Similarity.components.wtxt", section: "outcomes" -->
 
<!-- included from "../components/BIN-ALI-Similarity.components.wtxt", section: "outcomes" -->
...
+
After working through this unit you ...
 +
* ... can access and work with amino acid property tables from the seqinr package;
 +
* ... can access and work with mutation data matrices from the Biostrings package, in particular BLOSUM62.
  
 
{{Vspace}}
 
{{Vspace}}
Line 89: Line 94:
 
}}
 
}}
  
 
=== DotPlots and the Mutation Data Matrix ===
 
 
Before we start calculating alignments, we should get a better sense of the underlying sequence similarity. A Dotplot is a perfect tool for that, because it displays alignment-free similarity information. (For a deeper introduction into dotplots, see the [[BIN-ALI-Dotplot|Dotplots learning unit]]). Let's make a dotplot that uses the BLOSUM62 Mutation Data Matrix to measure pairwise amino acid similarity. The NCBI makes its alignment matrices available by ftp. They are located at  ftp://ftp.ncbi.nih.gov/blast/matrices - for example here is a link to the [ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62 '''BLOSUM62 matrix''']<ref>That directory also contains sourcecode to generate the PAM matrices. This may be of interest if you ever want to produce scoring matrices from your own datasets.</ref>.
 
  
 
{{Vspace}}
 
{{Vspace}}
  
 
The NCBI makes its alignment matrices available by ftp. They are located at  ftp://ftp.ncbi.nih.gov/blast/matrices - for example here is a link to the [ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62 '''BLOSUM62 matrix''']<ref>That directory also contains sourcecode to generate the PAM matrices. This may be of interest if you ever want to produce scoring matrices from your own datasets.</ref>.
 
The NCBI makes its alignment matrices available by ftp. They are located at  ftp://ftp.ncbi.nih.gov/blast/matrices - for example here is a link to the [ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62 '''BLOSUM62 matrix''']<ref>That directory also contains sourcecode to generate the PAM matrices. This may be of interest if you ever want to produce scoring matrices from your own datasets.</ref>.
 
Scoring matrices are also available in the Bioconductor Biostrings package.
 
  
 
<source lang="text">
 
<source lang="text">
Line 144: Line 143:
 
{{Vspace}}
 
{{Vspace}}
  
Next, let's apply the scoring matrix for actual comparison:
+
{{ABC-Unit|BIN-ALI-Similarity.R}}
 
 
{{Vspace}}
 
 
 
{{task|1 =
 
 
 
* Return to your RStudio session.
 
* If you've been away from it for a while, it's probably a good idea to update to the newest versions of scripts and data by pulling from the master file on GitHub.
 
* Study and work through the code in the <code>Dotplot and MDM</code> section of the <code>BCH441_A04.R</code> script
 
 
 
}}
 
  
 
{{Vspace}}
 
{{Vspace}}
Line 165: Line 154:
  
 
*{{#pmid: 15286655}}
 
*{{#pmid: 15286655}}
 +
*{{WP|BLOSUM|'''BLOSUM''' article at Wikipedia}} (Good article.)
  
  
Line 226: Line 216:
 
:2017-08-05
 
:2017-08-05
 
<b>Modified:</b><br />
 
<b>Modified:</b><br />
:2017-08-05
+
:2017-10-20
 
<b>Version:</b><br />
 
<b>Version:</b><br />
:0.1
+
:1.0
 
<b>Version history:</b><br />
 
<b>Version history:</b><br />
 +
*1.0 First live version
 
*0.1 First stub
 
*0.1 First stub
 
</div>
 
</div>

Revision as of 03:37, 23 October 2017

Measuring Sequence Similarity


 

Keywords:  sequence similarity: measurement via MDM; BLOSUM 62 matrix, affine gap penalties


 



 


 


Abstract

In order to compare protein sequences quantitatively, we must define how to measure the similarity of two amino acids. This can be done according to biophysical considerations, or empirically, based on the propensity of amino acids to substitute for each other in homologous sequences. "Mutation Data Matrices" make this information conveniently available.


 


This unit ...

Prerequisites

You need to complete the following units before beginning this one:


 


Objectives

This unit will ...

  • ... introduce issues of defining amino acid similarity;
  • ... teach how to use the amino acid property tables from the seqinr package;
  • ... teach the use of mutation data matrices from the Biostrings package.


 


Outcomes

After working through this unit you ...

  • ... can access and work with amino acid property tables from the seqinr package;
  • ... can access and work with mutation data matrices from the Biostrings package, in particular BLOSUM62.


 


Deliverables

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.


 


Evaluation

Evaluation: NA

This unit is not evaluated for course marks.


 


Contents


 

The NCBI makes its alignment matrices available by ftp. They are located at ftp://ftp.ncbi.nih.gov/blast/matrices - for example here is a link to the BLOSUM62 matrix[1].

BLOSUM62

   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  J  Z  X  *
A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1 -1 -1 -4
R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1 -2  0 -1 -4
N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  4 -3  0 -1 -4
D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4 -3  1 -1 -4
C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -1 -3 -1 -4
Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0 -2  4 -1 -4
E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1 -3  4 -1 -4
G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -4 -2 -1 -4
H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0 -3  0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3 -3  3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1 -4  3 -3 -1 -4
K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2  0 -3  1 -1 -4
M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1 -3  2 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1 -3  0 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2 -2 -3 -1 -1 -4
S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2  0 -2  0 -1 -4
T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0 -1 -1 -1 -1 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3 -4 -2 -2 -1 -4
Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1 -3 -1 -2 -1 -4
V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4 -3  2 -2 -1 -4
B -2 -1  4  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4 -3  0 -1 -4
J -1 -2 -3 -3 -1 -2 -3 -4 -3  3  3 -3  2  0 -3 -2 -1 -2 -1  2 -3  3 -3 -1 -4
Z -1  0  0  1 -3  4  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -2 -2 -2  0 -3  4 -1 -4
X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4  1


Task:

  • Study this and make sure you understand what this table is, how it can be used, and what a reasonable range of values for identities and pairscores for non-identical, similar and dissimilar residues is. Ask on the mailing list in case you have questions. This piece of data is the foundation of any sequence alignment. without it, no sensible alignment could be produced!
  • Figure out the following values:
    • Compare an identical match of histidine with an identical match of serine. What does this mean?
    • How similar are lysine and leucine, as compared to leucine and isoleucine? Is this what you expect?
    • PAM matrices are sensitive to an interesting artefact. Since W and R can be interchanged with a single point mutation, the probability of observing W→R and R→W exchanges in closely related sequences is much higher than one would expect from the two amino acid's biophysical properties. (Why?) PAM matrices were compiled from hypothetical point exchanges and then extrapolated. Therefore these matrices assign a relatively high degree of similarity to (W, R), that is not warranted considering what actually happens in nature. Do you see this problem in the BLOSUM matrix? If BLOSUM does not have this issue, why not?


 

Template:ABC-Unit


 


 


Further reading, links and resources

  • Eddy (2004) Where did the BLOSUM62 alignment score matrix come from?. Nat Biotechnol 22:1035-6. (pmid: 15286655)

    PubMed ] [ DOI ] Many sequence alignment programs use the BLOSUM62 score matrix to score pairs of aligned residues. Where did BLOSUM62 come from?


     


    Notes

    1. That directory also contains sourcecode to generate the PAM matrices. This may be of interest if you ever want to produce scoring matrices from your own datasets.


     


Self-evaluation

 



 




 

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.



 

About ...
 
Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2017-10-20

Version:

1.0

Version history:

  • 1.0 First live version
  • 0.1 First stub

CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.