Difference between revisions of "BIN-ALI-Similarity"
m (Created page with "<div id="BIO"> <div class="b1"> Measuring Sequence Similarity </div> {{Vspace}} <div class="keywords"> <b>Keywords:</b> sequence similarity: measurement via MD...") |
m |
||
(7 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | <div id=" | + | <div id="ABC"> |
− | + | <div style="padding:5px; border:1px solid #000000; background-color:#b3dbce; font-size:300%; font-weight:400; color: #000000; width:100%;"> | |
Measuring Sequence Similarity | Measuring Sequence Similarity | ||
− | + | <div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#b3dbce; font-size:30%; font-weight:200; color: #000000; "> | |
− | + | (sequence similarity: measurement via MDM; BLOSUM 62 matrix, affine gap penalties) | |
− | + | </div> | |
− | |||
− | |||
− | |||
− | sequence similarity: measurement via MDM; BLOSUM 62 matrix, affine gap penalties | ||
</div> | </div> | ||
− | {{ | + | {{Smallvspace}} |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
+ | <div style="padding:5px; border:1px solid #000000; background-color:#b3dbce33; font-size:85%;"> | ||
+ | <div style="font-size:118%;"> | ||
+ | <b>Abstract:</b><br /> | ||
+ | <section begin=abstract /> | ||
+ | In order to compare protein sequences quantitatively, we must define how to measure the similarity of two amino acids. This can be done according to biophysical considerations, or empirically, based on the propensity of amino acids to substitute for each other in homologous sequences. "Mutation Data Matrices" make this information conveniently available. | ||
+ | <section end=abstract /> | ||
+ | </div> | ||
+ | <!-- ============================ --> | ||
+ | <hr> | ||
+ | <table> | ||
+ | <tr> | ||
+ | <td style="padding:10px;"> | ||
+ | <b>Objectives:</b><br /> | ||
+ | This unit will ... | ||
+ | * ... introduce issues of defining amino acid similarity; | ||
+ | * ... teach how to use the amino acid property tables included in the <tt>seqinr::</tt> package; | ||
+ | * ... teach the use of mutation data matrices from the Biostrings package. | ||
+ | </td> | ||
+ | <td style="padding:10px;"> | ||
+ | <b>Outcomes:</b><br /> | ||
+ | After working through this unit you ... | ||
+ | * ... can access and work with amino acid property tables from the <tt>seqinr::</tt> package; | ||
+ | * ... can access and work with mutation data matrices from the <tt>Biostrings::</tt> package, in particular <tt>BLOSUM62</tt>. | ||
+ | </td> | ||
+ | </tr> | ||
+ | </table> | ||
+ | <!-- ============================ --> | ||
+ | <hr> | ||
+ | <b>Deliverables:</b><br /> | ||
+ | <section begin=deliverables /> | ||
+ | <li><b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.</li> | ||
+ | <li><b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.</li> | ||
+ | <li><b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].</li> | ||
+ | <section end=deliverables /> | ||
+ | <!-- ============================ --> | ||
+ | <hr> | ||
+ | <section begin=prerequisites /> | ||
+ | <b>Prerequisites:</b><br /> | ||
+ | This unit builds on material covered in the following prerequisite units:<br /> | ||
+ | *[[RPR-Biostrings|RPR-Biostrings (The biostrings R Package)]] | ||
+ | <section end=prerequisites /> | ||
+ | <!-- ============================ --> | ||
</div> | </div> | ||
− | |||
− | |||
− | |||
− | |||
− | {{ | + | {{Smallvspace}} |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | {{ | + | {{Smallvspace}} |
− | + | __TOC__ | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
{{Vspace}} | {{Vspace}} | ||
Line 70: | Line 67: | ||
=== Evaluation === | === Evaluation === | ||
− | |||
− | |||
<b>Evaluation: NA</b><br /> | <b>Evaluation: NA</b><br /> | ||
− | :This unit is not evaluated for course marks. | + | <div style="margin-left: 2rem;">This unit is not evaluated for course marks.</div> |
− | |||
− | |||
− | |||
− | |||
− | </div | ||
− | |||
== Contents == | == Contents == | ||
− | |||
− | |||
− | {{ | + | {{Task|1= |
+ | *Read the introductory notes on {{ABC-PDF|BIN-ALI-Similarity|the concepts behind quantifying amino acid sequence similarity}}. | ||
+ | }} | ||
− | |||
− | |||
− | |||
− | |||
− | |||
{{Vspace}} | {{Vspace}} | ||
+ | The NCBI makes its alignment matrices available by ftp. They are located at ftp://ftp.ncbi.nih.gov/blast/matrices - for example here is a link to the [ftp://ftp.ncbi.nih.gov/blast/matrices/BLOSUM62 '''BLOSUM62 matrix''']<ref>That directory also contains sourcecode to generate the PAM matrices. This may be of interest if you ever want to produce scoring matrices from your own datasets.</ref>. | ||
− | = | + | <source lang="text"> |
− | |||
− | |||
− | |||
− | |||
− | |||
+ | BLOSUM62 | ||
− | + | A R N D C Q E G H I L K M F P S T W Y V B J Z X * | |
− | + | A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 -1 -1 -4 | |
− | + | R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 -2 0 -1 -4 | |
− | + | N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 4 -3 0 -1 -4 | |
− | + | D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 -3 1 -1 -4 | |
− | + | C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -1 -3 -1 -4 | |
+ | Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 -2 4 -1 -4 | ||
+ | E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 -3 4 -1 -4 | ||
+ | G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -4 -2 -1 -4 | ||
+ | H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 -3 0 -1 -4 | ||
+ | I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 3 -3 -1 -4 | ||
+ | L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 3 -3 -1 -4 | ||
+ | K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 -3 1 -1 -4 | ||
+ | M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 2 -1 -1 -4 | ||
+ | F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 0 -3 -1 -4 | ||
+ | P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -3 -1 -1 -4 | ||
+ | S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 -2 0 -1 -4 | ||
+ | T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 -1 -1 -4 | ||
+ | W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -2 -2 -1 -4 | ||
+ | Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -1 -2 -1 -4 | ||
+ | V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 2 -2 -1 -4 | ||
+ | B -2 -1 4 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 -3 0 -1 -4 | ||
+ | J -1 -2 -3 -3 -1 -2 -3 -4 -3 3 3 -3 2 0 -3 -2 -1 -2 -1 2 -3 3 -3 -1 -4 | ||
+ | Z -1 0 0 1 -3 4 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -2 -2 -2 0 -3 4 -1 -4 | ||
+ | X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -4 | ||
+ | * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1 | ||
− | + | </source> | |
− | |||
− | |||
− | |||
− | |||
− | + | {{task| | |
− | + | * Study this and make sure you understand what this table is, how it can be used, and what a reasonable range of values for identities and pairscores for non-identical, similar and dissimilar residues is. Ask on the mailing list in case you have questions. '''This piece of data is the foundation of any sequence alignment. without it, no sensible alignment could be produced!''' | |
− | + | * Figure out the following values: | |
− | + | ** Compare an identical match of histidine with an identical match of serine. What does this mean? | |
− | + | ** How similar are lysine and leucine, as compared to leucine and isoleucine? Is this what you expect? | |
− | + | ** PAM matrices are sensitive to an interesting artefact. Since W and R can be interchanged with a single point mutation, the probability of observing W→R and R→W exchanges in '''closely related sequences''' is much higher than one would expect from the two amino acid's biophysical properties. (Why?) PAM matrices were compiled from hypothetical point exchanges and then extrapolated. Therefore these matrices assign a relatively high degree of similarity to (W, R), that is not warranted considering what actually happens in nature. Do you see this problem in the BLOSUM matrix? If BLOSUM does not have this issue, why not? | |
+ | }} | ||
{{Vspace}} | {{Vspace}} | ||
− | + | {{ABC-unit|BIN-ALI-Similarity.R}} | |
{{Vspace}} | {{Vspace}} | ||
+ | == Further reading, links and resources == | ||
− | + | *{{#pmid: 15286655}} | |
+ | *{{WP|BLOSUM|'''BLOSUM''' article at Wikipedia}} (Good article.) | ||
− | + | == Notes == | |
+ | <references /> | ||
{{Vspace}} | {{Vspace}} | ||
− | |||
− | |||
− | |||
− | |||
− | |||
<div class="about"> | <div class="about"> | ||
Line 151: | Line 147: | ||
:2017-08-05 | :2017-08-05 | ||
<b>Modified:</b><br /> | <b>Modified:</b><br /> | ||
− | : | + | :2020-09-24 |
<b>Version:</b><br /> | <b>Version:</b><br /> | ||
− | : | + | :1.1 |
<b>Version history:</b><br /> | <b>Version history:</b><br /> | ||
+ | *1.1 2020 Updates | ||
+ | *1.0 First live version | ||
*0.1 First stub | *0.1 First stub | ||
</div> | </div> | ||
− | |||
− | |||
{{CC-BY}} | {{CC-BY}} | ||
+ | [[Category:ABC-units]] | ||
+ | {{UNIT}} | ||
+ | {{LIVE}} | ||
</div> | </div> | ||
<!-- [END] --> | <!-- [END] --> |
Latest revision as of 11:10, 25 September 2020
Measuring Sequence Similarity
(sequence similarity: measurement via MDM; BLOSUM 62 matrix, affine gap penalties)
Abstract:
In order to compare protein sequences quantitatively, we must define how to measure the similarity of two amino acids. This can be done according to biophysical considerations, or empirically, based on the propensity of amino acids to substitute for each other in homologous sequences. "Mutation Data Matrices" make this information conveniently available.
Objectives:
|
Outcomes:
|
Deliverables:
Prerequisites:
This unit builds on material covered in the following prerequisite units:
Evaluation
Evaluation: NA
Contents
Task:
- Read the introductory notes on the concepts behind quantifying amino acid sequence similarity.
The NCBI makes its alignment matrices available by ftp. They are located at ftp://ftp.ncbi.nih.gov/blast/matrices - for example here is a link to the BLOSUM62 matrix[1].
BLOSUM62
A R N D C Q E G H I L K M F P S T W Y V B J Z X *
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 -1 -1 -4
R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 -2 0 -1 -4
N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 4 -3 0 -1 -4
D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 -3 1 -1 -4
C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -1 -3 -1 -4
Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 -2 4 -1 -4
E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 -3 4 -1 -4
G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -4 -2 -1 -4
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 -3 0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 3 -3 -1 -4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 -3 1 -1 -4
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 2 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 0 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -3 -1 -1 -4
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 -2 0 -1 -4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 -1 -1 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -2 -2 -1 -4
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -1 -2 -1 -4
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 2 -2 -1 -4
B -2 -1 4 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 -3 0 -1 -4
J -1 -2 -3 -3 -1 -2 -3 -4 -3 3 3 -3 2 0 -3 -2 -1 -2 -1 2 -3 3 -3 -1 -4
Z -1 0 0 1 -3 4 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -2 -2 -2 0 -3 4 -1 -4
X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
Task:
- Study this and make sure you understand what this table is, how it can be used, and what a reasonable range of values for identities and pairscores for non-identical, similar and dissimilar residues is. Ask on the mailing list in case you have questions. This piece of data is the foundation of any sequence alignment. without it, no sensible alignment could be produced!
- Figure out the following values:
- Compare an identical match of histidine with an identical match of serine. What does this mean?
- How similar are lysine and leucine, as compared to leucine and isoleucine? Is this what you expect?
- PAM matrices are sensitive to an interesting artefact. Since W and R can be interchanged with a single point mutation, the probability of observing W→R and R→W exchanges in closely related sequences is much higher than one would expect from the two amino acid's biophysical properties. (Why?) PAM matrices were compiled from hypothetical point exchanges and then extrapolated. Therefore these matrices assign a relatively high degree of similarity to (W, R), that is not warranted considering what actually happens in nature. Do you see this problem in the BLOSUM matrix? If BLOSUM does not have this issue, why not?
Task:
- Open RStudio and load the
ABC-units
R project. If you have loaded it before, choose File → Recent projects → ABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit. - Choose Tools → Version Control → Pull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
- Type
init()
if requested. - Open the file
BIN-ALI-Similarity.R
and follow the instructions.
Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.
Further reading, links and resources
Eddy (2004) Where did the BLOSUM62 alignment score matrix come from?. Nat Biotechnol 22:1035-6. (pmid: 15286655) [ PubMed ] [ DOI ] Many sequence alignment programs use the BLOSUM62 score matrix to score pairs of aligned residues. Where did BLOSUM62 come from?
- BLOSUM article at Wikipedia (Good article.)
Notes
- ↑ That directory also contains sourcecode to generate the PAM matrices. This may be of interest if you ever want to produce scoring matrices from your own datasets.
About ...
Author:
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2017-08-05
Modified:
- 2020-09-24
Version:
- 1.1
Version history:
- 1.1 2020 Updates
- 1.0 First live version
- 0.1 First stub
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.