Difference between revisions of "BIO Assignment Week 4"
m |
|||
Line 488: | Line 488: | ||
help(package = "Biostrings") | help(package = "Biostrings") | ||
+ | data(package = "Biostrings") | ||
+ | data(BLOSUM62) | ||
BLOSUM62 | BLOSUM62 |
Revision as of 11:39, 11 October 2015
Assignment for Week 4
Sequence alignment
< Assignment 3 | Assignment 5 > |
Note! This assignment is currently active. All significant changes will be announced on the mailing list.
Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.
Contents
Introduction
Sequence alignment is a very large, and important topic. In this assignment we will encounter the essentials of
- optimal global and local alignment;
- BLAST searches for best matches;
- PSI BLAST searches for exhaustive matches; and
- Multiple sequence alignments.
As usual, the focus will be on practical, hands on approaches.
This is the scenario in brief: you have identified a best match for a Mbp1 relative in YFO. HOw can you identify all related genes in YFO? And, what can you learn from this collection?
Optimal sequence alignments
Let's start by aligning the sequences of Mbp1 and the YFO relative. For simplicity, I will call the two proteins MBP1_SACCE
and MBP1_YFORG
through the remainder of the assignment, and even if I casually refer to a gene when I'm really talking about a protein (sorry), you should recognize from context what is meant.
Preparation: Updated Database Functions
First we need to pull out the two sequences from the database object we created last time. You could recreate the state of your database by re-running the relevant parts of the script, or piece things together from the code of the previous assignment.
Keeping things in scripts is really useful.
But since we'll be working more with our database, adding to the data model, updating code for getting and setting data, and adding proteins, annotations and cross-references, let's spend a moment to organize things in a more principled way.
- We should create a script that loads the functions to manage the database;
- We should save our database so we can easily reload the contents.
Task:
Here's how we should organize this:
- We'll define a variable called
PROJECTDIR
which automatically gets set whenever you startup R. - A scriptfile with the necessary functions should automatically get
source()
'd at startup; - The database should be saved so it can easily be loaded.
You will find the code below. It looks long, but it's really quite straightforward bookkeeping. I have added a number of tests to help make sure the input is sane. That actually makes up the majority of the code. Sanitizing user input is always much more effort than the actual algorithm. I have tested the functions and think they should work as expected. But if you come across a situation where your input produces an error, or creates an inconsistency in the database, by all means let me know so the code can be improved.
1. Create a project directory for the assignments on your computer if you don't have one yet.
2. Adapt the code below as needed, and execute it to update .Rprofile
.
file.edit("~/.Rprofile")
# Add:
PROJECTDIR <- "full/path/to/your/directory/" # including the final backslash.
source(paste(PROJECTDIR, "dbUtilities.R", sep=""))
# ... and save the file.
# To make the definition available, run it.
source("~/.Rprofile")
# Now let's create the script for the database functions:
file.edit(paste(PROJECTDIR, "dbUtilities.R", sep=""))
An edit window for the file has opened. Copy the entire code block below, and paste it into the editor.
# dbUtilities.R
#
# Purpose: Utility functions for a Protein datamodel
# Version: 0.1
# Date: Oct 2015
# Author: Boris and class
#
# ToDo: Add more tables.
# Accept either taxonomy_id OR species name
# and pull the other from NCBI.
# Notes: Cf. schema sketch at
# http://steipe.biochemistry.utoronto.ca/abc/index.php/File:ProteinDataModel.1.jpg
# Currently implements only "protein" and
# "taxonomy" table.
# ==========================================================
# ==== FUNCTIONS =========================================
# ==== createDB =============================================
# Returns an empty list
# We use a separate function because we might want to
# some initialization code later.
createDB <- function() {
return(list())
}
# ==== in2seq ==============================================
# Utility function to sanitize input and convert it into a
# sequence string. Case can be optionally changed.
# Letters that are not one-letter code - such as
# ambiguity codes - throw an error if not explicitly
# permitted.
in2seq <- function(s, uc = FALSE, lc = FALSE, noAmbig = TRUE) {
s <- paste(unlist(s), collapse="") # flatten whatever structure it has
s <- gsub("[^a-zA-Z]", "", s)
if (noAmbig) {
ambCodes <- "([bjouxzBJOUXZ])" # parentheses capture the match
ambChar <- unlist(regmatches(s, regexec(ambCodes, s)))[1]
if (! is.na(ambChar)) {
stop(paste("Input contains ambiguous letter: \"", ambChar, "\"", sep=""))
}
}
if (uc) { s <- toupper(s)}
if (lc) { s <- tolower(s)}
return(s)
}
# ==== in2vec ==============================================
# Utility function to sanitize input and expand it into a
# vector of single characters. Arguments for in2seq are
# passed through via the three-dots parameter syntax.
in2vec <- function(s, ...) {
s <- in2seq(s, ...)
return(unlist(strsplit(s, "")))
}
# ==== addToDB =============================================
# Add a new protein entry to the database, with associated
# taxonomy entry
addToDB <- function(database,
name = "",
refseq_id = "",
uniprot_id = "",
taxonomy_id,
genome_xref = numeric(),
genome_from = numeric(),
genome_to = numeric(),
sequence = "",
species_name = "") {
if (missing(database)) {
stop("\"database\" argument is missing with no default.")
}
if (missing(taxonomy_id)) {
stop("taxonomy_id argument is missing with no default.")
}
if (! is.numeric(taxonomy_id)) {
stop(paste("taxonomy_id \"",
taxonomy_id,
"\" is not numeric. Please correct.", sep=""))
}
# check taxonomy_id
if (! any(database$taxonomy$id == taxonomy_id)) { # new taxonomy_id
if (missing(species_name)) {
stop(paste("taxonomy_id",
taxonomy_id,
"is not yet in database, but species_name",
"is missing with no default."))
}
else {
# add this species to the taxonomy table
database$taxonomy <- rbind(database$taxonomy,
data.frame(id = taxonomy_id,
species_name = species_name,
stringsAsFactors = FALSE))
}
}
# handle protein
# pid is 1 if the table is empty, max() + 1 otherwise.
if (is.null(nrow(database$protein))) { pid <- 1 }
else {pid <- max(database$protein$id) + 1}
database$protein <- rbind(database$protein,
data.frame(id = pid,
name = name,
refseq_id = refseq_id,
uniprot_id = uniprot_id,
taxonomy_id = taxonomy_id,
genome_xref = genome_xref,
genome_from = genome_from,
genome_to = genome_to,
sequence = in2seq(sequence),
stringsAsFactors = FALSE))
return(database)
}
# ==== setDB ===============================================
# Update database values
setDB <- function(database,
table,
id = NULL,
name = NULL,
refseq_id = NULL,
uniprot_id = NULL,
taxonomy_id = NULL,
genome_xref = NULL,
genome_from = NULL,
genome_to = NULL,
sequence = NULL,
species_name = NULL) {
if (missing(database) | missing(table)) {
stop("Database or table is missing with no default.")
}
if (table == "protein") {
if (is.null(id)) {
stop("Protein id is missing with no default.")
}
row <- which(database$protein$id == id)
if (! is.null(name)) { database$protein[row, "name"] <- as.character(name) }
if (! is.null(refseq_id)) { database$protein[row, "refseq_id"] <- as.character(refseq_id) }
if (! is.null(uniprot_id)) { database$protein[row, "uniprot_id"] <- as.character(uniprot_id) }
if (! is.null(taxonomy_id)) {
# must be numeric ...
if (! is.numeric(taxonomy_id)) {
stop(paste("taxonomy_id",
taxonomy_id,
"is not numeric. Please correct."))
}
# must exist in taxonomy table ...
if (! any(database$taxonomy$id == taxonomy_id)) { # new taxonomy_id
stop(paste("taxonomy_id",
taxonomy_id,
"not found in taxonomy table. Please update taxonomy table and try again."))
}
# all good, update it...
database$protein[row, "taxonomy_id"] <- taxonomy_id
}
if (! is.null(genome_xref)) { database$protein[row, "genome_xref"] <- genome_xref}
if (! is.null(genome_from)) { database$protein[row, "genome_from"] <- genome_from}
if (! is.null(genome_to)) { database$protein[row, "genome_to"] <- genome_to}
if (! is.null(sequence)) { database$protein[row, "sequence"] <- in2seq(sequence)}
}
else if (table == "taxonomy") {
if (missing(taxonomy_id)) {
stop("taxonomy_id is missing with no default.")
}
if (! any(database$taxonomy$id == taxonomy_id)) {
stop(paste(" Can't set values for this taxonomy_id.",
taxonomy_id,
"was not found in taxonomy table."))
}
row <- which(database$taxonomy$id == taxonomy_id)
if (species_name != "") { database$taxonomy[row, "species_name"] <- species_name }
}
else {
stop(paste("This function has no code to update table \"",
table,
"\". Please enter a valid table name."))
}
return(database)
}
# ==== getDBid =============================================
# Get a vector of IDs from a database table from all rows
# for which all of the requested attributes are true.
# Note: if no restrictions are entered, ALL ids are returned.
# We don't have code to select from genome coordinates, or
# query from sequence.
getDBid <- function(database,
table,
name = NULL,
refseq_id = NULL,
uniprot_id = NULL,
taxonomy_id = NULL,
species_name = NULL) {
if (missing(database) | missing(table)) {
stop("Database or table is missing with no default.")
}
if (table == "protein") {
sel <- rep(TRUE, nrow(database$protein)) # initialize
if (! is.null(name) ) { sel <- sel & database$protein[, "name"] == name }
if (! is.null(refseq_id) ) { sel <- sel & database$protein[, "refseq_id"] == refseq_id }
if (! is.null(uniprot_id) ) { sel <- sel & database$protein[, "uniprot_id"] == uniprot_id }
if (! is.null(taxonomy_id)) { sel <- sel & database$protein[, "taxonomy_id"] == taxonomy_id }
sel <- db$protein$id[sel] # get ids by selecting from vector
}
else if (table == "taxonomy") {
sel <- rep(TRUE, nrow(database$taxonomy)) # initialize
if (! is.null(taxonomy_id) ) { sel <- sel & database$taxonomy[, "id"] == taxonomy_id }
if (! is.null(species_name)) { sel <- sel & database$taxonomy[, "species_name"] == species_name }
sel <- db$taxonomy$id[sel] # get ids by selecting from vector
}
else {
stop(paste("This function has no code to select from table \"",
table,
"\". Please enter a valid table name."))
}
return(sel)
}
# ==== getSeq ==============================================
# Retrieve the sequences for given id matches from the
# protein table. Uppercase, to make Biostrings happy.
getSeq <- function(database, ...) {
if (missing(database)) {
stop("Database argument is missing with no default.")
}
ids <- getDBid(database, table= "protein", ...)
seq <- db$protein[ids, "sequence"]
return(toupper(seq))
}
# ==== MESSAGE ============================================
cat("db_utilities.R has been loaded. The following functions are now available:\n")
cat(" createDB()\n")
cat(" addToDB()\n")
cat(" setDB()\n")
cat(" getDBid()\n")
cat(" getSeq()\n")
cat(" in2seq()\n")
cat(" in2vec()\n")
cat(" \n")
# ==== TESTS =============================================
# TBD
# [END]
Save dbUtilities.R and source()
it to make the functions immediately available. They will also be available when you next start R.
source(paste(PROJECTDIR, "dbUtilities.R", sep=""))
We now have a first set of somewhat credible database functions. Let's create a database and add two proteins.
db <- createDB()
db <- addToDB(db,
name = "Mbp1",
refseq_id = "NP_010227",
uniprot_id = "P39678",
taxonomy_id = 4932,
genome_xref = "NC_001136.10",
genome_from = 352877,
genome_to = 355378,
sequence = "
1 msnqiysary sgvdvyefih stgsimkrkk ddwvnathil kaanfakakr trilekevlk
61 ethekvqggf gkyqgtwvpl niakqlaekf svydqlkplf dftqtdgsas pppapkhhha
121 skvdrkkair sastsaimet krnnkkaeen qfqsskilgn ptaaprkrgr pvgstrgsrr
181 klgvnlqrsq sdmgfprpai pnssisttql psirstmgpq sptlgileee rhdsrqqqpq
241 qnnsaqfkei dledglssdv epsqqlqqvf nqntgfvpqq qssliqtqqt esmatsvsss
301 pslptspgdf adsnpfeerf pgggtspiis miprypvtsr pqtsdindkv nkylsklvdy
361 fisnemksnk slpqvllhpp phsapyidap idpelhtafh wacsmgnlpi aealyeagts
421 irstnsqgqt plmrsslfhn sytrrtfpri fqllhetvfd idsqsqtvih hivkrksttp
481 savyyldvvl skikdfspqy rielllntqd kngdtalhia skngdvvffn tlvkmgaltt
541 isnkegltan eimnqqyeqm miqngtnqhv nssntdlnih vntnnietkn dvnsmvimsp
601 vspsdyityp sqiatnisrn ipnvvnsmkq masiyndlhe qhdneikslq ktlksisktk
661 iqvslktlev lkesskdeng eaqtnddfei lsrlqeqntk klrkrliryk rlikqkleyr
721 qtvllnklie detqattnnt vekdnntler lelaqeltml qlqrknklss lvkkfednak
781 ihkyrriire gtemnieevd ssldvilqtl iannnknkga eqiitisnan sha
",
species_name = "Saccharomyces cerevisiae")
db <- addToDB(db,
name = "Res2",
refseq_id = "NP_593032",
uniprot_id = "P41412",
taxonomy_id = 4896,
genome_xref = "NC_003424.3",
genome_from = 686543,
genome_to = 689179,
sequence = "
1 maprssavhv avysgvevye cfikgvsvmr rrrdswlnat qilkvadfdk pqrtrvlerq
61 vqigahekvq ggygkyqgtw vpfqrgvdla tkykvdgims pilsldideg kaiapkkkqt
121 kqkkpsvrgr rgrkpsslss stlhsvnekq pnssisptie ssmnkvnlpg aeeqvsatpl
181 paspnallsp ndntikpvee lgmleapldk yeeslldffl hpeegripsf lyspppdfqv
241 nsvidddght slhwacsmgh iemiklllra nadigvcnrl sqtplmrsvi ftnnydcqtf
301 gqvlellqst iyavdtngqs ifhhivqsts tpskvaaaky yldcilekli siqpfenvvr
361 lvnlqdsngd tslliaarng amdcvnslls ynanpsipnr qrrtaseyll eadkkphsll
421 qsnsnashsa fsfsgispai ispscsshaf vkaipsissk fsqlaeeyes qlrekeedli
481 ranrlkqdtl neisrtyqel tflqknnpty sqsmenlire aqetyqqlsk rlliwlearq
541 ifdlerslkp htslsisfps dflkkedgls lnndfkkpac nnvtnsdeye qlinkltslq
601 asrkkdtlyi rklyeelgid dtvnsyrrli amscginped lsleildave ealtrek
",
species_name = "Schizosaccharomyces pombe")
Now for YFO. Copy one of the samples above, edit it for the your Mbp1 homologue in YFO and add it to the database.
Then save the database, delete it and reload it:
save(db, file="proteinDB.RData") # write to file
rm(db) # remove
db # it's gone
load("proteinDB.RData") # read it back
db # verify
When that is done, we're ready to run some alignments.
Optimal Sequence Alignment at EMBOSS
Online programs for optimal sequence alignment are part of the EMBOSS tools. The programs take FASTA files or raw text files as input.
- Local optimal SEQUENCE alignment "water"
Task:
- Fetch the sequences for
MBP1_SACCE
andMBP1_YFORG
from your database. Something like:
getSeq(db, refseq_id = "NP_010227")
- Access the EMBOSS Explorer site (if you haven't done so yet, you might want to bookmark it.)
- Look for ALIGNMENT LOCAL, click on water, paste your sequences and run the program with default parameters.
- Study the results. You will probably find that the alignment extends over most of the protein, but does not include the termini.
- Considering the sequence identity cutoff we discussed in class (25% over the length of a domain), do you believe that the N-terminal domains (the APSES domains) are homologous?
- Change the Gap opening and Gap extension parameters to high values (e.g. 30 and 5). Then run the alignment again.
- Note what is different.
- Global optimal SEQUENCE alignment "needle"
Task:
- Look for ALIGNMENT GLOBAL, click on needle, paste the
MBP1_SACCE
andMBP1_YFORG
sequences again and run the program with default parameters. - Study the results. You will find that the alignment extends over the entire protein, likely with long indels at the termini.
The Mutation Data Matrix
The NCBI makes its alignment matrices available by ftp. They are located at ftp://ftp.ncbi.nih.gov/blast/matrices - for example here is a link to the BLOSUM62 matrix[1].
Scoring matrices are also available in the Bioconductor Biostrings package.
if (!require(Biostrings, quietly=TRUE)) {
source("https://bioconductor.org/biocLite.R")
biocLite("Biostrings")
library(Biostrings)
}
help(package = "Biostrings")
data(package = "Biostrings")
data(BLOSUM62)
BLOSUM62
A R N D C Q E G H I L K M F P S T W Y V B J Z X *
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 -1 -1 -4
R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 -2 0 -1 -4
N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 4 -3 0 -1 -4
D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 -3 1 -1 -4
C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -1 -3 -1 -4
Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 -2 4 -1 -4
E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 -3 4 -1 -4
G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -4 -2 -1 -4
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 -3 0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 3 -3 -1 -4
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 -3 1 -1 -4
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 2 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 0 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -3 -1 -1 -4
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 -2 0 -1 -4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 -1 -1 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -2 -2 -1 -4
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -1 -2 -1 -4
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 2 -2 -1 -4
B -2 -1 4 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 -3 0 -1 -4
J -1 -2 -3 -3 -1 -2 -3 -4 -3 3 3 -3 2 0 -3 -2 -1 -2 -1 2 -3 3 -3 -1 -4
Z -1 0 0 1 -3 4 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -2 -2 -2 0 -3 4 -1 -4
X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
BLOSUM62["H", "H"]
BLOSUM62["L", "L"]
BLOSUM62["S", "T"]
BLOSUM62["L", "D"]
Task:
- Study this and make sure you understand what this table is, how it can be used, and what a reasonable range of values for identities and pairscores for non-identical, similar and dissimilar residues is. Ask on the mailing list in case you have questions.
Alignment with Biostrings
Biostrings has extensive functions for sequence alignments. They are generally well written and tightly integrated with the rest of Bioconductor's functions. There are a few quirks however: for example alignments won't work with lower-case sequences. This is why our getSeq()
changes sequences to uppercase.
# sequence are stored in AAstring objects
?AAString
seq1 <- AAString(getSeq(db, refseq_id = "NP_010227"))
seq2 <- AAString(getSeq(db, refseq_id = "NP_593032")) # use MBP1_YFORG instead!
?pairwiseAlignment
# global alignment with end-gap penalties is default.
ali1 <- pairwiseAlignment(
seq1,
seq2,
substitutionMatrix = "BLOSUM62",
gapOpening = 10,
gapExtension = 0.5)
writePairwiseAlignments(ali1)
# local alignment
ali2 <- pairwiseAlignment(
seq1,
seq2,
type = "local",
substitutionMatrix = "BLOSUM62",
gapOpening = 50,
gapExtension = 10)
writePairwiseAlignments(ali2)
- TBC
Links and resources
Footnotes and references
- ↑ That directory also contains sourcecode to generate the PAM matrices. This may be of interest if you ever want to produce scoring matrices from your own datasets.
Ask, if things don't work for you!
- If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.
- Do consider how to ask your questions so that a meaningful answer is possible:
- How to create a Minimal, Complete, and Verifiable example on stackoverflow and ...
- How to make a great R reproducible example are required reading.
< Assignment 3 | Assignment 5 > |