BIO Assignment Week 4

From "A B C"
Revision as of 05:49, 10 October 2015 by Boris (talk | contribs)
Jump to navigation Jump to search

Assignment for Week 4
Sequence alignment

< Assignment 3 Assignment 5 >

Note! This assignment is currently active. All significant changes will be announced on the mailing list.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.


 

Introduction

Sequence alignment is a very large, and important topic. In this assignment we will encounter the essentials of

  • optimal global and local alignment;
  • BLAST searches for best matches;
  • PSI BLAST searches for exhaustive matches; and
  • Multiple sequence alignments.

As usual, the focus will be on practical, hands on approaches.

This is the scenario in brief: you have identified a best match for a Mbp1 relative in YFO. HOw can you identify all related genes in YFO? And, what can you learn from this collection?


 

Optimal sequence alignments

Let's start by aligning the sequences of Mbp1 and the YFO relative. For simplicity, I will call the two proteins MBP1_SACCE and MBP1_YFORG through the remainder of the assignment, and even if I casually refer to a gene when I'm really talking about a protein (sorry), you should recognize from context what is meant.


Preparation: Updated Database Functions

First we need to pull out the two sequences from the database object we created last time. You could recreate the state of your database by re-running the relevant parts of the script, or piece things together from the code of the previous assignment.

Keeping things in scripts is really useful.

But since we'll be working more with our database, adding to the data model, updating code for getting and setting data, and adding proteins, annotations and cross-references, let's spend a moment to organize things in a more principled way.

  • We should create a script that loads the functions to manage the database;
  • We should save our database so we can easily reload the contents.

Task:
Here's how we should organize this:

  1. We'll define a variable called PROJECTDIR which automatically gets set whenever you startup R.
  2. A scriptfile with the necessary functions should automatically get source()'d at startup;
  3. The database should be saved so it can easily be loaded.

You will find the code below. It looks long, but it's really quite straightforward bookkeeping. I have added a number of tests to help make sure the input is sane. That actually makes up the majority of the code. Sanitizing user input is always much more effort than the actual algorithm. I have tested the functions and think they should work as expected. But if you come across a situation where your input produces an error, or creates an inconsistency in the database, by all means let me know so the code can be improved.


1. Create a project directory for the assignments on your computer if you don't have one yet.

2. Adapt the code below as needed, and execute it to update .Rprofile.

file.edit("~/.Rprofile")

# Add:
PROJECTDIR <- "full/path/to/your/directory/"  # including the final backslash.
source(paste(PROJECTDIR, "dbUtilities.R", sep=""))

# ... and save the file.
# To make the definition available, run it.
source("~/.Rprofile")

# Now let's create  the script for the database functions:

file.edit(paste(PROJECTDIR, "dbUtilities.R", sep=""))


An edit window for the file has opened. Copy the entire code block below, and paste it into the editor.


# dbUtilities.R
#
# Purpose: Utility functions for a Protein datamodel
# Version: 0.1
# Date:    Oct 2015
# Author:  Boris and class
#
# ToDo:    Add more tables.
#          Accept either taxonomy_id OR species name
#             and pull the other from NCBI. 
# Notes:   Cf. schema sketch at 
# http://steipe.biochemistry.utoronto.ca/abc/index.php/File:ProteinDataModel.1.jpg
#          Currently implements only "protein" and
#          "taxonomy" table.
# ==========================================================


# ====  FUNCTIONS  =========================================

# ==== createDB =============================================
# Returns an empty list
# We use a separate function because we might want to
# some initialization code later.
createDB <- function() {
	return(list())
}


# ==== in2seq ==============================================
# Utility function to sanitize input and convert it into a
# sequence string. Case can be optionally changed.
# Letters that are not one-letter code - such as
# ambiguity codes - throw an error if not explicitly
# permitted.

in2seq <- function(s, uc = FALSE, lc = FALSE, noAmbig = TRUE) {
	s <- paste(unlist(s), collapse="") # flatten whatever structure it has
	s <- gsub("[^a-zA-Z]", "", s)
	if (noAmbig) {
		ambCodes <- "([bjouxzBJOUXZ])"  # parentheses capture the match
		ambChar <- unlist(regmatches(s, regexec(ambCodes, s)))[1] 
         if (! is.na(ambChar)) {
         	    stop(paste("Input contains ambiguous letter: \"", ambChar, "\"", sep=""))
         }		
	}
	if (uc) { s <- toupper(s)}
	if (lc) { s <- tolower(s)}
	return(s)
}

# ==== in2vec ==============================================
# Utility function to sanitize input and expand it into a
# vector of single characters. Arguments for in2seq are
# passed through via the three-dots parameter syntax.
in2vec <- function(s, ...) {
	s <- in2seq(s, ...)
	return(unlist(strsplit(s, "")))
}



# ==== addToDB =============================================
# Add a new protein entry to the database, with associated
# taxonomy entry
addToDB <- function(database,
                    name = "",
                    refseq_id = "",
                    uniprot_id = "",
                    taxonomy_id,
                    genome_xref = numeric(),
                    genome_from = numeric(),
                    genome_to = numeric(),
                    sequence = "",
                    species_name = "") {
    if (missing(database)) {
    	stop("\"database\" argument is missing with no default.")
    }
    if (missing(taxonomy_id)) {
    	stop("taxonomy_id argument is missing with no default.")
    }
    
    if (! is.numeric(taxonomy_id)) {
   		stop(paste("taxonomy_id \"", 
   		            taxonomy_id, 
   		            "\" is not numeric. Please correct.", sep=""))
   	}
    
    # check taxonomy_id
    if (! any(database$taxonomy$id == taxonomy_id)) {  # new taxonomy_id
        if (missing(species_name)) {
    		stop(paste("taxonomy_id", 
    		           taxonomy_id, 
    		           "is not yet in database, but species_name", 
    		           "is missing with no default."))
    	}
    	else {
    		# add this species to the taxonomy table
            database$taxonomy <- rbind(database$taxonomy,
              data.frame(id = taxonomy_id,
                species_name = species_name,
                stringsAsFactors = FALSE))
    	}
    }
    # handle protein
    
    # pid is 1 if the table is empty, max() + 1 otherwise.
    if (is.null(nrow(database$protein))) { pid <- 1 }
    else {pid <- max(database$protein$id) + 1}
    
    database$protein <- rbind(database$protein,
      data.frame(id = pid,
        name = name,
        refseq_id = refseq_id,
        uniprot_id = uniprot_id,
        taxonomy_id = taxonomy_id,
        genome_xref = genome_xref,
        genome_from = genome_from,
        genome_to = genome_to,
        sequence = in2seq(sequence),
        stringsAsFactors = FALSE))
 
    return(database)
}


# ==== setDB ===============================================
# Update database values

setDB <- function(database,
                  table,
                  id   =         NULL,
                  name =         NULL,
                  refseq_id =    NULL,
                  uniprot_id =   NULL,
                  taxonomy_id =  NULL,
                  genome_xref =  NULL,
                  genome_from =  NULL,
                  genome_to =    NULL,
                  sequence =     NULL,
                  species_name = NULL) {
    if (missing(database) | missing(table)) {
    	stop("Database or table is missing with no default.")
    }
    if (table == "protein") {
	    if (is.null(id)) {
	    	stop("Protein id is missing with no default.")
	    }
    	row <- which(database$protein$id == id)
    	if (! is.null(name)) { database$protein[row, "name"] <- as.character(name) } 
    	if (! is.null(refseq_id)) { database$protein[row, "refseq_id"] <- as.character(refseq_id) } 
    	if (! is.null(uniprot_id)) { database$protein[row, "uniprot_id"] <- as.character(uniprot_id) } 

    	if (! is.null(taxonomy_id)) {
    		# must be numeric ...
    		if (! is.numeric(taxonomy_id)) {
    		stop(paste("taxonomy_id", 
    		           taxonomy_id, 
    		           "is not numeric. Please correct."))
    		}
    		# must exist in taxonomy table ...
	        if (! any(database$taxonomy$id == taxonomy_id)) {  # new taxonomy_id
	    		stop(paste("taxonomy_id", 
	    		           taxonomy_id, 
	    		           "not found in taxonomy table. Please update taxonomy table and try again."))
	        }
	        # all good, update it...
    		database$protein[row, "taxonomy_id"] <- taxonomy_id
        } 
    	if (! is.null(genome_xref)) { database$protein[row, "genome_xref"] <- genome_xref} 
    	if (! is.null(genome_from)) { database$protein[row, "genome_from"] <- genome_from} 
    	if (! is.null(genome_to)) { database$protein[row, "genome_to"] <- genome_to} 
    	if (! is.null(sequence)) { database$protein[row, "sequence"] <- in2seq(sequence)} 
    }
    else if (table == "taxonomy") {
	    if (missing(taxonomy_id)) {
	    	stop("taxonomy_id is missing with no default.")
	    }
    if (! any(database$taxonomy$id == taxonomy_id)) { 
	       stop(paste(" Can't set values for this taxonomy_id.", 
	    		       taxonomy_id, 
	    		       "was not found in taxonomy table."))
	    }
    	row <- which(database$taxonomy$id == taxonomy_id)
    	if (species_name != "") { database$taxonomy[row, "species_name"] <- species_name } 
    }
    else {
    	stop(paste("This function has no code to update table \"", 
	    	       table, 
	    	       "\". Please enter a valid table name."))
	}
    
    return(database)
}


# ==== getDBid =============================================
# Get a vector of IDs from a database table from all rows
# for which all of the requested attributes are true.
# Note: if no restrictions are entered, ALL ids are returned.
# We don't have code to select from genome coordinates, or
# query from sequence.

getDBid <- function(database,
                  table,
                  name =         NULL,
                  refseq_id =    NULL,
                  uniprot_id =   NULL,
                  taxonomy_id =  NULL,
                  species_name = NULL) {
    if (missing(database) | missing(table)) {
    	stop("Database or table is missing with no default.")
    }
    if (table == "protein") {
    	sel <- rep(TRUE, nrow(database$protein))  # initialize
    	if (! is.null(name)       ) { sel <- sel & database$protein[, "name"]        == name } 
    	if (! is.null(refseq_id)  ) { sel <- sel & database$protein[, "refseq_id"]   == refseq_id } 
    	if (! is.null(uniprot_id) ) { sel <- sel & database$protein[, "uniprot_id"]  == uniprot_id } 
    	if (! is.null(taxonomy_id)) { sel <- sel & database$protein[, "taxonomy_id"] == taxonomy_id } 
        sel <- db$protein$id[sel]  # get ids by selecting from vector
    }
    else if (table == "taxonomy") {
    	sel <- rep(TRUE, nrow(database$taxonomy))  # initialize
    	if (! is.null(taxonomy_id) ) { sel <- sel & database$taxonomy[, "id"]           == taxonomy_id } 
    	if (! is.null(species_name)) { sel <- sel & database$taxonomy[, "species_name"] == species_name } 
        sel <- db$taxonomy$id[sel]  # get ids by selecting from vector
    }
    else {
    	stop(paste("This function has no code to select from table \"", 
	    	       table, 
	    	       "\". Please enter a valid table name."))
	}
    
    return(sel)

}

# ==== getSeq ==============================================
# Retrieve the sequences for given id matches from the
# protein table. Uppercase, to make Biostrings happy.
getSeq <- function(database, ...) {
    if (missing(database)) {
    	stop("Database argument is missing with no default.")
    }
    ids <- getDBid(database, table= "protein", ...)
    seq <- db$protein[ids, "sequence"]
    return(toupper(seq))
}


# ====  MESSAGE ============================================

cat("db_utilities.R has been loaded. The following functions are now available:\n")
cat("    createDB()\n")
cat("    addToDB()\n")
cat("    setDB()\n")
cat("    getDBid()\n")
cat("    getSeq()\n")
cat("    in2seq()\n")
cat("    in2vec()\n")
cat("    \n")


# ====  TESTS  =============================================

# TBD



# [END]


Save dbUtilities.R and source() it to make the functions immediately available. They will also be available when you next start R.

source(paste(PROJECTDIR, "dbUtilities.R", sep=""))


We now have a first set of somewhat credible database functions. Let's create a database and add two proteins.


db <- createDB()

db <- addToDB(db,
              name = "Mbp1",
              refseq_id = "NP_010227",
              uniprot_id = "P39678",
              taxonomy_id = 4932,
              genome_xref = "NC_001136.10",
              genome_from = 352877,
              genome_to = 355378,
              sequence = "
       1 msnqiysary sgvdvyefih stgsimkrkk ddwvnathil kaanfakakr trilekevlk
      61 ethekvqggf gkyqgtwvpl niakqlaekf svydqlkplf dftqtdgsas pppapkhhha
     121 skvdrkkair sastsaimet krnnkkaeen qfqsskilgn ptaaprkrgr pvgstrgsrr
     181 klgvnlqrsq sdmgfprpai pnssisttql psirstmgpq sptlgileee rhdsrqqqpq
     241 qnnsaqfkei dledglssdv epsqqlqqvf nqntgfvpqq qssliqtqqt esmatsvsss
     301 pslptspgdf adsnpfeerf pgggtspiis miprypvtsr pqtsdindkv nkylsklvdy
     361 fisnemksnk slpqvllhpp phsapyidap idpelhtafh wacsmgnlpi aealyeagts
     421 irstnsqgqt plmrsslfhn sytrrtfpri fqllhetvfd idsqsqtvih hivkrksttp
     481 savyyldvvl skikdfspqy rielllntqd kngdtalhia skngdvvffn tlvkmgaltt
     541 isnkegltan eimnqqyeqm miqngtnqhv nssntdlnih vntnnietkn dvnsmvimsp
     601 vspsdyityp sqiatnisrn ipnvvnsmkq masiyndlhe qhdneikslq ktlksisktk
     661 iqvslktlev lkesskdeng eaqtnddfei lsrlqeqntk klrkrliryk rlikqkleyr
     721 qtvllnklie detqattnnt vekdnntler lelaqeltml qlqrknklss lvkkfednak
     781 ihkyrriire gtemnieevd ssldvilqtl iannnknkga eqiitisnan sha
                         ",
              species_name = "Saccharomyces cerevisiae")


db <- addToDB(db,
              name = "Res2",
              refseq_id = "NP_593032",
              uniprot_id = "P41412",
              taxonomy_id = 4896,
              genome_xref = "NC_003424.3",
              genome_from = 686543,
              genome_to = 689179,
              sequence = "
        1 maprssavhv avysgvevye cfikgvsvmr rrrdswlnat qilkvadfdk pqrtrvlerq
       61 vqigahekvq ggygkyqgtw vpfqrgvdla tkykvdgims pilsldideg kaiapkkkqt
      121 kqkkpsvrgr rgrkpsslss stlhsvnekq pnssisptie ssmnkvnlpg aeeqvsatpl
      181 paspnallsp ndntikpvee lgmleapldk yeeslldffl hpeegripsf lyspppdfqv
      241 nsvidddght slhwacsmgh iemiklllra nadigvcnrl sqtplmrsvi ftnnydcqtf
      301 gqvlellqst iyavdtngqs ifhhivqsts tpskvaaaky yldcilekli siqpfenvvr
      361 lvnlqdsngd tslliaarng amdcvnslls ynanpsipnr qrrtaseyll eadkkphsll
      421 qsnsnashsa fsfsgispai ispscsshaf vkaipsissk fsqlaeeyes qlrekeedli
      481 ranrlkqdtl neisrtyqel tflqknnpty sqsmenlire aqetyqqlsk rlliwlearq
      541 ifdlerslkp htslsisfps dflkkedgls lnndfkkpac nnvtnsdeye qlinkltslq
      601 asrkkdtlyi rklyeelgid dtvnsyrrli amscginped lsleildave ealtrek
                         ",
              species_name = "Schizosaccharomyces pombe")


Now for YFO. Copy one of the samples above, edit it for the your Mbp1 homologue in YFO and add it to the database.

Then save the database, delete it and reload it:

save(db, file="proteinDB.RData")  # write to file
rm(db)                            # remove
db                                # it's gone

load("proteinDB.RData")           # read it back
db                                # verify

When that is done, we're ready to run some alignments.


 

Optimal Sequence Alignment at EMBOSS

Online programs for optimal sequence alignment are part of the EMBOSS tools. The programs take FASTA files or raw text files as input.

Local optimal SEQUENCE alignment "water"

Task:

  1. Fetch the sequences for MBP1_SACCE and MBP1_YFORG from your database. Something like:
getSeq(db, refseq_id = "NP_010227")
  1. Access the EMBOSS Explorer site (if you haven't done so yet, you might want to bookmark it.)
  2. Look for ALIGNMENT LOCAL, click on water, paste your sequences and run the program with default parameters.
  3. Study the results. You will probably find that the alignment extends over most of the protein, but does not include the termini.
  4. Considering the sequence identity cutoff we discussed in class (25% over the length of a domain), do you believe that the N-terminal domains (the APSES domains) are homologous?
  5. Change the Gap opening and Gap extension parameters to high values (e.g. 30 and 5). Then run the alignment again.
  6. Note what is different.


Global optimal SEQUENCE alignment "needle"

Task:

  1. Look for ALIGNMENT GLOBAL, click on needle, paste the MBP1_SACCE and MBP1_YFORG sequences again and run the program with default parameters.
  2. Study the results. You will find that the alignment extends over the entire protein, likely with long indels at the termini.


 

The Mutation Data Matrix

The NCBI makes its alignment matrices available by ftp. They are located at ftp://ftp.ncbi.nih.gov/blast/matrices - for example here is a link to the BLOSUM62 matrix[1].

Scoring matrices are also available in the Bioconductor Biostrings package.

if (!require(Biostrings, quietly=TRUE)) {
    source("https://bioconductor.org/biocLite.R")
    biocLite("Biostrings")
    library(Biostrings)
}

help(package = "Biostrings")

BLOSUM62

   A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  J  Z  X  *
A  4 -1 -2 -2  0 -1 -1  0 -2 -1 -1 -1 -1 -2 -1  1  0 -3 -2  0 -2 -1 -1 -1 -4
R -1  5  0 -2 -3  1  0 -2  0 -3 -2  2 -1 -3 -2 -1 -1 -3 -2 -3 -1 -2  0 -1 -4
N -2  0  6  1 -3  0  0  0  1 -3 -3  0 -2 -3 -2  1  0 -4 -2 -3  4 -3  0 -1 -4
D -2 -2  1  6 -3  0  2 -1 -1 -3 -4 -1 -3 -3 -1  0 -1 -4 -3 -3  4 -3  1 -1 -4
C  0 -3 -3 -3  9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -1 -3 -1 -4
Q -1  1  0  0 -3  5  2 -2  0 -3 -2  1  0 -3 -1  0 -1 -2 -1 -2  0 -2  4 -1 -4
E -1  0  0  2 -4  2  5 -2  0 -3 -3  1 -2 -3 -1  0 -1 -3 -2 -2  1 -3  4 -1 -4
G  0 -2  0 -1 -3 -2 -2  6 -2 -4 -4 -2 -3 -3 -2  0 -2 -2 -3 -3 -1 -4 -2 -1 -4
H -2  0  1 -1 -3  0  0 -2  8 -3 -3 -1 -2 -1 -2 -1 -2 -2  2 -3  0 -3  0 -1 -4
I -1 -3 -3 -3 -1 -3 -3 -4 -3  4  2 -3  1  0 -3 -2 -1 -3 -1  3 -3  3 -3 -1 -4
L -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4 -2  2  0 -3 -2 -1 -2 -1  1 -4  3 -3 -1 -4
K -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5 -1 -3 -1  0 -1 -3 -2 -2  0 -3  1 -1 -4
M -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5  0 -2 -1 -1 -1 -1  1 -3  2 -1 -1 -4
F -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6 -4 -2 -2  1  3 -1 -3  0 -3 -1 -4
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7 -1 -1 -4 -3 -2 -2 -3 -1 -1 -4
S  1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4  1 -3 -2 -2  0 -2  0 -1 -4
T  0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5 -2 -2  0 -1 -1 -1 -1 -4
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11  2 -3 -4 -2 -2 -1 -4
Y -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7 -1 -3 -1 -2 -1 -4
V  0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4 -3  2 -2 -1 -4
B -2 -1  4  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4 -3  0 -1 -4
J -1 -2 -3 -3 -1 -2 -3 -4 -3  3  3 -3  2  0 -3 -2 -1 -2 -1  2 -3  3 -3 -1 -4
Z -1  0  0  1 -3  4  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -2 -2 -2  0 -3  4 -1 -4
X -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4  1

BLOSUM62["H", "H"]
BLOSUM62["L", "L"]
BLOSUM62["S", "T"]
BLOSUM62["L", "D"]


Task:

  • Study this and make sure you understand what this table is, how it can be used, and what a reasonable range of values for identities and pairscores for non-identical, similar and dissimilar residues is. Ask on the mailing list in case you have questions.


 

Alignment with Biostrings

Biostrings has extensive functions for sequence alignments. They are generally well written and tightly integrated with the rest of Bioconductor's functions. There are a few quirks however: for example alignments won't work with lower-case sequences. This is why our getSeq() changes sequences to uppercase.


# sequence are stored in AAstring objects
?AAString

seq1 <- AAString(getSeq(db, refseq_id = "NP_010227"))
seq2 <- AAString(getSeq(db, refseq_id = "NP_593032")) # use MBP1_YFORG instead!


?pairwiseAlignment

# global alignment with end-gap penalties is default.
ali1 <-  pairwiseAlignment(
            seq1,
            seq2,
            substitutionMatrix = "BLOSUM62",
            gapOpening = 10,
            gapExtension = 0.5)

writePairwiseAlignments(ali1)

# local alignment
ali2 <-  pairwiseAlignment(
            seq1,
            seq2,
            type = "local",
            substitutionMatrix = "BLOSUM62",
            gapOpening = 50,
            gapExtension = 10)

writePairwiseAlignments(ali2)


 


TBC


 

Links and resources

 


Footnotes and references

  1. That directory also contains sourcecode to generate the PAM matrices. This may be of interest if you ever want to produce scoring matrices from your own datasets.


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.



< Assignment 3 Assignment 5 >