Difference between revisions of "BIO Assignment Week 6"

From "A B C"
Jump to navigation Jump to search
m
 
(9 intermediate revisions by the same user not shown)
Line 24: Line 24:
 
 
 
 
  
In this assignment we will perform a few computations with coordinate data in PDB files, look more in depth at domain annotations, and compare them across different proteins related to yeast Mbp1. We will write a function in '''R''' to help us plot a graphic of the comparison and collaborate to share data for our plots. But first we need to go through a few more '''R''' concepts.
+
In this assignment we will first download a number of APSES domain containing sequences into our database - and we will automate the process. Then we will annotate them with domain data. First manually, and then again, we will automate this. Next we will extract the APSES domains from our database according to the annotations. And finally we will align them, and visualize domain conservation in the 3D model to study parts of the protein that are conserved.
  
  
 
 
 
 
  
== Programming '''R''' code  ==
+
==Downloading Protein Data From the Web==
  
First, we will cover essentials of '''R''' programming: the fundamental statements that are needed to write programs–conditional expressions and loops, and how to define ''functions'' that allow us to use programming code. But let's start with a few more data types of '''R''' so we can use the concepts later on: matrices, lists and data frames.
 
  
 +
In [[BIO_Assignment_Week_3|Assignment 3]] we created a schema for a local protein sequence collection, and implemented it as an R list. We added sequences to this database by hand, but since the information should be cross-referenced and available based on a protein's RefSeq ID, we should really have a function that automates this process. It is far too easy to make mistakes and enter erroneous information otherwise.
  
{{task|
 
Please begin by working through the short [http://biochemistry.utoronto.ca/steipe/abc/index.php/R_tutorial#Matrices '''R''' - tutorial: matrices] section and the following sections on "Lists" and "Data frames".
 
 
Note that what we have done here is just the bare minimum on vectors, matrices and lists. The concepts are very generally useful, and there are many useful ways to extract subsets of values. We'll come across these in various parts of '''R''' sample code. But unless you read the provided code examples line by line, make sure you understand every single statement and '''ask''' if you are not clear about the syntax, this will all be quickly forgotten. Use it or lose it!
 
}}
 
 
 
'''R''' is a full featured programming language and in order to be that, it needs the ability to manipulate '''Control Flow''', i.e. the order in which instructions are executed. By far the most important control flow statement is a '''conditional branch''' i.e. the instruction to execute code at alternative points of the program, depending on the presence or absence of a condition. Using such conditional branch instructions, two main types of programming constructs can be realized: ''conditional expressions'' and ''loops''.
 
 
=== Conditional expressions ===
 
 
The template for conditional expression in R is:
 
 
<source lang="rsplus">
 
if( <expression 1> ) {
 
  <statement 1>
 
}
 
else if ( <expresssion 2> ) {
 
  <statement 2>
 
}
 
else {
 
  <statement 3>
 
}
 
 
</source>
 
 
...where both the <code>else if (...) { ... }</code> and the <code>else (...)</code> block are optional. We have encountered this construct previously, when we assigned the appropriate colors for amino acids in the frequency plot:
 
 
<source lang="rsplus">
 
if      (names(logRatio[i]) == "F") { barColors[i] <- hydrophobic }
 
else if (names(logRatio[i]) == "G") { barColors[i] <- plain }
 
  [... etc ...]
 
else                                { barColors[i] <- plain }
 
 
</source>
 
 
==== Logical expressions ====
 
 
We have to consider the <code>&lt;expression&gt;</code> in a bit more detail: anything that is, or produces, or can be interpreted as, a Boolean TRUE or FALSE value can serve as an expression in a conditional statement.
 
  
 
{{task|1=
 
{{task|1=
Here are some examples. Copy the code to an '''R''' script, predict what will happen on in each line and try it out:
 
  
<source lang="rsplus">
+
Work through the following code examples.
# A boolean constant is interpreted as is:
+
<source lang="R">
if (TRUE)    {print("true")} else {print("false")}
 
if (FALSE)  {print("true")} else {print("false")}
 
 
 
# Strings of "true" and "false" are coerced to their
 
# Boolean equivalent, but contrary to other programming languages
 
# arbitrary, non-empty or empty strings are not interpreted.
 
if ("true") {print("true")} else {print("false")}
 
if ("false") {print("true")} else {print("false")}
 
if ("widdershins")    {print("true")} else {print("false")}
 
if ("") {print("true")} else {print("false")}
 
 
 
# All non-zero, defined numbers are TRUE
 
if (1)      {print("true")} else {print("false")}
 
if (0)      {print("true")} else {print("false")}
 
if (-1)    {print("true")} else {print("false")}
 
if (pi)    {print("true")} else {print("false")}
 
if (NULL)  {print("true")} else {print("false")}
 
if (NA)    {print("true")} else {print("false")}
 
if (NaN)    {print("true")} else {print("false")}
 
if (Inf)    {print("true")} else {print("false")}
 
 
 
# functions can return Boolean values
 
affirm <- function() { return(TRUE) }
 
deny <- function() { return(FALSE) }
 
if (affirm())    {print("true")} else {print("false")}
 
if (deny())    {print("true")} else {print("false")}
 
 
 
# N.B. coercion of Booleans into numbers can be done as well
 
and is sometimes useful: consider ...
 
a <- c(TRUE, TRUE, FALSE, TRUE, FALSE)
 
a
 
as.numeric(a)
 
sum(a)
 
  
# ... or coercing the other way ...
+
# To begin, we load some libraries with functions
as.logical(-1:1)
+
# we need...
</source>
 
}}
 
  
==== Logical operators ====
+
# httr sends and receives information via the http
 
+
# protocol, just like a Web browser.
To actually write a conditional statement, we have to be able to '''test a condition''' and this is what logical operators do. Is something '''equal''' to something else? Is it less? Does something exist? Is it a number?
+
if (!require(httr, quietly=TRUE)) {
 
+
install.packages("httr")
{{task|1=
+
library(httr)
 
 
Here are some examples. Again, predict what will happen ...
 
 
 
<source lang="rsplus">
 
TRUE            # Just a statement.
 
 
 
#  unary operator
 
! TRUE          # NOT ...
 
 
 
# binary operators
 
FALSE > TRUE    # GREATER THAN ...
 
FALSE < TRUE    # ... these are coerced to numbers
 
FALSE < -1       
 
0 == FALSE      # Careful! == compares, = assigns!!!
 
 
 
"x" == "u"      # using lexical sort order ...
 
"x" >= "u"
 
"x" <= "u"
 
"x" != "u"
 
"aa" > "u"      # ... not just length, if different.
 
"abc" < "u" 
 
 
 
TRUE | FALSE    # OR: TRUE if either is true
 
TRUE & FALSE    # AND: TRUE if both are TRUE
 
 
 
# equality and identity
 
?identical
 
a <- c(TRUE)
 
b <- c(TRUE)
 
a; b
 
a == b
 
identical(a, b)
 
 
 
b <- 1
 
a; b
 
a == b
 
identical(a, b)  # Aha: equal, but not identical
 
 
 
 
 
# some other useful tests for conditional expressions
 
?all
 
?any
 
?duplicated
 
?exists
 
?is.character
 
?is.factor
 
?is.integer
 
?is.null
 
?is.numeric
 
?is.unsorted
 
?is.vector
 
</source>
 
}}
 
 
 
 
 
=== Loops ===
 
 
 
Loops allow you to repeat tasks many times over. The template is:
 
 
 
<source lang="rsplus">
 
for (<name> in <vector>) {
 
  <statement>
 
 
}
 
}
</source>
 
  
{{task|1=
+
# NCBI's eUtils send information in XML format; we
Consider the following: Again, copy the code to a script, study it, predict what will happen and then run it.
+
# need to be able to parse XML.
 
+
if (!require(XML, quietly=TRUE)) {
<source lang="rsplus">
+
install.packages("XML")
# simple for loop
+
library(XML)
for (i in 1:10) {
 
print(c(i, i^2, i^3))
 
 
}
 
}
  
# Compare excution times: one million square roots from a vector ...
+
# stringr has a number of useful utility functions
n <- 1000000
+
# to work with strings. E.g. a function that
x <- 1:n
+
# strips leading and trailing whitespace from
y <- sqrt(x)
+
# strings.
 
+
if (!require(stringr, quietly=TRUE)) {
# ... or done explicitly in a for-loop
+
install.packages("stringr")
for (i in 1:n) {
+
library(stringr)
  y[i] <- sqrt (x[i])
 
 
}
 
}
  
</source>
 
 
''If'' you can achieve your result with an '''R''' vector expression, it will be faster than using a loop. But sometimes you need to do things explicitly, for example if you need to access intermediate results.
 
 
}}
 
 
 
Here is an example to play with loops: a password generator. Passwords are a '''pain'''. We need them everywhere, they are frustrating to type, to remember and since the cracking programs are getting smarter they become more and more likely to be broken. Here is a simple password generator that creates random strings with consonant/vowel alterations. These are melodic and easy to memorize, but actually as '''strong''' as an 8-character, fully random password that uses all characters of the keyboard such as <code>)He.{2jJ</code> or <code>#h$bB2X^</code> (which is pretty much unmemorizable). The former is taken from 20<sup>7</sup> * 7<sup>7</sup> 10<sup>15</sup> possibilities, the latter is from 94<sup>8</sup> ~ 6*10<sup>15</sup> possibilities. HIgh-end GPU supported {{WP|Password cracking|password crackers}} can test about 10<sup>9</sup> passwords a second, the passwords generated by this little algorithm would thus take on the order of 10<sup>6</sup> seconds or eleven days to crack<ref>That's assuming the worst case in that the attacker needs to know the pattern with which the password is formed, i.e. the number of characters and the alphabet that we chose from. But note that there is an even worse case: if the attacker had access to our code and the seed to our random number generator. When the random number generator starts up, a new seed is generated from system time, thus the possible space of seeds can be devastatingly small. But even if a seed is set explicitly with the <code>set.seed()</code> function, that seed is a 32-bit integer and thus can take only a bit more than 4*10<sup>9</sup> values, six orders of magnitude less than the 10<sup>15</sup> password complexity we thought we had. It turns out that the code may be a much greater vulnerability than the password itself. Keep that in mind. <small>Keep it secret. <small>Keep it safe.</small></small></ref>. This is probably good enough to deter a casual attack.
 
 
{{task|1=
 
Copy, study and run ...
 
<source lang="rsplus">
 
# Suggest memorizable passwords
 
# Below we use the functions:
 
?nchar
 
?sample
 
?substr
 
?paste
 
?print
 
 
#define a string of  consonants ...
 
con <- "bcdfghjklmnpqrstvwxz"
 
# ... and a string of of vowels
 
vow <- "aeiouy"
 
  
for (i in 1:10) {  # ten sample passwords to choose from ...
+
# We will walk through the process with the refSeqID
    pass = rep("", 14)  # make an empty character vector
+
# of yeast Mbp1
    for (j in 1:7) {    # seven consonant/vowel pairs to be created ...
+
refSeqID <- "NP_010227"
        k  <- sample(1:nchar(con), 1)  # pick a random index for consonants ...
 
        ch  <- substr(con,k,k)          #  ... get the corresponding character ...
 
        idx <- (2*j)-1                  # ... compute the position (index) of where to put the consonant ...
 
        pass[idx] <- ch                # ...  and put it in the right spot
 
 
 
        # same thing for the vowel, but coded with fewer intermediate assignments
 
        # of results to variables
 
        k <- sample(1:nchar(vow), 1)
 
        pass[(2*j)] <- substr(vow,k,k)
 
    }
 
    print( paste(pass, collapse="") )  # collapse the vector in to a string and print
 
}
 
</source>
 
}}
 
  
=== Functions ===
 
  
Finally: functions. Functions look very much like the statements we have seen above. the template looks like:
+
# UniProt.
 +
# The UniProt ID mapping service supports a "RESTful
 +
# API": responses can be obtained simply via a Web-
 +
# browsers request. Such requests are commonly sent
 +
# via the GET or POST verbs that a Webserver responds
 +
# to, when a client asks for data. GET requests are
 +
# visible in the URL of the request; POST requests
 +
# are not directly visible, they are commonly used
 +
# to send the contents of forms, or when transmitting
 +
# larger, complex data items. The UniProt ID mapping
 +
# sevice can accept long lists of IDs, thus using the
 +
# POST mechanism makes sense.
  
<source lang="rsplus">
+
# R has a POST() function as part of the httr package.
<name> <- function (<parameters>) {
 
  <statements>
 
}
 
</source>
 
  
In this statement, the function is assigned to the ''name'' - any valid name in '''R'''. Once it is assigned, it the function can be invoked with <code>name()</code>. The parameter list (the values we write into the parentheses followin the function name) can be empty, or hold a list of variable names. If variable names are present, you need to enter the corresponding parameters when you execute the function. These assigned variables are available inside the function, and can be used for computations. This is called "passing the variable into the function".
+
# It's very straightforward to use: just define the URL
 +
# of the server and send a list of items as the  
 +
# body of the request.
  
You have encountered a function to choose YFO names. In this function, your Student ID was the parameter. Here is another example to play with: a function that calculates how old you are. In days. This is neat - you can celebrate your 10,000 birth'''day''' - or so.
+
# uniProt ID mapping service
 +
URL <- "http://www.uniprot.org/mapping/"
 +
response <- POST(URL,
 +
                body = list(from = "P_REFSEQ_AC",
 +
                            to = "ACC",
 +
                            format = "tab",
 +
                            query = refSeqID))
  
{{task|1=
+
response
  
Copy, explore and run ...
+
# If the query is successful, tabbed text is returned.
 +
# and we capture the fourth element as the requested
 +
# mapped ID.
 +
unlist(strsplit(content(response), "\\s+"))
  
;Define the function ...
+
# If the query can't be fulfilled because of a problem
<source lang = "rsplus">
+
# with the server, a WebPage is rturned. But the server status
# A lifedays calculator function
+
# is also returned and we can check the status code. I have
 +
# lately gotten many "503" status codes: Server Not Available...
  
myLifeDays <- function(date = NULL) { # give "date" a default value so we can test whether it has been set
+
if (response$status_code == 200) { # 200: oK
    if (is.null(date)) {
+
uniProtID <- unlist(strsplit(content(response), "\\s+"))[4]
        print ("Enter your birthday as a string in \"YYYY-MM-DD\" format.")
+
if (is.na(uniProtID)) {
        return()
+
warning(paste("UniProt ID mapping service returned NA.",
    }
+
              "Check your RefSeqID."))
    x <- strptime(date, "%Y-%m-%d") # convert string to time
+
}
    y <- format(Sys.time(), "%Y-%m-%d") # convert "now" to time
+
} else {
    diff <- round(as.numeric(difftime(y, x, unit="days")))
+
uniProtID <- NA
    print(paste("This date was ", diff, " days ago."))
+
warning(paste("No uniProt ID mapping available:",
 +
              "server returned status",
 +
              response$status_code))
 
}
 
}
</source>
 
  
;Use the function (example):
+
uniProtID  # Let's see what we got...
<source lang = "rsplus">
+
          # This should be "P39678"
  myLifeDays("1932-09-25") # Glenn Gould's birthday
+
          # (or NA if the query failed)
 
</source>
 
</source>
}}
 
  
Here is a good opportunity to play and practice programming: modify this function to accept a second argument. When a second argument is present (e.g. 10000) the function should print the calendar date on which the input date will be that number of days ago. Then you could use it to know when to celebrate your 10,000<sup>th</sup> lifeDay, or your 777<sup>th</sup> anniversary day or whatever.
 
  
Enjoy.
+
Next, we'll retrieve data from the various NCBI databases.
  
== Protein structure annotation ==
+
It is has become unreasonably difficult to screenscrape the NCBI site
 +
since the actual page contents are dynamically loaded via
 +
AJAX. This may be intentional, or just  overengineering.
 +
While NCBI offers a subset of their data via the eutils API and
 +
that works well enough, some of the data that is available to the
 +
Web browser's eyes is not served to a program.
  
Let's do something very quick in Chimera: calculate a Ramachandran plot.
+
The eutils API returns data in XML format. Have a
 +
look at the following URL in your browser to see what that looks like:
  
{{task|1=
+
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=NP_010227
# Open Chimera and load <code>1BM8</code>.
 
# Choose '''Favorites''' &rarr; '''Model Panel'''
 
# Look for the Option '''Ramachandran plot...''' in the choices on the right.
 
# Click the button and study the result. What do you see? What does this mean? What options are there? What do they do? What significance does this have?
 
}}
 
  
  
&nbsp;
+
<source lang="R">
  
== CDD domain annotation ==
+
# In order to parse such data, we need tools from the
 +
# XML package.
  
In the last assignment, you followed a link to '''CDD Search Results''' from the [http://www.ncbi.nlm.nih.gov/protein/NP_010227 RefSeq record for yeast Mbp1] and briefly looked at the information offered by the NCBI's Conserved Domain Database, a database of ''Position Specific Scoring Matrices'' that embody domain definitions. Rather than access precomputed results, you can also search CDD with sequences: assuming you have saved the YFO Mbp1 sequence in FASTA format, this is straightforward. If you did not save this sequence, return to [[BIO_Assignment_Week_3|Assignment 3]] and retrieve it again.
+
# First we build a query URL...
 +
eUtilsBase <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
  
  
{{task|1=
+
# Then we assemble an URL that will search for get the
# Access the [http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml '''CDD database'''] at http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
+
# unique, NCBI internal identifier,  the GI number,
# Read the information. CDD is a superset of various other database domain annotations as well as NCBI-curated domain definitions.
+
# for our refSeqID...
# Copy the YFO Mbp1 FASTA sequence, paste it into the search form and click '''Submit'''.
+
URL <- paste(eUtilsBase,
## On the result page, clik on '''View full result'''
+
            "esearch.fcgi?",    # ...using the esearch program
## Note that there are a number of partially overlapping ankyrin domain modules. We will study ankyrin domains in a later assignment.
+
                                  # that finds an entry in an
## Also note that there may be blocks of sequence colored cyan in the sequence bar. Hover your mouse over the blocks to see what these blocks signify.
+
                                  # NCBI database
## Open the link to '''Search for similar domain architecture''' in a separate window and study it. This is the '''CDART''' database. Think about what these results may be useful for.
+
            "db=protein",
## Click on one of the ANK superfamily graphics and see what the associated information looks like: there is a summary of structure and function, links to specific literature and a tree of the relationship of related sequences.
+
            "&term=", refSeqID,
}}
+
            sep="")
 +
# Copy the URL and paste it into your browser to see
 +
# what the response should look like.
 +
URL
  
== SMART domain annotation ==
+
# To fetch a response in R, we use the function htmlParse()
 +
# with our URL as its argument.
 +
response <- htmlParse(URL)
 +
response
  
 +
# This is XML. We can take the response apart into
 +
# its indvidual components with the xmlToList function.
  
The [http://smart.embl-heidelberg.de/ SMART database] at the EMBL in Heidelberg offers an alternative view on domain architectures. I personally find it more useful for annotations because it integrates a number of additional features. You can search by sequence, or by accession number and that raises the question of how to retrieve a database cross-reference from an NCBI sequence identifier to a UniProt sequence ID:
+
xmlToList(response)
  
 +
# Note how the XML "tree" is represented as a list of
 +
# lists of lists ...
 +
# If we know exactly what elelement we are looking for,
 +
# we can extract it from this structure:
 +
xmlToList(response)[["body"]][["esearchresult"]][["idlist"]][["id"]]
  
===ID mapping===
+
# But this is not very robus, it would break with the
 +
# slightest change that the NCBI makes to their response
 +
# and the NCBI changes things A LOT!
  
 +
# Somewhat more robust is to specify the type of element
 +
# we want - its the text contained in an <id>...</id>
 +
# elelement, and use the XPath XML parsing language to
 +
# retrieve it.
  
{{task|
+
# getNodeSet() lets us fetch tagged contents by
<!-- Yeast:  NP_010227 ... P39678 -->
+
# applying toString.XMLNode() to it...
# Access the [http://www.uniprot.org/mapping/ UniProt ID mapping service] at http://www.uniprot.org/mapping/
 
# Paste the RefSeq identifier for YFO Mbp1 into the search field.
 
# Use the menu to choose '''From''' ''RefSeq Protein'' and '''To''' ''UniprotKB AC''&ndash;the UniProt Knowledge Base Accession number.
 
# Click on '''Map''' to execute the search.
 
# Note the ID - it probably starts with a letter, followed by numbers and letters. Here are some examples for fungal Mbp1-like proteins: <code>P39678 Q5B8H6 Q5ANP5 P41412</code> ''etc.''
 
# Click on the link, and explore how the UniProt sequence page is similar or different from the RefSeq page.
 
}}
 
  
===SMART search===
+
node <- getNodeSet(response, "//id/text()")
 +
unlist(lapply(node, toString.XMLNode))  # "6320147 "
  
{{task|1=
+
# We will be doing this a lot, so we write a function
# Access the [http://smart.embl-heidelberg.de/ '''SMART database'''] at http://smart.embl-heidelberg.de/
+
# for it...
# Click the lick to access SMART in the '''normal''' mode.
+
node2string <- function(doc, tag) {
# Paste the YFO Mbp1 UniProtKB Accession number into the '''Sequence ID or ACC''' field.
+
    # an extractor function for the contents of elements
# Check the boxes for:
+
    # between given tags in an XML response.
## '''PFAM domains''' (domains defined by sequence similarity in the PFAM database)
+
    # Contents of all matching elements is returned in
## '''signal peptides''' (using the Gunnar von Heijne's SignalP 4.0 server at the Technical University in Lyngby, Denmark)
+
    # a vector of strings.
## '''internal repeats''' (using the programs ''ariadne'' and ''prospero'' at the Wellcome Trust Centre for Human Genetics at Oxford University, England)
+
path <- paste("//", tag, "/text()", sep="")
## '''intrinsic protein disorder''' (using Rune Linding's DisEMBL program at the EMBL in Heidelberg, Germany)
+
nodes <- getNodeSet(doc, path)
# Click on '''Sequence SMART''' to run the search and annotation. <small>(In case you get an error like: "Sorry, your entry seems to have no SMART domain ...", let me know and repeat the search with the actual FASTA sequence instead of the accession number.)</small>
+
return(unlist(lapply(nodes, toString.XMLNode)))
 
+
}
Study the results. Specifically, have a look at the proteins with similar domain '''ORGANISATION''' and '''COMPOSITION'''. This is similar to the NCBI's CDART.
 
 
 
}}
 
 
 
=== Visual comparison of domain annotations in '''R''' ===
 
  
 +
# using node2string() ...
 +
GID <- node2string(response, "id")
 +
GID
  
The versatile plotting functions of '''R''' allow us to compare domain annotations. The distribution of segments that are annotated as being "low-complexity" or "disordered is particulalry interesting: these are functional features of the amino acid sequence that are often not associated with sequence similarity.
+
# The GI is the pivot for all our data requests at the
 +
# NCBI.  
  
In the following code tutorial, we create a plot similar to the CDD and SMART displays. It is based on the SMART domain annotations of the six-fungal reference species for the course.
+
# Let's first get the associated data for this GI
 +
URL <- paste(eUtilsBase,
 +
            "esummary.fcgi?",
 +
            "db=protein",
 +
            "&id=",
 +
            GID,
 +
            "&version=2.0",
 +
            sep="")
 +
response <- htmlParse(URL)
 +
URL
 +
response
  
{{task|1=
+
taxID <- node2string(response, "taxid")
 +
organism <- node2string(response, "organism")
 +
taxID
 +
organism
  
Copy the code to an '''R''' script, study and execute it.
 
<source lang="R">
 
  
# plotDomains
+
# Next, fetch the actual sequence
# tutorial and functions to plot a colored rectangle from a list of domain annotations
+
URL <- paste(eUtilsBase,
 +
            "efetch.fcgi?",
 +
            "db=protein",
 +
            "&id=",
 +
            GID,
 +
            "&retmode=text&rettype=fasta",
 +
            sep="")
 +
response <- htmlParse(URL)
 +
URL
 +
response
  
 +
fasta <- node2string(response, "p")
 +
fasta
  
# First task: create a list structure for the annotations: this is a list of lists
+
seq <- unlist(strsplit(fasta, "\\n"))[-1] # Drop the first elelment,
# As you see below, we need to mix strings, numbers and vectors of numbers. In R
+
                                          # it is the FASTA header.
# such mixed data types must go into a list.
+
seq
  
Mbp1Domains <- list()  # start with an empty list
 
  
# For each species annotation, compile the SMART domain annotations in a list.
+
# Next, fetch the crossreference to the NCBI Gene
Mbp1Domains <- rbind(Mbp1Domains, list(  # rbind() appends the list to the existing
+
# database
    species = "Saccharomyces cerevisiae",
+
URL <- paste(eUtilsBase,
    code    = "SACCE",
+
            "elink.fcgi?",
    ACC    = "P39678",
+
            "dbfrom=protein",
    length  = 833,
+
            "&db=gene",
    KilAN  = c(18,102),  # Note: Vector of (start, end) pairs
+
            "&id=",
    AThook  = NULL,      # Note: NULL, because this annotation was not observed in this sequence
+
            GID,
    Seg    = c(108,122,236,241,279,307,700,717),
+
            sep="")
    DisEMBL = NULL,
+
response <- htmlParse(URL)
    Ankyrin = c(394,423,427,463,512,541),  # Note: Merge overlapping domains, if present
+
URL
    Coils  = c(633, 655)
+
response
    ))
 
  
Mbp1Domains <- rbind(Mbp1Domains, list(
+
geneID <- node2string(response, "linksetdb/id")
    species = "Emericella nidulans",
+
geneID
    code    = "ASPNI",
 
    ACC    = "Q5B8H6",
 
    length  = 695,
 
    KilAN  = c(23,94),
 
    AThook  = NULL,
 
    Seg    = c(529,543),
 
    DisEMBL = NULL,
 
    Ankyrin = c(260,289,381,413),
 
    Coils  = c(509,572)
 
    ))
 
  
Mbp1Domains <- rbind(Mbp1Domains, list(
+
# ... and the actual Gene record:
    species = "Candida albicans",
+
URL <- paste(eUtilsBase,
    code    = "CANAL",
+
            "esummary.fcgi?",
    ACC    = "Q5ANP5",
+
            "&db=gene",
    length  = 852,
+
            "&id=",
    KilAN  = c(19,102),
+
            geneID,
    AThook  = NULL,
+
            sep="")
    Seg    = c(351,365,678,692),
+
response <- htmlParse(URL)
    DisEMBL = NULL,
+
URL
    Ankyrin = c(376,408,412,448,516,545),
+
response
    Coils  = c(665,692)
 
    ))
 
  
Mbp1Domains <- rbind(Mbp1Domains, list(
+
name <- node2string(response, "name")
    species = "Neurospora crassa",
+
genome_xref <- node2string(response, "chraccver")
    code    = "NEUCR",
+
genome_from <- node2string(response, "chrstart")[1]
    ACC    = "Q7RW59",
+
genome_to <- node2string(response, "chrstop")[1]
    length  = 833,
+
name
    KilAN  = c(31,110),
+
genome_xref
    AThook  = NULL,
+
genome_from
    Seg    = c(130,141,253,266,514,525,554,564,601,618,620,629,636,652,658,672,725,735,752,771),
+
genome_to
    DisEMBL = NULL,
 
    Ankyrin = c(268,297,390,419),
 
    Coils  = c(500,550)
 
    ))
 
  
Mbp1Domains <- rbind(Mbp1Domains, list(
+
# So far so good. But since we need to do this a lot
    species = "Schizosaccharomyces pombe",
+
# we need to roll all of this into a function.
    code    = "SCHPO",
 
    ACC    = "P41412",
 
    length  = 657,
 
    KilAN  = c(21,97),
 
    AThook  = NULL,
 
    Seg    = c(111,125,136,145,176,191,422,447),
 
    DisEMBL = NULL,
 
    Ankyrin = c(247,276,368,397),
 
    Coils  = c(457,538)
 
    ))
 
  
Mbp1Domains <- rbind(Mbp1Domains, list(
+
# I have added the function to the dbUtilities code
    species = "Ustilago maydis",
+
# so you can update it easily.
    code   = "USTMA",
 
    ACC    = "Q4P117",
 
    length  = 956,
 
    KilAN  = c(21,98),
 
    AThook  = NULL,
 
    Seg    = c(106,116,161,183,657,672,776,796),
 
    DisEMBL = NULL,
 
    Ankyrin = c(245,274,355,384),
 
    Coils  = c(581,609)
 
    ))
 
  
 +
# Run:
  
# Working with data in lists and dataframes can be awkward, since the result
+
updateDbUtilities("55ca561e2944af6e9ce5cf2a558d0a3c588ea9af")
# of filters and slices are themselves lists, not vectors.
 
# Therefore we need to use the unlist() function a lot. When in doubt: unlist()
 
  
#### Boxes #####
+
# If that is successful, try these three testcases
# Define a function to draw colored boxes, given input of a vector with
 
# (start,end) pairs, a color, and the height where the box should go.
 
drawBoxes <- function(v, c, h) {  # vector of xleft, xright pairs; color; height
 
    if (is.null(v)) { return() }
 
    for (i in seq(1,length(v),by=2)) {
 
        rect(v[i], h-0.1, v[i+1], h+0.1, border="black", col=c)
 
    }
 
}
 
  
#### Annotations ####
+
myNewDB <- createDB()
# Define a function to write the species code, draw a grey
+
tmp <- fetchProteinData("NP_010227") # Mbp1p
# horizontal line and call drawBoxes() for every annotation type
+
tmp
# in the list
+
myNewDB <- addToDB(myNewDB, tmp)
drawGene <- function(rIndex) {
+
myNewDB
    # define colors:
 
    kil <- "#32344F"
 
    ank <- "#691A2C"
 
    seg <- "#598C9E"
 
    coi <- "#8B9998"
 
    xxx <- "#EDF7F7"
 
   
 
    text (-30, rIndex, adj=1, labels=unlist(Mbp1Domains[rIndex,"code"]), cex=0.75 )
 
    lines(c(0, unlist(Mbp1Domains[rIndex,"length"])), c(rIndex, rIndex), lwd=3, col="#999999")
 
  
    drawBoxes(unlist(Mbp1Domains[rIndex,"KilAN"]),  kil, rIndex)
+
tmp <- fetchProteinData("NP_011036") # Swi4p
    drawBoxes(unlist(Mbp1Domains[rIndex,"AThook"]),  xxx, rIndex)
+
tmp
    drawBoxes(unlist(Mbp1Domains[rIndex,"Seg"]),    seg, rIndex)
+
myNewDB <- addToDB(myNewDB, tmp)
    drawBoxes(unlist(Mbp1Domains[rIndex,"DisEMBL"]), xxx, rIndex)
+
myNewDB
    drawBoxes(unlist(Mbp1Domains[rIndex,"Ankyrin"]), ank, rIndex)
 
    drawBoxes(unlist(Mbp1Domains[rIndex,"Coils"]),  coi, rIndex)
 
}
 
  
#### Plot ####
+
tmp <- fetchProteinData("NP_012881") # Phd1p
# define the size of the plot-frame
+
tmp
yMax <- length(Mbp1Domains[,1])   # number of domains in list
+
myNewDB <- addToDB(myNewDB, tmp)
xMax <- max(unlist(Mbp1Domains[,"length"]))  # largest sequence length
+
myNewDB
  
# plot an empty frame
 
plot(1,1, xlim=c(-100,xMax), ylim=c(0, yMax) , type="n", yaxt="n", bty="n", xlab="sequence number", ylab="")
 
  
# Finally, iterate over all species and call drawGene()
 
for (i in 1:length(Mbp1Domains[,1])) {
 
    drawGene(i)
 
}
 
 
# end
 
  
 
</source>
 
</source>
}}
 
 
  
When you execute the code, your plot should look similar to this one:
 
  
[[Image:DomainAnnotations.jpg|frame|none|SMART domain annotations for Mbp1 proteins from six fungal species.
 
]]
 
 
{{task|1=
 
 
On the Student Wiki, add the annotations for YFO to the plot:
 
 
# Copy one of the list definitions for Mbp1 domains and edit it with the appropriate values for your own annotations.
 
# Test that you can add the YFO annotation to the plot.
 
# Submit your validated code block to the [http://biochemistry.utoronto.ca/steipe/abc/students/index.php/BCH441_2014_Assignment_4_domain_annotations '''Student Wiki here''']. The goal is to compile an overview of all species we are studying in class. 
 
# If your working annotation block is in the Wiki before noontime on Wednesday, you will be awarded a 10% bonus on the quiz.
 
 
}}
 
}}
  
  
==EC==
 
  
 +
This new <code>fetchProteinData()</code> function seems to be quite convenient. I have compiled a [[Reference_APSES_proteins_(reference_species)|set of APSES domain proteins]] for ten fungi species and loaded the 48 protein's data into an R database in a few minutes. This "reference database" will be automatically loaded for you with the '''next''' dbUtilities update. Note that it will be recreated every time you start up '''R'''. This means two things: (i) if you break something in the reference database, it's not a problem. (ii) if you store your own data in it, it will be lost. In order to add your own genes, you need to make a working copy for yourself.
  
  
==GO==
+
====Computer literacy====
  
==Introduction==
 
  
{{#pmid: 18563371}}
+
;Digression - some musings on computer literacy and code engineering.
{{#pmid: 19957156}}
+
It's really useful to get into a consistent habit of giving your files a meaningful name. The name should include something that tells you what the file contains, and something that tells you the date or version. I give versions major and minor numbers, and - knowing how much things always change - I write major version numbers with a leading zero eg. <code>04</code> so that they will be correctly sorted by name in a directory listing. The same goes for dates: always write <code>YYYY-MM-DD</code> to ensure proper sorting.
  
==GO==
+
On the topic of versions: creating the database with its data structures and the functions that operate on them is an ongoing process, and changes in one part of the code may have important consequences for another part. Imagine I made a poor choice of a column name early on: changing that would need to be done in every single function of the code that reads or writes or analyzes data. Once the code reaches a certain level of complexity, organizing it well is just as important as writing it in the first place. In the new update of <code>dbUtilities.R</code>, a database has a <code>$version</code> element, and every function checks that the database version matches the version for which the function was written. Obviously, this also means the developer must provide tools to migrate contents from an older version to a newer version. And since migrating can run into trouble and leave all data in an inconsistent and unfixable state, it's a good time to remind you to back up important data frequently. Of course you will want to save your database once you've done any significant work with it. And you will especially want to save the databases you create for your Term Project. But you should also (and perhaps more importantly) save the script that you use to create the database in the first place. And on that note: when was the last time you made a full backup of your computer's hard-drive? Too long ago? I thought so.
The Gene Ontology project is the most influential contributor to the definition of function in computational biology and the use of GO terms and GO annotations is ubiquitous.
 
  
{{WWW|WWW_GO}}
+
;Backup your hard-drive now.
{{#pmid: 21330331}}
 
  
The GO actually comprises three separate ontologies:
 
  
;Molecular function
+
If your last backup at the time of next week's quiz was less than two days ago, you will receive a 0.5 mark bonus.
:...
 
  
  
;Biological Process
+
===New Database ===
:...
 
  
 +
Here is some sample code to work with the new database, enter new protein data for YFO, save it and load it again when needed.
  
;Cellular component:
 
: ...
 
  
 +
<source lang="R">
 +
# You don't need to load the reference database refDB. If
 +
# everything is set up correctly, it gets loaded at startup.
 +
# (Just so you know: you can turn off that behaviour if you
 +
# ever should want to...)
  
===GO terms===
 
GO terms comprise the core of the information in the ontology: a carefully crafted definition of a term in any of GO's separate ontologies.
 
  
 +
# First you need to load the newest version of dbUtilities.R
  
 +
updateDButilities("7bb32ab3d0861ad81bdcb9294f0f6a737b820bf9")
  
===GO relationships===
+
# If you get an error:
The nature of the relationships is as much a part of the ontology as the terms themselves. GO uses three categories of relationships:
+
#    Error: could not find function "updateDButilities"
 +
# ... then it seems you didn't do the previous update.
  
* is a
+
# Try getting the update with the new key but the previous function:
* part of
+
# updateDbUtilities()
* regulates
+
#
 +
# If that function is not found either, confirm that your ~/.Rprofile
 +
# actually loads dbUtilites.R from your project directory.
  
 +
# As a desperate last resort, you could uncomment
 +
# the following piece of code and run the update
 +
# without verification...
 +
#
 +
# URL <- "http://steipe.biochemistry.utoronto.ca/abc/images/f/f9/DbUtilities.R"
 +
# download.file(URL, paste(PROJECTDIR, "dbUtilities.R", sep="")), method="auto")
 +
# source(paste(PROJECTDIR, "dbUtilities.R", sep=""))
 +
#
 +
# But be cautious: there is no verification. You yourself need
 +
# to satisfy yourself that this "file from the internet" is what
 +
# it should be, before source()'ing it...
  
===GO annotations===
 
The GO terms are conceptual in nature, and while they represent our interpretation of biological phenomena, they do not intrinsically represent biological objects, such a specific genes or proteins. In order to link molecules with these concepts, the ontology is used to '''annotate''' genes. The annotation project is referred to as GOA.
 
  
{{#pmid:18287709}}
+
# After the file has been source()'d,  refDB exists.
 +
ls(refDB)
  
  
===GO evidence codes===
+
# check the contents of refDB:
Annotations can be made according to literature data or computational inference and it is important to note how an annotation has been justified by the curator to evaluate the level of trust we should have in the annotation. GO uses evidence codes to make this process transparent. When computing with the ontology, we may want to filter (exclude) particular terms in order to avoid tautologies: for example if we were to infer functional relationships between homologous genes, we should exclude annotations that have been based on the same inference or similar, and compute only with the actual experimental data.
+
refDB$protein$name
 +
refDB$taxonomy
  
The following evidence codes are in current use; if you want to exclude inferred anotations you would restrict the codes you use to the ones shown in bold: EXP, IDA, IPI, IMP, IEP, and perhaps IGI, although the interpretation of genetic interactions can require assumptions.
 
  
;Automatically-assigned Evidence Codes
+
# list refSeqIDs for saccharomyces cerevisiae genes.
*IEA: Inferred from Electronic Annotation
+
refDB$protein[refDB$protein$taxID == 559292, "refSeqID"]
;Curator-assigned Evidence Codes
 
*'''Experimental Evidence Codes'''
 
**EXP: Inferred from Experiment
 
**IDA: Inferred from Direct Assay
 
**IPI: Inferred from Physical Interaction
 
**IMP: Inferred from Mutant Phenotype
 
**IGI: Inferred from Genetic Interaction
 
**IEP: Inferred from Expression Pattern</b>
 
*'''Computational Analysis Evidence Codes'''
 
**ISS: Inferred from Sequence or Structural Similarity
 
**ISO: Inferred from Sequence Orthology
 
**ISA: Inferred from Sequence Alignment
 
**ISM: Inferred from Sequence Model
 
**IGC: Inferred from Genomic Context
 
**IBA: Inferred from Biological aspect of Ancestor
 
**IBD: Inferred from Biological aspect of Descendant
 
**IKR: Inferred from Key Residues
 
**IRD: Inferred from Rapid Divergence
 
**RCA: inferred from Reviewed Computational Analysis
 
*'''Author Statement Evidence Codes'''
 
**TAS: Traceable Author Statement
 
**NAS: Non-traceable Author Statement
 
*'''Curator Statement Evidence Codes'''
 
**IC: Inferred by Curator
 
**ND: No biological Data available
 
  
For further details, see the [http://www.geneontology.org/GO.evidence.shtml Guide to GO Evidence Codes] and the [http://www.geneontology.org/GO.evidence.tree.shtml GO Evidence Code Decision Tree].
 
  
 +
# To add some genes from YFO, I proceed as follows.
 +
# Obviously, you need to adapt this to your YFO
 +
# and the sequences in YFO that you have found
 +
# with your PSI-BLAST search.
  
&nbsp;
+
# Let's assume my YFO is the fly agaric (amanita muscaria)
 +
# and its APSES domain proteins have the following IDs
 +
# (these are not refSeq btw. and thus unlikely
 +
# to be found in UniProt) ...
 +
# KIL68212
 +
# KIL69256
 +
# KIL65817
 +
#
  
===GO tools===
 
  
For many projects, the simplest approach will be to download the GO ontology itself. It is a well constructed, easily parseable file that is well suited for computation. For details, see [[Computing with GO]] on this wiki.
+
# First, I create a copy of the database with a name that
 +
# I will recognize to be associated with my YFO.
 +
amamuDB <- refDB
  
  
 +
# Then I fetch my protein data ...
 +
tmp1 <- fetchProteinData("KIL68212")
 +
tmp2 <- fetchProteinData("KIL69256")
 +
tmp3 <- fetchProteinData("KIL65817")
  
  
 +
# ... and if I am satisfied that it contains what I
 +
# want, I add it to the database.
 +
amamuDB <- addToDB(amamuDB, tmp1)
 +
amamuDB <- addToDB(amamuDB, tmp2)
 +
amamuDB <- addToDB(amamuDB, tmp3)
  
===AmiGO===
 
practical work with GO: at first via the AmiGO browser
 
[http://amigo.geneontology.org/cgi-bin/amigo/go.cgi '''AmiGO'''] is a [http://www.geneontology.org/ '''GO'''] browser developed by the Gene Ontology consortium and hosted on their website.
 
  
====AmiGO - Gene products====
+
# Then I make a local backup copy. Note the filename and
{{task|1=
+
# version number :-)
# Navigate to the [http://www.geneontology.org/ '''GO'''] homepage.
+
save(amamuDB, file="amamuDB.01.RData")
# Enter <code>SOX2</code> into the search box to initiate a search for the human SOX2 transcription factor ({{WP|SOX2|WP}}, [http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=11195 HUGO]) (as ''gene or protein name'').
+
# There are a number of hits in various organisms: ''sulfhydryl oxidases'' and ''(sex determining region Y)-box'' genes. Check to see the various ways by which you could filter and restrict the results.
 
# Select ''Homo sapiens'' as the '''species''' filter and set the filter. Note that this still does not give you a unique hit, but ...
 
# ... you can identify the '''[http://amigo.geneontology.org/cgi-bin/amigo/gp-details.cgi?gp=UniProtKB:P48431 Transcription factor SOX-2]''' and follow its gene product information link. Study the information on that page.
 
# Later, we will need Entrez Gene IDs. The GOA information page provides these as '''GeneID''' in the '''External references''' section. Note it down. With the same approach, find and record the Gene IDs (''a'') of the functionally related [http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=9221 Oct4 (POU5F1)] protein, (''b'') the human cell-cycle transcription factor [http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=3113 E2F1], (''c'') the human bone morphogenetic protein-4 transforming growth factor [http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=1071 BMP4], (''d'') the human UDP glucuronosyltransferase 1 family protein 1, an enzyme that is differentially expressed in some cancers, [http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=12530 UGT1A1], and (''d'') as a positive control, SOX2's interaction partner [http://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=20857 NANOG], which we would expect to be annotated as functionally similar to both Oct4 and SOX2.
 
}}
 
  
 +
# Now I can explore my new database ...
 +
amamuDB$protein[amamuDB$protein$taxID == 946122, "refSeqID"]
  
<!--
 
SOX2: 6657
 
POU5F1: 5460
 
E2F1: 1869
 
BMP4: 652
 
UGT1A1: 54658
 
NANOG: 79923
 
  
mgeneSim(c("6657", "5460", "1869", "652", "54658", "79923"), ont="BP", organism="human", measure="Wang")
+
# ... but if anything goes wrong, for example
-->
+
# if I make a mistake in checking which
 +
# rows contain taxID 946122 ...
 +
amamuDB$protein$taxID = 946122
  
====AmiGO - Associations====
+
# Ooops ... what did I just do wrong?
GO annotations for a protein are called ''associations''.
+
#      ... wnat happened instead?
  
{{task|1=
+
amamuDB$protein$taxID
# Open the ''associations'' information page for the human SOX2 protein via the [http://amigo.geneontology.org/cgi-bin/amigo/gp-assoc.cgi?gp=UniProtKB:P48431 link in the right column] in a separate tab. Study the information on that page.
 
# Note that you can filter the associations by ontology and evidence code. You have read about the three GO ontologies in your previous assignment, but you should also be familiar with the evidence codes. Click on any of the evidence links to access the Evidence code definition page and study the [http://www.geneontology.org/GO.evidence.shtml definitions of the codes]. '''Make sure you understand which codes point to experimental observation, and which codes denote computational inference, or say that the evidence is someone's opinion (TAS, IC ''etc''.).''' <small>Note: it is good practice - but regrettably not universally implemented standard - to clearly document database semantics and keep definitions associated with database entries easily accessible, as GO is doing here. You won't find this everywhere, but as a user please feel encouraged to complain to the database providers if you come across a database where the semantics are not clear. Seriously: opaque semantics make database annotations useless.</small> 
 
# There are many associations (around 60) and a good way to select which ones to pursue is to follow the '''most specific''' ones. Set <code>IDA</code> as a filter and among the returned terms select <code>GO:0035019</code> &ndash; [http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0035019 ''somatic stem cell maintenance''] in the '''Biological Process''' ontology. Follow that link.
 
# Study the information available on that page and through the tabs on the page, especially the graph view.
 
# In the '''Inferred Tree View''' tab, find the genes annotated to this go term for ''homo sapiens''. There should be about 55. Click on [http://amigo.geneontology.org/cgi-bin/amigo/term-assoc.cgi?term=GO:0035019&speciesdb=all&taxid=9606 the number behind the term]. The resulting page will give you all human proteins that have been annotated with this particular term. Note that the great majority of these is via the <code>IEA</code> evidence code.
 
}}
 
  
  
====Semantic similarity====
+
# ... I can simply recover from my backup copy:
 +
load("amamuDB.01.RData")   
 +
amamuDB$protein$taxID
  
A good, recent overview of ontology based functional annotation is found in the following article. This is not a formal reading assignment, but do familiarize yourself with section 3: ''Derivation of Semantic Similarity between Terms in an Ontology'' as an introduction to the code-based annotations below.
 
  
{{#pmid: 23533360}}
+
</source>
  
  
Practical work with GO:  bioconductor.
+
&nbsp;
 
 
The bioconductor project hosts the GOSemSim package for semantic similarity.
 
 
 
 
{{task|1=
 
{{task|1=
# Work through the following R-code. If you have problems, discuss them on the mailing list. Don't go through the code mechanically but make sure you are clear about what it does.
 
<source lang="R">
 
# GOsemanticSimilarity.R
 
# GO semantic similarity example
 
# B. Steipe for BCB420, January 2014
 
  
setwd("~/your-R-project-directory")
+
;Create your own version of the protein database by adding all the genes from YFO that you have discovered with the PSI-BLAST search for the APSES domain. Save it.
 
 
# GOSemSim is an R-package in the bioconductor project. It is not installed via
 
# the usual install.packages() comand (via CRAN) but via an installation script
 
# that is run from the bioconductor Website.
 
 
 
source("http://bioconductor.org/biocLite.R")
 
biocLite("GOSemSim")
 
 
 
library(GOSemSim)
 
 
 
# This loads the library and starts the Bioconductor environment.
 
# You can get an overview of functions by executing ...
 
browseVignettes()
 
# ... which will open a listing in your Web browser. Open the
 
# introduction to GOSemSim PDF. As the introduction suggests,
 
# now is a good time to execute ...
 
help(GOSemSim)
 
 
 
# The simplest function is to measure the semantic similarity of two GO
 
# terms. For example, SOX2 was annotated with GO:0035019 (somatic stem cell
 
# maintenance), QSOX2 was annotated with GO:0045454 (cell redox homeostasis),
 
# and Oct4 (POU5F1) with GO:0009786 (regulation of asymmetric cell division),
 
# among other associations. Lets calculate these similarities.
 
goSim("GO:0035019", "GO:0009786", ont="BP", measure="Wang")
 
goSim("GO:0035019", "GO:0045454", ont="BP", measure="Wang")
 
 
 
# Fair enough. Two numbers. Clearly we would appreciate an idea of the values
 
# that high similarity and low similarity can take. But in any case -
 
# we are really less interested in the similarity of GO terms - these
 
# are a function of how the Ontology was constructed. We are more
 
# interested in the functional similarity of our genes, and these
 
# have a number of GO terms associated with them.
 
 
 
# GOSemSim provides the functions ...
 
?geneSim()
 
?mgeneSim()
 
# ... to compute these values. Refer to the vignette for details, in
 
# particular, consider how multiple GO terms are combined, and how to
 
# keep/drop evidence codes.
 
# Here is a pairwise similarity example: the gene IDs are the ones you
 
# have recorded previously. Note that this will download a package
 
# of GO annotations - you might not want to do this on a low-bandwidth
 
# connection.
 
geneSim("6657", "5460", ont = "BP", measure="Wang", combine = "BMA")
 
# Another number. And the list of GO terms that were considered.
 
 
 
# Your task: use the mgeneSim() function to calculate the similarities
 
# between all six proteins for which you have recorded the GeneIDs
 
# previously (SOX2, POU5F1, E2F1, BMP4, UGT1A1 and NANOG) in the
 
# biological process ontology.
 
 
 
# This will run for some time. On my machine, half an hour or so.  
 
 
 
# Do the results correspond to your expectations?
 
 
 
</source>
 
  
 
}}
 
}}
  
===GO reading and resources===
 
;General
 
<div class="reference-box">[http://www.obofoundry.org/ '''OBO Foundry''' (Open Biological and Biomedical Ontologies)]</div>
 
{{#pmid: 18793134}}
 
  
 
+
&nbsp;
;Phenotype ''etc.'' Ontologies
 
<div class="reference-box">[http://http://www.human-phenotype-ontology.org/ '''Human Phenotype Ontology''']<br/>
 
See also: {{#pmid: 24217912}}</div>
 
{{#pmid: 22080554}}
 
{{#pmid: 21437033}}
 
{{#pmid: 20004759}}
 
{{#pmid: 16982638}}
 
 
 
 
 
;Semantic similarity
 
{{#pmid: 23741529}}
 
{{#pmid: 23533360}}
 
{{#pmid: 22084008}}
 
{{#pmid: 21078182}}
 
{{#pmid: 20179076}}
 
 
 
;GO
 
{{#pmid: 22102568}}
 
{{#pmid: 21779995}}
 
{{#pmid: 19920128}}
 
Carol Goble on the tension between purists and pragmatists in life-science ontology construction. Plenary talk at SOFG2...
 
{{#pmid: 18629186}}
 
  
  
 +
;TBC
  
;That is all.
 
  
  
 
&nbsp;
 
&nbsp;
  
== Links and resources ==
 
  
<!-- {{#pmid: 19957275}} -->
 
<!-- {{WWW|WWW_GMOD}} -->
 
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 
  
  

Latest revision as of 05:54, 17 November 2015

Assignment for Week 6
Function

< Assignment 5 Assignment 7 >

Note! This assignment is currently inactive. Major and minor unannounced changes may be made at any time.

 
 

Concepts and activities (and reading, if applicable) for this assignment will be topics on next week's quiz.




 

Introduction

 

In this assignment we will first download a number of APSES domain containing sequences into our database - and we will automate the process. Then we will annotate them with domain data. First manually, and then again, we will automate this. Next we will extract the APSES domains from our database according to the annotations. And finally we will align them, and visualize domain conservation in the 3D model to study parts of the protein that are conserved.


 

Downloading Protein Data From the Web

In Assignment 3 we created a schema for a local protein sequence collection, and implemented it as an R list. We added sequences to this database by hand, but since the information should be cross-referenced and available based on a protein's RefSeq ID, we should really have a function that automates this process. It is far too easy to make mistakes and enter erroneous information otherwise.


Task:
Work through the following code examples.

# To begin, we load some libraries with functions
# we need...

# httr sends and receives information via the http
# protocol, just like a Web browser.
if (!require(httr, quietly=TRUE)) { 
	install.packages("httr")
	library(httr)
}

# NCBI's eUtils send information in XML format; we
# need to be able to parse XML.
if (!require(XML, quietly=TRUE)) {
	install.packages("XML")
	library(XML)
}

# stringr has a number of useful utility functions
# to work with strings. E.g. a function that
# strips leading and trailing whitespace from
# strings.
if (!require(stringr, quietly=TRUE)) {
	install.packages("stringr")
	library(stringr)
}


# We will walk through the process with the refSeqID
# of yeast Mbp1
refSeqID <- "NP_010227"


# UniProt.
# The UniProt ID mapping service supports a "RESTful
# API": responses can be obtained simply via a Web-
# browsers request. Such requests are commonly sent
# via the GET or POST verbs that a Webserver responds
# to, when a client asks for data. GET requests are 
# visible in the URL of the request; POST requests
# are not directly visible, they are commonly used
# to send the contents of forms, or when transmitting
# larger, complex data items. The UniProt ID mapping
# sevice can accept long lists of IDs, thus using the
# POST mechanism makes sense.

# R has a POST() function as part of the httr package.

# It's very straightforward to use: just define the URL
# of the server and send a list of items as the 
# body of the request.

# uniProt ID mapping service
URL <- "http://www.uniprot.org/mapping/"
response <- POST(URL, 
                 body = list(from = "P_REFSEQ_AC",
                             to = "ACC",
                             format = "tab",
                             query = refSeqID))

response

# If the query is successful, tabbed text is returned.
# and we capture the fourth element as the requested
# mapped ID.
unlist(strsplit(content(response), "\\s+"))

# If the query can't be fulfilled because of a problem
# with the server, a WebPage is rturned. But the server status
# is also returned and we can check the status code. I have
# lately gotten many "503" status codes: Server Not Available...

if (response$status_code == 200) { # 200: oK
	uniProtID <- unlist(strsplit(content(response), "\\s+"))[4]
	if (is.na(uniProtID)) {
	warning(paste("UniProt ID mapping service returned NA.",
	              "Check your RefSeqID."))
	}
} else {
	uniProtID <- NA
	warning(paste("No uniProt ID mapping available:",
	              "server returned status",
	              response$status_code))
}

uniProtID  # Let's see what we got...
           # This should be "P39678"
           # (or NA if the query failed)


Next, we'll retrieve data from the various NCBI databases.

It is has become unreasonably difficult to screenscrape the NCBI site since the actual page contents are dynamically loaded via AJAX. This may be intentional, or just overengineering. While NCBI offers a subset of their data via the eutils API and that works well enough, some of the data that is available to the Web browser's eyes is not served to a program.

The eutils API returns data in XML format. Have a look at the following URL in your browser to see what that looks like:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=NP_010227


# In order to parse such data, we need tools from the 
# XML package. 

# First we build a query URL...
eUtilsBase <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/"


# Then we assemble an URL that will search for get the
# unique, NCBI internal identifier,  the GI number,
# for our refSeqID...
URL <- paste(eUtilsBase,
             "esearch.fcgi?",     # ...using the esearch program
                                  # that finds an entry in an
                                  # NCBI database
             "db=protein",
             "&term=", refSeqID,
             sep="")
# Copy the URL and paste it into your browser to see
# what the response should look like.
URL

# To fetch a response in R, we use the function htmlParse()
# with our URL as its argument.
response <- htmlParse(URL)
response

# This is XML. We can take the response apart into
# its indvidual components with the xmlToList function.

xmlToList(response)

# Note how the XML "tree" is represented as a list of
# lists of lists ...
# If we know exactly what elelement we are looking for,
# we can extract it from this structure:
xmlToList(response)[["body"]][["esearchresult"]][["idlist"]][["id"]]

# But this is not very robus, it would break with the
# slightest change that the NCBI makes to their response
# and the NCBI changes things A LOT!

# Somewhat more robust is to specify the type of element
# we want - its the text contained in an <id>...</id>
# elelement, and use the XPath XML parsing language to
# retrieve it.

# getNodeSet() lets us fetch tagged contents by 
# applying toString.XMLNode() to it...

node <- getNodeSet(response, "//id/text()")
unlist(lapply(node, toString.XMLNode))  # "6320147 "

# We will be doing this a lot, so we write a function
# for it...
node2string <- function(doc, tag) {
    # an extractor function for the contents of elements
    # between given tags in an XML response.
    # Contents of all matching elements is returned in
    # a vector of strings.
	path <- paste("//", tag, "/text()", sep="")
	nodes <- getNodeSet(doc, path)
	return(unlist(lapply(nodes, toString.XMLNode)))
}

# using node2string() ...
GID <- node2string(response, "id")
GID

# The GI is the pivot for all our data requests at the
# NCBI. 

# Let's first get the associated data for this GI
URL <- paste(eUtilsBase,
             "esummary.fcgi?",
             "db=protein",
             "&id=",
             GID,
             "&version=2.0",
             sep="")
response <- htmlParse(URL)
URL
response

taxID <- node2string(response, "taxid")
organism <- node2string(response, "organism")
taxID
organism


# Next, fetch the actual sequence
URL <- paste(eUtilsBase,
             "efetch.fcgi?",
             "db=protein",
             "&id=",
             GID,
             "&retmode=text&rettype=fasta",
             sep="")
response <- htmlParse(URL)
URL
response

fasta <- node2string(response, "p")
fasta

seq <- unlist(strsplit(fasta, "\\n"))[-1] # Drop the first elelment,
                                          # it is the FASTA header.
seq


# Next, fetch the crossreference to the NCBI Gene
# database
URL <- paste(eUtilsBase,
             "elink.fcgi?",
             "dbfrom=protein",
             "&db=gene",
             "&id=",
             GID,
             sep="")
response <- htmlParse(URL)
URL
response

geneID <- node2string(response, "linksetdb/id")
geneID

# ... and the actual Gene record:
URL <- paste(eUtilsBase,
             "esummary.fcgi?",
             "&db=gene",
             "&id=",
             geneID,
             sep="")
response <- htmlParse(URL)
URL
response

name <- node2string(response, "name")
genome_xref <- node2string(response, "chraccver")
genome_from <- node2string(response, "chrstart")[1]
genome_to <- node2string(response, "chrstop")[1]
name
genome_xref
genome_from
genome_to

# So far so good. But since we need to do this a lot
# we need to roll all of this into a function. 

# I have added the function to the dbUtilities code
# so you can update it easily.

# Run:

updateDbUtilities("55ca561e2944af6e9ce5cf2a558d0a3c588ea9af")

# If that is successful, try these three testcases

myNewDB <- createDB()
tmp <- fetchProteinData("NP_010227") # Mbp1p
tmp
myNewDB <- addToDB(myNewDB, tmp)
myNewDB

tmp <- fetchProteinData("NP_011036") # Swi4p
tmp
myNewDB <- addToDB(myNewDB, tmp)
myNewDB

tmp <- fetchProteinData("NP_012881") # Phd1p
tmp
myNewDB <- addToDB(myNewDB, tmp)
myNewDB


This new fetchProteinData() function seems to be quite convenient. I have compiled a set of APSES domain proteins for ten fungi species and loaded the 48 protein's data into an R database in a few minutes. This "reference database" will be automatically loaded for you with the next dbUtilities update. Note that it will be recreated every time you start up R. This means two things: (i) if you break something in the reference database, it's not a problem. (ii) if you store your own data in it, it will be lost. In order to add your own genes, you need to make a working copy for yourself.


Computer literacy

Digression - some musings on computer literacy and code engineering.

It's really useful to get into a consistent habit of giving your files a meaningful name. The name should include something that tells you what the file contains, and something that tells you the date or version. I give versions major and minor numbers, and - knowing how much things always change - I write major version numbers with a leading zero eg. 04 so that they will be correctly sorted by name in a directory listing. The same goes for dates: always write YYYY-MM-DD to ensure proper sorting.

On the topic of versions: creating the database with its data structures and the functions that operate on them is an ongoing process, and changes in one part of the code may have important consequences for another part. Imagine I made a poor choice of a column name early on: changing that would need to be done in every single function of the code that reads or writes or analyzes data. Once the code reaches a certain level of complexity, organizing it well is just as important as writing it in the first place. In the new update of dbUtilities.R, a database has a $version element, and every function checks that the database version matches the version for which the function was written. Obviously, this also means the developer must provide tools to migrate contents from an older version to a newer version. And since migrating can run into trouble and leave all data in an inconsistent and unfixable state, it's a good time to remind you to back up important data frequently. Of course you will want to save your database once you've done any significant work with it. And you will especially want to save the databases you create for your Term Project. But you should also (and perhaps more importantly) save the script that you use to create the database in the first place. And on that note: when was the last time you made a full backup of your computer's hard-drive? Too long ago? I thought so.

Backup your hard-drive now.


If your last backup at the time of next week's quiz was less than two days ago, you will receive a 0.5 mark bonus.


New Database

Here is some sample code to work with the new database, enter new protein data for YFO, save it and load it again when needed.


# You don't need to load the reference database refDB. If
# everything is set up correctly, it gets loaded at startup.
# (Just so you know: you can turn off that behaviour if you
# ever should want to...)


# First you need to load the newest version of dbUtilities.R

updateDButilities("7bb32ab3d0861ad81bdcb9294f0f6a737b820bf9")

# If you get an error: 
#    Error: could not find function "updateDButilities"
# ... then it seems you didn't do the previous update.

# Try getting the update with the new key but the previous function:
# updateDbUtilities()
#
# If that function is not found either, confirm that your ~/.Rprofile
# actually loads dbUtilites.R from your project directory. 

# As a desperate last resort, you could uncomment
# the following piece of code and run the update
# without verification...
#
# URL <- "http://steipe.biochemistry.utoronto.ca/abc/images/f/f9/DbUtilities.R"
# download.file(URL, paste(PROJECTDIR, "dbUtilities.R", sep="")), method="auto")
# source(paste(PROJECTDIR, "dbUtilities.R", sep=""))
#
# But be cautious: there is no verification. You yourself need
# to satisfy yourself that this "file from the internet" is what 
# it should be, before source()'ing it...


# After the file has been source()'d,  refDB exists.
ls(refDB)


# check the contents of refDB:
refDB$protein$name
refDB$taxonomy


# list refSeqIDs for saccharomyces cerevisiae genes.
refDB$protein[refDB$protein$taxID == 559292, "refSeqID"]


# To add some genes from YFO, I proceed as follows.
# Obviously, you need to adapt this to your YFO
# and the sequences in YFO that you have found
# with your PSI-BLAST search.

# Let's assume my YFO is the fly agaric (amanita muscaria)
# and its APSES domain proteins have the following IDs
# (these are not refSeq btw. and thus unlikely
# to be found in UniProt) ...
# KIL68212
# KIL69256
# KIL65817
#


# First, I create a copy of the database with a name that
# I will recognize to be associated with my YFO.
amamuDB <- refDB


# Then I fetch my protein data ...
tmp1 <- fetchProteinData("KIL68212")
tmp2 <- fetchProteinData("KIL69256")
tmp3 <- fetchProteinData("KIL65817")


# ... and if I am satisfied that it contains what I
# want, I add it to the database.
amamuDB <- addToDB(amamuDB, tmp1)
amamuDB <- addToDB(amamuDB, tmp2)
amamuDB <- addToDB(amamuDB, tmp3)


# Then I make a local backup copy. Note the filename and
# version number  :-)
save(amamuDB, file="amamuDB.01.RData")
 

# Now I can explore my new database ...
amamuDB$protein[amamuDB$protein$taxID == 946122, "refSeqID"]


# ... but if anything goes wrong, for example 
# if I make a mistake in checking which
# rows contain taxID 946122 ... 
amamuDB$protein$taxID = 946122

# Ooops ... what did I just do wrong?
#       ... wnat happened instead? 

amamuDB$protein$taxID


# ... I can simply recover from my backup copy:
load("amamuDB.01.RData")    
amamuDB$protein$taxID


 

Task:

Create your own version of the protein database by adding all the genes from YFO that you have discovered with the PSI-BLAST search for the APSES domain. Save it.


 


TBC


 



 


Footnotes and references


 

Ask, if things don't work for you!

If anything about the assignment is not clear to you, please ask on the mailing list. You can be certain that others will have had similar problems. Success comes from joining the conversation.



< Assignment 5 Assignment 7 >