Difference between revisions of "RPR-Scripting data downloads"

From "A B C"
Jump to navigation Jump to search
m
m
Line 592: Line 592:
  
 
At the bottom of these sequences, you should see the APSES sequences from
 
At the bottom of these sequences, you should see the APSES sequences from
YFO, '''in particular the Mbp1 RBM sequence from YFO'''. Email me if you have trouble getting to that stage.
+
MYSPE, '''in particular the Mbp1 RBM sequence from MYSPE'''. Email me if you have trouble getting to that stage.
  
 
We'll need to align these sequences with the template...
 
We'll need to align these sequences with the template...

Revision as of 02:52, 4 October 2017

Scripting Data Downloads


 

Keywords:  Techniques for accessing databases and downloading data


 



 


Caution!

This unit is under development. There is some contents here but it is incomplete and/or may change significantly: links may lead to nowhere, the contents is likely going to be rearranged, and objectives, deliverables etc. may be incomplete or missing. Do not work with this material until it is updated to "live" status.


 


Abstract

...


 


This unit ...

Prerequisites

You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:

  • The Central Dogma: Regulation of transcription and translation; protein biosynthesis and degradation; quality control.

You need to complete the following units before beginning this one:


 


Objectives

...


 


Outcomes

...


 


Deliverables

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.


 


Evaluation

Evaluation: NA

This unit is not evaluated for course marks.


 


Contents

Downloading Protein Data From the Web

In Assignment 3 we created a schema for a local protein sequence collection, and implemented it as an R list. We added sequences to this database by hand, but since the information should be cross-referenced and available based on a protein's RefSeq ID, we should really have a function that automates this process. It is far too easy to make mistakes and enter erroneous information otherwise.


Task:
Work through the following code examples.

# To begin, we load some libraries with functions
# we need...

# httr sends and receives information via the http
# protocol, just like a Web browser.
if (!require(httr, quietly=TRUE)) {
	install.packages("httr")
	library(httr)
}

# NCBI's eUtils send information in XML format; we
# need to be able to parse XML.
if (!require(XML, quietly=TRUE)) {
	install.packages("XML")
	library(XML)
}

# stringr has a number of useful utility functions
# to work with strings. E.g. a function that
# strips leading and trailing whitespace from
# strings.
if (!require(stringr, quietly=TRUE)) {
	install.packages("stringr")
	library(stringr)
}


# We will walk through the process with the refSeqID
# of yeast Mbp1
refSeqID <- "NP_010227"


# UniProt.
# The UniProt ID mapping service supports a "RESTful
# API": responses can be obtained simply via a Web-
# browsers request. Such requests are commonly sent
# via the GET or POST verbs that a Webserver responds
# to, when a client asks for data. GET requests are
# visible in the URL of the request; POST requests
# are not directly visible, they are commonly used
# to send the contents of forms, or when transmitting
# larger, complex data items. The UniProt ID mapping
# sevice can accept long lists of IDs, thus using the
# POST mechanism makes sense.

# R has a POST() function as part of the httr package.

# It's very straightforward to use: just define the URL
# of the server and send a list of items as the
# body of the request.

# uniProt ID mapping service
URL <- "http://www.uniprot.org/mapping/"
response <- POST(URL,
                 body = list(from = "P_REFSEQ_AC",
                             to = "ACC",
                             format = "tab",
                             query = refSeqID))

response

# If the query is successful, tabbed text is returned.
# and we capture the fourth element as the requested
# mapped ID.
unlist(strsplit(content(response), "\\s+"))

# If the query can't be fulfilled because of a problem
# with the server, a WebPage is rturned. But the server status
# is also returned and we can check the status code. I have
# lately gotten many "503" status codes: Server Not Available...

if (response$status_code == 200) { # 200: oK
	uniProtID <- unlist(strsplit(content(response), "\\s+"))[4]
	if (is.na(uniProtID)) {
	warning(paste("UniProt ID mapping service returned NA.",
	              "Check your RefSeqID."))
	}
} else {
	uniProtID <- NA
	warning(paste("No uniProt ID mapping available:",
	              "server returned status",
	              response$status_code))
}

uniProtID  # Let's see what we got...
           # This should be "P39678"
           # (or NA if the query failed)


Next, we'll retrieve data from the various NCBI databases.

It is has become unreasonably difficult to screenscrape the NCBI site since the actual page contents are dynamically loaded via AJAX. This may be intentional, or just overengineering. While NCBI offers a subset of their data via the eutils API and that works well enough, some of the data that is available to the Web browser's eyes is not served to a program.

The eutils API returns data in XML format. Have a look at the following URL in your browser to see what that looks like:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=NP_010227


# In order to parse such data, we need tools from the
# XML package.

# First we build a query URL...
eUtilsBase <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/"


# Then we assemble an URL that will search for get the
# unique, NCBI internal identifier,  the GI number,
# for our refSeqID...
URL <- paste(eUtilsBase,
             "esearch.fcgi?",     # ...using the esearch program
                                  # that finds an entry in an
                                  # NCBI database
             "db=protein",
             "&term=", refSeqID,
             sep="")
# Copy the URL and paste it into your browser to see
# what the response should look like.
URL

# To fetch a response in R, we use the function htmlParse()
# with our URL as its argument.
response <- htmlParse(URL)
response

# This is XML. We can take the response apart into
# its indvidual components with the xmlToList function.

xmlToList(response)

# Note how the XML "tree" is represented as a list of
# lists of lists ...
# If we know exactly what elelement we are looking for,
# we can extract it from this structure:
xmlToList(response)[["body"]][["esearchresult"]][["idlist"]][["id"]]

# But this is not very robus, it would break with the
# slightest change that the NCBI makes to their response
# and the NCBI changes things A LOT!

# Somewhat more robust is to specify the type of element
# we want - its the text contained in an <id>...</id>
# elelement, and use the XPath XML parsing language to
# retrieve it.

# getNodeSet() lets us fetch tagged contents by
# applying toString.XMLNode() to it...

node <- getNodeSet(response, "//id/text()")
unlist(lapply(node, toString.XMLNode))  # "6320147 "

# We will be doing this a lot, so we write a function
# for it...
node2string <- function(doc, tag) {
    # an extractor function for the contents of elements
    # between given tags in an XML response.
    # Contents of all matching elements is returned in
    # a vector of strings.
	path <- paste("//", tag, "/text()", sep="")
	nodes <- getNodeSet(doc, path)
	return(unlist(lapply(nodes, toString.XMLNode)))
}

# using node2string() ...
GID <- node2string(response, "id")
GID

# The GI is the pivot for all our data requests at the
# NCBI.

# Let's first get the associated data for this GI
URL <- paste(eUtilsBase,
             "esummary.fcgi?",
             "db=protein",
             "&id=",
             GID,
             "&version=2.0",
             sep="")
response <- htmlParse(URL)
URL
response

taxID <- node2string(response, "taxid")
organism <- node2string(response, "organism")
taxID
organism


# Next, fetch the actual sequence
URL <- paste(eUtilsBase,
             "efetch.fcgi?",
             "db=protein",
             "&id=",
             GID,
             "&retmode=text&rettype=fasta",
             sep="")
response <- htmlParse(URL)
URL
response

fasta <- node2string(response, "p")
fasta

seq <- unlist(strsplit(fasta, "\\n"))[-1] # Drop the first elelment,
                                          # it is the FASTA header.
seq


# Next, fetch the crossreference to the NCBI Gene
# database
URL <- paste(eUtilsBase,
             "elink.fcgi?",
             "dbfrom=protein",
             "&db=gene",
             "&id=",
             GID,
             sep="")
response <- htmlParse(URL)
URL
response

geneID <- node2string(response, "linksetdb/id")
geneID

# ... and the actual Gene record:
URL <- paste(eUtilsBase,
             "esummary.fcgi?",
             "&db=gene",
             "&id=",
             geneID,
             sep="")
response <- htmlParse(URL)
URL
response

name <- node2string(response, "name")
genome_xref <- node2string(response, "chraccver")
genome_from <- node2string(response, "chrstart")[1]
genome_to <- node2string(response, "chrstop")[1]
name
genome_xref
genome_from
genome_to

# So far so good. But since we need to do this a lot
# we need to roll all of this into a function.

# I have added the function to the dbUtilities code
# so you can update it easily.

# Run:

updateDbUtilities("55ca561e2944af6e9ce5cf2a558d0a3c588ea9af")

# If that is successful, try these three testcases

myNewDB <- createDB()
tmp <- fetchProteinData("NP_010227") # Mbp1p
tmp
myNewDB <- addToDB(myNewDB, tmp)
myNewDB

tmp <- fetchProteinData("NP_011036") # Swi4p
tmp
myNewDB <- addToDB(myNewDB, tmp)
myNewDB

tmp <- fetchProteinData("NP_012881") # Phd1p
tmp
myNewDB <- addToDB(myNewDB, tmp)
myNewDB


This new fetchProteinData() function seems to be quite convenient. I have compiled a set of APSES domain proteins for ten fungi species and loaded the 48 protein's data into an R database in a few minutes. This "reference database" will be automatically loaded for you with the next dbUtilities update. Note that it will be recreated every time you start up R. This means two things: (i) if you break something in the reference database, it's not a problem. (ii) if you store your own data in it, it will be lost. In order to add your own genes, you need to make a working copy for yourself.



Task:

  • In our case it seems the best results are had when searching the Prosite database with the ScanProsite interface.

Let's have a first look at ScanProsite, using the yeast Mbp1 sequence. We need the UniProt ID to search Prosite. With your protein database loaded in a fresh R session, type

# (commands indented, to align their components and
# help you understand their relationship)

       refDB$protein$uniProtID
                               which(refDB$protein$name == "MBP1")
       refDB$protein$uniProtID[which(refDB$protein$name == "MBP1")]
uID <- refDB$protein$uniProtID[which(refDB$protein$name == "MBP1")]
uID
  • Navigate to ScanProsite, paste the UniprotID for yeast Mbp1 into the text field, select Table output for STEP 3, and START THE SCAN.

You should see four feature hits: the APSES domain, and three ankyrin domain sequences that partially overlap. We could copy and paste the start and end numbers and IDs but that would be lame. Let's get them directly from Prosite instead, because we will want to fetch a few of these. Prosite does not have a nice API interface like UniProt, but the principles of using R's httr package to send POST requests and retrieve the results are the same. Getting data informally from Webpages is called screenscraping and really a life-saving skill. The first step to capture the data from this page via screenscraping is to look into the HTML code of the page.

(I am writing this section from the perspective of the Chrome browser - I don't think other browsers have all of the functionality that I am describing here. You may need to install Chrome to try this...)

  • Use the menu and access ViewDeveloperView Source. Scroll through the page. You should easily be able to identify the data table. That's fair enough: each of the lines contain the UniProt ID and we should be able to identify them. But how to send the request to get this page in the first place?
  • Use the browser's back button, and again: ViewDeveloperView Source. This is the page that accepts user input in a so called form via several different types of elements: "radio-buttons", a "text-box", "check-boxes", a "drop down menu" and a "submit" button. We need to figure out what each of the values are so that we can construct a valid POST request. If we get them wrong, in the wrong order, or have parts missing, it is likely that the server will simply ignore our request. These elements are much harder to identify thean the lines of feature information, and it's really easy to get them wrong, miss something and get no output. But Chrome has a great tool to help us: it allows you to see the exact, assembled POST header that it sent to the Prosite server!
  • On the scanProsite page, open ViewDeveloperDeveloper Tools in the Chrome menu. Then click again on START THE SCAN. The Developer Tools page will show you information about what just happened in the transaction it negotiated to retrieve the results page. Click on the Network tab, and then on the top element: PSScan.cgi. This contains the form data. Then click on the Headers tab and scroll down until you see the Request Payload. This has all the the required POST elements nicely spelled out. No guesswork required. What worked from the browser should work the same way from an R script. Analogous to our UniProt fetch code, we create a POST query:
URL <- "http://prosite.expasy.org/cgi-bin/prosite/PSScan.cgi"
response <- POST(URL,
                 body = list(meta = "opt1",
                             meta1_protein = "opt1",
                             seq = "P39678",
                             skip = "on",
                             output = "tabular"))
# Note how the list-elements correspond to the page header's
# Request Payload. We include everything but the value of the
# submit button (which is for display only) in our POST
# request.

# Send off this request, and you should have a response in a few
# seconds.

# The text contents of the response is available with the
# content() function:
content(response, "text")

# ... should show you the same as the page contents that
# you have seen in the browser. Now we need to extract
# the data from the page: we need regular expressions, but
# only simple ones. First, we strsplit() the response into
# individual lines, since each of our data elements is on
# its own line. We simply split on the "\\n" newline character.

lines <- unlist(strsplit(content(response, "text"), "\\n"))
head(lines)

# Now we define a query pattern for the lines we want:
# we can use the uID, bracketed by two "|" pipe
# characters:

pattern <- paste("\\|", uID, "\\|", sep="")

# ... and select only the lines that match this
# pattern:

lines <- lines[grep(pattern, lines)]
lines

# ... captures the four lines of output.

# Now we break the lines apart into
# apart in tokens: this is another application of
# strsplit(), but this time we split either on
# "pipe" characters, "|" OR on tabs "\t". Look at the
# regex "\\t|\\|" in the strsplit() call:

strsplit(lines[1], "\\t|\\|")

# Its parts are (\\t)=tab (|)=or (\\|)=pipe.
# Both "t" and "|" need to be escaped with a backslash.
# "t" has to be escaped because we want to match a tab (\t),
# not the literal character "t". And "|" has to be escaped
# because we mean the literal pipe character, not its
# usual (special) meaning OR. Thus sometimes the backslash
# turns a special meaning off, and sometimes it turns a
# special meaning on. Unfortunately there's no easy way
# to tell - you just need to remember the characters - or
# have a reference handy. The special characters are
# (){}[]^$?*+.|&-   ... and some of them have different
# meanings depending on where in the regex they are.

# Let's put the tokens into named slots of a vector.

features <- list()
for (line in lines) {
    tokens <- unlist(strsplit(line, "\\t|\\|"))
    features <- rbind(features, c(uID   =  tokens[2],
                                  start =  tokens[4],
                                  end   =  tokens[5],
                                  psID  =  tokens[6],
                                  psName = tokens[7]))
}
features

This forms the base of a function that collects the features automatically from a PrositeScan result. We still need to do a bit more on the database part, but this is mostly bookkeeping:

  • We need to put the feature annotations into a database table and link them to a protein ID and to a description of the feature itself.
  • We need a function that extracts feature sequences in FASTA format.
  • And, since we are changing the structure of the database, we need a way to migrate your old database contents to a newer version.

I don't think much new can be learned from this, so I have written those functions and put them into dbUtilities.R But you can certainly learn something from having a look at the code of

  • fetchPrositeFeatures()
  • addFeatureToDB()
  • getFeatureFASTA()

Also, have a quick look back at our database schema: this update has implemented the proteinFeature and the feature table. Do you remember what they were good for?

Time for a database update. You must be up to date with the latest version of dbUtilities.R for this to work. When you are, execute the following steps:

updateVerifiedFile("363ffbae3ff21ba80aa4fbf90dcc75164dbf10f8")

# Make a backup copy of your protein database.
# Load your protein database. Then merge the data in your database
# with the updated reference database. (Obviously, substitute the
# actual filename in the placeholder strings below. And don't type
# the angled brackets!)

<my-new-database> <- mergeDB(<my-old-database>, refDB)

# check that this has worked:
str(<my-new-database>)

# and save your database.

save(<my-new-database>, file="<my-DB-filename.02>.RData")

# Now, for each of your proteins, add the domain annotations to
# the database. You could write a loop to do this but it's probably
# better to check the results of each annotation before committing
# it to the database. So just paste the UniProt Ids as argument of
# the function fetchPrositeFeatures(), execute and repeat.


features <- fetchPrositeFeatures(<one-of-my-proteins-uniProt-IDs>)
refDB <- addFeatureToDB(refDB, features)

# When you are done, save your database.

Finally, we can create a sequence selection of APSES domains from our reference proteins. The function getFeatureFasta()

  • accepts a feature name such as "HTH_APSES";
  • finds the corresponding feature ID;
  • finds all matching entries in the proteinFeature table;
  • looks up the start and end position of each feature;
  • fetches the corresponding substring from the sequence entries;
  • adds a meaningful header line; and
  • writes everything to output.

... so that you can simply execute:

cat(getFeatureFasta(<my-new-database>, "HTH_APSES"))

Here are the first five sequences from that result:

>CC1G_01306_COPCI    HTH_APSES 6:112
IFKATYSGIPVYEMMCKGVAVMRRRSDSWLNATQILKVAGFDKPQRTRVLEREVQKGEHE
KVQGGYGKYQGTWIPLERGMQLAKQYNCEHLLRPIIEFTPAAKSPPL
>CNBB4890_CRYNE    HTH_APSES 17:123
IYKATYSGVPVYEMVCRDVAVMRRRSDAYLNATQILKVAGFDKPQRTRVLEREVQKGEHE
KVQGGYGKYQGTWIPIERGLALAKQYGVEDILRPIIDYVPTSVSPPP
>COCMIDRAFT_338_BIPOR    HTH_APSES 9:115
IYSATYSNVPVYECNVNGHHVMRRRADDWINATHILKVADYDKPARTRILEREVQKGVHE
KVQGGYGKYQGTWIPLEEGRGLAERNGVLDKMRAIFDYVPGDRSPPP
>WALSEDRAFT_68476_WALME    HTH_APSES 83:192
IYSAVYSGVGVYEAMIRGIAVMRRRADGYMNATQILKVAGVDKGRRTKILEREILAGLHE
KIQGGYGKYQGTWIPFERGRELALQYGCDHLLAPIFDFNPSVMQPSAGRS
>PGTG_08863_PUCGR    HTH_APSES 90:196
IYKATYSGVPVLEMPCEGIAVMRRRSDSWLNATQILKVAGFDKPQRTRVLEREIQKGTHE
KIQGGYGKYQGTWVPLDRGIDLAKQYGVDHLLSALFNFQPSSNESPP
[...]


At the bottom of these sequences, you should see the APSES sequences from MYSPE, in particular the Mbp1 RBM sequence from MYSPE. Email me if you have trouble getting to that stage.

We'll need to align these sequences with the template...




 


Further reading, links and resources

 


Notes


 


Self-evaluation

 



 




 

If in doubt, ask! If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.



 

About ...
 
Author:

Boris Steipe <boris.steipe@utoronto.ca>

Created:

2017-08-05

Modified:

2017-08-05

Version:

0.1

Version history:

  • 0.1 First stub

CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.