Difference between revisions of "RPR-Scripting data downloads"

From "A B C"
Jump to navigation Jump to search
m (Created page with "<div id="BIO"> <div class="b1"> Scripting Data Downloads </div> {{Vspace}} <div class="keywords"> <b>Keywords:</b>  Techniques for accessing databases and downl...")
 
m
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
<div id="BIO">
+
<div id="ABC">
  <div class="b1">
+
<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce; font-size:300%; font-weight:400; color: #000000; width:100%;">
 
Scripting Data Downloads
 
Scripting Data Downloads
  </div>
+
<div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#b3dbce; font-size:30%; font-weight:200; color: #000000; ">
 
+
(Techniques for accessing databases and downloading data)
  {{Vspace}}
+
</div>
 
 
<div class="keywords">
 
<b>Keywords:</b>&nbsp;
 
Techniques for accessing databases and downloading data
 
 
</div>
 
</div>
  
{{Vspace}}
+
{{Smallvspace}}
 
 
 
 
__TOC__
 
 
 
{{Vspace}}
 
 
 
 
 
{{DEV}}
 
 
 
{{Vspace}}
 
  
  
 +
<div style="padding:5px; border:1px solid #000000; background-color:#b3dbce33; font-size:85%;">
 +
<div style="font-size:118%;">
 +
<b>Abstract:</b><br />
 +
<section begin=abstract />
 +
Often we need to automate access to databases that provide their data via Web interfaces, or are only designed to be viewed in Web browsers. This unit discussess three strategies: working with text data that is accessed through GET and POST commands, and parsing simple XML formatted data.
 +
<section end=abstract />
 
</div>
 
</div>
<div id="ABC-unit-framework">
+
<!-- ============================  -->
== Abstract ==
+
<hr>
<!-- included from "../components/RPR-Scripting_data_downloads.components.wtxt", section: "abstract" -->
+
<table>
...
+
<tr>
 
+
<td style="padding:10px;">
{{Vspace}}
+
<b>Objectives:</b><br />
 
+
This unit will ...
 
+
* ... introduce the GET and POST verbs to interface with Web servers;
== This unit ... ==
+
* ... demonstrate parsing of text and XML responses.
=== Prerequisites ===
+
</td>
<!-- included from "../components/RPR-Scripting_data_downloads.components.wtxt", section: "prerequisites" -->
+
<td style="padding:10px;">
<!-- included from "ABC-unit_components.wtxt", section: "notes-external_prerequisites" -->
+
<b>Outcomes:</b><br />
You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:
+
After working through this unit you ...
<!-- included from "FND-prerequisites.wtxt", section: "central_dogma" -->
+
* ... can construct GET and POST queries to Web servers;
 +
* ... can parse text data and XML data;
 +
* ... have integrated sample code for this into a utility function.
 +
</td>
 +
</tr>
 +
</table>
 +
<!-- ============================ -->
 +
<hr>
 +
<b>Deliverables:</b><br />
 +
<section begin=deliverables />
 +
<li><b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.</li>
 +
<li><b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.</li>
 +
<li><b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].</li>
 +
<section end=deliverables />
 +
<!-- ============================  -->
 +
<hr>
 +
<section begin=prerequisites />
 +
<b>Prerequisites:</b><br />
 +
You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:<br />
 
*<b>The Central Dogma</b>: Regulation of transcription and translation; protein biosynthesis and degradation; quality control.
 
*<b>The Central Dogma</b>: Regulation of transcription and translation; protein biosynthesis and degradation; quality control.
<!-- included from "ABC-unit_components.wtxt", section: "notes-prerequisites" -->
+
This unit builds on material covered in the following prerequisite units:<br />
You need to complete the following units before beginning this one:
+
*[[BIN-Data_integration|BIN-Data_integration (Data Integration)]]
*[[BIN-Data_integration]]
+
<section end=prerequisites />
 +
<!-- ============================  -->
 +
</div>
  
{{Vspace}}
+
{{Smallvspace}}
  
  
=== Objectives ===
 
<!-- included from "../components/RPR-Scripting_data_downloads.components.wtxt", section: "objectives" -->
 
...
 
  
{{Vspace}}
+
{{Smallvspace}}
  
  
=== Outcomes ===
+
__TOC__
<!-- included from "../components/RPR-Scripting_data_downloads.components.wtxt", section: "outcomes" -->
 
...
 
 
 
{{Vspace}}
 
 
 
 
 
=== Deliverables ===
 
<!-- included from "../components/RPR-Scripting_data_downloads.components.wtxt", section: "deliverables" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-time_management" -->
 
*<b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-journal" -->
 
*<b>Journal</b>: Document your progress in your [[FND-Journal|course journal]].
 
<!-- included from "ABC-unit_components.wtxt", section: "deliverables-insights" -->
 
*<b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|insights! page]].
 
  
 
{{Vspace}}
 
{{Vspace}}
Line 74: Line 69:
  
 
=== Evaluation ===
 
=== Evaluation ===
<!-- included from "../components/RPR-Scripting_data_downloads.components.wtxt", section: "evaluation" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "eval-none" -->
 
 
<b>Evaluation: NA</b><br />
 
<b>Evaluation: NA</b><br />
:This unit is not evaluated for course marks.
+
<div style="margin-left: 2rem;">This unit is not evaluated for course marks.</div>
 
 
{{Vspace}}
 
 
 
 
 
</div>
 
<div id="BIO">
 
 
== Contents ==
 
== Contents ==
<!-- included from "../components/RPR-Scripting_data_downloads.components.wtxt", section: "contents" -->
 
  
 +
Many databases provide download links of their holdings, and/or covenient subsets of data, but sometimes we need to access the data piece by piece - either because no bulk download is available, or because the full dataset is unreasonably large. For the odd protein here or there, we may be able to get the information from a Web-page by hand, but this is tedious, and it is easy to make mistakes. Much better to learn how to script data downloads.
  
==Downloading Protein Data From the Web==
+
In this unit we will cover three download strategies. Our first example is the UniProt interface from which we will retrieve the FASTA sequence of a protein with a simple GET request for a text file. The second example is to retieve motif annotations from PROSITE - a POST request, with subsequent parsing of a table. The final example is to retrieve XML data from the NCBI via their E-utils interface.
  
 +
{{Vspace}}
  
In [[BIO_Assignment_Week_3|Assignment 3]] we created a schema for a local protein sequence collection, and implemented it as an R list. We added sequences to this database by hand, but since the information should be cross-referenced and available based on a protein's RefSeq ID, we should really have a function that automates this process. It is far too easy to make mistakes and enter erroneous information otherwise.
+
===UniProt GET===
  
 +
{{ABC-unit|RPR-UniProt_GET.R}}
  
{{task|1=
 
  
Work through the following code examples.
+
{{Vspace}}
<source lang="R">
 
  
# To begin, we load some libraries with functions
+
===ScanPprosite POST===
# we need...
 
  
# httr sends and receives information via the http
+
ScanProsite is a tool to search for the occurrence of expert-curated motifs in the PROSITE database in a sequence of interest.
# protocol, just like a Web browser.
 
if (!require(httr, quietly=TRUE)) {
 
install.packages("httr")
 
library(httr)
 
}
 
  
# NCBI's eUtils send information in XML format; we
+
{{task|1=
# need to be able to parse XML.
 
if (!require(XML, quietly=TRUE)) {
 
install.packages("XML")
 
library(XML)
 
}
 
  
# stringr has a number of useful utility functions
+
ScanProsite uses UniProt IDs. The UniProt ID for yeast Mbp1 is <code>P39678</code>.
# to work with strings. E.g. a function that
 
# strips leading and trailing whitespace from
 
# strings.
 
if (!require(stringr, quietly=TRUE)) {
 
install.packages("stringr")
 
library(stringr)
 
}
 
  
 +
* Navigate to [http://prosite.expasy.org/scanprosite/ ScanProsite], paste <code>P39678</code> into the text field, select '''Table''' output from the dropdown menu in the STEP 3 section, and '''START THE SCAN'''.
  
# We will walk through the process with the refSeqID
+
You should see four feature hits: the APSES domain, and three ankyrin domain sequences that partially overlap. We could copy and paste the start and end numbers and IDs but that would be lame. Let's get them directly from Prosite instead, because later we will want to fetch a few of these annotations. Prosite does not have a nice API interface like UniProt, but the principles of using '''R''''s <code>httr</code> package to send POST requests and retrieve the results are the same. The parameters for the POST request are hidden in the so-called "form" element" that your browser sends to the PROSITE Web server. In order to construct our request correctly, we need to use the correct parameter names in our script, that the Web page assigns when it constructs input. The first step to capture the data from this page via screenscraping is to look into the HTML code of the page.
# of yeast Mbp1
 
refSeqID <- "NP_010227"
 
  
 +
(I am writing this section from the perspective of the Chrome browser - I don't think other browsers have all of the functionality that I am describing here. You may need to install Chrome to try this...)
  
# UniProt.
+
* Use the menu and access '''View''' &rarr; '''Developer''' &rarr; '''View Source'''. Scroll through the page. You should easily be able to identify the data table. That's fair enough: each of the lines contain the UniProt ID and we should be able to identify them. But how to send the request to get this page in the first place?
# The UniProt ID mapping service supports a "RESTful
 
# API": responses can be obtained simply via a Web-
 
# browsers request. Such requests are commonly sent
 
# via the GET or POST verbs that a Webserver responds
 
# to, when a client asks for data. GET requests are
 
# visible in the URL of the request; POST requests
 
# are not directly visible, they are commonly used
 
# to send the contents of forms, or when transmitting
 
# larger, complex data items. The UniProt ID mapping
 
# sevice can accept long lists of IDs, thus using the
 
# POST mechanism makes sense.
 
  
# R has a POST() function as part of the httr package.
+
*Use the browser's back button to go back to the original query form, and again: '''View''' &rarr; '''Developer''' &rarr; '''View Source'''. This is the page that accepts user input in a so called <code>form</code> via several different types of elements: "radio-buttons", a "text-box", "check-boxes", a "drop down menu" and a "submit" button. We need to figure out what each of the values are so that we can  construct a valid <code>POST</code> request. If we get them wrong, in the wrong order, or have parts missing, it is likely that the server will simply ignore our request. These elements are harder to identify than the lines of feature information, and it's really easy to get them wrong, miss something and get no output. But Chrome has a great tool to help us: it allows you to see the exact, assembled <code>POST</code> header that it sent to the Prosite server!
  
# It's very straightforward to use: just define the URL
+
* Close the page source, open '''View''' &rarr; '''Developer''' &rarr; '''Developer Tools''' in the Chrome menu. '''Then''' click again on '''START THE SCAN'''. The Developer Tools page will show you information about what just happened in the entire transaction that the browser negotiated to retrieve the results page. Click on the '''Network''' tab, on '''All''', and in the '''Names''' column, select: <code>PSScan.cgi</code>. This contains the form data. Then click on the '''Headers''' tab and scroll down until you see the '''Form Data'''. This has all the the required <code>POST</code> elements nicely spelled out. What you are looking for are key value pairs like:
# of the server and send a list of items as the
 
# body of the request.
 
  
# uniProt ID mapping service
+
::'''meta''':  <code>opt1</code>
URL <- "http://www.uniprot.org/mapping/"
+
::'''meta1_protein''':  <code>opt1</code>
response <- POST(URL,
+
::'''seq''':  <code>P39678</code>
                body = list(from = "P_REFSEQ_AC",
+
::etc ...
                            to = "ACC",
 
                            format = "tab",
 
                            query = refSeqID))
 
  
response
+
These are the field keys, and the required values. You have now reverse-engineered a Web form. Armed with this knowledge we can script it: what worked from the browser should work the same way from an '''R''' script.
  
# If the query is successful, tabbed text is returned.
+
}}
# and we capture the fourth element as the requested
 
# mapped ID.
 
unlist(strsplit(content(response), "\\s+"))
 
  
# If the query can't be fulfilled because of a problem
+
{{Smallvspace}}
# with the server, a WebPage is rturned. But the server status
 
# is also returned and we can check the status code. I have
 
# lately gotten many "503" status codes: Server Not Available...
 
  
if (response$status_code == 200) { # 200: oK
+
{{ABC-unit|RPR-PROSITE_POST.R}}
uniProtID <- unlist(strsplit(content(response), "\\s+"))[4]
 
if (is.na(uniProtID)) {
 
warning(paste("UniProt ID mapping service returned NA.",
 
              "Check your RefSeqID."))
 
}
 
} else {
 
uniProtID <- NA
 
warning(paste("No uniProt ID mapping available:",
 
              "server returned status",
 
              response$status_code))
 
}
 
  
uniProtID  # Let's see what we got...
+
{{Vspace}}
          # This should be "P39678"
 
          # (or NA if the query failed)
 
</source>
 
  
 +
===NCBI Entrez E-Utils===
  
Next, we'll retrieve data from the various NCBI databases.
 
  
 
It is has become unreasonably difficult to screenscrape the NCBI site
 
It is has become unreasonably difficult to screenscrape the NCBI site
Line 201: Line 134:
 
look at the following URL in your browser to see what that looks like:
 
look at the following URL in your browser to see what that looks like:
  
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=NP_010227
+
<div class="reference-box"><tt>
 
+
[http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=NP_010227 http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&<span style="color:#CC0000;">term=NP_010227</span>]
 
+
</tt></div>
<source lang="R">
 
 
 
# In order to parse such data, we need tools from the
 
# XML package.
 
 
 
# First we build a query URL...
 
eUtilsBase <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
 
 
 
 
 
# Then we assemble an URL that will search for get the
 
# unique, NCBI internal identifier,  the GI number,
 
# for our refSeqID...
 
URL <- paste(eUtilsBase,
 
            "esearch.fcgi?",    # ...using the esearch program
 
                                  # that finds an entry in an
 
                                  # NCBI database
 
            "db=protein",
 
            "&term=", refSeqID,
 
            sep="")
 
# Copy the URL and paste it into your browser to see
 
# what the response should look like.
 
URL
 
 
 
# To fetch a response in R, we use the function htmlParse()
 
# with our URL as its argument.
 
response <- htmlParse(URL)
 
response
 
 
 
# This is XML. We can take the response apart into
 
# its indvidual components with the xmlToList function.
 
 
 
xmlToList(response)
 
 
 
# Note how the XML "tree" is represented as a list of
 
# lists of lists ...
 
# If we know exactly what elelement we are looking for,
 
# we can extract it from this structure:
 
xmlToList(response)[["body"]][["esearchresult"]][["idlist"]][["id"]]
 
 
 
# But this is not very robus, it would break with the
 
# slightest change that the NCBI makes to their response
 
# and the NCBI changes things A LOT!
 
 
 
# Somewhat more robust is to specify the type of element
 
# we want - its the text contained in an <id>...</id>
 
# elelement, and use the XPath XML parsing language to
 
# retrieve it.
 
 
 
# getNodeSet() lets us fetch tagged contents by
 
# applying toString.XMLNode() to it...
 
 
 
node <- getNodeSet(response, "//id/text()")
 
unlist(lapply(node, toString.XMLNode))  # "6320147 "
 
 
 
# We will be doing this a lot, so we write a function
 
# for it...
 
node2string <- function(doc, tag) {
 
    # an extractor function for the contents of elements
 
    # between given tags in an XML response.
 
    # Contents of all matching elements is returned in
 
    # a vector of strings.
 
path <- paste("//", tag, "/text()", sep="")
 
nodes <- getNodeSet(doc, path)
 
return(unlist(lapply(nodes, toString.XMLNode)))
 
}
 
 
 
# using node2string() ...
 
GID <- node2string(response, "id")
 
GID
 
 
 
# The GI is the pivot for all our data requests at the
 
# NCBI.
 
 
 
# Let's first get the associated data for this GI
 
URL <- paste(eUtilsBase,
 
            "esummary.fcgi?",
 
            "db=protein",
 
            "&id=",
 
            GID,
 
            "&version=2.0",
 
            sep="")
 
response <- htmlParse(URL)
 
URL
 
response
 
 
 
taxID <- node2string(response, "taxid")
 
organism <- node2string(response, "organism")
 
taxID
 
organism
 
 
 
 
 
# Next, fetch the actual sequence
 
URL <- paste(eUtilsBase,
 
            "efetch.fcgi?",
 
            "db=protein",
 
            "&id=",
 
            GID,
 
            "&retmode=text&rettype=fasta",
 
            sep="")
 
response <- htmlParse(URL)
 
URL
 
response
 
 
 
fasta <- node2string(response, "p")
 
fasta
 
 
 
seq <- unlist(strsplit(fasta, "\\n"))[-1] # Drop the first elelment,
 
                                          # it is the FASTA header.
 
seq
 
 
 
 
 
# Next, fetch the crossreference to the NCBI Gene
 
# database
 
URL <- paste(eUtilsBase,
 
            "elink.fcgi?",
 
            "dbfrom=protein",
 
            "&db=gene",
 
            "&id=",
 
            GID,
 
            sep="")
 
response <- htmlParse(URL)
 
URL
 
response
 
 
 
geneID <- node2string(response, "linksetdb/id")
 
geneID
 
 
 
# ... and the actual Gene record:
 
URL <- paste(eUtilsBase,
 
            "esummary.fcgi?",
 
            "&db=gene",
 
            "&id=",
 
            geneID,
 
            sep="")
 
response <- htmlParse(URL)
 
URL
 
response
 
 
 
name <- node2string(response, "name")
 
genome_xref <- node2string(response, "chraccver")
 
genome_from <- node2string(response, "chrstart")[1]
 
genome_to <- node2string(response, "chrstop")[1]
 
name
 
genome_xref
 
genome_from
 
genome_to
 
 
 
# So far so good. But since we need to do this a lot
 
# we need to roll all of this into a function.
 
 
 
# I have added the function to the dbUtilities code
 
# so you can update it easily.
 
  
# Run:
+
Look at the contents of the <tt>&lt;ID>...&lt;/ID><tt> tag, and follow the next query:
 
 
updateDbUtilities("55ca561e2944af6e9ce5cf2a558d0a3c588ea9af")
 
 
 
# If that is successful, try these three testcases
 
 
 
myNewDB <- createDB()
 
tmp <- fetchProteinData("NP_010227") # Mbp1p
 
tmp
 
myNewDB <- addToDB(myNewDB, tmp)
 
myNewDB
 
 
 
tmp <- fetchProteinData("NP_011036") # Swi4p
 
tmp
 
myNewDB <- addToDB(myNewDB, tmp)
 
myNewDB
 
 
 
tmp <- fetchProteinData("NP_012881") # Phd1p
 
tmp
 
myNewDB <- addToDB(myNewDB, tmp)
 
myNewDB
 
 
 
 
 
 
 
</source>
 
 
 
 
 
}}
 
 
 
 
 
 
 
This new <code>fetchProteinData()</code> function seems to be quite convenient. I have compiled a [[Reference_APSES_proteins_(reference_species)|set of APSES domain proteins]] for ten fungi species and loaded the 48 protein's data into an R database in a few minutes. This "reference database" will be automatically loaded for you with the '''next''' dbUtilities update. Note that it will be recreated every time you start up '''R'''. This means two things: (i) if you break something in the reference database, it's not a problem. (ii) if you store your own data in it, it will be lost. In order to add your own genes, you need to make a working copy for yourself.
 
  
 +
<div class="reference-box"><tt>
 +
[http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=protein&id=6320147&version=2.0 http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=protein&<span style="color:#CC0000;">id=6320147</span>&version=2.0]
 +
</tt></div>
  
 +
Note the conceptual difference between search "term" and retrieval "id".
  
 +
An API to NCBIs Entrez system is provided through eUtils.
  
 
{{task|1=
 
{{task|1=
  
* In our case it seems the best results are had when searching the [http://prosite.expasy.org/prosite.html Prosite] database with the [http://prosite.expasy.org/scanprosite/ ScanProsite] interface.
+
Browse through the [https://www.ncbi.nlm.nih.gov/books/NBK25500/ E-utilities Quick Start chapter] of the NCBI's Entrez Programming Utilites Handbook for a quick overview.
 
 
Let's have a first look at ScanProsite, using the yeast Mbp1 sequence. We need the UniProt ID to search Prosite. With your protein database loaded in a fresh '''R''' session, type
 
 
 
<source lang="RSplus">
 
# (commands indented, to align their components and
 
# help you understand their relationship)
 
 
 
      refDB$protein$uniProtID
 
                              which(refDB$protein$name == "MBP1")
 
      refDB$protein$uniProtID[which(refDB$protein$name == "MBP1")]
 
uID <- refDB$protein$uniProtID[which(refDB$protein$name == "MBP1")]
 
uID
 
</source>
 
 
 
* Navigate to [http://prosite.expasy.org/scanprosite/ ScanProsite], paste the UniprotID for yeast Mbp1 into the text field, select '''Table''' output for STEP 3, and '''START THE SCAN'''.
 
 
 
You should see four feature hits: the APSES domain, and three ankyrin domain sequences that partially overlap. We could copy and paste the start and end numbers and IDs but that would be lame. Let's get them directly from Prosite instead, because we will want to fetch a few of these. Prosite does not have a nice API interface like UniProt, but the principles of using '''R''''s <code>httr</code> package to send POST requests and retrieve the results are the same. Getting data informally from Webpages is called '''screenscraping''' and really a life-saving skill. The first step to capture the data from this page via screenscraping is to look into the HTML code of the page.
 
 
 
(I am writing this section from the perspective of the Chrome browser - I don't think other browsers have all of the functionality that I am describing here. You may need to install Chrome to try this...)
 
 
 
* Use the menu and access '''View''' &rarr; '''Developer''' &rarr; '''View Source'''. Scroll through the page. You should easily be able to identify the data table. That's fair enough: each of the lines contain the UniProt ID and we should be able to identify them. But how to send the request to get this page in the first place?
 
 
 
*Use the browser's back button, and again: '''View''' &rarr; '''Developer''' &rarr; '''View Source'''. This is the page that accepts user input in a so called <code>form</code> via several different types of elements: "radio-buttons", a "text-box", "check-boxes", a "drop down menu" and a "submit" button. We need to figure out what each of the values are so that we can  construct a valid <code>POST</code> request. If we get them wrong, in the wrong order, or have parts missing, it is likely that the server will simply ignore our request. These elements are much harder to identify thean the lines of feature information, and it's really easy to get them wrong, miss something and get no output. But Chrome has a great tool to help us: it allows you to see the exact, assembled <code>POST</code> header that it sent to the Prosite server!
 
 
 
* On the scanProsite page, open '''View''' &rarr; '''Developer''' &rarr; '''Developer Tools''' in the Chrome menu. '''Then''' click again on '''START THE SCAN'''. The Developer Tools page will show you information about what just happened in the transaction it negotiated to retrieve the results page. Click on the '''Network''' tab, and then on the top element: <code>PSScan.cgi</code>. This contains the form data. Then click on the '''Headers''' tab and scroll down until you see the '''Request Payload'''. This has all the the required <code>POST</code> elements nicely spelled out. No guesswork required. What worked from the browser should work the same way from an '''R''' script. Analogous to our UniProt fetch code, we create a <code>POST</code> query:
 
 
 
<source lang="RSplus">
 
 
 
URL <- "http://prosite.expasy.org/cgi-bin/prosite/PSScan.cgi"
 
response <- POST(URL,
 
                body = list(meta = "opt1",
 
                            meta1_protein = "opt1",
 
                            seq = "P39678",
 
                            skip = "on",
 
                            output = "tabular"))
 
# Note how the list-elements correspond to the page header's
 
# Request Payload. We include everything but the value of the
 
# submit button (which is for display only) in our POST
 
# request.
 
 
 
# Send off this request, and you should have a response in a few
 
# seconds.
 
 
 
# The text contents of the response is available with the
 
# content() function:
 
content(response, "text")
 
 
 
# ... should show you the same as the page contents that
 
# you have seen in the browser. Now we need to extract
 
# the data from the page: we need regular expressions, but
 
# only simple ones. First, we strsplit() the response into
 
# individual lines, since each of our data elements is on
 
# its own line. We simply split on the "\\n" newline character.
 
 
 
lines <- unlist(strsplit(content(response, "text"), "\\n"))
 
head(lines)
 
 
 
# Now we define a query pattern for the lines we want:
 
# we can use the uID, bracketed by two "|" pipe
 
# characters:
 
 
 
pattern <- paste("\\|", uID, "\\|", sep="")
 
 
 
# ... and select only the lines that match this
 
# pattern:
 
 
 
lines <- lines[grep(pattern, lines)]
 
lines
 
 
 
# ... captures the four lines of output.
 
 
 
# Now we break the lines apart into
 
# apart in tokens: this is another application of
 
# strsplit(), but this time we split either on
 
# "pipe" characters, "|" OR on tabs "\t". Look at the
 
# regex "\\t|\\|" in the strsplit() call:
 
 
 
strsplit(lines[1], "\\t|\\|")
 
 
 
# Its parts are (\\t)=tab (|)=or (\\|)=pipe.
 
# Both "t" and "|" need to be escaped with a backslash.
 
# "t" has to be escaped because we want to match a tab (\t),
 
# not the literal character "t". And "|" has to be escaped
 
# because we mean the literal pipe character, not its
 
# usual (special) meaning OR. Thus sometimes the backslash
 
# turns a special meaning off, and sometimes it turns a
 
# special meaning on. Unfortunately there's no easy way
 
# to tell - you just need to remember the characters - or
 
# have a reference handy. The special characters are
 
# (){}[]^$?*+.|&-  ... and some of them have different
 
# meanings depending on where in the regex they are.
 
 
 
# Let's put the tokens into named slots of a vector.
 
 
 
features <- list()
 
for (line in lines) {
 
    tokens <- unlist(strsplit(line, "\\t|\\|"))
 
    features <- rbind(features, c(uID  =  tokens[2],
 
                                  start =  tokens[4],
 
                                  end  =  tokens[5],
 
                                  psID  =  tokens[6],
 
                                  psName = tokens[7]))
 
}
 
features
 
</source>
 
 
 
This forms the base of a function that collects the features automatically from a PrositeScan result. We still need to do a bit more on the database part, but this is mostly bookkeeping:
 
 
 
* We need to put the feature annotations into a database table and link them to a protein ID and to a description of the feature itself.
 
* We need a function that extracts feature sequences in FASTA format.
 
* And, since we are changing the structure of the database, we need a way to migrate your old database contents to a newer version.
 
 
 
I don't think much new can be learned from this, so I have written those functions and put them into dbUtilities.R But you can certainly learn something from having a look at the code of
 
 
 
*<code>fetchPrositeFeatures()</code>
 
*<code>addFeatureToDB()</code>
 
*<code>getFeatureFASTA()</code>
 
 
 
Also, have a quick look back at our [[BIO_Assignment_Week_3#The_Protein_datamodel|database schema:]] this update has implemented the proteinFeature and the feature table. Do you remember what they were good for?
 
 
 
Time for a database update. You must be up to date with the latest version of <code>dbUtilities.R</code> for this to work. When you are, execute the following steps:
 
 
 
<source lang="R">
 
 
 
updateVerifiedFile("363ffbae3ff21ba80aa4fbf90dcc75164dbf10f8")
 
 
 
# Make a backup copy of your protein database.
 
# Load your protein database. Then merge the data in your database
 
# with the updated reference database. (Obviously, substitute the
 
# actual filename in the placeholder strings below. And don't type
 
# the angled brackets!)
 
 
 
<my-new-database> <- mergeDB(<my-old-database>, refDB)
 
 
 
# check that this has worked:
 
str(<my-new-database>)
 
 
 
# and save your database.
 
 
 
save(<my-new-database>, file="<my-DB-filename.02>.RData")
 
 
 
# Now, for each of your proteins, add the domain annotations to
 
# the database. You could write a loop to do this but it's probably
 
# better to check the results of each annotation before committing
 
# it to the database. So just paste the UniProt Ids as argument of
 
# the function fetchPrositeFeatures(), execute and repeat.
 
 
 
 
 
features <- fetchPrositeFeatures(<one-of-my-proteins-uniProt-IDs>)
 
refDB <- addFeatureToDB(refDB, features)
 
 
 
# When you are done, save your database.
 
</source>
 
 
 
Finally, we can create a sequence selection of APSES domains
 
from our reference proteins. The function <code>getFeatureFasta()</code>
 
 
 
* accepts a feature name such as <code>"HTH_APSES"</code>;
 
* finds the corresponding feature ID;
 
* finds all matching entries in the proteinFeature table;
 
* looks up the start and end position of each feature;
 
* fetches the corresponding substring from the sequence entries;
 
* adds a meaningful header line; and
 
* writes everything to output.
 
 
 
... so that you can simply execute:
 
 
 
<source lang="R">
 
cat(getFeatureFasta(<my-new-database>, "HTH_APSES"))
 
</source>
 
 
 
Here are the first five sequences from that result:
 
 
 
<source lang="text">
 
>CC1G_01306_COPCI    HTH_APSES 6:112
 
IFKATYSGIPVYEMMCKGVAVMRRRSDSWLNATQILKVAGFDKPQRTRVLEREVQKGEHE
 
KVQGGYGKYQGTWIPLERGMQLAKQYNCEHLLRPIIEFTPAAKSPPL
 
>CNBB4890_CRYNE    HTH_APSES 17:123
 
IYKATYSGVPVYEMVCRDVAVMRRRSDAYLNATQILKVAGFDKPQRTRVLEREVQKGEHE
 
KVQGGYGKYQGTWIPIERGLALAKQYGVEDILRPIIDYVPTSVSPPP
 
>COCMIDRAFT_338_BIPOR    HTH_APSES 9:115
 
IYSATYSNVPVYECNVNGHHVMRRRADDWINATHILKVADYDKPARTRILEREVQKGVHE
 
KVQGGYGKYQGTWIPLEEGRGLAERNGVLDKMRAIFDYVPGDRSPPP
 
>WALSEDRAFT_68476_WALME    HTH_APSES 83:192
 
IYSAVYSGVGVYEAMIRGIAVMRRRADGYMNATQILKVAGVDKGRRTKILEREILAGLHE
 
KIQGGYGKYQGTWIPFERGRELALQYGCDHLLAPIFDFNPSVMQPSAGRS
 
>PGTG_08863_PUCGR    HTH_APSES 90:196
 
IYKATYSGVPVLEMPCEGIAVMRRRSDSWLNATQILKVAGFDKPQRTRVLEREIQKGTHE
 
KIQGGYGKYQGTWVPLDRGIDLAKQYGVDHLLSALFNFQPSSNESPP
 
[...]
 
</source>
 
 
 
 
 
At the bottom of these sequences, you should see the APSES sequences from
 
YFO, '''in particular the Mbp1 RBM sequence from YFO'''. Email me if you have trouble getting to that stage.
 
 
 
We'll need to align these sequences with the template...
 
  
 
}}
 
}}
  
 +
{{Smallvspace}}
  
 +
{{ABC-unit|RPR-eUtils_XML.R}}
  
  
  
  
{{Vspace}}
 
 
 
== Further reading, links and resources ==
 
<!-- {{#pmid: 19957275}} -->
 
<!-- {{WWW|WWW_GMOD}} -->
 
<!-- <div class="reference-box">[http://www.ncbi.nlm.nih.gov]</div> -->
 
  
 
{{Vspace}}
 
{{Vspace}}
  
 
== Notes ==
 
<!-- included from "../components/RPR-Scripting_data_downloads.components.wtxt", section: "notes" -->
 
<!-- included from "ABC-unit_components.wtxt", section: "notes" -->
 
<references />
 
 
{{Vspace}}
 
 
 
</div>
 
<div id="ABC-unit-framework">
 
== Self-evaluation ==
 
<!-- included from "../components/RPR-Scripting_data_downloads.components.wtxt", section: "self-evaluation" -->
 
<!--
 
=== Question 1===
 
 
Question ...
 
 
<div class="toccolours mw-collapsible mw-collapsed" style="width:800px">
 
Answer ...
 
<div class="mw-collapsible-content">
 
Answer ...
 
 
</div>
 
  </div>
 
 
  {{Vspace}}
 
 
-->
 
 
{{Vspace}}
 
 
 
 
{{Vspace}}
 
 
 
<!-- included from "ABC-unit_components.wtxt", section: "ABC-unit_ask" -->
 
 
----
 
 
{{Vspace}}
 
 
<b>If in doubt, ask!</b> If anything about this learning unit is not clear to you, do not proceed blindly but ask for clarification. Post your question on the course mailing list: others are likely to have similar problems. Or send an email to your instructor.
 
 
----
 
 
{{Vspace}}
 
  
 
<div class="about">
 
<div class="about">
Line 668: Line 173:
 
:2017-08-05
 
:2017-08-05
 
<b>Modified:</b><br />
 
<b>Modified:</b><br />
:2017-08-05
+
:2020-09-24
 
<b>Version:</b><br />
 
<b>Version:</b><br />
:0.1
+
:1.2
 
<b>Version history:</b><br />
 
<b>Version history:</b><br />
 +
*1.2 2020 Updates
 +
*1.0.1 Update for Chromes' developer tools layout change
 +
*1.0 Working version
 
*0.1 First stub
 
*0.1 First stub
 
</div>
 
</div>
[[Category:ABC-units]]
 
<!-- included from "ABC-unit_components.wtxt", section: "ABC-unit_footer" -->
 
  
 
{{CC-BY}}
 
{{CC-BY}}
  
 +
[[Category:ABC-units]]
 +
{{UNIT}}
 +
{{LIVE}}
 
</div>
 
</div>
 
<!-- [END] -->
 
<!-- [END] -->

Latest revision as of 02:12, 25 September 2020

Scripting Data Downloads

(Techniques for accessing databases and downloading data)


 


Abstract:

Often we need to automate access to databases that provide their data via Web interfaces, or are only designed to be viewed in Web browsers. This unit discussess three strategies: working with text data that is accessed through GET and POST commands, and parsing simple XML formatted data.


Objectives:
This unit will ...

  • ... introduce the GET and POST verbs to interface with Web servers;
  • ... demonstrate parsing of text and XML responses.

Outcomes:
After working through this unit you ...

  • ... can construct GET and POST queries to Web servers;
  • ... can parse text data and XML data;
  • ... have integrated sample code for this into a utility function.

Deliverables:

  • Time management: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.
  • Journal: Document your progress in your Course Journal. Some tasks may ask you to include specific items in your journal. Don't overlook these.
  • Insights: If you find something particularly noteworthy about this unit, make a note in your insights! page.

  • Prerequisites:
    You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:

    • The Central Dogma: Regulation of transcription and translation; protein biosynthesis and degradation; quality control.

    This unit builds on material covered in the following prerequisite units:


     



     



     


    Evaluation

    Evaluation: NA

    This unit is not evaluated for course marks.

    Contents

    Many databases provide download links of their holdings, and/or covenient subsets of data, but sometimes we need to access the data piece by piece - either because no bulk download is available, or because the full dataset is unreasonably large. For the odd protein here or there, we may be able to get the information from a Web-page by hand, but this is tedious, and it is easy to make mistakes. Much better to learn how to script data downloads.

    In this unit we will cover three download strategies. Our first example is the UniProt interface from which we will retrieve the FASTA sequence of a protein with a simple GET request for a text file. The second example is to retieve motif annotations from PROSITE - a POST request, with subsequent parsing of a table. The final example is to retrieve XML data from the NCBI via their E-utils interface.


     

    UniProt GET

    Task:

     
    • Open RStudio and load the ABC-units R project. If you have loaded it before, choose FileRecent projectsABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
    • Choose ToolsVersion ControlPull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
    • Type init() if requested.
    • Open the file RPR-UniProt_GET.R and follow the instructions.


     

    Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.


     


     

    ScanPprosite POST

    ScanProsite is a tool to search for the occurrence of expert-curated motifs in the PROSITE database in a sequence of interest.

    Task:
    ScanProsite uses UniProt IDs. The UniProt ID for yeast Mbp1 is P39678.

    • Navigate to ScanProsite, paste P39678 into the text field, select Table output from the dropdown menu in the STEP 3 section, and START THE SCAN.

    You should see four feature hits: the APSES domain, and three ankyrin domain sequences that partially overlap. We could copy and paste the start and end numbers and IDs but that would be lame. Let's get them directly from Prosite instead, because later we will want to fetch a few of these annotations. Prosite does not have a nice API interface like UniProt, but the principles of using R's httr package to send POST requests and retrieve the results are the same. The parameters for the POST request are hidden in the so-called "form" element" that your browser sends to the PROSITE Web server. In order to construct our request correctly, we need to use the correct parameter names in our script, that the Web page assigns when it constructs input. The first step to capture the data from this page via screenscraping is to look into the HTML code of the page.

    (I am writing this section from the perspective of the Chrome browser - I don't think other browsers have all of the functionality that I am describing here. You may need to install Chrome to try this...)

    • Use the menu and access ViewDeveloperView Source. Scroll through the page. You should easily be able to identify the data table. That's fair enough: each of the lines contain the UniProt ID and we should be able to identify them. But how to send the request to get this page in the first place?
    • Use the browser's back button to go back to the original query form, and again: ViewDeveloperView Source. This is the page that accepts user input in a so called form via several different types of elements: "radio-buttons", a "text-box", "check-boxes", a "drop down menu" and a "submit" button. We need to figure out what each of the values are so that we can construct a valid POST request. If we get them wrong, in the wrong order, or have parts missing, it is likely that the server will simply ignore our request. These elements are harder to identify than the lines of feature information, and it's really easy to get them wrong, miss something and get no output. But Chrome has a great tool to help us: it allows you to see the exact, assembled POST header that it sent to the Prosite server!
    • Close the page source, open ViewDeveloperDeveloper Tools in the Chrome menu. Then click again on START THE SCAN. The Developer Tools page will show you information about what just happened in the entire transaction that the browser negotiated to retrieve the results page. Click on the Network tab, on All, and in the Names column, select: PSScan.cgi. This contains the form data. Then click on the Headers tab and scroll down until you see the Form Data. This has all the the required POST elements nicely spelled out. What you are looking for are key value pairs like:
    meta: opt1
    meta1_protein: opt1
    seq: P39678
    etc ...

    These are the field keys, and the required values. You have now reverse-engineered a Web form. Armed with this knowledge we can script it: what worked from the browser should work the same way from an R script.


     

    Task:

     
    • Open RStudio and load the ABC-units R project. If you have loaded it before, choose FileRecent projectsABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
    • Choose ToolsVersion ControlPull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
    • Type init() if requested.
    • Open the file RPR-PROSITE_POST.R and follow the instructions.


     

    Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.


     


     

    NCBI Entrez E-Utils

    It is has become unreasonably difficult to screenscrape the NCBI site since the actual page contents are dynamically loaded via AJAX. This may be intentional, or just overengineering. While NCBI offers a subset of their data via the eutils API and that works well enough, some of the data that is available to the Web browser's eyes is not served to a program.

    The eutils API returns data in XML format. Have a look at the following URL in your browser to see what that looks like:

    Look at the contents of the <ID>...</ID> tag, and follow the next query:

    Note the conceptual difference between search "term" and retrieval "id".

    An API to NCBIs Entrez system is provided through eUtils.

    Task:
    Browse through the E-utilities Quick Start chapter of the NCBI's Entrez Programming Utilites Handbook for a quick overview.


     

    Task:

     
    • Open RStudio and load the ABC-units R project. If you have loaded it before, choose FileRecent projectsABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit.
    • Choose ToolsVersion ControlPull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
    • Type init() if requested.
    • Open the file RPR-eUtils_XML.R and follow the instructions.


     

    Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.


     




     


    About ...
     
    Author:

    Boris Steipe <boris.steipe@utoronto.ca>

    Created:

    2017-08-05

    Modified:

    2020-09-24

    Version:

    1.2

    Version history:

    • 1.2 2020 Updates
    • 1.0.1 Update for Chromes' developer tools layout change
    • 1.0 Working version
    • 0.1 First stub

    CreativeCommonsBy.png This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.