Difference between revisions of "RPR-Scripting data downloads"
m |
m |
||
(11 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | <div id=" | + | <div id="ABC"> |
− | + | <div style="padding:5px; border:1px solid #000000; background-color:#b3dbce; font-size:300%; font-weight:400; color: #000000; width:100%;"> | |
Scripting Data Downloads | Scripting Data Downloads | ||
− | + | <div style="padding:5px; margin-top:20px; margin-bottom:10px; background-color:#b3dbce; font-size:30%; font-weight:200; color: #000000; "> | |
− | + | (Techniques for accessing databases and downloading data) | |
− | + | </div> | |
− | |||
− | |||
− | |||
− | Techniques for accessing databases and downloading data | ||
</div> | </div> | ||
− | {{ | + | {{Smallvspace}} |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | < | + | <div style="padding:5px; border:1px solid #000000; background-color:#b3dbce33; font-size:85%;"> |
− | <div | + | <div style="font-size:118%;"> |
− | + | <b>Abstract:</b><br /> | |
<section begin=abstract /> | <section begin=abstract /> | ||
− | + | Often we need to automate access to databases that provide their data via Web interfaces, or are only designed to be viewed in Web browsers. This unit discussess three strategies: working with text data that is accessed through GET and POST commands, and parsing simple XML formatted data. | |
− | |||
<section end=abstract /> | <section end=abstract /> | ||
− | + | </div> | |
− | + | <!-- ============================ --> | |
− | + | <hr> | |
− | + | <table> | |
− | == | + | <tr> |
− | === | + | <td style="padding:10px;"> |
− | < | + | <b>Objectives:</b><br /> |
− | <!-- | + | This unit will ... |
− | You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources: | + | * ... introduce the GET and POST verbs to interface with Web servers; |
− | < | + | * ... demonstrate parsing of text and XML responses. |
+ | </td> | ||
+ | <td style="padding:10px;"> | ||
+ | <b>Outcomes:</b><br /> | ||
+ | After working through this unit you ... | ||
+ | * ... can construct GET and POST queries to Web servers; | ||
+ | * ... can parse text data and XML data; | ||
+ | * ... have integrated sample code for this into a utility function. | ||
+ | </td> | ||
+ | </tr> | ||
+ | </table> | ||
+ | <!-- ============================ --> | ||
+ | <hr> | ||
+ | <b>Deliverables:</b><br /> | ||
+ | <section begin=deliverables /> | ||
+ | <li><b>Time management</b>: Before you begin, estimate how long it will take you to complete this unit. Then, record in your course journal: the number of hours you estimated, the number of hours you worked on the unit, and the amount of time that passed between start and completion of this unit.</li> | ||
+ | <li><b>Journal</b>: Document your progress in your [[FND-Journal|Course Journal]]. Some tasks may ask you to include specific items in your journal. Don't overlook these.</li> | ||
+ | <li><b>Insights</b>: If you find something particularly noteworthy about this unit, make a note in your [[ABC-Insights|'''insights!''' page]].</li> | ||
+ | <section end=deliverables /> | ||
+ | <!-- ============================ --> | ||
+ | <hr> | ||
+ | <section begin=prerequisites /> | ||
+ | <b>Prerequisites:</b><br /> | ||
+ | You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:<br /> | ||
*<b>The Central Dogma</b>: Regulation of transcription and translation; protein biosynthesis and degradation; quality control. | *<b>The Central Dogma</b>: Regulation of transcription and translation; protein biosynthesis and degradation; quality control. | ||
− | + | This unit builds on material covered in the following prerequisite units:<br /> | |
− | + | *[[BIN-Data_integration|BIN-Data_integration (Data Integration)]] | |
− | *[[BIN-Data_integration]] | + | <section end=prerequisites /> |
+ | <!-- ============================ --> | ||
+ | </div> | ||
− | {{ | + | {{Smallvspace}} |
− | |||
− | |||
− | |||
− | {{ | + | {{Smallvspace}} |
− | + | __TOC__ | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
{{Vspace}} | {{Vspace}} | ||
Line 76: | Line 69: | ||
=== Evaluation === | === Evaluation === | ||
− | |||
− | |||
<b>Evaluation: NA</b><br /> | <b>Evaluation: NA</b><br /> | ||
− | :This unit is not evaluated for course marks. | + | <div style="margin-left: 2rem;">This unit is not evaluated for course marks.</div> |
− | |||
− | |||
− | |||
− | |||
− | </div | ||
− | |||
== Contents == | == Contents == | ||
− | |||
+ | Many databases provide download links of their holdings, and/or covenient subsets of data, but sometimes we need to access the data piece by piece - either because no bulk download is available, or because the full dataset is unreasonably large. For the odd protein here or there, we may be able to get the information from a Web-page by hand, but this is tedious, and it is easy to make mistakes. Much better to learn how to script data downloads. | ||
− | + | In this unit we will cover three download strategies. Our first example is the UniProt interface from which we will retrieve the FASTA sequence of a protein with a simple GET request for a text file. The second example is to retieve motif annotations from PROSITE - a POST request, with subsequent parsing of a table. The final example is to retrieve XML data from the NCBI via their E-utils interface. | |
+ | {{Vspace}} | ||
− | + | ===UniProt GET=== | |
+ | {{ABC-unit|RPR-UniProt_GET.R}} | ||
− | |||
− | + | {{Vspace}} | |
− | |||
− | + | ===ScanPprosite POST=== | |
− | |||
− | + | ScanProsite is a tool to search for the occurrence of expert-curated motifs in the PROSITE database in a sequence of interest. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | {{task|1= | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | ScanProsite uses UniProt IDs. The UniProt ID for yeast Mbp1 is <code>P39678</code>. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
+ | * Navigate to [http://prosite.expasy.org/scanprosite/ ScanProsite], paste <code>P39678</code> into the text field, select '''Table''' output from the dropdown menu in the STEP 3 section, and '''START THE SCAN'''. | ||
− | + | You should see four feature hits: the APSES domain, and three ankyrin domain sequences that partially overlap. We could copy and paste the start and end numbers and IDs but that would be lame. Let's get them directly from Prosite instead, because later we will want to fetch a few of these annotations. Prosite does not have a nice API interface like UniProt, but the principles of using '''R''''s <code>httr</code> package to send POST requests and retrieve the results are the same. The parameters for the POST request are hidden in the so-called "form" element" that your browser sends to the PROSITE Web server. In order to construct our request correctly, we need to use the correct parameter names in our script, that the Web page assigns when it constructs input. The first step to capture the data from this page via screenscraping is to look into the HTML code of the page. | |
− | |||
− | |||
+ | (I am writing this section from the perspective of the Chrome browser - I don't think other browsers have all of the functionality that I am describing here. You may need to install Chrome to try this...) | ||
− | + | * Use the menu and access '''View''' → '''Developer''' → '''View Source'''. Scroll through the page. You should easily be able to identify the data table. That's fair enough: each of the lines contain the UniProt ID and we should be able to identify them. But how to send the request to get this page in the first place? | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | *Use the browser's back button to go back to the original query form, and again: '''View''' → '''Developer''' → '''View Source'''. This is the page that accepts user input in a so called <code>form</code> via several different types of elements: "radio-buttons", a "text-box", "check-boxes", a "drop down menu" and a "submit" button. We need to figure out what each of the values are so that we can construct a valid <code>POST</code> request. If we get them wrong, in the wrong order, or have parts missing, it is likely that the server will simply ignore our request. These elements are harder to identify than the lines of feature information, and it's really easy to get them wrong, miss something and get no output. But Chrome has a great tool to help us: it allows you to see the exact, assembled <code>POST</code> header that it sent to the Prosite server! | |
− | + | * Close the page source, open '''View''' → '''Developer''' → '''Developer Tools''' in the Chrome menu. '''Then''' click again on '''START THE SCAN'''. The Developer Tools page will show you information about what just happened in the entire transaction that the browser negotiated to retrieve the results page. Click on the '''Network''' tab, on '''All''', and in the '''Names''' column, select: <code>PSScan.cgi</code>. This contains the form data. Then click on the '''Headers''' tab and scroll down until you see the '''Form Data'''. This has all the the required <code>POST</code> elements nicely spelled out. What you are looking for are key value pairs like: | |
− | |||
− | |||
− | + | ::'''meta''': <code>opt1</code> | |
− | + | ::'''meta1_protein''': <code>opt1</code> | |
− | + | ::'''seq''': <code>P39678</code> | |
− | + | ::etc ... | |
− | |||
− | |||
− | |||
− | + | These are the field keys, and the required values. You have now reverse-engineered a Web form. Armed with this knowledge we can script it: what worked from the browser should work the same way from an '''R''' script. | |
− | + | }} | |
− | |||
− | |||
− | |||
− | + | {{Smallvspace}} | |
− | |||
− | |||
− | |||
− | + | {{ABC-unit|RPR-PROSITE_POST.R}} | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | } | ||
− | |||
− | |||
− | |||
− | |||
− | } | ||
− | + | {{Vspace}} | |
− | |||
− | |||
− | |||
+ | ===NCBI Entrez E-Utils=== | ||
− | |||
It is has become unreasonably difficult to screenscrape the NCBI site | It is has become unreasonably difficult to screenscrape the NCBI site | ||
Line 203: | Line 134: | ||
look at the following URL in your browser to see what that looks like: | look at the following URL in your browser to see what that looks like: | ||
− | http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=NP_010227 | + | <div class="reference-box"><tt> |
+ | [http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=NP_010227 http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&<span style="color:#CC0000;">term=NP_010227</span>] | ||
+ | </tt></div> | ||
+ | Look at the contents of the <tt><ID>...</ID><tt> tag, and follow the next query: | ||
− | < | + | <div class="reference-box"><tt> |
− | + | [http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=protein&id=6320147&version=2.0 http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=protein&<span style="color:#CC0000;">id=6320147</span>&version=2.0] | |
− | + | </tt></div> | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | </ | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
+ | Note the conceptual difference between search "term" and retrieval "id". | ||
+ | An API to NCBIs Entrez system is provided through eUtils. | ||
{{task|1= | {{task|1= | ||
− | + | Browse through the [https://www.ncbi.nlm.nih.gov/books/NBK25500/ E-utilities Quick Start chapter] of the NCBI's Entrez Programming Utilites Handbook for a quick overview. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | for | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
}} | }} | ||
+ | {{Smallvspace}} | ||
+ | {{ABC-unit|RPR-eUtils_XML.R}} | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
{{Vspace}} | {{Vspace}} | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
<div class="about"> | <div class="about"> | ||
Line 670: | Line 173: | ||
:2017-08-05 | :2017-08-05 | ||
<b>Modified:</b><br /> | <b>Modified:</b><br /> | ||
− | : | + | :2020-09-24 |
<b>Version:</b><br /> | <b>Version:</b><br /> | ||
− | : | + | :1.2 |
<b>Version history:</b><br /> | <b>Version history:</b><br /> | ||
+ | *1.2 2020 Updates | ||
+ | *1.0.1 Update for Chromes' developer tools layout change | ||
+ | *1.0 Working version | ||
*0.1 First stub | *0.1 First stub | ||
</div> | </div> | ||
− | |||
− | |||
{{CC-BY}} | {{CC-BY}} | ||
+ | [[Category:ABC-units]] | ||
+ | {{UNIT}} | ||
+ | {{LIVE}} | ||
</div> | </div> | ||
<!-- [END] --> | <!-- [END] --> |
Latest revision as of 02:12, 25 September 2020
Scripting Data Downloads
(Techniques for accessing databases and downloading data)
Abstract:
Often we need to automate access to databases that provide their data via Web interfaces, or are only designed to be viewed in Web browsers. This unit discussess three strategies: working with text data that is accessed through GET and POST commands, and parsing simple XML formatted data.
Objectives:
|
Outcomes:
|
Deliverables:
Prerequisites:
You need the following preparation before beginning this unit. If you are not familiar with this material from courses you took previously, you need to prepare yourself from other information sources:
- The Central Dogma: Regulation of transcription and translation; protein biosynthesis and degradation; quality control.
This unit builds on material covered in the following prerequisite units:
Evaluation
Evaluation: NA
Contents
Many databases provide download links of their holdings, and/or covenient subsets of data, but sometimes we need to access the data piece by piece - either because no bulk download is available, or because the full dataset is unreasonably large. For the odd protein here or there, we may be able to get the information from a Web-page by hand, but this is tedious, and it is easy to make mistakes. Much better to learn how to script data downloads.
In this unit we will cover three download strategies. Our first example is the UniProt interface from which we will retrieve the FASTA sequence of a protein with a simple GET request for a text file. The second example is to retieve motif annotations from PROSITE - a POST request, with subsequent parsing of a table. The final example is to retrieve XML data from the NCBI via their E-utils interface.
UniProt GET
Task:
- Open RStudio and load the
ABC-units
R project. If you have loaded it before, choose File → Recent projects → ABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit. - Choose Tools → Version Control → Pull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
- Type
init()
if requested. - Open the file
RPR-UniProt_GET.R
and follow the instructions.
Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.
ScanPprosite POST
ScanProsite is a tool to search for the occurrence of expert-curated motifs in the PROSITE database in a sequence of interest.
Task:
ScanProsite uses UniProt IDs. The UniProt ID for yeast Mbp1 is P39678
.
- Navigate to ScanProsite, paste
P39678
into the text field, select Table output from the dropdown menu in the STEP 3 section, and START THE SCAN.
You should see four feature hits: the APSES domain, and three ankyrin domain sequences that partially overlap. We could copy and paste the start and end numbers and IDs but that would be lame. Let's get them directly from Prosite instead, because later we will want to fetch a few of these annotations. Prosite does not have a nice API interface like UniProt, but the principles of using R's httr
package to send POST requests and retrieve the results are the same. The parameters for the POST request are hidden in the so-called "form" element" that your browser sends to the PROSITE Web server. In order to construct our request correctly, we need to use the correct parameter names in our script, that the Web page assigns when it constructs input. The first step to capture the data from this page via screenscraping is to look into the HTML code of the page.
(I am writing this section from the perspective of the Chrome browser - I don't think other browsers have all of the functionality that I am describing here. You may need to install Chrome to try this...)
- Use the menu and access View → Developer → View Source. Scroll through the page. You should easily be able to identify the data table. That's fair enough: each of the lines contain the UniProt ID and we should be able to identify them. But how to send the request to get this page in the first place?
- Use the browser's back button to go back to the original query form, and again: View → Developer → View Source. This is the page that accepts user input in a so called
form
via several different types of elements: "radio-buttons", a "text-box", "check-boxes", a "drop down menu" and a "submit" button. We need to figure out what each of the values are so that we can construct a validPOST
request. If we get them wrong, in the wrong order, or have parts missing, it is likely that the server will simply ignore our request. These elements are harder to identify than the lines of feature information, and it's really easy to get them wrong, miss something and get no output. But Chrome has a great tool to help us: it allows you to see the exact, assembledPOST
header that it sent to the Prosite server!
- Close the page source, open View → Developer → Developer Tools in the Chrome menu. Then click again on START THE SCAN. The Developer Tools page will show you information about what just happened in the entire transaction that the browser negotiated to retrieve the results page. Click on the Network tab, on All, and in the Names column, select:
PSScan.cgi
. This contains the form data. Then click on the Headers tab and scroll down until you see the Form Data. This has all the the requiredPOST
elements nicely spelled out. What you are looking for are key value pairs like:
- meta:
opt1
- meta1_protein:
opt1
- seq:
P39678
- etc ...
- meta:
These are the field keys, and the required values. You have now reverse-engineered a Web form. Armed with this knowledge we can script it: what worked from the browser should work the same way from an R script.
Task:
- Open RStudio and load the
ABC-units
R project. If you have loaded it before, choose File → Recent projects → ABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit. - Choose Tools → Version Control → Pull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
- Type
init()
if requested. - Open the file
RPR-PROSITE_POST.R
and follow the instructions.
Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.
NCBI Entrez E-Utils
It is has become unreasonably difficult to screenscrape the NCBI site since the actual page contents are dynamically loaded via AJAX. This may be intentional, or just overengineering. While NCBI offers a subset of their data via the eutils API and that works well enough, some of the data that is available to the Web browser's eyes is not served to a program.
The eutils API returns data in XML format. Have a look at the following URL in your browser to see what that looks like:
Look at the contents of the <ID>...</ID> tag, and follow the next query:
Note the conceptual difference between search "term" and retrieval "id".
An API to NCBIs Entrez system is provided through eUtils.
Task:
Browse through the E-utilities Quick Start chapter of the NCBI's Entrez Programming Utilites Handbook for a quick overview.
Task:
- Open RStudio and load the
ABC-units
R project. If you have loaded it before, choose File → Recent projects → ABC-Units. If you have not loaded it before, follow the instructions in the RPR-Introduction unit. - Choose Tools → Version Control → Pull Branches to fetch the most recent version of the project from its GitHub repository with all changes and bug fixes included.
- Type
init()
if requested. - Open the file
RPR-eUtils_XML.R
and follow the instructions.
Note: take care that you understand all of the code in the script. Evaluation in this course is cumulative and you may be asked to explain any part of code.
About ...
Author:
- Boris Steipe <boris.steipe@utoronto.ca>
Created:
- 2017-08-05
Modified:
- 2020-09-24
Version:
- 1.2
Version history:
- 1.2 2020 Updates
- 1.0.1 Update for Chromes' developer tools layout change
- 1.0 Working version
- 0.1 First stub
This copyrighted material is licensed under a Creative Commons Attribution 4.0 International License. Follow the link to learn more.