Difference between revisions of "Screenscraping"
m (→wget) |
m |
||
(27 intermediate revisions by the same user not shown) | |||
Line 5: | Line 5: | ||
− | + | <div class="alert"> | |
− | + | This material is slowly being phased out and replaced with an RStudio project that you can load from https://github.com/hyginn/R_Exercise-Screenscraping | |
+ | </div> | ||
Screenscraping is a highly informal method of parsing HTML output for data. | Screenscraping is a highly informal method of parsing HTML output for data. | ||
Line 15: | Line 16: | ||
− | The ''right'' way to work with programs on non-local hosts is through simple RPC (remote procedure calls) or [http://en.wikipedia.org/wiki/Application_programming_interface APIs] (Application Program Interfaces) that may include more complex objects or datastructures. However | + | The ''right'' way to work with programs on non-local hosts is through simple RPC (remote procedure calls) or [http://en.wikipedia.org/wiki/Application_programming_interface APIs] (Application Program Interfaces) that may include more complex objects or datastructures. In fact, the contents of webpages has changed dramatically from the simple HTML we would have seen even a few years ago, to highly dynamic containers of elaborate Javascript programs that assemble their information payload client-side, often from multiple sources. For such pages, API access is likely to be the '''only''' sensible way, and reverse engineering such output would be a major project. However many less sophisticated sites give us simpler output and we need to work with what we have: the contents of a Webpage. Taking a server's HTML output and parsing the relevant data-elements from it is therefore often called ''screenscraping''. It is generally used |
− | *when data is available only through a Web-server | + | *when data is available only through a Web-server; |
− | *when no formal specifications for the data and or procedures exist or are not easily available | + | *when no formal specifications for the data and or procedures exist or are not easily available; |
− | *when robust access and reproducible behaviour are of less importance than speed of deployment | + | *when robust access and reproducible behaviour are of less importance than speed of deployment. |
Screen scraping is therefore highly informal and ''ad hoc'' - a quick and dirty solution to common tasks. | Screen scraping is therefore highly informal and ''ad hoc'' - a quick and dirty solution to common tasks. | ||
Line 26: | Line 27: | ||
====Web browser==== | ====Web browser==== | ||
− | Simply navigate to a page, then save it as HTML or text-only. It's often easier to work with HTML because the markup may simplify parsing. | + | Simply navigate to a page, then save it as HTML or text-only. It's often easier to work with HTML because the markup may simplify parsing. Let's access the domain information for the yeast Mbp1 cell cycle regulator at Uniprot. The UniProt ID of the protein is P39678 and we can directly access information for a protein at Uniprot with this knowledge: |
+ | |||
+ | * http://www.uniprot.org/uniprot/P39678 | ||
+ | |||
+ | What I would like to do from this page is to access the start and end coordinates for the annotated domains. Let's have a look at the HTML source first. | ||
+ | |||
+ | It is certainly quite messy - but it seems well enough structured to work with it. We could copy and paste it and take it apart ... | ||
====wget==== | ====wget==== | ||
− | [[Unix_wget|wget]] is a Unix | + | Better is to download the page directly. |
+ | [[Unix_wget|wget]] is a Unix command line interface to network protocols. It is simple to use to download the Mbp1 Uniprot page: | ||
− | wget -O - http:// | + | wget -O - http://www.uniprot.org/uniprot/P39678 |
If a file name is specified instead of "'''-'''", the output will be written to that file instead of STDOUT. If '''-O''' is not specified, the output will be written to a file in the local directory with the same name as the file on the server. | If a file name is specified instead of "'''-'''", the output will be written to that file instead of STDOUT. If '''-O''' is not specified, the output will be written to a file in the local directory with the same name as the file on the server. | ||
− | |||
− | |||
====curl==== | ====curl==== | ||
Line 43: | Line 49: | ||
curl is an alternative to wget. For a comparison of the two see [http://daniel.haxx.se/docs/curl-vs-wget.html '''here''']. Its syntax is "cleaner" unix, supporting redirects and pipes. | curl is an alternative to wget. For a comparison of the two see [http://daniel.haxx.se/docs/curl-vs-wget.html '''here''']. Its syntax is "cleaner" unix, supporting redirects and pipes. | ||
− | curl http:// | + | curl http://www.uniprot.org/uniprot/P39678 > P39678.html |
− | head | + | head P39678.html |
====Perl==== | ====Perl==== | ||
=====backticks===== | =====backticks===== | ||
− | The easiest way to use Perl to retrieve a Web server document is actually through wget. Consider the following code: | + | The easiest way to use Perl to retrieve a Web server document is actually through wget or curl. Consider the following code: |
<source lang="perl"> | <source lang="perl"> | ||
Line 55: | Line 61: | ||
use warnings; | use warnings; | ||
− | my $url = 'http:// | + | my $url = 'http://www.uniprot.org/uniprot/P39678'; |
− | my $out = ` | + | my $out = `curl $url`; |
print $out; | print $out; | ||
Line 85: | Line 91: | ||
$raw = file_get_contents($source); | $raw = file_get_contents($source); | ||
− | preg_match("/(< | + | # print($raw); |
+ | |||
+ | preg_match("/(PHP<\/span><\/h4>)(.*?)(<h4>)/s", $raw, $matches); | ||
echo $matches[1] . $matches[2] . "\n"; | echo $matches[1] . $matches[2] . "\n"; | ||
Line 91: | Line 99: | ||
?> | ?> | ||
</body> | </body> | ||
− | </html> | + | </html></source> |
− | </source> | ||
<small>The [http://php.net/manual/en/reference.pcre.pattern.modifiers.php '''modifier'''] '''s''' after the matching pattern allows matching across newline boundaries. Otherwise matches would only be retrieved if they were completely contained within a single line.</small> | <small>The [http://php.net/manual/en/reference.pcre.pattern.modifiers.php '''modifier'''] '''s''' after the matching pattern allows matching across newline boundaries. Otherwise matches would only be retrieved if they were completely contained within a single line.</small> | ||
Line 102: | Line 109: | ||
===='''R'''==== | ===='''R'''==== | ||
− | Screenscraping with '''R''' is almost ridiculously easy with the <code>XML</code> package and the <code>readHTMLTable()</code> function ... | + | Screenscraping with '''R''' is almost ridiculously easy with the <code>XML</code> package and the <code>readHTMLTable()</code> function<ref>Note that <code>readHTMLTable()</code> returns dataframes and by default turns strings into factors. This '''R''' code simply avoids creating factors in the first place, but for more principled approaches |
+ | see [http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters '''at stackoverflow'''].</ref> ... | ||
<source lang="R"> | <source lang="R"> | ||
− | # | + | # screenscrapingUniprot.R |
# sample code to demonstrate the R XML package and | # sample code to demonstrate the R XML package and | ||
# readHTMLTable() | # readHTMLTable() | ||
# | # | ||
− | # Boris Steipe for | + | # Boris Steipe for BCB410 |
− | # | + | # 2015 |
− | setwd(" | + | setwd("my/project/dir") |
− | if (!require(XML, quiet =TRUE)) { install.packages("XML") | + | if (!require(XML, quiet =TRUE)) { |
− | library(XML) | + | install.packages("XML") |
+ | library(XML) | ||
+ | } | ||
− | # Fetch the | + | # Fetch the yeast Mbp1 page from Uniprot |
− | + | queryURL <- "http://www.uniprot.org/uniprot/P39678" | |
− | + | data <- readHTMLTable(queryURL, stringsAsFactors = FALSE) | |
− | |||
# That's all we need to do to parse the page and return all table data | # That's all we need to do to parse the page and return all table data | ||
Line 130: | Line 139: | ||
# The dataframes can be easily accessed as named elements of the list: | # The dataframes can be easily accessed as named elements of the list: | ||
− | data | + | names(data) |
− | + | ||
+ | # It seems the "domainsAnno_section" is what we are looking for. | ||
− | + | data$domainsAnno_section | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | # | + | # We can extract the information with our normal R-syntax: |
− | + | pos <- data$domainsAnno_section[,"Position(s)"] | |
− | + | pos | |
− | |||
− | + | # Obviously, we would like to get the actual sequence of these domains. | |
− | + | # The protein sequence is however not contained in a table and we have | |
− | + | # to pull it out of the HTML source. | |
− | + | # Let's capture it: | |
− | + | rawHTML <- htmlParse(queryURL) # This returns an XML node-set | |
− | + | str(rawHTML) | |
− | |||
− | |||
− | # | + | # As we found in the source, the sequence is contained in a <pre> |
− | + | # element, labelled with class="sequence" | |
− | |||
− | |||
− | |||
+ | # To extract it from the XML tree in the response object, | ||
+ | # we use the function getNodeSet(). | ||
+ | # getNodeSet takes two (or more) parameters: an xml tree, and a "path" | ||
+ | # that describes what nodes should be considered. Paths are expressed | ||
+ | # in the xpath language (see: http://www.w3.org/TR/xpath/). Without | ||
+ | # getting too technical, we use the following notation for the path: | ||
+ | # // shorthand for /descendant-or-self::node()/ | ||
+ | # pre() iterates over all pre nodes and returns a list | ||
+ | # comment()[<...>] returns a subset of list items, selected by <...> | ||
+ | # contains(X, Y) is true if X contains the string in Y | ||
+ | # Once the node set is found, we use toString.XMLNode to convert the | ||
+ | # node into a string. | ||
− | + | raw <- toString.XMLNode( | |
− | + | getNodeSet(rawHTML, "//pre[@class='sequence']") | |
− | + | ) | |
− | + | raw | |
− | |||
+ | # to assemble the sequence, we need to split this along the | ||
+ | # <br/> elements | ||
− | + | lines <- strsplit(raw, "<br/>")[[1]] | |
+ | lines | ||
− | + | # the sequence is in the even lines ... | |
+ | lines[seq(2, length(lines), by=2)] | ||
+ | # ...and from there we can collapse it: | ||
+ | seq <- paste(lines[seq(2, length(lines), by=2)], collapse="") | ||
+ | # ... and remove the remaining whitespace. gsub() is the base-R | ||
+ | # approach, but the package stringr has more flexible functions: | ||
+ | if (!require(stringr, quiet =TRUE)) { | ||
+ | install.packages("stringr") | ||
+ | library(stringr) | ||
+ | } | ||
− | + | seq <- str_replace_all(seq, " ", "") | |
+ | seq | ||
− | # | + | # now all that's left to do is to parse the start and end |
− | + | # position of the domain, and use a substr() call to get the | |
+ | # sequence. | ||
− | + | getStartEnd <- function(s) { | |
+ | return(as.numeric(strsplit(s, "\\s")[[1]][c(1,3)])) | ||
+ | } | ||
+ | for (i in 1:length(pos)) { | ||
+ | se <- getStartEnd(pos[i]) | ||
+ | s <- substr(seq, se[1], se[2]) | ||
+ | print(paste(se[1], " - ", | ||
+ | se[2], ": ", | ||
+ | s)) | ||
+ | } | ||
</source> | </source> | ||
Line 198: | Line 231: | ||
--> | --> | ||
+ | | ||
+ | |||
+ | ===Automation=== | ||
+ | |||
+ | What if you want to extract data from multiple pages? What if the data is dynamically generated as a result to a query and you don't know the URL? What if you simply need much more control over your data retrieval? You will need to write some code that emulates user behaviour, essentially a bot, or spider. Note that there may be legal issues involved in doing so<ref>For a discussion of the legal issues see e.g. {{WP|Web scraping}} on Wikipedia</ref>. | ||
+ | |||
+ | Here is a simple example for real XML parsing that perform a BLAST search via the Web API using '''R''''s XML package<ref>For a quick introduction to the package see http://www.omegahat.org/RSXML/shortIntro.pdf. The vignettte is here: http://cran.r-project.org/web/packages/XML/XML.pdf. Type <code>??xml</code> in '''R''' to see the available functions.</ref>. | ||
+ | |||
+ | <source lang="R"> | ||
+ | # BLASTsearch.r | ||
+ | # Tutorial to send off one BLAST search | ||
+ | # Boris Steipe for BCB410 | ||
+ | |||
+ | # This script uses the BLAST URL-API (Application Programming Interface) | ||
+ | # at the NCBI. Read about the constraints here: | ||
+ | # http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=DeveloperInfo | ||
+ | |||
+ | # We will send off one BLAST search for the APSES domain we have found in the | ||
+ | # previous example. | ||
+ | |||
+ | # The bioconducter "annotate" package contains code for BLAST searches, | ||
+ | # in case you need to do something more involved. | ||
+ | |||
+ | |||
+ | # ====== Basic parameters ============================================ | ||
+ | |||
+ | emailAddress <- "<your.name>@<your.host>" | ||
+ | |||
+ | |||
+ | # ====== The query =================================================== | ||
+ | # Queries can be either sequences, or database IDs. In case you | ||
+ | # use a numeric database ID - like a GI number - you might wrap | ||
+ | # the ID in as.character() to be sure the query is passed as a string. | ||
+ | |||
+ | # the APSES domain of P39678 (yeast Mbp1) | ||
+ | querySeq <- paste("IYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHI", | ||
+ | "LKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQ", | ||
+ | "GTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDGSASP", | ||
+ | sep="") | ||
+ | |||
+ | |||
+ | # ====== The Entrez restriction ====================================== | ||
+ | # Entrez restrictions work in the usual way: you can specify an | ||
+ | # organism as a binomial (eg. "saccaromyces cerevisiae"[organism]), or | ||
+ | # as an NCBI taxonomy ID (e.g. txid4932[organism]). For details and | ||
+ | # options see the Search Builder section of the Advanced Search | ||
+ | # interface of any query at the NCBI. | ||
+ | |||
+ | # ====== The command URL ============================================= | ||
+ | # To use the BLAST URL API, we specify the required parameters in an | ||
+ | # URL string. See the extensive documentation on the possible | ||
+ | # parameters at | ||
+ | # http://www.ncbi.nlm.nih.gov/blast/Doc/urlapi.html | ||
+ | |||
+ | # I separate commands and arguments in the code below, so they are | ||
+ | # easy to replace with other strings or variables. | ||
+ | |||
+ | queryURL <- paste( | ||
+ | "http://www.ncbi.nlm.nih.gov/blast/Blast.cgi", | ||
+ | "?", | ||
+ | "QUERY=", querySeq, | ||
+ | "&DATABASE=", "refseq_protein", # or: nr, pdb, swissprot ... | ||
+ | "&HITLIST_SIZE=", "30", | ||
+ | "&EXPECT=", "3", # hit probably meaningless if E-value worse | ||
+ | "&PROGRAM=", "blastp", # | ||
+ | "&ENTREZ_QUERY=", paste("\"saccharomyces cerevisiae\"[organism]", sep=""), | ||
+ | "&NOHEADER=", "true", # turn off graphic header in result | ||
+ | "&EMAIL=", emailAddress, # contact address for problem feedback | ||
+ | "&CMD=Put", | ||
+ | sep = "") | ||
+ | |||
+ | # ====== Sending the URL off to BLAST ================================ | ||
+ | # To communicate over the Internet, we need functions to post a query | ||
+ | # and receive results. These can be found in the XML package. | ||
+ | |||
+ | if (!require(XML)) { | ||
+ | install.packages("XML") | ||
+ | library(XML) | ||
+ | } | ||
+ | |||
+ | # send the query string off, and capture the response. | ||
+ | |||
+ | response <- htmlParse(queryURL) | ||
+ | |||
+ | |||
+ | # The response contains two items we need to continue: | ||
+ | # RID - the request ID. With this ID we can retrieve results | ||
+ | # from the BLAST server. | ||
+ | # RTOE - the expected time to complete. | ||
+ | # We extract these two items from the response, to be able | ||
+ | # to construct further queries to pick up the results. | ||
+ | # They are contained in a comment of the returned HTML document: | ||
+ | # <!--QBlastInfoBegin | ||
+ | # RID = 7W7M0JM4015 | ||
+ | # RTOE = 27 | ||
+ | # QBlastInfoEnd | ||
+ | # --> | ||
+ | |||
+ | # To extract the comment from the XML tree in the response object, | ||
+ | # we use the function getNodeSet(). | ||
+ | # getNodeSet takes two (or more) parameters: an xml tree, and a "path" | ||
+ | # that describes what nodes should be considered. Paths are expressed | ||
+ | # in the xpath language (see: http://www.w3.org/TR/xpath/). Without | ||
+ | # getting too technical, we use the following notation for the path: | ||
+ | # // shorthand for /descendant-or-self::node()/ | ||
+ | # comment() iterates over all comments and returns a list | ||
+ | # comment()[<...>] returns a subset of list items, selected by <...> | ||
+ | # contains(X, Y) is true if X contains the string in Y | ||
+ | # Once the node set is found, we use toString.XMLNode to convert the | ||
+ | # node into a string. | ||
+ | |||
+ | info <- toString.XMLNode( | ||
+ | getNodeSet(response, "//comment()[contains(., \"QBlastInfo\")]") | ||
+ | ) | ||
+ | |||
+ | # Finally we use regular expression matching to extract the information we need. | ||
+ | rid <- regmatches(info, regexec("RID = (\\w+)" , info))[[1]][2] | ||
+ | rtoe <- as.numeric(regmatches(info, regexec("RTOE = (\\d+)" , info))[[1]][2]) | ||
+ | rid | ||
+ | rtoe | ||
+ | # Now: we sleep for rtoe seconds, then we access the result by querying | ||
+ | # for the request ID. | ||
+ | |||
+ | ridURL <- paste( | ||
+ | "http://www.ncbi.nlm.nih.gov/blast/Blast.cgi", | ||
+ | "?", | ||
+ | "RID=", rid, # the request ID | ||
+ | "&FORMAT_TYPE=", "XML", | ||
+ | "&EMAIL=", emailAddress, | ||
+ | "&CMD=Get", | ||
+ | sep = "") | ||
+ | |||
+ | Sys.sleep(rtoe) | ||
+ | result <- htmlParse(ridURL) | ||
+ | |||
+ | # The rtoe number of seconds is an estimate - our query may or may not be done in | ||
+ | # time. If it's not done, there will be no element <hit> ... </hit> in the result. | ||
+ | |||
+ | length(getNodeSet(result, "//hat")) # should be zero - no such node | ||
+ | length(getNodeSet(result, "//hit")) # will be > 0 when the result has returned | ||
+ | |||
+ | # Run interactively, we may simply try multiple times to get the result. | ||
+ | # When we put everything together as a function, we'll poll for the result | ||
+ | # at regular intervals or until a timeout limit is reached. | ||
+ | # But you have to be careful: when you script this, you might generate | ||
+ | # too many requests in a given time interval, and the NCBI might block | ||
+ | # your IP address! Here is the NCBI usage policy: | ||
+ | # http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node2.html | ||
+ | # In a nutshell: don't submit more than three jobs per second, and | ||
+ | # don't poll for the same RID more frequently than in minute intervals. | ||
+ | # | ||
+ | # Please take this seriously: not only is it a discourtesy not to | ||
+ | # comply, but getting e.g. your lab's IP blocked by the NCBI is going | ||
+ | # to be a bit of a problem going forward. | ||
+ | # | ||
+ | # Once the result is available, we can parse it for the data we need and | ||
+ | # populate our list. Examine the contents of the hit: | ||
+ | |||
+ | getNodeSet(result, "//hit") | ||
+ | |||
+ | # Done | ||
+ | #[End] | ||
+ | |||
+ | </source> | ||
+ | |||
+ | |||
+ | |||
+ | <!-- The [http://webscraper.io/ '''Web-scraper''' Chrome browser extension] may be your solution. | ||
+ | |||
+ | With this extension, you can create a site map, define how the site should be traversed and store the information elements. | ||
+ | |||
+ | {{task|1= | ||
+ | # Install Web-scraper from the [http://chrome.google.com/webstore/detail/web-scraper/jnhgnonknehpejjnehehllkliplmbmhn '''Chrome app store''']. Its free. | ||
+ | # Open '''Developer tools'''. Web-scraper has its own tab. | ||
+ | # Choose: '''Create new sitemap''' → '''Create sitemap'''. | ||
+ | # We'll pull some data on language distributions in Canada from [http://www.ethnologue.com '''Ethnologue''']. Give the site-map a name ... | ||
+ | # ... and enter the starting URL: <code>http://www.ethnologue.com/country/CA/languages</code> | ||
+ | # Then click: '''Create Sitemap'''. | ||
+ | |||
+ | |||
+ | }} | ||
+ | |||
+ | --> | ||
+ | |||
+ | ==Beyond regular expressions: Xpath== | ||
+ | One of the problems we encounter is that [http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 HTML cannot be parsed with regular expressions]. Of course, it often works, but for robust applications it is highly not advisable. Use an XML parser instead. Work through an online [http://www.w3schools.com/xsl/xpath_intro.asp '''XPATH tutorial''']. | ||
+ | |||
+ | |||
+ | | ||
+ | |||
+ | ==Exercises== | ||
+ | <section begin=exercises /> | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ===A BLAST RBM tool for orthologue discovery=== | ||
+ | <div class="mw-collapsible mw-collapsed exercise-box" data-expandtext="Hint" data-collapsetext="Collapse"> | ||
+ | |||
+ | For this exercise, I would like you to write a tool that may be invaluable for your future research, and I would like you to write it in a way that reflects robust software development practice. You can write in any language of your choice, I will give an '''R''' solution. | ||
+ | |||
+ | #'''Extend the BLAST example to make a tool for reciprocal best match searches.''' You should use the BLAST API as documented, and you should parse the XML of the result, not use regular expressions on the unstructured stream. Allow a forward search based on sequence or accession number. | ||
+ | # Check your parameters for sanity and completeness before sending off the request to the NCBI. | ||
+ | #Parse the result so that the ''reverse'' search covers at least ''min''-% of the forward search, where ''min'' is an adjustable parameter. | ||
+ | #Adjust your handling of the results so that you may choose the second-best or even lower hit if its ''E-value'' is not much worse, but its coverage is much better. '''You will need to invent a way to balance the tradeoff and define a rational decision threshold.''' | ||
+ | #Factor your code into clearly defined functions so that no functionality is repeated in your code. | ||
+ | #Add a progress bar, or other indication that the function is alive and waiting for response and what the <code>RTOE</code> is. Update if the result is not yet available. | ||
+ | #Format your output in a meaningful way. | ||
+ | #Make sure your code is documented. | ||
+ | #Add tests for all functions. | ||
+ | |||
+ | <div class="mw-collapsible mw-collapsed exercise-box" data-expandtext="Expand" data-collapsetext="Collapse" style="background-color:#EEEEF9;"> | ||
+ | Sample data ... | ||
+ | |||
+ | |||
+ | <div class="mw-collapsible-content"> | ||
+ | |||
+ | ;An APSES domain sequence: | ||
+ | <source lang="text"> | ||
+ | IYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHI | ||
+ | LKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQ | ||
+ | GTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDGSASP | ||
+ | </source> | ||
+ | |||
+ | ;An accession number for an almost identical sequence | ||
+ | <source lang="text"> | ||
+ | 1BM8_A | ||
+ | </source> | ||
+ | <small>Note: this is a PDB ID plus chain, which is a valid refSeq ID.</small> | ||
+ | |||
+ | ;An organism for the RBM search: | ||
+ | <source lang="text"> | ||
+ | schizosaccharomyces pombe | ||
+ | </source> | ||
+ | |||
+ | |||
+ | </div> | ||
+ | </div> | ||
+ | |||
+ | |||
+ | <div class="mw-collapsible-content exercise-box"> | ||
+ | <div class="mw-collapsible mw-collapsed" data-expandtext="Solution" data-collapsetext="Collapse"> | ||
+ | Not sure what kind of hint would be helpful here - email me if you're stuck and I'll add something here. Don't just jump ahead and read the solution code however, that would be pointless. | ||
+ | |||
+ | |||
+ | <div class="mw-collapsible-content exercise-box"> | ||
+ | |||
+ | Don't peek! You didn't even try yet. | ||
+ | |||
+ | |||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
+ | |||
+ | |||
+ | |||
+ | <section end=exercises /> | ||
+ | |||
+ | ==Notes and references== | ||
+ | <references /> | ||
+ | |||
| | ||
Line 203: | Line 498: | ||
<!-- {{#pmid:21627854}} --> | <!-- {{#pmid:21627854}} --> | ||
<!-- {{WWW|WWW_UniProt}} --> | <!-- {{WWW|WWW_UniProt}} --> | ||
− | < | + | <div class="reference-box">[http://blog.hartleybrody.com/web-scraping/ '''I Don't Need No Stinking API:'''] Web Scraping For Fun and Profit<br />Blog by a Web developer that discusses options like building correct GET strings and using Firebug to traverse the DOM.</div> |
− | + | <div class="reference-box">[http://www.notprovided.eu/six-tools-web-scraping-use-data-journalism-creating-insightful-content/ '''Six tools for web scraping''']<br/>Includes some commercial options ...</div> | |
| |
Latest revision as of 09:24, 27 September 2016
Screenscraping
This material is slowly being phased out and replaced with an RStudio project that you can load from https://github.com/hyginn/R_Exercise-Screenscraping
Screenscraping is a highly informal method of parsing HTML output for data.
Contents
The right way to work with programs on non-local hosts is through simple RPC (remote procedure calls) or APIs (Application Program Interfaces) that may include more complex objects or datastructures. In fact, the contents of webpages has changed dramatically from the simple HTML we would have seen even a few years ago, to highly dynamic containers of elaborate Javascript programs that assemble their information payload client-side, often from multiple sources. For such pages, API access is likely to be the only sensible way, and reverse engineering such output would be a major project. However many less sophisticated sites give us simpler output and we need to work with what we have: the contents of a Webpage. Taking a server's HTML output and parsing the relevant data-elements from it is therefore often called screenscraping. It is generally used
- when data is available only through a Web-server;
- when no formal specifications for the data and or procedures exist or are not easily available;
- when robust access and reproducible behaviour are of less importance than speed of deployment.
Screen scraping is therefore highly informal and ad hoc - a quick and dirty solution to common tasks.
Retrieving
The first issue to address is how to retrieve the data. Let us assume it is textual data - I could not imagine that it would be less work to try to parse images than to contact the maintainers of the data and negotiate a proper interface.
Web browser
Simply navigate to a page, then save it as HTML or text-only. It's often easier to work with HTML because the markup may simplify parsing. Let's access the domain information for the yeast Mbp1 cell cycle regulator at Uniprot. The UniProt ID of the protein is P39678 and we can directly access information for a protein at Uniprot with this knowledge:
What I would like to do from this page is to access the start and end coordinates for the annotated domains. Let's have a look at the HTML source first.
It is certainly quite messy - but it seems well enough structured to work with it. We could copy and paste it and take it apart ...
wget
Better is to download the page directly. wget is a Unix command line interface to network protocols. It is simple to use to download the Mbp1 Uniprot page:
wget -O - http://www.uniprot.org/uniprot/P39678
If a file name is specified instead of "-", the output will be written to that file instead of STDOUT. If -O is not specified, the output will be written to a file in the local directory with the same name as the file on the server.
curl
curl is an alternative to wget. For a comparison of the two see here. Its syntax is "cleaner" unix, supporting redirects and pipes.
curl http://www.uniprot.org/uniprot/P39678 > P39678.html head P39678.html
Perl
backticks
The easiest way to use Perl to retrieve a Web server document is actually through wget or curl. Consider the following code:
use strict;
use warnings;
my $url = 'http://www.uniprot.org/uniprot/P39678';
my $out = `curl $url`;
print $out;
exit();
The fun part is in the backticks: strings in backticks are executed as system commands and the resulting output from STDOUT is assigned to a variable.
LWP
A much more flexible way to use Perl to interact with Webservers is LWP (example) (Library for WWW in Perl) (also see here - with a link to an online book on the topic). Typical uses include sites into which you have to login, accept cookies or otherwise interact in more complex ways with the server. While wget will retrieve the contents of an URL, LWP simulates much of the behaviour of a browser.
Neither Javascript nor any plugins will work through LWP. That's not to say it can't be done, just not with LWP.
PHP
PHP has inbuilt support to retrieve HTML documents. Here is an example that retrieves this page and parses only the PHP section from it.
<html>
<head>
<title>Example</title>
</head>
<body>
<?php
$source = "http://biochemistry.utoronto.ca/steipe/abc/index.php/Screenscraping";
$raw = file_get_contents($source);
# print($raw);
preg_match("/(PHP<\/span><\/h4>)(.*?)(<h4>)/s", $raw, $matches);
echo $matches[1] . $matches[2] . "\n";
?>
</body>
</html>
The modifier s after the matching pattern allows matching across newline boundaries. Otherwise matches would only be retrieved if they were completely contained within a single line.
Parsing
To parse anything meaningful from the raw code you have retrieved, you will need Regular_Expressions ...
R
Screenscraping with R is almost ridiculously easy with the XML
package and the readHTMLTable()
function[1] ...
# screenscrapingUniprot.R
# sample code to demonstrate the R XML package and
# readHTMLTable()
#
# Boris Steipe for BCB410
# 2015
setwd("my/project/dir")
if (!require(XML, quiet =TRUE)) {
install.packages("XML")
library(XML)
}
# Fetch the yeast Mbp1 page from Uniprot
queryURL <- "http://www.uniprot.org/uniprot/P39678"
data <- readHTMLTable(queryURL, stringsAsFactors = FALSE)
# That's all we need to do to parse the page and return all table data
# for further processing. The return value is a list of dataframes. Most
# likely the dataframes will contain columns of factors.
data
# The dataframes can be easily accessed as named elements of the list:
names(data)
# It seems the "domainsAnno_section" is what we are looking for.
data$domainsAnno_section
# We can extract the information with our normal R-syntax:
pos <- data$domainsAnno_section[,"Position(s)"]
pos
# Obviously, we would like to get the actual sequence of these domains.
# The protein sequence is however not contained in a table and we have
# to pull it out of the HTML source.
# Let's capture it:
rawHTML <- htmlParse(queryURL) # This returns an XML node-set
str(rawHTML)
# As we found in the source, the sequence is contained in a <pre>
# element, labelled with class="sequence"
# To extract it from the XML tree in the response object,
# we use the function getNodeSet().
# getNodeSet takes two (or more) parameters: an xml tree, and a "path"
# that describes what nodes should be considered. Paths are expressed
# in the xpath language (see: http://www.w3.org/TR/xpath/). Without
# getting too technical, we use the following notation for the path:
# // shorthand for /descendant-or-self::node()/
# pre() iterates over all pre nodes and returns a list
# comment()[<...>] returns a subset of list items, selected by <...>
# contains(X, Y) is true if X contains the string in Y
# Once the node set is found, we use toString.XMLNode to convert the
# node into a string.
raw <- toString.XMLNode(
getNodeSet(rawHTML, "//pre[@class='sequence']")
)
raw
# to assemble the sequence, we need to split this along the
# <br/> elements
lines <- strsplit(raw, "<br/>")[[1]]
lines
# the sequence is in the even lines ...
lines[seq(2, length(lines), by=2)]
# ...and from there we can collapse it:
seq <- paste(lines[seq(2, length(lines), by=2)], collapse="")
# ... and remove the remaining whitespace. gsub() is the base-R
# approach, but the package stringr has more flexible functions:
if (!require(stringr, quiet =TRUE)) {
install.packages("stringr")
library(stringr)
}
seq <- str_replace_all(seq, " ", "")
seq
# now all that's left to do is to parse the start and end
# position of the domain, and use a substr() call to get the
# sequence.
getStartEnd <- function(s) {
return(as.numeric(strsplit(s, "\\s")[[1]][c(1,3)]))
}
for (i in 1:length(pos)) {
se <- getStartEnd(pos[i])
s <- substr(seq, se[1], se[2])
print(paste(se[1], " - ",
se[2], ": ",
s))
}
Automation
What if you want to extract data from multiple pages? What if the data is dynamically generated as a result to a query and you don't know the URL? What if you simply need much more control over your data retrieval? You will need to write some code that emulates user behaviour, essentially a bot, or spider. Note that there may be legal issues involved in doing so[2].
Here is a simple example for real XML parsing that perform a BLAST search via the Web API using R's XML package[3].
# BLASTsearch.r
# Tutorial to send off one BLAST search
# Boris Steipe for BCB410
# This script uses the BLAST URL-API (Application Programming Interface)
# at the NCBI. Read about the constraints here:
# http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=DeveloperInfo
# We will send off one BLAST search for the APSES domain we have found in the
# previous example.
# The bioconducter "annotate" package contains code for BLAST searches,
# in case you need to do something more involved.
# ====== Basic parameters ============================================
emailAddress <- "<your.name>@<your.host>"
# ====== The query ===================================================
# Queries can be either sequences, or database IDs. In case you
# use a numeric database ID - like a GI number - you might wrap
# the ID in as.character() to be sure the query is passed as a string.
# the APSES domain of P39678 (yeast Mbp1)
querySeq <- paste("IYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHI",
"LKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQ",
"GTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDGSASP",
sep="")
# ====== The Entrez restriction ======================================
# Entrez restrictions work in the usual way: you can specify an
# organism as a binomial (eg. "saccaromyces cerevisiae"[organism]), or
# as an NCBI taxonomy ID (e.g. txid4932[organism]). For details and
# options see the Search Builder section of the Advanced Search
# interface of any query at the NCBI.
# ====== The command URL =============================================
# To use the BLAST URL API, we specify the required parameters in an
# URL string. See the extensive documentation on the possible
# parameters at
# http://www.ncbi.nlm.nih.gov/blast/Doc/urlapi.html
# I separate commands and arguments in the code below, so they are
# easy to replace with other strings or variables.
queryURL <- paste(
"http://www.ncbi.nlm.nih.gov/blast/Blast.cgi",
"?",
"QUERY=", querySeq,
"&DATABASE=", "refseq_protein", # or: nr, pdb, swissprot ...
"&HITLIST_SIZE=", "30",
"&EXPECT=", "3", # hit probably meaningless if E-value worse
"&PROGRAM=", "blastp", #
"&ENTREZ_QUERY=", paste("\"saccharomyces cerevisiae\"[organism]", sep=""),
"&NOHEADER=", "true", # turn off graphic header in result
"&EMAIL=", emailAddress, # contact address for problem feedback
"&CMD=Put",
sep = "")
# ====== Sending the URL off to BLAST ================================
# To communicate over the Internet, we need functions to post a query
# and receive results. These can be found in the XML package.
if (!require(XML)) {
install.packages("XML")
library(XML)
}
# send the query string off, and capture the response.
response <- htmlParse(queryURL)
# The response contains two items we need to continue:
# RID - the request ID. With this ID we can retrieve results
# from the BLAST server.
# RTOE - the expected time to complete.
# We extract these two items from the response, to be able
# to construct further queries to pick up the results.
# They are contained in a comment of the returned HTML document:
# <!--QBlastInfoBegin
# RID = 7W7M0JM4015
# RTOE = 27
# QBlastInfoEnd
# -->
# To extract the comment from the XML tree in the response object,
# we use the function getNodeSet().
# getNodeSet takes two (or more) parameters: an xml tree, and a "path"
# that describes what nodes should be considered. Paths are expressed
# in the xpath language (see: http://www.w3.org/TR/xpath/). Without
# getting too technical, we use the following notation for the path:
# // shorthand for /descendant-or-self::node()/
# comment() iterates over all comments and returns a list
# comment()[<...>] returns a subset of list items, selected by <...>
# contains(X, Y) is true if X contains the string in Y
# Once the node set is found, we use toString.XMLNode to convert the
# node into a string.
info <- toString.XMLNode(
getNodeSet(response, "//comment()[contains(., \"QBlastInfo\")]")
)
# Finally we use regular expression matching to extract the information we need.
rid <- regmatches(info, regexec("RID = (\\w+)" , info))[[1]][2]
rtoe <- as.numeric(regmatches(info, regexec("RTOE = (\\d+)" , info))[[1]][2])
rid
rtoe
# Now: we sleep for rtoe seconds, then we access the result by querying
# for the request ID.
ridURL <- paste(
"http://www.ncbi.nlm.nih.gov/blast/Blast.cgi",
"?",
"RID=", rid, # the request ID
"&FORMAT_TYPE=", "XML",
"&EMAIL=", emailAddress,
"&CMD=Get",
sep = "")
Sys.sleep(rtoe)
result <- htmlParse(ridURL)
# The rtoe number of seconds is an estimate - our query may or may not be done in
# time. If it's not done, there will be no element <hit> ... </hit> in the result.
length(getNodeSet(result, "//hat")) # should be zero - no such node
length(getNodeSet(result, "//hit")) # will be > 0 when the result has returned
# Run interactively, we may simply try multiple times to get the result.
# When we put everything together as a function, we'll poll for the result
# at regular intervals or until a timeout limit is reached.
# But you have to be careful: when you script this, you might generate
# too many requests in a given time interval, and the NCBI might block
# your IP address! Here is the NCBI usage policy:
# http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/node2.html
# In a nutshell: don't submit more than three jobs per second, and
# don't poll for the same RID more frequently than in minute intervals.
#
# Please take this seriously: not only is it a discourtesy not to
# comply, but getting e.g. your lab's IP blocked by the NCBI is going
# to be a bit of a problem going forward.
#
# Once the result is available, we can parse it for the data we need and
# populate our list. Examine the contents of the hit:
getNodeSet(result, "//hit")
# Done
#[End]
Beyond regular expressions: Xpath
One of the problems we encounter is that HTML cannot be parsed with regular expressions. Of course, it often works, but for robust applications it is highly not advisable. Use an XML parser instead. Work through an online XPATH tutorial.
Exercises
A BLAST RBM tool for orthologue discovery
For this exercise, I would like you to write a tool that may be invaluable for your future research, and I would like you to write it in a way that reflects robust software development practice. You can write in any language of your choice, I will give an R solution.
- Extend the BLAST example to make a tool for reciprocal best match searches. You should use the BLAST API as documented, and you should parse the XML of the result, not use regular expressions on the unstructured stream. Allow a forward search based on sequence or accession number.
- Check your parameters for sanity and completeness before sending off the request to the NCBI.
- Parse the result so that the reverse search covers at least min-% of the forward search, where min is an adjustable parameter.
- Adjust your handling of the results so that you may choose the second-best or even lower hit if its E-value is not much worse, but its coverage is much better. You will need to invent a way to balance the tradeoff and define a rational decision threshold.
- Factor your code into clearly defined functions so that no functionality is repeated in your code.
- Add a progress bar, or other indication that the function is alive and waiting for response and what the
RTOE
is. Update if the result is not yet available. - Format your output in a meaningful way.
- Make sure your code is documented.
- Add tests for all functions.
Sample data ...
- An APSES domain sequence
IYSARYSGVDVYEFIHSTGSIMKRKKDDWVNATHI
LKAANFAKAKRTRILEKEVLKETHEKVQGGFGKYQ
GTWVPLNIAKQLAEKFSVYDQLKPLFDFTQTDGSASP
- An accession number for an almost identical sequence
1BM8_A
Note: this is a PDB ID plus chain, which is a valid refSeq ID.
- An organism for the RBM search
schizosaccharomyces pombe
Not sure what kind of hint would be helpful here - email me if you're stuck and I'll add something here. Don't just jump ahead and read the solution code however, that would be pointless.
Don't peek! You didn't even try yet.
Notes and references
- ↑ Note that
readHTMLTable()
returns dataframes and by default turns strings into factors. This R code simply avoids creating factors in the first place, but for more principled approaches see at stackoverflow. - ↑ For a discussion of the legal issues see e.g. Web scraping on Wikipedia
- ↑ For a quick introduction to the package see http://www.omegahat.org/RSXML/shortIntro.pdf. The vignettte is here: http://cran.r-project.org/web/packages/XML/XML.pdf. Type
??xml
in R to see the available functions.
Further reading and resources
Blog by a Web developer that discusses options like building correct GET strings and using Firebug to traverse the DOM.
Includes some commercial options ...