Screenscraping

From "A B C"
Revision as of 07:31, 23 September 2015 by Boris (talk | contribs) (→‎R)
Jump to navigation Jump to search

Screenscraping


Screenscraping is a highly informal method of parsing HTML output for data.



The right way to work with programs on non-local hosts is through simple RPC (remote procedure calls) or APIs (Application Program Interfaces) that may include more complex objects or datastructures. In fact, the contents of webpages has changed dramatically from the simple HTML we would have seen even a few years ago, to highly dynamic containers of elaborate Javascript programs that assemble their information payload client-side, often from multiple sources. For such pages, API access is likely to be the only sensible way, and reverse engineering such output would be a major project. However many less sophisticated sites give us simpler output and we need to work with what we have: the contents of a Webpage. Taking a server's HTML output and parsing the relevant data-elements from it is therefore often called screenscraping. It is generally used

  • when data is available only through a Web-server;
  • when no formal specifications for the data and or procedures exist or are not easily available;
  • when robust access and reproducible behaviour are of less importance than speed of deployment.

Screen scraping is therefore highly informal and ad hoc - a quick and dirty solution to common tasks.

Retrieving

The first issue to address is how to retrieve the data. Let us assume it is textual data - I could not imagine that it would be less work to try to parse images than to contact the maintainers of the data and negotiate a proper interface.

Web browser

Simply navigate to a page, then save it as HTML or text-only. It's often easier to work with HTML because the markup may simplify parsing. Let's access the domain information for the yeast Mbp1 cell cycle regulator at Uniprot. The UniProt ID of the protein is P39678 and we can directly access information for a protein at Uniprot with this knowledge:

What I would like to do from this page is to access the start and end coordinates for the annotated domains. Let's have a look at the HTML source first.

It is certainly quite messy - but it seems well enough structured to work with it. We could copy and paste it and take it apart ...


wget

Better is to download the page directly. wget is a Unix command line interface to network protocols. It is simple to use to download the Mbp1 Uniprot page:

wget -O - http://www.uniprot.org/uniprot/P39678

If a file name is specified instead of "-", the output will be written to that file instead of STDOUT. If -O is not specified, the output will be written to a file in the local directory with the same name as the file on the server.

curl

curl is an alternative to wget. For a comparison of the two see here. Its syntax is "cleaner" unix, supporting redirects and pipes.

curl http://www.uniprot.org/uniprot/P39678 > P39678.html
head P39678.html

Perl

backticks

The easiest way to use Perl to retrieve a Web server document is actually through wget or curl. Consider the following code:

use strict;
use warnings;
 
my $url = 'http://www.uniprot.org/uniprot/P39678';
my $out = `curl $url`;
print $out;

exit();

The fun part is in the backticks: strings in backticks are executed as system commands and the resulting output from STDOUT is assigned to a variable.

LWP

A much more flexible way to use Perl to interact with Webservers is LWP (example) (Library for WWW in Perl) (also see here - with a link to an online book on the topic). Typical uses include sites into which you have to login, accept cookies or otherwise interact in more complex ways with the server. While wget will retrieve the contents of an URL, LWP simulates much of the behaviour of a browser.

Neither Javascript nor any plugins will work through LWP. That's not to say it can't be done, just not with LWP.

PHP

PHP has inbuilt support to retrieve HTML documents. Here is an example that retrieves this page and parses only the PHP section from it.


<html>
    <head>
        <title>Example</title>
    </head>
    <body>
<?php

$source = "http://biochemistry.utoronto.ca/steipe/abc/index.php/Screenscraping";

$raw = file_get_contents($source);
# print($raw);

preg_match("/(PHP<\/span><\/h4>)(.*?)(<h4>)/s", $raw, $matches);

echo $matches[1] . $matches[2] . "\n";

?>
    </body>
</html>

The modifier s after the matching pattern allows matching across newline boundaries. Otherwise matches would only be retrieved if they were completely contained within a single line.

Parsing

To parse anything meaningful from the raw code you have retrieved, you will need Regular_Expressions ...


R

Screenscraping with R is almost ridiculously easy with the XML package and the readHTMLTable() function[1] ...


# screenscrapingUniprot.R
# sample code to demonstrate the R XML package and
# readHTMLTable()
#
# Boris Steipe for BCB410
# 2015

setwd("my/project/dir")

if (!require(XML, quiet =TRUE)) { 
	install.packages("XML")
	library(XML)
}

# Fetch the yeast Mbp1 page from Uniprot
queryURL <- "http://www.uniprot.org/uniprot/P39678"
data <- readHTMLTable(queryURL, stringsAsFactors = FALSE)

# That's all we need to do to parse the page and return all table data 
# for further processing. The return value is a list of dataframes. Most
# likely the dataframes will contain columns of factors.

data

# The dataframes can be easily accessed as named elements of the list:
names(data)

# It seems the "domainsAnno_section" is what we are looking for.

data$domainsAnno_section

# We can extract the information with our normal R-syntax:
pos <- data$domainsAnno_section[,"Position(s)"]
pos

# Obviously, we would like to get the actual sequence of these domains.
# The protein sequence is however not contained in a table and we have
# to pull it out of the HTML source.

# Let's capture it:

rawHTML <- htmlParse(queryURL)  # This returns an XML node-set
str(rawHTML)

# As we found in the source, the sequence is contained in a <pre>
# element, labelled with class="sequence"

# To extract it from the XML tree in the response object,
# we use the function getNodeSet(). 
# getNodeSet takes two (or more) parameters: an xml tree, and a "path"
# that describes what nodes should be considered. Paths are expressed
# in the xpath language (see: http://www.w3.org/TR/xpath/). Without
# getting too technical, we use the following notation for the path:
#    //               shorthand for /descendant-or-self::node()/
#    pre()            iterates over all pre nodes and returns a list
#    comment()[<...>] returns a subset of list items, selected by <...>
#    contains(X, Y)   is true if X contains the string in Y 
# Once the node set is found, we use toString.XMLNode to convert the
# node into a string.

raw <- toString.XMLNode(
            getNodeSet(rawHTML, "//pre[@class='sequence']")
        )
raw

# to assemble the sequence, we need to split this along the
# <br/> elements

lines <- strsplit(raw, "<br/>")[[1]]
lines

# the sequence is in the even lines ...
lines[seq(2, length(lines), by=2)]
# ...and from there we can collapse it:
seq <- paste(lines[seq(2, length(lines), by=2)], collapse="")
# ... and remove the remaining whitespace. gsub() is the base-R
# approach, but the package stringr has more flexible functions:
if (!require(stringr, quiet =TRUE)) { 
	install.packages("stringr")
	library(stringr)
}

seq <- str_replace_all(seq, " ", "")
seq

# now all that's left to do is to parse the start and end
# position of the domain, and use a substr() call to get the
# sequence.

getStartEnd <- function(s) {
	return(as.numeric(strsplit(s, "\\s")[[1]][c(1,3)]))
}

for (i in 1:length(pos)) {
	se <- getStartEnd(pos[i])
	s <- substr(seq, se[1], se[2])
	print(paste(se[1], " - ",
	            se[2], ": ",
	            s))
}


   

Automation

What if you want to extract data from multiple pages? What if the data is dynamically generated as a result to a query and you don't know the URL? What if you simply need much more control over your data retrieval? You will need to write some code that emulates user behaviour, essentially a bot, or spider. Note that there may be legal issues involved in doing so[2].

Here is a simple example for real XML parsing that pulls some data on language distributions in Canada from Ethnologue using R's XML package[3].

# EthnoSpider.R
# sample code to demonstrate accessing multiple URLs
# via XML tree parsing functions
#
# Boris Steipe for BCB410
# 2014

setwd("my/R/working/directory")

if (!require(XML, quiet =TRUE)) { install.packages("XML") }
library(XML)

country_page <- htmlParse("http://www.ethnologue.com/country/CA/languages", isURL = TRUE)
pages <- getNodeSet(country_page, "//a[@href]")

urls <- c()
for (i in 1:length(pages)) {
	if (length(grep("/language/...$", xmlAttrs(pages[[i]])))>0) {
		urls <- c(urls, paste("http://www.ethnologue.com", xmlAttrs(pages[[i]]), sep=""))
	}
}

urls

#... etc.



Beyond regular expressions: Xpath

One of the problems we encounter is that HTML cannot be parsed with regular expressions. Of course, it often works, but for robust applications it is highly not advisable. Use an XML parser instead.

TBC


 

Exercises


TBD





Notes and references

  1. Note that readHTMLTable() returns dataframes and by default turns strings into factors. This R code simply avoids creating factors in the first place, but for more principled approaches see at stackoverflow.
  2. For a discussion of the legal issues see e.g. Web scraping on Wikipedia
  3. For a quick introduction to the package see http://www.omegahat.org/RSXML/shortIntro.pdf. The vignettte is here: http://cran.r-project.org/web/packages/XML/XML.pdf. Type ??xml in R to see the available functions.

 

Further reading and resources

I Don't Need No Stinking API: Web Scraping For Fun and Profit
Blog by a Web developer that discusses options like building correct GET strings and using Firebug to traverse the DOM.
Six tools for web scraping
Includes some commercial options ...