Difference between revisions of "Screenscraping"
m (→R) |
m (→R) |
||
Line 101: | Line 101: | ||
===='''R'''==== | ===='''R'''==== | ||
− | Screenscraping with '''R''' is almost ridiculously easy with the <code>XML</code> package and the <code>readHTMLTable()</code> function ... | + | Screenscraping with '''R''' is almost ridiculously easy with the <code>XML</code> package and the <code>readHTMLTable()</code> function<ref>Note that <code>readHTMLTable()</code> returns dataframes and by default turns strings into factors. This '''R''' code converts the factors in a very pedestrian way, but for more principled approaches to avoid getting factors in the first place, or to convert them if you do, |
+ | see [http://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters '''here at stackoverflow'''].</ref> ... | ||
Revision as of 12:42, 1 October 2014
Screenscraping
Screenscraping is a highly informal method of parsing HTML output for data.
Contents
The right way to work with programs on non-local hosts is through simple RPC (remote procedure calls) or APIs (Application Program Interfaces) that may include more complex objects or datastructures. However quite often all one has to work with is some sort of Web browser screen. Taking a server's HTML output and parsing the relevant data-elements from it is therefore often called screenscraping. It is generally used
- when data is available only through a Web-server
- when no formal specifications for the data and or procedures exist or are not easily available
- when robust access and reproducible behaviour are of less importance than speed of deployment
Screen scraping is therefore highly informal and ad hoc - a quick and dirty solution to common tasks.
Retrieving
The first issue to address is how to retrieve the data. Let us assume it is textual data - I could not imagine that it would be less work to try to parse images than to contact the maintainers of the data and negotiate a proper interface.
Web browser
Simply navigate to a page, then save it as HTML or text-only. It's often easier to work with HTML because the markup may simplify parsing.
wget
wget is a Unix command line interface to network protocols. For example, the following will write the contents of this page to STDOUT.
wget -O - http://biochemistry.utoronto.ca/steipe/abc/index.php/Screenscraping
If a file name is specified instead of "-", the output will be written to that file instead of STDOUT. If -O is not specified, the output will be written to a file in the local directory with the same name as the file on the server.
curl
curl is an alternative to wget. For a comparison of the two see here. Its syntax is "cleaner" unix, supporting redirects and pipes.
curl http://biochemistry.utoronto.ca/steipe/abc/index.php/Screenscraping > test.txt head test.txt
Perl
backticks
The easiest way to use Perl to retrieve a Web server document is actually through wget or curl. Consider the following code:
use strict;
use warnings;
my $url = 'http://biochemistry.utoronto.ca/steipe/abc/index.php/Screenscraping';
my $out = `curl $url`;
print $out;
exit();
The fun part is in the backticks: strings in backticks are executed as system commands and the resulting output from STDOUT is assigned to a variable.
LWP
A much more flexible way to use Perl to interact with Webservers is LWP (example) (Library for WWW in Perl) (also see here - with a link to an online book on the topic). Typical uses include sites into which you have to login, accept cookies or otherwise interact in more complex ways with the server. While wget will retrieve the contents of an URL, LWP simulates much of the behaviour of a browser.
Neither Javascript nor any plugins will work through LWP. That's not to say it can't be done, just not with LWP.
PHP
PHP has inbuilt support to retrieve HTML documents. Here is an example that retrieves this page and parses only the PHP section from it.
<html>
<head>
<title>Example</title>
</head>
<body>
<?php
$source = "http://biochemistry.utoronto.ca/steipe/abc/index.php/Screenscraping";
$raw = file_get_contents($source);
# print($raw);
preg_match("/(PHP<\/span><\/h4>)(.*?)(<h4>)/s", $raw, $matches);
echo $matches[1] . $matches[2] . "\n";
?>
</body>
</html>
The modifier s after the matching pattern allows matching across newline boundaries. Otherwise matches would only be retrieved if they were completely contained within a single line.
Parsing
To parse anything meaningful from the raw code you have retrieved, you will need Regular_Expressions ...
R
Screenscraping with R is almost ridiculously easy with the XML
package and the readHTMLTable()
function[1] ...
# screenscraping.R
# sample code to demonstrate the R XML package and
# readHTMLTable()
#
# Boris Steipe for BCB410
# 2014
setwd("my/R/working/directory")
if (!require(XML, quiet =TRUE)) { install.packages("XML") }
library(XML)
# Fetch the datatable for Ebola cases and deaths timeline from the
# Wikipedia page on the 2014 outbreak.
data <- readHTMLTable("http://en.wikipedia.org/wiki/Ebola_virus_epidemic_in_West_Africa")
# That's all we need to do to parse the page and return all table data
# for further processing. The return value is a list of dataframes. Most
# likely the dataframes will contain columns of factors.
data
# The dataframes can be easily accessed as named elements of the list:
data$`Ebola cases and deaths by country and by date - 1 August to present.`
data$`Archived Ebola cases and deaths by country. - 22 March to 30 July`
# Extract the first three columns for these tables:
d1 <- data$`Ebola cases and deaths by country and by date - 1 August to present.`[1:3]
d1 <- d1[2:nrow(d1),1:3] # slice off the first row
d2 <- data$`Archived Ebola cases and deaths by country. - 22 March to 30 July`[1:3]
d2 <- d2[2:nrow(d1),1:3]
d1 <- rbind(d1, d2)
#
# Let's clean up the data:
# First we convert the factors to strings, (column by column
# otherwise R gets confused)
d1[,1] <- as.character(d1[,1])
d1[,2] <- as.character(d1[,2])
d1[,3] <- as.character(d1[,3])
d1
# We see that some cells contain linebreaks. Lets remove these
# rows.
d1 <- d1[apply(d1[,2:3], 1, function(row) length(grep("\\n",row))==0),]
d1
# Now remove the comma from numbers and convert them to numeric
# gsub() will help us:
# the expression we need is as.numeric(gsub(",","", data)) ...
apply(d1[,2:3],2,function(x) as.numeric(gsub(",","", x)))
d1[,2:3] <- apply(d1[,2:3], 2, function(x) as.numeric(gsub(",","", x)))
# finally we need to convert the dates ...
apply(d1[,1], 2, function(x) as.integer(as.Date(x, format="%d %b %Y")))
d1[,1] <- as.integer(as.Date(d1[,1], format="%d %b %Y"))
# ... and start with day 1:
d1[,1] <- d1[,1] - min(d1[,1]) + 1
# Done. Plot this:
plot(d1[,1:2])
plot(d1[,1], log(d1[,2]))
# Case fatality ratio:
plot(d1[,1], d1[,3]/d1[,2])
# Done.
Exercises
TBD
Further reading and resources
- ↑ Note that
readHTMLTable()
returns dataframes and by default turns strings into factors. This R code converts the factors in a very pedestrian way, but for more principled approaches to avoid getting factors in the first place, or to convert them if you do, see here at stackoverflow.