Difference between revisions of "Screenscraping"

Revision as of 15:30, 16 September 2012

Screenscraping

The contents of this page has recently been imported from an older version of this Wiki. This page may contain outdated information, information that is irrelevant for this Wiki, information that needs to be differently structured, outdated syntax, and/or broken links. Use with caution!

Screenscraping is a highly informal method of parsing HTML output for data.

The right way to work with programs on non-local hosts is through simple RPC (remote procedure calls) or APIs (Application Program Interfaces) that may include more complex objects or datastructures. However quite often all one has to work with is some sort of Web browser screen. Taking a server's HTML output and parsing the relevant data-elements from it is therefore often called screenscraping. It is generally used

when data is available only through a Web-server
when no formal specifications for the data and or procedures exist or are not easily available
when robust access and reproducible behaviour are of less importance than speed of deployment

Screen scraping is therefore highly informal and ad hoc - a quick and dirty solution to common tasks.

Retrieving

The first issue to address is how to retrieve the data. Let us assume it is textual data - I could not imagine that it would be less work to try to parse images than to contact the maintainers of the data and negotiate a proper interface.

Web browser

Simply navigate to a page, then save it as HTML or text-only. It's often easier to work with HTML because the markup may simplify parsing.

wget

wget is a Unix commandline interface to network protocols. For example, the following will write the contents of this page to STDOUT.

wget -O - http://biochemistry.utoronto.ca/steipe/abc/index.php/Screenscraping

If a file name is specified instead of "-", the output will be written to that file instead of STDOUT. If -O is not specified, the output will be written to a file in the local directory with the same name as the file on the server.

Perl

backticks

The easiest way to use Perl to retrieve a Web server document is actually through wget. Consider the following code:

use strict;
use warnings;
 
my $url = 'http://biochemistry.utoronto.ca/steipe/abc/index.php/Screenscraping';
my $out = `wget -O - $url`;
print $out;

exit();

The fun part is in the backticks: strings in backticks are executed as system commands and the resulting output from STDOUT is assigned to a variable.

LWP

A much more flexible way to use Perl to interact with Webservers is LWP (example) (Library for WWW in Perl) (also see here - with a link to an online book on the topic). Typical uses include sites into which you have to login, accept cookies or otherwise interact in more complex ways with the server. While wget will retrieve the contents of an URL, LWP simulates much of the behaviour of a browser.

Neither Javascript nor any plugins will work through LWP. That's not to say it can't be done, just not with LWP.

PHP

PHP has inbuilt support to retrieve HTML documents. Here is an example that retrieves this page and parses only the PHP section from it.

<html>
    <head>
        <title>Example</title>
    </head>
    <body>
<?php

$source = "http://biochemistry.utoronto.ca/graduate_studies/courses/JTB2020H/wiki/index.php/Screenscraping";

$raw = file_get_contents($source);
preg_match("/(<h4>PHP<\/h4>)(.*?)(<div class=\"editsection\")/s", $raw, $matches);

echo $matches[1] . $matches[2] . "\n";

?>
    </body>
</html>

The modifier s after the matching pattern allows matching across newline boundaries. Otherwise matches would only be retrieved if they were completely contained within a single line.

Parsing

To parse anything meaningful from the raw code you have retrieved, you will need Regular_Expressions ...

@@ Line 33: / Line 33: @@
 [[Unix_wget|wget]] is a Unix commandline interface to network protocols. For example, the following will write the contents of this page to STDOUT.
-  wget -O - http://biochemistry.utoronto.ca/graduate_studies/courses/JTB2020H/wiki/index.php/Screenscraping
+  wget -O - http://biochemistry.utoronto.ca/steipe/abc/index.php/Screenscraping
 If a file name is specified instead of "'''-'''", the output will be written to that file instead of STDOUT. If '''-O''' is not specified, the output will be written to a file in the local directory with the same name as the file on the server.
@@ Line 43: / Line 43: @@
 The easiest way to use Perl to retrieve a Web server document is actually through wget. Consider the following code:
-<perl>
+<source lang="perl">
 use strict;
 use warnings;
-my $url = 'http://biochemistry.utoronto.ca/graduate_studies/courses/JTB2020H/wiki/index.php/Screenscraping';
+my $url = 'http://biochemistry.utoronto.ca/steipe/abc/index.php/Screenscraping';
 my $out = `wget -O - $url`;
 print $out;
 exit();
-</perl>
+</source>
 The fun part is in the backticks: strings in backticks are executed as system commands and the resulting output from STDOUT is assigned to a variable.
@@ Line 65: / Line 65: @@
 PHP has inbuilt support to retrieve HTML documents. Here is an example that retrieves this page and parses only the PHP section from it.
-<php>
+<source lang="php">
 <html>
      <head>
@@ Line 83: / Line 84: @@
      </body>
 </html>
-</php>
+</source>
 <small>The [http://php.net/manual/en/reference.pcre.pattern.modifiers.php '''modifier'''] '''s''' after the matching pattern allows matching across newline boundaries. Otherwise matches would only be retrieved if they were completely contained within a single line.</small>
@@ Line 116: / Line 117: @@
 &nbsp;
 [[Category:Applied_Bioinformatics]]
+[[Category:Perl]]
+[[Category:PHP]]
 </div>

Difference between revisions of "Screenscraping"

Revision as of 15:30, 16 September 2012

Contents

Retrieving

Web browser

wget

Perl

backticks

LWP

PHP

Parsing

Further reading and resources

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Sections

Tools