URL Spider
How to write a web robot, or 'spider', in Perl
|
|
See also
|
[ home ]
-
[ search ]
-
[ sitemap ]
Originally written by
Tkil.
See more of his work at
http://slinky.scrye.com/~tkil/perl/.
Republished by kind permission of Tkil and
The Perl Journal
Today, someone on the IRC #perl channel was asking some confused questions. We finally managed to figure out that he was trying to write a web robot, or "spider", in Perl. Which is a grand idea, except that:
- Perfectly good spiders have already been written and are freely available at http://info.webcrawler.com/mak/ projects/robots/robots.html.
- A Perl-based web spider is probably not an ideal project for a novice Perl programmer. Work your way up to it.
Having said that, I immediately pictured a one-line Perl robot. It wouldn't do much, but it would be amusing. After a few abortive attempts, I ended up with this monster, which requires Perl 5.005. I've split it onto separate lines for easier reading.
perl -MLWP::UserAgent -MHTML::LinkExtor -MURI::URL -lwe '
$ua = LWP::UserAgent->new;
while (my $link = shift @ARGV) {
print STDERR "working on $link";
HTML::LinkExtor->new(
sub {
my ($t, %a) = @_;
my @links = map { url($_, $link)->abs() }
grep { defined } @a{qw/href img/};
print STDERR "+ $_" foreach @links;
push @ARGV, @links;
} ) -> parse(
do {
my $r = $ua->simple_request
(HTTP::Request->new("GET", $link));
$r->content_type eq "text/html" ? $r->content : "";
}
)
}' http://slinky.scrye.com/~tkil/
I actually edited this on a single line; I use shell-mode inside of Emacs, so it wasn't that much of a terror. Here's the one-line version.
perl -MLWP::UserAgent -MHTML::LinkExtor -MURI::URL -lwe
'$ua = LWP::UserAgent->new; while (my $link = shift @ARGV) {
print STDERR "working on $link";HTML::LinkExtor->new( sub
{ my ($t, %a) = @_; my @links = map { url($_, $link)->abs()
} grep { defined } @a{qw/href img/}; print STDERR "+ $_"
foreach @links; push @ARGV, @links} )->parse(do { my $r =
$ua->simple_request (HTTP::Request->new("GET", $link));
$r->content_type eq "text/html" ? $r-> content : ""; } )
}' http://slinky.scrye.com/~tkil/
After getting an ego-raising chorus of groans from the hapless onlookers in #perl, I thought I'd try to identify some cute things I did with this code that might actually be instructive to TPJ readers.
Callbacks and Closures
Many modules are designed to do grunt work. In this case, HTML::LinkExtor (a specialized version of HTML::Parser) knows how to look through an HTML document and find links. Once it finds them, however, it needs to know what to do with them.
This is where "callbacks" come in. They're well-known in GUI circles, since interfaces need to know what to do when one presses a button or selects a menu item. Here, HTML::LinkExtor needs to know what to do with links (all tags, actually) when it finds them.
My callback is an anonymous subroutine reference:
sub {
my ($t, %a) = @_;
my @links = map { url($_, $link)->abs() }
grep { defined } @a{qw/href img/};
print STDERR "+ $_" foreach @links;
push @ARGV, @links;
}
I didn't notice until later that $link is actually scoped just outside of this subroutine (in the while loop), making this subroutine look almost like a closure. It's not a classical closure - it doesn't define its own storage - but it does use a lexical value far away from where it is defined. (Enough justification for a section title!)
Cascading Arrows
It's amusing to note that, aside from debugging output, the while loop consists of a single statement. The arrow operator (->) only cares about the value of the left hand side; this is the heart of the Perl/Tk idiom:
my $button = $main->Button( ... )->pack();
We use a similar approach, except we don't keep a copy of the created reference (which is stored in $button above):
HTML::LinkExtor->new(...)->parse(...);
This is a nice shortcut to use whenever you want to create an object for a single use.
Using Modules with One-Liners
From my first thought of this one-liner, I knew I'd be using modules from the libwww-perl (LWP) library. The first few iterations of this "one-liner" used LWP::Simple, which explicitly states that it should be ideal for one-liners. The -M flag is easy to use, and makes many things very easy. LWP::Simple fetched the files just fine. I used something like:
HTML::LinkExtor->new(...)->parse( get $link );
Where get() is a function provided by LWP::Simple; it returns the contents of a given URL.
Unfortunately, I needed to check the Content-Type of the returned data. The first version merrily tried to parse .tar.gz files and got confused:
working on ./dist/irchat/irchat-3.03.tar.gz
Use of uninitialized value at
/usr/lib/perl5/site_perl/5.005/LWP/Protocol.pm line 104.
Use of uninitialized value at
/usr/lib/perl5/site_perl/5.005/LWP/Protocol.pm line 107.
Use of uninitialized value at
/usr/lib/perl5/site_perl/5.005/LWP/Protocol.pm line 82.
Ooops.
Switching to the "industrial strength" LWP::UserAgent module allowed me to check the Content-Type of the fetched page. Using this information, together with the HTTP::Response module and a quick ?: construct, I could parse either the HTML content or an empty string.
Spidering sample in script form
#!/usr/bin/perl -w
use strict;
my $VERSION = "0.82";
use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;
$| = 1;
sub spider (%);
spider URL => 'http://mobile.sunrise.ch';
sub spider (%) {
my %args = @_;
my @startlinks = ();
push(@startlinks, $args{URL});
my $ua = LWP::UserAgent->new;
WORKLOOP: while (my $link = shift @startlinks) {
for (my $i = 0; $i< $#startlinks; $i++) {
next WORKLOOP if $link eq $startlinks[$i];
}
print ">>>>> working on $link\n";
HTML::LinkExtor->new(
sub {
my ($t, %a) = @_;
my @links = map { url($_, $link)->abs() }
grep { defined } @a{qw/href img/};
# mark already spidered links for removal
foreach my $start_link (@startlinks) {
my $i = 0;
for (0 .. $#links) {
if ($links[$i++] eq $start_link) {
$links[$i -1] = "'REMOVE'";
}
}
}
# remove already spidered links
@links = sort @links;
for (my $i = 0; $i< $#links; $i++) {
$links[$i] = "'REMOVE'" if $links[$i] eq $links[$i +1];
}
@links = grep { $_ ne "'REMOVE'" } @links;
print "+ $_\n" foreach @links;
push @startlinks, @links if @links;
} ) -> parse(
do {
my $r = $ua->simple_request
(HTTP::Request->new("GET", $link));
$r->content_type eq "text/html" ? $r->content : "";
}
)
}
}
|
|