7.9 Handling Proxy Requests
The HTTP proxy protocol was
originally designed to allow users unfortunate enough to be stuck
behind a firewall to access external web sites. Instead of connecting
to the remote server directly, an action forbidden by the firewall,
users point their browsers at a proxy server located on the firewall
machine itself. The proxy goes out and fetches the requested document
from the remote site and forwards the retrieved document to the user.
Nowadays most
firewall systems have a web proxy built
right in so there's no need for dedicated proxying servers.
However, proxy servers are still useful for a variety of purposes.
For example, a caching proxy (of which Apache is one example) will
store frequently requested remote documents in a disk directory and
return the cached documents directly to the browser instead of
fetching them anew. Anonymizing proxies take the outgoing request and
strip out all the headers that can be used to identify the user or
his browser. By writing Apache API modules that participate in the
proxy process, you can achieve your own special processing of proxy
requests.
The proxy request/response protocol is nearly the same as vanilla
HTTP. The major difference is that instead of requesting a
server-relative URI in the request line, the client asks for a full
URL, complete with scheme and host. In addition, a few optional HTTP
headers beginning with Proxy- may be added to
the request. For example, a normal (nonproxy) HTTP request sent by a
browser might look like this:
GET /foo/index.html HTTP/1.0
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
Pragma: no-cache
Connection: Keep-Alive
User-Agent: Mozilla/2.01 (WinNT; I)
Host: www.modperl.com:80
In contrast, the corresponding HTTP proxy request will look like
this:
GET http://www.modperl.com/foo/index.html HTTP/1.0
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
Pragma: no-cache
User-Agent: Mozilla/2.01 (WinNT; I)
Host: www.modperl.com:80
Proxy-Connection: Keep-Alive
Notice that the URL in the request line of an HTTP proxy request
includes the scheme and hostname. This information enables the proxy
server to initiate a connection to the distant server. To generate
this type of request, the user must configure his browser so that
HTTP and, optionally, FTP requests are proxied to the server. This
usually involves setting values in the browser's preference
screens. An Apache server will be able to respond to this type of
request if it has been compiled with the
mod_proxy
module. This module is part of the core
Apache distribution but is not compiled in by default.
You can interact with Apache's proxy mechanism at the
translation handler phase. There are two types of interventions you
can make. You can take an ordinary (nonproxy) request and change it
into one so that it will be handled by Apache's standard proxy
module, or you can take an incoming proxy request and install your
own content handler for it so that you can examine and possibly
modify the response from the remote server.
7.9.1 Invoking mod_proxy for Nonproxy Requests
We'll look first
at Apache::PassThru, an
example of how to turn an ordinary request into a proxy
request. Because this
technique uses Apache's mod_proxy module,
this module will have to be compiled and installed in order for this
example to run on your system.
The idea behind the example is simple. Requests for URIs beginning
with a certain path will be dynamically transformed into a proxy
request. For example, we might transform requests for URLs beginning
with /CPAN/ into a request for
http://www.perl.com/CPAN/. The request to
www.perl.com will be done completely behind the
scenes; nothing will reveal to the user that the directory hierarchy
is being served from a third-party server rather than our own. This
functionality is the same as the ProxyPass
directive provided by mod_proxy itself. You can
also achieve the same effect by providing an appropriate rewrite rule
to mod_rewrite.
The configuration for this example uses a
PerlSetVar to set a variable named
PerlPassThru. A typical entry in the
configuration directive will look like this:
PerlTransHandler Apache::PassThru
PerlSetVar PerlPassThru '/CPAN/ => http://www.perl.com/,\
/search/ => http://www.altavista.digital.com/'
The PerlPassThru variable contains a string
representing a series of URI=>proxy pairs,
separated by commas. A backslash at the end of a line can be used to
split the string over several lines, improving readability (the
ability to use backslash as a continuation character is actually an
Apache configuration file feature but not a well-publicized one). In
this example, we map the URI /CPAN/ to
http://www.perl.com/ and
/search/ to
http://www.altavista.digital.com/. For the
mapping to work correctly, local directory names should end with a
slash in the manner shown in the example.
The code for Apache::PassThru is given in Example 7.10. The handler()
subroutine begins by retrieving the request object and calling its
proxyreq() method to determine whether the
current request is a proxy request:
sub handler {
my $r = shift;
return DECLINED if $r->proxyreq;
If this is already a proxy request, we don't want to alter it
in any way, so we decline the transaction. Otherwise, we retrieve the
value of PerlPassThru, split it into its
key/value components with a pattern match, and store the result in a
hash named %mappings:
my $uri = $r->uri;
my %mappings = split /\s*(?:,|=>)\s*/, $r->dir_config('PerlPassThru');
We now loop through each of the local paths, looking for a match with
the current request's URI. If a match is found, we perform a
string substitution to replace the local path with the corresponding
proxy URI. Otherwise, we continue to loop:
for my $src (keys %mappings) {
next unless $uri =~ s/^$src/$mappings{$src}/;
$r->proxyreq(1);
$r->uri($uri);
$r->filename("proxy:$uri");
$r->handler('proxy-server');
return OK;
}
return DECLINED;
}
If the URI substitution succeeds, there are four steps we need to
take to transform this request into something that
mod_proxy will handle. The first two are
obvious, but the others are less so. First, we need to set the proxy
request flag to a true value by calling
$r->proxyreq(1). Next, we change the requested
URI to the proxied URI by calling the request object's
uri() method. In the third step, we set the
request filename to the string proxy: followed by
the URI, as in proxy:http://www.perl.com/CPAN/.
This is a special filename format recognized by
mod_proxy, and as such is somewhat arbitrary.
The last step is to set the content handler to
proxy-server, so that the request is passed to
mod_proxy to handle the response phase.
If we turned the local path into a proxy request, we return
OK from the translation handler. Otherwise, we
return DECLINED.
Example 7.10. Invoking Apache's Proxy Request Mechanism from Within a Translation Handler
package Apache::PassThru;
# file: Apache/PassThru.pm;
use strict;
use Apache::Constants qw(:common);
sub handler {
my $r = shift;
return DECLINED if $r->proxyreq;
my $uri = $r->uri;
my %mappings = split /\s*(?:,|=>)\s*/, $r->dir_config('PerlPassThru');
for my $src (keys %mappings) {
next unless $uri =~ s/^$src/$mappings{$src}/;
$r->proxyreq(1);
$r->uri($uri);
$r->filename("proxy:$uri");
$r->handler('proxy-server');
return OK;
}
return DECLINED;
}
1;
_ _END_ _
7.9.2 An Anonymizing Proxy
As public concern about the ability of
web servers to track people's surfing sessions grows,
anonymizing proxies are becoming more popular. An anonymizing proxy
is similar to an ordinary web proxy, except that certain HTTP headers
that provide identifying information such as the
Referer
, Cookie,
User-Agent, and From fields
are quietly stripped from the request before forwarding it on to the
remote server. Not only is this identifying information removed, but
the identity of the requesting host is obscured. The remote server
knows only the hostname and IP address of the proxy machine, not the
identity of the machine the user is browsing from.
You can write a simple anonymizing proxy in the Apache Perl API in
all of 18 lines (including comments). The source code listing is
shown in Example 7.11. Like the previous example, it
uses Apache's mod_proxy, so that module
must be installed before this example will run correctly.
The module defines a package global named @Remove
containing the names of all the request headers to be stripped from
the request. In this example, we remove
User-Agent,
Cookie, Referer, and the infrequently
used From field. The handler() subroutine begins by fetching the Apache request object
and checking whether the current request uses the proxy protocol.
However, unlike the previous example where we wanted the existence of
the proxy to be secret, here we expect the user to explicitly
configure his browser to use our anonymizing proxy. So here we return
DECLINED if proxyreq()
returns false.
If proxyreq() returns true, we know that we are
in the midst of a proxy request. We loop through each of the fields
to be stripped and delete them from the incoming headers table by
using the request object's header_in()
method to set the field to undef. We then return
OK to signal Apache to continue processing the
request. That's all there is to it.
To activate the anonymizing proxy, install it as a URI translation
handler as before:
PerlTransHandler Apache::AnonProxy
An alternative that works just as well is to call the module during
the header parsing phase (see the discussion of this phase earlier).
In some ways, this makes more sense because we aren't doing any
actual URI translation, but we are modifying the HTTP header. Here is
the appropriate directive:
PerlHeaderParserHandler Apache::AnonProxy
The drawback to using PerlHeaderParserHandler
like this is that, unlike PerlTransHandler, the
directive is allowed in directory configuration sections and
.htaccess files. But directory configuration
sections are irrelevant in proxy requests, so the directive will
silently fail if placed in one of these sections. The directive
should go in the main part of one of the configuration files or in a
virtual host section.
Example 7.11. A Simple Anonymizing Proxy
package Apache::AnonProxy;
# file: Apache/AnonProxy.pm
use strict;
use Apache::Constants qw(:common);
my @Remove = qw(user-agent cookie from referer);
sub handler {
my $r = shift;
return DECLINED unless $r->proxyreq;
foreach (@Remove) {
$r->header_in($_ => undef);
}
return OK;
}
1;
__END__
In order to test that this handler was actually working, we set up a
test Apache server as the target of the proxy requests and added the
following entry to its configuration file:
CustomLog logs/nosy_log "%h %{Referer}i %{User-Agent}i %{Cookie}i %U"
This created a "nosy" log that contains entries for the
Referer, User-Agent, and
Cookie fields. Before installing the anonymous
proxy module, entries in this log looked like this (the lines have
been wrapped to fit on the page):
192.168.2.5 http://prego/ Mozilla/4.04 [en] (X11; I; Linux 2.0.33 i686)
- /tkdocs/tk_toc.ht
192.168.2.5 http://prego/ Mozilla/4.04 [en] (X11; I; Linux 2.0.33 i686)
POMIS=10074 /perl/hangman1.pl
In contrast, after installing the anonymizing proxy module, all the
identifying information was stripped out, leaving only the IP address
of the proxy machine:
192.168.2.5 - - - /perl/hangman1.pl
192.168.2.5 - - - /icons/hangman/h0.gif
192.168.2.5 - - - /cgi-bin/info2www
7.9.3 Handling the Proxy Process on Your Own
As long
as you only need to monitor or modify the request half of a proxy
transaction, you can use Apache's
mod_proxy module directly as we did in the
previous two examples. However, if you also want to intercept the
response so as to modify the information returned from the remote
server, then you'll need to handle the proxy request on your
own.
In this
section, we present Apache::AdBlocker. This
module replaces Apache's mod_proxy with a
specialized proxy that filters the content of certain URLs.
Specifically, it looks for URLs that are likely to be banner
advertisements and replaces their content with a transparent GIF
image that says "Blocked Ad." This can be used to
"lower the volume" of commercial sites by removing
distracting animated GIFs and brightly colored banners. Figure 7.3 shows what the AltaVista search site looks like
when fetched through the Apache::AdBlocker
proxy.

The code for Apache::AdBlocker is given in Example 7.12. It is a bit more complicated than the other
modules we've worked with in this chapter but not much more.
The basic strategy is to install two handlers. The first handler is
activated during the URI translation phase. It doesn't actually
alter the URI or filename in any way, but it does inspect the
transaction to see if it is a proxy request. If this is the case, the
handler installs a custom content handler to actually go out and do
the request. In this respect, the translation handler is similar to
Apache::Checksum3, which also installs a custom
content handler for certain URIs.
Later on, when its content handler is called, the module uses the
Perl LWP library to fetch the remote document. If the document does
not appear to be a banner ad, the content handler forwards it on to
the waiting client. Otherwise, the handler does a little switcheroo,
replacing the advertisement with a custom GIF image of exactly the
same size and shape as the ad. This bit of legerdemain is completely
invisible to the browser, which goes ahead and renders the image as
if it were the original banner ad.
In addition to the LWP library, this module requires the
GD and Image::Size
libraries for creating and manipulating images. They are available on
CPAN if you do not already have them installed.
Turning to the code, after the familiar preamble we create a new
LWP::UserAgent object that we will use to make
all our requests for documents from remote servers:
@ISA = qw(LWP::UserAgent);
$VERSION = '1.00';
my $UA = __PACKAGE__->new;
$UA->agent(join "/", __PACKAGE__, $VERSION);
We actually subclass LWP::UserAgent, using the
@ISA global to create an inheritance relationship
between LWP::UserAgent and our own package.
Although we don't override any of LWP::UserAgent
's methods, making our module a subclass of
LWP::UserAgent allows us to cleanly customize
these methods at a later date should we need to.
We now create a new instance of the
LWP::UserAgent subclass, using the special token
_ _PACKAGE_ _
which evaluates at
compile time to the name of the current package. In this case,
_ _PACKAGE_ _->new is equivalent to
Apache::AdBlocker->new (or
new Apache::AdBlocker if you
prefer Smalltalk syntax). Immediately afterward we call the
object's agent() method with a string
composed of the package name and version number. This is the calling
card that LWP sends to the remote hosts' web servers as the
HTTP User-Agent field. The method we use for
constructing the User-Agent field creates the
string Apache::AdBlocker/1.00.
my $Ad = join "|", qw{ads? advertisements? banners? adv promotions?};
The last initialization step is to define a package global named
$Ad that defines a pattern match that picks up
many (but certainly not all) banner advertisement URIs. Most ads
contain variants on the words "ad,"
"advertisement," "banner," or
"promotion" somewhere in the URI, although this may have
changed by the time you read this!
sub handler {
my $r = shift;
return DECLINED unless $r->proxyreq;
$r->handler("perl-script"); #ok, let's do it
$r->push_handlers(PerlHandler => \&proxy_handler);
return OK;
}
The next part of the module is the definition of the
handler() subroutine, which in this case will
be run during the URI translation phase. It simply checks whether the
current transaction is a proxy request and declines the transaction
if not. Otherwise, it calls the request object's
handler() method to set the content handler to
perl-script and calls push_handlers() to make the module's proxy_handler() subroutine the callback for the response phase of the
transaction. handler() then returns
OK to flag that it has handled the URI translation
phase.
Most of the work is done in proxy_handler().
Its job is to use LWP 's object-oriented methods to create an
HTTP::Request object. The
HTTP::Request is then forwarded to the remote
host by the LWP::UserAgent, returning an
HTTP::Response. The response must then be
returned to the waiting browser, possibly after replacing the
content. The only subtlety here is the need to copy the request
headers from the incoming Apache request's
headers_in() table to the
HTTP::Request and, in turn, to copy the response
headers from the HTTP::Response into the Apache
request headers_out() table. If this copying
back and forth isn't performed, then documents that rely on the
exact values of certain HTTP fields, such as CGI scripts, will fail
to work correctly across the proxy.
sub proxy_handler {
my $r = shift;
my $request = HTTP::Request->new($r->method, $r->uri);
proxy_handler() starts by recovering the Apache
request object. It then uses the request object's
method() and uri()
methods to fetch the request method and the URI. These are used to
create and initialize a new HTTP::Request. We
now feed the incoming header fields from the Apache request object
into the corresponding fields in the outgoing HTTP::Request
:
$r->headers_in->do(sub {
$request->header(@_);
});
We use a little trick to accomplish the copy. The
headers_in()
method (as opposed to the
header_in() method that we have seen before)
returns an instance of the
Apache::Table
class. This class, described in more
detail in Section 9.1 (see Section 9.2.5"), implements methods for manipulating
Apache's various table-like structures, including the incoming
and outgoing HTTP header fields. One of these methods is
do(), which when passed a CODE reference
invokes the code once for each header field, passing to the routine
the header's name and value each time. In this case, we call
do() with an anonymous subroutine that passes
the header keys and values on to the
HTTP::Request object's header() method. It is important to use headers->do() here rather than copying the headers into a hash because
certain headers, particularly Cookie, can be
multivalued.
# copy POST data, if any
if($r->method eq 'POST') {
my $len = $r->header_in('Content-length');
my $buf;
$r->read($buf, $len);
$request->content($buf);
}
The next block of code checks whether the request method is POST. If
so, we must copy the POSTed data from the incoming request to the
HTTP::Request object. We do this by calling the
request object's read() method to read
the POST data into a temporary buffer. The data is then copied into
the HTTP::Request by calling its
content() method. Request methods other than
POST may include a request body, but this example does not cope with
these rare cases.
The HTTP::Request object is now complete, so we
can actually issue the request:
my $response = $UA->request($request);
We pass the HTTP::Request object to the user
agent's request() method. After a delay
for the network fetch, the call returns an
HTTP::Response object, which we copy into a
variable named $response.
$r->content_type($response->header('Content-type'));
$r->status($response->code);
$r->status_line(join " ", $response->code, $response->message);
Now the process of copying the headers is reversed. Every header in
the LWP HTTP::Response object must be copied to
the Apache request object. First, we handle a few special cases. We
call the HTTP::Response object's
header() method to fetch the content type of
the returned document and immediately pass the result to the Apache
request object's content_type() method.
Next, we set the numeric HTTP status code and the human-readable HTTP
status line. We call the HTTP::Response
object's code() and message() methods to return the numeric code and human-readable
messages, respectively, and copy them to the Apache request object,
using the status() and status_line() methods to set the values.
When the special case headers are done, we copy all the other header
fields, using the HTTP::Response object's
scan(
)
method:
$response->scan(sub {
$r->header_out(@_);
});
scan() is similar to the
Apache::Table do() method:
it loops through each of the header fields, invoking an anonymous
callback routine for each one. The callback sets the corresponding
field in the Apache request object using the header_out() method.
if ($r->header_only) {
$r->send_http_header();
return OK;
}
The outgoing header is complete at this point, so we check whether
the current transaction is a HEAD request. If so, we emit the HTTP
header and exit with an OK status code.
my $content = \$response->content;
if($r->content_type =~ /^image/ and $r->uri =~ /\b($Ad)\b/i) {
block_ad($content);
$r->content_type("image/gif");
}
Otherwise, the time has come to deal with potential banner ads. To
identify likely ads, we require that the document be an image and
that its URI satisfy the regular expression match defined at the top
of the module. We retrieve the document contents by calling the
HTTP::Response object's content() method, and store a reference to the contents in a local
variable named $content. We now check whether the
document's MIME type is one of the image variants and that the
URI satisfies the advertisement pattern match. If both of these are
true, we call block_ad() to replace the content
with a customized image. We also set the document's content
type to image/gif, since this is what
block_ad() produces.
$r->content_type('text/html') unless $$content;
$r->send_http_header;
$r->print($$content || $response->error_as_HTML);
We send the HTTP header, then print the document contents. Notice
that the document content may be empty, which can happen when LWP
connects to a server that is down or busy. In this case, instead of
printing an empty document, we return the nicely formatted error
message returned by the HTTP::Response
object's error_as_HTML() method.
return OK;
}
Our work is done, so we return an OK status code.
The block_ad() subroutine is short and sweet.
Its job is to take an image in any of several possible formats and
replace it with a custom GIF of exactly the same dimensions. The GIF
will be transparent, allowing the page background color to show
through, and will have the words "Blocked Ad" printed in
large friendly letters in the upper lefthand corner.
sub block_ad {
my $data = shift;
my($x, $y) = imgsize($data);
my $im = GD::Image->new($x,$y);
To get the width and height of the image, we call imgsize(
)
, a function imported from the
Image::Size module. imgsize(
) recognizes most web image formats, including GIF, JPEG,
XBM, and PNG. Using these values, we create a new blank
GD::Image object and store it in a variable
named $im.
my $white = $im->colorAllocate(255,255,255);
my $black = $im->colorAllocate(0,0,0);
my $red = $im->colorAllocate(255,0,0);
We call the image object's colorAllocate(
)
method three
times to allocate color table entries for white, black, and red. Then
we declare that the white color is transparent, using the
transparent() method:
$im->transparent($white);
$im->string(GD::gdLargeFont(),5,5,"Blocked Ad",$red);
$im->rectangle(0,0,$x-1,$y-1,$black);
$$data = $im->gif;
}
The routine calls the string() method to draw
the message starting at coordinates (5,5) and finally frames the
whole image with a black rectangle. The custom image is now converted
into GIF format with the gif() method and
copied into $$data, overwriting whatever was there
before.
sub redirect_ok {return undef;}
The last detail is to define a redirect_ok()
method to override the default LWP::UserAgent
method. By returning undef this method tells LWP
not to handle redirects internally but to pass them on to the browser
to handle. This is the correct behavior for a proxy server.
Activating this module is just a matter of adding the following line
to one of the configuration files:
PerlTransHandler Apache::AdBlocker
Users who wish to make use of this filtering service should configure
their browsers to proxy their requests through your server.
Example 7.12. A Banner Ad Blocking Proxy
package Apache::AdBlocker;
# file: Apache/AdBlocker.pm
use strict;
use vars qw(@ISA $VERSION);
use Apache::Constants qw(:common);
use GD ();
use Image::Size qw(imgsize);
use LWP::UserAgent ();
@ISA = qw(LWP::UserAgent);
$VERSION = '1.00';
my $UA = __PACKAGE__->new;
$UA->agent(join "/", __PACKAGE__, $VERSION);
my $Ad = join "|", qw{ads? advertisements? banners? adv promotions?};
sub handler {
my $r = shift;
return DECLINED unless $r->proxyreq;
$r->handler("perl-script"); #ok, let's do it
$r->push_handlers(PerlHandler => \&proxy_handler);
return OK;
}
sub proxy_handler {
my $r = shift;
my $request = HTTP::Request->new($r->method, $r->uri);
$r->headers_in->do(sub {
$request->header(@_);
});
# copy POST data, if any
if($r->method eq 'POST') {
my $len = $r->header_in('Content-length');
my $buf;
$r->read($buf, $len);
$request->content($buf);
}
my $response = $UA->request($request);
$r->content_type($response->header('Content-type'));
#feed response back into our request_rec*
$r->status($response->code);
$r->status_line(join " ", $response->code, $response->message);
$response->scan(sub {
$r->header_out(@_);
});
if ($r->header_only) {
$r->send_http_header();
return OK;
}
my $content = \$response->content;
if($r->content_type =~ /^image/ and $r->uri =~ /\b($Ad)\b/i) {
block_ad($content);
$r->content_type("image/gif");
}
$r->content_type('text/html') unless $$content;
$r->send_http_header;
$r->print($$content || $response->error_as_HTML);
return OK;
}
sub block_ad {
my $data = shift;
my($x, $y) = imgsize($data);
my $im = GD::Image->new($x,$y);
my $white = $im->colorAllocate(255,255,255);
my $black = $im->colorAllocate(0,0,0);
my $red = $im->colorAllocate(255,0,0);
$im->transparent($white);
$im->string(GD::gdLargeFont(),5,5,"Blocked Ad",$red);
$im->rectangle(0,0,$x-1,$y-1,$black);
$$data = $im->gif;
}
sub redirect_ok {return undef;}
1;
_ _END_ _
 |