***** infoCopter.com/perl *****

Handling Proxy Requests




Book: Writing Apache Modules with Perl and C
Section: Chapter 7. Other Request Phases



7.9 Handling Proxy Requests

The HTTP proxy protocol was originally designed to allow users unfortunate enough to be stuck behind a firewall to access external web sites. Instead of connecting to the remote server directly, an action forbidden by the firewall, users point their browsers at a proxy server located on the firewall machine itself. The proxy goes out and fetches the requested document from the remote site and forwards the retrieved document to the user.

Nowadays most firewall systems have a web proxy built right in so there's no need for dedicated proxying servers. However, proxy servers are still useful for a variety of purposes. For example, a caching proxy (of which Apache is one example) will store frequently requested remote documents in a disk directory and return the cached documents directly to the browser instead of fetching them anew. Anonymizing proxies take the outgoing request and strip out all the headers that can be used to identify the user or his browser. By writing Apache API modules that participate in the proxy process, you can achieve your own special processing of proxy requests.

The proxy request/response protocol is nearly the same as vanilla HTTP. The major difference is that instead of requesting a server-relative URI in the request line, the client asks for a full URL, complete with scheme and host. In addition, a few optional HTTP headers beginning with Proxy- may be added to the request. For example, a normal (nonproxy) HTTP request sent by a browser might look like this:

GET /foo/index.html HTTP/1.0
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
Pragma: no-cache
Connection: Keep-Alive
User-Agent: Mozilla/2.01 (WinNT; I)
Host: www.modperl.com:80

In contrast, the corresponding HTTP proxy request will look like this:

GET http://www.modperl.com/foo/index.html HTTP/1.0
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
Pragma: no-cache
User-Agent: Mozilla/2.01 (WinNT; I)
Host: www.modperl.com:80
Proxy-Connection: Keep-Alive

Notice that the URL in the request line of an HTTP proxy request includes the scheme and hostname. This information enables the proxy server to initiate a connection to the distant server. To generate this type of request, the user must configure his browser so that HTTP and, optionally, FTP requests are proxied to the server. This usually involves setting values in the browser's preference screens. An Apache server will be able to respond to this type of request if it has been compiled with the mod_proxy module. This module is part of the core Apache distribution but is not compiled in by default.

You can interact with Apache's proxy mechanism at the translation handler phase. There are two types of interventions you can make. You can take an ordinary (nonproxy) request and change it into one so that it will be handled by Apache's standard proxy module, or you can take an incoming proxy request and install your own content handler for it so that you can examine and possibly modify the response from the remote server.

7.9.1 Invoking mod_proxy for Nonproxy Requests

We'll look first at Apache::PassThru, an example of how to turn an ordinary request into a proxy request.[9] Because this technique uses Apache's mod_proxy module, this module will have to be compiled and installed in order for this example to run on your system.

[9] There are several third-party Perl API modules on CPAN that handle proxy requests, including one named Apache::ProxyPass and another named Apache::ProxyPassThru. If you are looking for the functionality of Apache::PassThru, you should examine one of these more finished products before using this one as the basis for your own module.

The idea behind the example is simple. Requests for URIs beginning with a certain path will be dynamically transformed into a proxy request. For example, we might transform requests for URLs beginning with /CPAN/ into a request for http://www.perl.com/CPAN/. The request to www.perl.com will be done completely behind the scenes; nothing will reveal to the user that the directory hierarchy is being served from a third-party server rather than our own. This functionality is the same as the ProxyPass directive provided by mod_proxy itself. You can also achieve the same effect by providing an appropriate rewrite rule to mod_rewrite.

The configuration for this example uses a PerlSetVar to set a variable named PerlPassThru. A typical entry in the configuration directive will look like this:

PerlTransHandler Apache::PassThru
PerlSetVar PerlPassThru '/CPAN/   => http://www.perl.com/,\
                         /search/ => http://www.altavista.digital.com/'

The PerlPassThru variable contains a string representing a series of URI=>proxy pairs, separated by commas. A backslash at the end of a line can be used to split the string over several lines, improving readability (the ability to use backslash as a continuation character is actually an Apache configuration file feature but not a well-publicized one). In this example, we map the URI /CPAN/ to http://www.perl.com/ and /search/ to http://www.altavista.digital.com/. For the mapping to work correctly, local directory names should end with a slash in the manner shown in the example.

The code for Apache::PassThru is given in Example 7.10. The handler() subroutine begins by retrieving the request object and calling its proxyreq() method to determine whether the current request is a proxy request:

sub handler {
    my $r = shift;
    return DECLINED if $r->proxyreq;

If this is already a proxy request, we don't want to alter it in any way, so we decline the transaction. Otherwise, we retrieve the value of PerlPassThru, split it into its key/value components with a pattern match, and store the result in a hash named %mappings:

my $uri = $r->uri;
    my %mappings = split /\s*(?:,|=>)\s*/, $r->dir_config('PerlPassThru');

We now loop through each of the local paths, looking for a match with the current request's URI. If a match is found, we perform a string substitution to replace the local path with the corresponding proxy URI. Otherwise, we continue to loop:

for my $src (keys %mappings) {
       next unless $uri =~ s/^$src/$mappings{$src}/;
       $r->proxyreq(1);
       $r->uri($uri);
       $r->filename("proxy:$uri");
       $r->handler('proxy-server');
       return OK;
       }
    return DECLINED;
}

If the URI substitution succeeds, there are four steps we need to take to transform this request into something that mod_proxy will handle. The first two are obvious, but the others are less so. First, we need to set the proxy request flag to a true value by calling $r->proxyreq(1). Next, we change the requested URI to the proxied URI by calling the request object's uri() method. In the third step, we set the request filename to the string proxy: followed by the URI, as in proxy:http://www.perl.com/CPAN/. This is a special filename format recognized by mod_proxy, and as such is somewhat arbitrary. The last step is to set the content handler to proxy-server, so that the request is passed to mod_proxy to handle the response phase.

If we turned the local path into a proxy request, we return OK from the translation handler. Otherwise, we return DECLINED.

Example 7.10. Invoking Apache's Proxy Request Mechanism from Within a Translation Handler
package Apache::PassThru;
# file: Apache/PassThru.pm;
use strict;
use Apache::Constants qw(:common);

sub handler {
    my $r = shift;
    return DECLINED if $r->proxyreq;
    my $uri = $r->uri;
    my %mappings = split /\s*(?:,|=>)\s*/, $r->dir_config('PerlPassThru');
    for my $src (keys %mappings) {
       next unless $uri =~ s/^$src/$mappings{$src}/;
       $r->proxyreq(1);
       $r->uri($uri);
       $r->filename("proxy:$uri");
       $r->handler('proxy-server');
       return OK;
    }
    return DECLINED;
}
1;
_    _END_    _







7.9.2 An Anonymizing Proxy

As public concern about the ability of web servers to track people's surfing sessions grows, anonymizing proxies are becoming more popular. An anonymizing proxy is similar to an ordinary web proxy, except that certain HTTP headers that provide identifying information such as the Referer , Cookie, User-Agent, and From fields are quietly stripped from the request before forwarding it on to the remote server. Not only is this identifying information removed, but the identity of the requesting host is obscured. The remote server knows only the hostname and IP address of the proxy machine, not the identity of the machine the user is browsing from.

You can write a simple anonymizing proxy in the Apache Perl API in all of 18 lines (including comments). The source code listing is shown in Example 7.11. Like the previous example, it uses Apache's mod_proxy, so that module must be installed before this example will run correctly.

The module defines a package global named @Remove containing the names of all the request headers to be stripped from the request. In this example, we remove User-Agent, Cookie, Referer, and the infrequently used From field. The handler() subroutine begins by fetching the Apache request object and checking whether the current request uses the proxy protocol. However, unlike the previous example where we wanted the existence of the proxy to be secret, here we expect the user to explicitly configure his browser to use our anonymizing proxy. So here we return DECLINED if proxyreq() returns false.

If proxyreq() returns true, we know that we are in the midst of a proxy request. We loop through each of the fields to be stripped and delete them from the incoming headers table by using the request object's header_in() method to set the field to undef. We then return OK to signal Apache to continue processing the request. That's all there is to it.

To activate the anonymizing proxy, install it as a URI translation handler as before:

PerlTransHandler Apache::AnonProxy

An alternative that works just as well is to call the module during the header parsing phase (see the discussion of this phase earlier). In some ways, this makes more sense because we aren't doing any actual URI translation, but we are modifying the HTTP header. Here is the appropriate directive:

PerlHeaderParserHandler Apache::AnonProxy

The drawback to using PerlHeaderParserHandler like this is that, unlike PerlTransHandler, the directive is allowed in directory configuration sections and .htaccess files. But directory configuration sections are irrelevant in proxy requests, so the directive will silently fail if placed in one of these sections. The directive should go in the main part of one of the configuration files or in a virtual host section.

Example 7.11. A Simple Anonymizing Proxy
package Apache::AnonProxy;
# file: Apache/AnonProxy.pm
use strict;
use Apache::Constants qw(:common);

my @Remove = qw(user-agent cookie from referer);

sub handler {
    my $r = shift;
    return DECLINED unless $r->proxyreq;
    foreach (@Remove) {
       $r->header_in($_ => undef);
    }
    return OK;
}

1;
__END__

In order to test that this handler was actually working, we set up a test Apache server as the target of the proxy requests and added the following entry to its configuration file:

CustomLog logs/nosy_log "%h %{Referer}i %{User-Agent}i %{Cookie}i %U"

This created a "nosy" log that contains entries for the Referer, User-Agent, and Cookie fields. Before installing the anonymous proxy module, entries in this log looked like this (the lines have been wrapped to fit on the page):

192.168.2.5 http://prego/ Mozilla/4.04 [en] (X11; I; Linux 2.0.33 i686)
       - /tkdocs/tk_toc.ht
192.168.2.5 http://prego/ Mozilla/4.04 [en] (X11; I; Linux 2.0.33 i686)
       POMIS=10074 /perl/hangman1.pl

In contrast, after installing the anonymizing proxy module, all the identifying information was stripped out, leaving only the IP address of the proxy machine:

192.168.2.5 - - -  /perl/hangman1.pl 
192.168.2.5 - - -  /icons/hangman/h0.gif 
192.168.2.5 - - -  /cgi-bin/info2www











7.9.3 Handling the Proxy Process on Your Own

As long as you only need to monitor or modify the request half of a proxy transaction, you can use Apache's mod_proxy module directly as we did in the previous two examples. However, if you also want to intercept the response so as to modify the information returned from the remote server, then you'll need to handle the proxy request on your own.

In this section, we present Apache::AdBlocker. This module replaces Apache's mod_proxy with a specialized proxy that filters the content of certain URLs. Specifically, it looks for URLs that are likely to be banner advertisements and replaces their content with a transparent GIF image that says "Blocked Ad." This can be used to "lower the volume" of commercial sites by removing distracting animated GIFs and brightly colored banners. Figure 7.3 shows what the AltaVista search site looks like when fetched through the Apache::AdBlocker proxy.

Figure 7.3. The AltaVista search engine after filtering by Apache::AdBlocker
figs/wam.0703.gif


The code for Apache::AdBlocker is given in Example 7.12. It is a bit more complicated than the other modules we've worked with in this chapter but not much more. The basic strategy is to install two handlers. The first handler is activated during the URI translation phase. It doesn't actually alter the URI or filename in any way, but it does inspect the transaction to see if it is a proxy request. If this is the case, the handler installs a custom content handler to actually go out and do the request. In this respect, the translation handler is similar to Apache::Checksum3, which also installs a custom content handler for certain URIs.

Later on, when its content handler is called, the module uses the Perl LWP library to fetch the remote document. If the document does not appear to be a banner ad, the content handler forwards it on to the waiting client. Otherwise, the handler does a little switcheroo, replacing the advertisement with a custom GIF image of exactly the same size and shape as the ad. This bit of legerdemain is completely invisible to the browser, which goes ahead and renders the image as if it were the original banner ad.

In addition to the LWP library, this module requires the GD and Image::Size libraries for creating and manipulating images. They are available on CPAN if you do not already have them installed.

Turning to the code, after the familiar preamble we create a new LWP::UserAgent object that we will use to make all our requests for documents from remote servers:

@ISA = qw(LWP::UserAgent);
$VERSION = '1.00';

my $UA = __PACKAGE__->new;
$UA->agent(join "/", __PACKAGE__, $VERSION);

We actually subclass LWP::UserAgent, using the @ISA global to create an inheritance relationship between LWP::UserAgent and our own package. Although we don't override any of LWP::UserAgent 's methods, making our module a subclass of LWP::UserAgent allows us to cleanly customize these methods at a later date should we need to.

We now create a new instance of the LWP::UserAgent subclass, using the special token _ _PACKAGE_ _ which evaluates at compile time to the name of the current package. In this case, _ _PACKAGE_ _->new is equivalent to Apache::AdBlocker->new (or new Apache::AdBlocker if you prefer Smalltalk syntax). Immediately afterward we call the object's agent() method with a string composed of the package name and version number. This is the calling card that LWP sends to the remote hosts' web servers as the HTTP User-Agent field. The method we use for constructing the User-Agent field creates the string Apache::AdBlocker/1.00.

my $Ad = join "|", qw{ads? advertisements? banners? adv promotions?};

The last initialization step is to define a package global named $Ad that defines a pattern match that picks up many (but certainly not all) banner advertisement URIs. Most ads contain variants on the words "ad," "advertisement," "banner," or "promotion" somewhere in the URI, although this may have changed by the time you read this!

sub handler {
    my $r = shift;
    return DECLINED unless $r->proxyreq;
    $r->handler("perl-script"); #ok, let's do it
    $r->push_handlers(PerlHandler => \&proxy_handler);
    return OK;
}

The next part of the module is the definition of the handler() subroutine, which in this case will be run during the URI translation phase. It simply checks whether the current transaction is a proxy request and declines the transaction if not. Otherwise, it calls the request object's handler() method to set the content handler to perl-script and calls push_handlers() to make the module's proxy_handler() subroutine the callback for the response phase of the transaction. handler() then returns OK to flag that it has handled the URI translation phase.

Most of the work is done in proxy_handler(). Its job is to use LWP 's object-oriented methods to create an HTTP::Request object. The HTTP::Request is then forwarded to the remote host by the LWP::UserAgent, returning an HTTP::Response. The response must then be returned to the waiting browser, possibly after replacing the content. The only subtlety here is the need to copy the request headers from the incoming Apache request's headers_in() table to the HTTP::Request and, in turn, to copy the response headers from the HTTP::Response into the Apache request headers_out() table. If this copying back and forth isn't performed, then documents that rely on the exact values of certain HTTP fields, such as CGI scripts, will fail to work correctly across the proxy.

sub proxy_handler {
    my $r = shift;

    my $request = HTTP::Request->new($r->method, $r->uri);

proxy_handler() starts by recovering the Apache request object. It then uses the request object's method() and uri() methods to fetch the request method and the URI. These are used to create and initialize a new HTTP::Request. We now feed the incoming header fields from the Apache request object into the corresponding fields in the outgoing HTTP::Request :

$r->headers_in->do(sub {
       $request->header(@_);
    });

We use a little trick to accomplish the copy. The headers_in() method (as opposed to the header_in() method that we have seen before) returns an instance of the Apache::Table class. This class, described in more detail in Section 9.1 (see Section 9.2.5"), implements methods for manipulating Apache's various table-like structures, including the incoming and outgoing HTTP header fields. One of these methods is do(), which when passed a CODE reference invokes the code once for each header field, passing to the routine the header's name and value each time. In this case, we call do() with an anonymous subroutine that passes the header keys and values on to the HTTP::Request object's header() method. It is important to use headers->do() here rather than copying the headers into a hash because certain headers, particularly Cookie, can be multivalued.

# copy POST data, if any
    if($r->method eq 'POST') {
        my $len = $r->header_in('Content-length');
        my $buf;
        $r->read($buf, $len);
        $request->content($buf);
     }

The next block of code checks whether the request method is POST. If so, we must copy the POSTed data from the incoming request to the HTTP::Request object. We do this by calling the request object's read() method to read the POST data into a temporary buffer. The data is then copied into the HTTP::Request by calling its content() method. Request methods other than POST may include a request body, but this example does not cope with these rare cases.

The HTTP::Request object is now complete, so we can actually issue the request:

my $response = $UA->request($request);

We pass the HTTP::Request object to the user agent's request() method. After a delay for the network fetch, the call returns an HTTP::Response object, which we copy into a variable named $response.

$r->content_type($response->header('Content-type'));
    $r->status($response->code);
    $r->status_line(join " ", $response->code, $response->message);

Now the process of copying the headers is reversed. Every header in the LWP HTTP::Response object must be copied to the Apache request object. First, we handle a few special cases. We call the HTTP::Response object's header() method to fetch the content type of the returned document and immediately pass the result to the Apache request object's content_type() method. Next, we set the numeric HTTP status code and the human-readable HTTP status line. We call the HTTP::Response object's code() and message() methods to return the numeric code and human-readable messages, respectively, and copy them to the Apache request object, using the status() and status_line() methods to set the values.

When the special case headers are done, we copy all the other header fields, using the HTTP::Response object's scan( ) method:

$response->scan(sub {
       $r->header_out(@_);
    });

scan() is similar to the Apache::Table do() method: it loops through each of the header fields, invoking an anonymous callback routine for each one. The callback sets the corresponding field in the Apache request object using the header_out() method.

if ($r->header_only) {
       $r->send_http_header();
       return OK;
    }

The outgoing header is complete at this point, so we check whether the current transaction is a HEAD request. If so, we emit the HTTP header and exit with an OK status code.

my $content = \$response->content;
    if($r->content_type =~ /^image/ and $r->uri =~ /\b($Ad)\b/i) {
       block_ad($content);
       $r->content_type("image/gif");
    }

Otherwise, the time has come to deal with potential banner ads. To identify likely ads, we require that the document be an image and that its URI satisfy the regular expression match defined at the top of the module. We retrieve the document contents by calling the HTTP::Response object's content() method, and store a reference to the contents in a local variable named $content.[10] We now check whether the document's MIME type is one of the image variants and that the URI satisfies the advertisement pattern match. If both of these are true, we call block_ad() to replace the content with a customized image. We also set the document's content type to image/gif, since this is what block_ad() produces.

[10] In this example, we call the response object's content() method to slurp the document content into a scalar. However, it can be more efficient to use the three-argument form of LWP::UserAgent 's response() method to read the content in fixed-size chunks. See the LWP::UserAgent manual page for details.

$r->content_type('text/html') unless $$content;
    $r->send_http_header;
    $r->print($$content || $response->error_as_HTML);

We send the HTTP header, then print the document contents. Notice that the document content may be empty, which can happen when LWP connects to a server that is down or busy. In this case, instead of printing an empty document, we return the nicely formatted error message returned by the HTTP::Response object's error_as_HTML() method.

return OK;
}

Our work is done, so we return an OK status code.

The block_ad() subroutine is short and sweet. Its job is to take an image in any of several possible formats and replace it with a custom GIF of exactly the same dimensions. The GIF will be transparent, allowing the page background color to show through, and will have the words "Blocked Ad" printed in large friendly letters in the upper lefthand corner.

sub block_ad {
    my $data = shift;
    my($x, $y) = imgsize($data);

    my $im = GD::Image->new($x,$y);

To get the width and height of the image, we call imgsize( ) , a function imported from the Image::Size module. imgsize( ) recognizes most web image formats, including GIF, JPEG, XBM, and PNG. Using these values, we create a new blank GD::Image object and store it in a variable named $im.

my $white = $im->colorAllocate(255,255,255); 
    my $black = $im->colorAllocate(0,0,0);        
    my $red = $im->colorAllocate(255,0,0);

We call the image object's colorAllocate( ) method three times to allocate color table entries for white, black, and red. Then we declare that the white color is transparent, using the transparent() method:

$im->transparent($white); 
    $im->string(GD::gdLargeFont(),5,5,"Blocked Ad",$red); 
    $im->rectangle(0,0,$x-1,$y-1,$black); 
 
    $$data = $im->gif; 
}

The routine calls the string() method to draw the message starting at coordinates (5,5) and finally frames the whole image with a black rectangle. The custom image is now converted into GIF format with the gif() method and copied into $$data, overwriting whatever was there before.

sub redirect_ok {return undef;}

The last detail is to define a redirect_ok() method to override the default LWP::UserAgent method. By returning undef this method tells LWP not to handle redirects internally but to pass them on to the browser to handle. This is the correct behavior for a proxy server.

Activating this module is just a matter of adding the following line to one of the configuration files:

PerlTransHandler Apache::AdBlocker

Users who wish to make use of this filtering service should configure their browsers to proxy their requests through your server.

Example 7.12. A Banner Ad Blocking Proxy
package Apache::AdBlocker;
# file: Apache/AdBlocker.pm

use strict;
use vars qw(@ISA $VERSION);
use Apache::Constants qw(:common);
use GD ();
use Image::Size qw(imgsize);
use LWP::UserAgent ();

@ISA = qw(LWP::UserAgent);
$VERSION = '1.00';

my $UA = __PACKAGE__->new;
$UA->agent(join "/", __PACKAGE__, $VERSION);

my $Ad = join "|", qw{ads? advertisements? banners? adv promotions?};

sub handler {
    my $r = shift;
    return DECLINED unless $r->proxyreq;
    $r->handler("perl-script"); #ok, let's do it
    $r->push_handlers(PerlHandler => \&proxy_handler);
    return OK;
}

sub proxy_handler {
    my $r = shift;

    my $request = HTTP::Request->new($r->method, $r->uri);

    $r->headers_in->do(sub {
       $request->header(@_);
    });

    # copy POST data, if any
    if($r->method eq 'POST') {
       my $len = $r->header_in('Content-length');
       my $buf;
       $r->read($buf, $len);
       $request->content($buf);
    }

    my $response = $UA->request($request);
    $r->content_type($response->header('Content-type'));

    #feed response back into our request_rec*
    $r->status($response->code);
    $r->status_line(join " ", $response->code, $response->message);
    $response->scan(sub {
       $r->header_out(@_);
    });

    if ($r->header_only) {
       $r->send_http_header();
       return OK;
    }

    my $content = \$response->content;
    if($r->content_type =~ /^image/ and $r->uri =~ /\b($Ad)\b/i) {
       block_ad($content);
       $r->content_type("image/gif");
    }

    $r->content_type('text/html') unless $$content;
    $r->send_http_header;
    $r->print($$content || $response->error_as_HTML);

    return OK;
}

sub block_ad {
    my $data = shift;
    my($x, $y) = imgsize($data);

    my $im = GD::Image->new($x,$y);

    my $white = $im->colorAllocate(255,255,255);
    my $black = $im->colorAllocate(0,0,0);
    my $red = $im->colorAllocate(255,0,0);

    $im->transparent($white);
    $im->string(GD::gdLargeFont(),5,5,"Blocked Ad",$red);
    $im->rectangle(0,0,$x-1,$y-1,$black);

    $$data = $im->gif;
}

sub redirect_ok {return undef;}






1;










_    _END_    _





    URL http://safari.bvdep.com/156592567X/ch07-7619



    © reto :)