| Level: Intermediate Nathan Harrington (harrington.nathan@gmail.com), Programmer, IBM
28 Aug 2007 Create your own 404 error-message handler to provide useful links and redirects
for the contents of your site. Use metaphone matching and a simple weighted score file
to make typographical, spelling, and bad-link redirect suggestions. Customize the
suggestions based solely on your Web site's content and preferred redirection
locations. Catch multiple errors in incoming URL requests and process them for
corrections in directory, script, and HTML page names.
You can find many tutorials that show you how to create an effective format for your
404 page. Most suggest that 404 pages contain static, suggested links that point to
common areas on your site, such as the front page, downloads page, and your site's
search engine, if you have one. The problem with generic 404 pages is that they do not
reflect why the visitor came to the site. This article shows you how to build a
suggestion-maker and a method of providing more useful redirect links that are based on
the content of your Web site.
Current 404 handlers allow you to provide a few suggested links for all errors, such as
pointing the users to the site directory. Spelling correctors, such as mod_speling (yes,
it has one "l") can be used to correct errors in dictionary words that may lead a user
to the right page. The code here will help you build a suggestion-making engine to
handle nondictionary words and directory links based on the content of your Web site.
Consider, for example, you hear a Web page name during a teleconference, so you try a
link to blegs/DavSmath.html. Current spelling correction modules would be unable to
provide a useful link for this case. Using the code in this article, you'll be able to
generate a 404 page with a suggestion for the valid page at /blogs/DaveSmith.html.
Requirements
Any modern PC manufactured after 2000 should provide plenty of horsepower for compiling
and running the code in this article. You may need RAM-rich, high-powered hardware or
patience if your Web site contains more than 10,000 or so distinct pages.
The Perl and CGI scripts provided work on a variety of UNIX® and
Windows® flavors (see Download). Although this
article uses Apache and a CGI script for the suggestion engine, the tools built should
function with most Web servers. For metaphone matching, this article references the
Text::Metaphone module by Michael Schwern. Install the Text::Metaphone module from
your favorite CPAN mirror and you'll be ready to start. See Resources for downloads.
The sample files referred to in this article are available in Download.
Web server pages and metaphone codes
The primary method for suggesting alternatives to typographical and spelling errors
will be metaphone matching. Metaphones, like Soundex and other algorithms, use a
alphanumeric code to represent the verbal pronunciation of a word. Unlike Soundex,
however, metaphone codes are built to match the linguistic variabilities of
pronunciation in the English language. The average metaphone code is, therefore, a much
more accurate representation of a given word, and provides an ideal basis for building a suggestion library.
Consider the following list of files in a sample Web server directory.
Listing 1. Web server files
./index.html
./survey.html
./search_tips.html
./about.html
./how.html
./why.html
./who.html
./NathanHarrington.html
./blogs/NathanHarrington.html
./blogs/DaveSmith.html
./blogs/MarkCappel.html
|
With this set of static HTML files, we'll use the buildMetaphoneList.pl program to
create metaphones for each filename with an .html extension.
Listing 2. buildMetaphoneList.pl
#!/usr/bin/perl -w
# buildMetaphoneList.pl - / split filename, 0 score, metaphones
use strict;
use File::Find;
use Text::Metaphone;
find(\&htmlOnly,".");
sub htmlOnly
{
if( $File::Find::name =~ /\.html/ )
{
my $clipFname = $File::Find::name;
$clipFname =~ s/\.html//g;
my @slParts = split '/', $clipFname;
shift(@slParts);
print "$File::Find::name ### 0 ### ";
for( @slParts ){ print Metaphone($_) . " " }
print "\n";
}#if a matching .html file
}#htmlOnly sub
|
The buildMetaphoneList.pl program processes files with an .html extension only, removes
the .html from the filename, then generates metaphones for each part of the full path
name. Copy the buildMetaPhoneList.pl program to your webserver root directory and run
the command perl buildMetaphoneList.pl >
metaphonesScore.txt . For the files shown in Listing 1, the corresponding
metaphonesScore.txt file contents is shown below.
Listing 3. metaphonesScore.txt
./index.html ### 0 ### INTKS
./survey.html ### 0 ### SRF
./search_tips.html ### 0 ### SRXTPS
./about.html ### 0 ### ABT
./how.html ### 0 ### H
./why.html ### 0 ### H
./who.html ### 0 ### H
./NathanHarrington.html ### 0 ### N0NHRNKTN
./blogs/NathanHarrington.html ### 0 ### BLKS N0NHRNKTN
./blogs/DaveSmith.html ### 0 ### BLKS TFSM0
./blogs/MarkCappel.html ### 0 ### BLKS MRKKPL
|
Each line in Listing 3 shows the actual link in the filesystem under the
webserver root directory, default score, and metaphone code. Note how how.html,
why.html, and who.html all resolve to the same metaphone code. To deal with this
ambiguity, modify the score field to have the link-suggestion program provide links to
your pages in the desired order. For example, change the "H" metaphone entires to be:
./how.html ### 100 ### H
./why.html ### 50 ### H
./who.html ### 0 ### H
|
This creates a straightforward reordering of the links, with room for further
modification of the scores. Large score counts are preferable for later insertion of
files with the same metaphone, but a different score. For example, adding a hoo.html
file list could have a score of 25 appear above the who.html entry and below the why.html entry.
You can also use the score field for differentiation between files of the same name
from differing directories. Modify the ./NathanHarrington.html line score to be 100,
for example, and requests for pages like nathenHorrington.html will list the
./NathanHarrington.html link before the ./blogs/NathanHarrington.html page.
When choosing how to score your files, consider the statistical and logical access
components of your Web site. Users may more frequently request the why.html page
according to the log files, but if you know it's more important they know the how.html,
simply provide corresponding scores for correct sorting.
Building CGI 404 handler
With the appropriate metaphones generated along with their associated scores, we can
now build the actual suggestion-maker. The typical 404 error message path is due to
typographical errors in the link or bad links themselves. Suggestions made by the code
listed below will be created by running three main tests: matching given a directory
structure, matching with a combined metaphone, and "contains" matching when all else
fails. These three tests are designed to handle the majority of 404 errors. The
beginning of the MetaphoneSuggest CGI Perl script is shown below.
Listing 4. MetaphoneSuggest CGI Part 1
#!/usr/bin/perl -w
# MetaphoneSuggest - suggest links for typographical and other errors from 404s
use strict;
use CGI::Pretty ':standard'; #standard cgi stuff
use Text::Metaphone;
my @suggestLinks = (); # suggested link list
my %mt = (); # filename, score, metaphone code hash
my $origLink = substr($ENV{REDIRECT_URL},1); # remove leading /
$origLink =~ s/\.html//g; # remove trailing .html
open(MPH,'metaphonesScore.txt') or die "can't open metaphones";
while(my @slPart = split '###', <MPH>)
{
$slPart[0] =~ s/ //g; #remove trailing space
$mt{$slPart[0]}{ score } = $slPart[1];
$mt{$slPart[0]}{ metaphones } = $slPart[2];
}
close(MPH);
|
After the usual library includes and variable declarations, the code will load the
reported 404 text, as well as the metaphones created using the buildMetaphoneList.pl
program. Now we're ready for the main program logic, as shown below.
Listing 5. Main program logic
push @suggestLinks, sortResults( directorySplitTest( $origLink ) );
push @suggestLinks, sortResults( combinedTest( $origLink ) );
push @suggestLinks, sortResults( containsTest( $origLink ) );
# from the book - unique-ify the array
my %seen = ();
@suggestLinks = grep{ ! $seen{$_}++ } @suggestLinks ;
print header;
print qq{Error 404: The file requested [$ENV{REDIRECT_URL}] is unavailable.<BR >};
next if( @suggestLinks == 0 );
print qq{Please try one of the following pages:<BR >};
for my $link( @suggestLinks ){
$link = substr($link,index($link,'./')+1);
print qq{<a href="$link">$link</a><BR >};
}
|
The output of each section of match test code is sorted, then added to the overall
suggestion link list. After sorting and unique-ifying the link list, printing out the
suggested links is straightforward.
The three sort commands pushed onto a single results array is designed to create an
ordered and numerically sorted suggestion list. When a 404 comes in, it's highly likely
that the presence of directory delimiters indicate a Web page is desired at least one
level down the directory tree. Take, for example, a page request like
bloggs/nathenherringtoon.html. The directorySplitTest as called above will create a
sorted list of pages that have a metaphone match for both BLKS and N0NHRNKTN in
subsequent directories. This strategy provides the necessary distinction between files
in the root directory, such as a blogs.html and nathanharrington.html, and pages with
the full path name match like blogs/nathanharrington.html. The listing below shows the
contents of the directorySplitTest subroutine.
Listing 6. directorySplitTest subroutine
sub directorySplitTest
{
my @matchRes = ();
my $inLink = $_[0];
for my $fileName ( keys %mt )
{
my @inLinkMetas = ();
# process each metaphone chunk as a directory
for my $inP ( split '\/', $inLink ){ push @inLinkMetas, Metaphone($inP) }
my @metaList = split ' ', $mt{$fileName}{metaphones};
next if( @metaList != @inLinkMetas );
my $pos = 0;
my $totalMatch = 0;
for( @metaList )
{
$totalMatch++ if( $metaList[$pos] =~ /(\b$inLinkMetas[$pos]\b)/i );
$pos++;
}#for meatlist
# make sure there is a match in each metaphone chunk
next if( $totalMatch != @metaList );
push @matchRes, "$mt{$fileName}{score} ## $fileName";
}#for keys in metaphone hash
return( @matchRes );
}#directorySplitTest
|
Following the directorySplitTest , the combined test will check for matches where the
metaphones are smooshed together — disregarding any directory structure. This
is useful for correcting a class of 404s that involve space, slash, backslash, colon,
and other nonpronounced characters in their filenames. For example, if a 404 request
comes in for blogs_nathanherrington.html, the directorySplitTest will return zero
results, but the combinedTest will find that the metaphones produced by that 404 are an
exact match with those of the blogs/NathanHarrington.html page when combined. Again,
these suggestions are lower priority than a directory match, so their sorted results
are pushed onto the suggestLinks array after the directorySplitTest . The listing below
shows the combinedTest subroutine.
Listing 7. combinedTest subroutine
sub combinedTest
{
my @matchRes = ();
my $inLink = $_[0];
for my $fileName ( keys %mt )
{
my $inLinkMeta = Metaphone($inLink);
# smoosh all of the keys together, removing spaces and trailing newline
my $metaList = $mt{$fileName}{metaphones};
$metaList =~ s/( |\n)//g;
next if( $metaList !~ /(\b$inLinkMeta\b)/i );
push @matchRes, "$mt{$fileName}{score} ## $fileName";
}#for filename keys in metaphone hash
return(@matchRes);
}#combinedTest
|
After the combinedTest , the final attempt is made to match based on a broad-ranging
contains search. If the metaphone of the current 404 link is anywhere in any of the
available metaphones from metaphoneScores.txt, we will add it to the suggestion list.
The contains search is designed to pick up on severely incomplete URLs. The page
nathan.html is nowhere to be found, but a good suggestion would be
/NathanHarrington.html and /blogs/NathanHarrington.html, and these are sorted on score
and added to the suggestLinks array. Note that this approach
will also produce suggestions of NathanHarrington.html for one-letter metaphone 404s
like whoo.html. Because the NathanHarrington.html metaphone contains an "H," it will be
added to the suggestion list. Consider creating minimum lengths of metaphones to be
matched or providing a total limit to the number of contains matches to modify this
behavior. Listing 8 shows the containsTest and sortResults subroutines.
Listing 8. sortResults and containsTest subroutines
sub sortResults
{
# simply procedue to sort an array of 'score ## filename' entries
my @scored = @_;
my @idx = (); #temporary index for sorting
for my $entry( @scored ){
# create an index of scores
my $item = substr($entry,0,index($entry,'##'));
push @idx, $item;
}
# sort the index of scores
my @sorted = @scored[ sort { $idx[$b] <=> $idx[$a] } 0 .. $#idx ];
return( @sorted );
}#sortResults
sub containsTest
{
my @matchRes = ();
my $inLink = $_[0];
for my $fileName ( keys %mt )
{
my $inLinkMeta = Metaphone($inLink);
my $metaList = $mt{$fileName}{metaphones};
next if( $metaList !~ /$inLinkMeta/i );
push @matchRes, "$mt{$fileName}{score} ## $fileName";
}#for filename keys in metaphone hash
return(@matchRes);
}#containsTest
|
Modifying the Apache httpd.conf file
The MetaphoneSuggest script as designed above is a straightforward cgi-bin script to be
called from Apache. You'll need to modify your httpd.conf file to run the
MetaphoneSuggest script instead of displaying a 404 error page. For example, if your
default httpd.conf file has the section:
Listing 9. Default httpd.conf section
# Customizable error responses come in three flavors:
# 1) plain text 2) local redirects 3) external redirects
#
# Some examples:
#ErrorDocument 500 "The server made a boo boo."
#ErrorDocument 404 /missing.html
#ErrorDocument 404 "/cgi-bin/missing_handler.pl"
#ErrorDocument 402 http://www.example.com/subscription_info.html
|
Insert the following line: ErrorDocument 404
"/cgi-bin/MetaphoneSuggest" after the commented-out ErrorDocument lines. Make sure
the MetaphoneSuggest and metaphonesScore.txt file are in the
<document_root</cgi-bin/ directory on your Web server. Issue a server restart
command as root: /usr/local/apache2/bin/apachectl restart
(for example), and you're ready to start serving smart suggestions instead of dead-end errors.
Implementation options and usability considerations
Keep in mind when using the tools described in the MetaphoneSuggest program that a 404
page is an error condition. Consider providing just a few suggested alternatives and
keeping the design simple. Consult the big names in Web design for information on why
they do not provide automatic link suggestions, or usability studies for how best to
implement a link suggestion tool into your site.
This article provides the tools and code necessary to create options for useful link
suggestions from 404s. However they are chosen to be implemented, you now have the
ability to provide more than simple directory links or spelling suggestions. With
results tailored for specific sites and content, the dead-end 404 can be a thing of the past.
Download Description | Name | Size | Download method |
---|
Code | os-metaphone.web404MetaphoneSuggest.zip | 2KB | HTTP |
---|
Resources Learn
-
To listen to interesting interviews and discussions for software developers, check out
check out developerWorks podcasts.
-
Stay current with developerWorks'
Technical events and webcasts.
-
Watch and learn about IBM and open source technologies and product functions with the
no-cost developerWorks On demand demos.
-
Check out upcoming conferences, trade shows, webcasts, and other
Events around the world that are of interest to IBM open source developers.
-
Visit the developerWorks Open
source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
Get products and technologies
-
Download Michael Schwern's Text::Metaphone module from CPAN.
-
Check out Apache.org for the best in Web servers.
You can download a version of Apache
HTTP Server for almost any operating system.
-
If your system doesn't have Perl, you can download it for almost any operating system.
-
Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
-
Innovate your next open source development project with
IBM trial software, available for download or on DVD.
Discuss
About the author | | | Nathan Harrington is a programmer at IBM currently working with Linux and resource-locating technologies. |
Rate this page
| |