This is an archived cached-text copy of the developerWorks article. Please consider viewing the original article at: IBM developerWorks



IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & industry solutions      Support & downloads      My IBM     
skip to main content

developerWorks  >  Open source  >

Make your 404 pages smarter with metaphone matching

Don't let typos and bad referrers get between your visitors and your Web site's content

developerWorks
Document options

Document options requiring JavaScript are not displayed

Sample code


Rate this page

Help us improve this content


Level: Intermediate

Nathan Harrington (harrington.nathan@gmail.com), Programmer, IBM 

28 Aug 2007

Create your own 404 error-message handler to provide useful links and redirects for the contents of your site. Use metaphone matching and a simple weighted score file to make typographical, spelling, and bad-link redirect suggestions. Customize the suggestions based solely on your Web site's content and preferred redirection locations. Catch multiple errors in incoming URL requests and process them for corrections in directory, script, and HTML page names.

You can find many tutorials that show you how to create an effective format for your 404 page. Most suggest that 404 pages contain static, suggested links that point to common areas on your site, such as the front page, downloads page, and your site's search engine, if you have one. The problem with generic 404 pages is that they do not reflect why the visitor came to the site. This article shows you how to build a suggestion-maker and a method of providing more useful redirect links that are based on the content of your Web site.

Current 404 handlers allow you to provide a few suggested links for all errors, such as pointing the users to the site directory. Spelling correctors, such as mod_speling (yes, it has one "l") can be used to correct errors in dictionary words that may lead a user to the right page. The code here will help you build a suggestion-making engine to handle nondictionary words and directory links based on the content of your Web site.

Consider, for example, you hear a Web page name during a teleconference, so you try a link to blegs/DavSmath.html. Current spelling correction modules would be unable to provide a useful link for this case. Using the code in this article, you'll be able to generate a 404 page with a suggestion for the valid page at /blogs/DaveSmith.html.

Requirements

Any modern PC manufactured after 2000 should provide plenty of horsepower for compiling and running the code in this article. You may need RAM-rich, high-powered hardware or patience if your Web site contains more than 10,000 or so distinct pages.

The Perl and CGI scripts provided work on a variety of UNIX® and Windows® flavors (see Download). Although this article uses Apache and a CGI script for the suggestion engine, the tools built should function with most Web servers. For metaphone matching, this article references the Text::Metaphone module by Michael Schwern. Install the Text::Metaphone module from your favorite CPAN mirror and you'll be ready to start. See Resources for downloads.

The sample files referred to in this article are available in Download.



Back to top


Web server pages and metaphone codes

The primary method for suggesting alternatives to typographical and spelling errors will be metaphone matching. Metaphones, like Soundex and other algorithms, use a alphanumeric code to represent the verbal pronunciation of a word. Unlike Soundex, however, metaphone codes are built to match the linguistic variabilities of pronunciation in the English language. The average metaphone code is, therefore, a much more accurate representation of a given word, and provides an ideal basis for building a suggestion library.

Consider the following list of files in a sample Web server directory.


Listing 1. Web server files
                
./index.html
./survey.html
./search_tips.html
./about.html
./how.html
./why.html
./who.html
./NathanHarrington.html
./blogs/NathanHarrington.html
./blogs/DaveSmith.html
./blogs/MarkCappel.html

With this set of static HTML files, we'll use the buildMetaphoneList.pl program to create metaphones for each filename with an .html extension.


Listing 2. buildMetaphoneList.pl
                
#!/usr/bin/perl -w 
# buildMetaphoneList.pl - / split filename, 0 score, metaphones

use strict;
use File::Find;
use Text::Metaphone;

find(\&htmlOnly,".");

sub htmlOnly
{
  if( $File::Find::name =~ /\.html/ )
  {
    my $clipFname = $File::Find::name;
    $clipFname =~ s/\.html//g;

    my @slParts = split '/', $clipFname;
    shift(@slParts);

    print "$File::Find::name ### 0 ### ";
    for( @slParts ){ print Metaphone($_) . " " }
    print "\n";

  }#if a matching .html file

}#htmlOnly sub

The buildMetaphoneList.pl program processes files with an .html extension only, removes the .html from the filename, then generates metaphones for each part of the full path name. Copy the buildMetaPhoneList.pl program to your webserver root directory and run the command perl buildMetaphoneList.pl > metaphonesScore.txt. For the files shown in Listing 1, the corresponding metaphonesScore.txt file contents is shown below.


Listing 3. metaphonesScore.txt
                
./index.html ### 0 ### INTKS 
./survey.html ### 0 ### SRF 
./search_tips.html ### 0 ### SRXTPS 
./about.html ### 0 ### ABT 
./how.html ### 0 ### H 
./why.html ### 0 ### H 
./who.html ### 0 ### H 
./NathanHarrington.html ### 0 ### N0NHRNKTN 
./blogs/NathanHarrington.html ### 0 ### BLKS N0NHRNKTN 
./blogs/DaveSmith.html ### 0 ### BLKS TFSM0 
./blogs/MarkCappel.html ### 0 ### BLKS MRKKPL 

Each line in Listing 3 shows the actual link in the filesystem under the webserver root directory, default score, and metaphone code. Note how how.html, why.html, and who.html all resolve to the same metaphone code. To deal with this ambiguity, modify the score field to have the link-suggestion program provide links to your pages in the desired order. For example, change the "H" metaphone entires to be:

./how.html ### 100 ### H 
./why.html ### 50 ### H 
./who.html ### 0 ### H 

This creates a straightforward reordering of the links, with room for further modification of the scores. Large score counts are preferable for later insertion of files with the same metaphone, but a different score. For example, adding a hoo.html file list could have a score of 25 appear above the who.html entry and below the why.html entry.

You can also use the score field for differentiation between files of the same name from differing directories. Modify the ./NathanHarrington.html line score to be 100, for example, and requests for pages like nathenHorrington.html will list the ./NathanHarrington.html link before the ./blogs/NathanHarrington.html page.

When choosing how to score your files, consider the statistical and logical access components of your Web site. Users may more frequently request the why.html page according to the log files, but if you know it's more important they know the how.html, simply provide corresponding scores for correct sorting.



Back to top


Building CGI 404 handler

With the appropriate metaphones generated along with their associated scores, we can now build the actual suggestion-maker. The typical 404 error message path is due to typographical errors in the link or bad links themselves. Suggestions made by the code listed below will be created by running three main tests: matching given a directory structure, matching with a combined metaphone, and "contains" matching when all else fails. These three tests are designed to handle the majority of 404 errors. The beginning of the MetaphoneSuggest CGI Perl script is shown below.


Listing 4. MetaphoneSuggest CGI Part 1
                
#!/usr/bin/perl -w
# MetaphoneSuggest - suggest links for typographical and other errors from 404s
use strict;
use CGI::Pretty ':standard';  #standard cgi stuff
use Text::Metaphone;
  
my @suggestLinks = (); # suggested link list
my %mt = ();           # filename, score, metaphone code hash

my $origLink = substr($ENV{REDIRECT_URL},1); # remove leading /
$origLink  =~ s/\.html//g;                   # remove trailing .html

open(MPH,'metaphonesScore.txt') or die "can't open metaphones";
  while(my @slPart = split '###', <MPH>)
  {
    $slPart[0] =~ s/ //g; #remove trailing space
    $mt{$slPart[0]}{ score } = $slPart[1];
    $mt{$slPart[0]}{ metaphones } = $slPart[2];
  }
close(MPH);

After the usual library includes and variable declarations, the code will load the reported 404 text, as well as the metaphones created using the buildMetaphoneList.pl program. Now we're ready for the main program logic, as shown below.


Listing 5. Main program logic
                
push @suggestLinks, sortResults( directorySplitTest( $origLink ) );
push @suggestLinks, sortResults( combinedTest( $origLink ) );
push @suggestLinks, sortResults( containsTest( $origLink ) );

# from the book - unique-ify the array
my %seen = ();
@suggestLinks = grep{ ! $seen{$_}++ } @suggestLinks ;

print header;
print qq{Error 404: The file requested [$ENV{REDIRECT_URL}] is unavailable.<BR >};
next if( @suggestLinks == 0 );

print qq{Please try one of the following pages:<BR >};
for my $link( @suggestLinks ){
  $link = substr($link,index($link,'./')+1);
  print qq{<a href="$link">$link</a><BR >};
}

The output of each section of match test code is sorted, then added to the overall suggestion link list. After sorting and unique-ifying the link list, printing out the suggested links is straightforward.

The three sort commands pushed onto a single results array is designed to create an ordered and numerically sorted suggestion list. When a 404 comes in, it's highly likely that the presence of directory delimiters indicate a Web page is desired at least one level down the directory tree. Take, for example, a page request like bloggs/nathenherringtoon.html. The directorySplitTest as called above will create a sorted list of pages that have a metaphone match for both BLKS and N0NHRNKTN in subsequent directories. This strategy provides the necessary distinction between files in the root directory, such as a blogs.html and nathanharrington.html, and pages with the full path name match like blogs/nathanharrington.html. The listing below shows the contents of the directorySplitTest subroutine.


Listing 6. directorySplitTest subroutine
                
sub directorySplitTest
{ 
  my @matchRes = ();
  my $inLink = $_[0];
  for my $fileName ( keys %mt )
  { 
    my @inLinkMetas = ();
    # process each metaphone chunk as a directory
    for my $inP ( split '\/', $inLink ){ push @inLinkMetas, Metaphone($inP) }

    my @metaList = split ' ', $mt{$fileName}{metaphones};
    next if( @metaList != @inLinkMetas );

    my $pos = 0;
    my $totalMatch = 0;
    for( @metaList )
    { 
      $totalMatch++ if( $metaList[$pos] =~ /(\b$inLinkMetas[$pos]\b)/i );
      $pos++;
    }#for meatlist

    # make sure there is a match in each metaphone chunk
    next if( $totalMatch != @metaList );
    push @matchRes, "$mt{$fileName}{score} ## $fileName";

  }#for keys in metaphone hash

  return( @matchRes );

}#directorySplitTest

Following the directorySplitTest, the combined test will check for matches where the metaphones are smooshed together — disregarding any directory structure. This is useful for correcting a class of 404s that involve space, slash, backslash, colon, and other nonpronounced characters in their filenames. For example, if a 404 request comes in for blogs_nathanherrington.html, the directorySplitTest will return zero results, but the combinedTest will find that the metaphones produced by that 404 are an exact match with those of the blogs/NathanHarrington.html page when combined. Again, these suggestions are lower priority than a directory match, so their sorted results are pushed onto the suggestLinks array after the directorySplitTest. The listing below shows the combinedTest subroutine.


Listing 7. combinedTest subroutine
                
sub combinedTest
{ 
  my @matchRes = ();
  my $inLink = $_[0];
  for my $fileName ( keys %mt )
  { 
    my $inLinkMeta = Metaphone($inLink);

    # smoosh all of the keys together, removing spaces and trailing newline
    my $metaList =  $mt{$fileName}{metaphones};
    $metaList =~ s/( |\n)//g;

    next if( $metaList !~ /(\b$inLinkMeta\b)/i );
    push @matchRes, "$mt{$fileName}{score} ## $fileName";
  }#for filename keys in metaphone hash

  return(@matchRes);

}#combinedTest

After the combinedTest, the final attempt is made to match based on a broad-ranging contains search. If the metaphone of the current 404 link is anywhere in any of the available metaphones from metaphoneScores.txt, we will add it to the suggestion list. The contains search is designed to pick up on severely incomplete URLs. The page nathan.html is nowhere to be found, but a good suggestion would be /NathanHarrington.html and /blogs/NathanHarrington.html, and these are sorted on score and added to the suggestLinks array. Note that this approach will also produce suggestions of NathanHarrington.html for one-letter metaphone 404s like whoo.html. Because the NathanHarrington.html metaphone contains an "H," it will be added to the suggestion list. Consider creating minimum lengths of metaphones to be matched or providing a total limit to the number of contains matches to modify this behavior. Listing 8 shows the containsTest and sortResults subroutines.


Listing 8. sortResults and containsTest subroutines
                
sub sortResults
{
  # simply procedue to sort an array of 'score ## filename' entries
  my @scored = @_;
  my @idx = (); #temporary index for sorting
  for my $entry( @scored ){
    # create an index of scores
    my $item =  substr($entry,0,index($entry,'##'));
    push @idx, $item;
  } 
  
  # sort the index of scores
  my @sorted = @scored[ sort { $idx[$b] <=> $idx[$a] } 0 .. $#idx ];
  
  return( @sorted );
  
}#sortResults

sub containsTest
{
  my @matchRes = ();
  my $inLink = $_[0];
  for my $fileName ( keys %mt )
  {
    my $inLinkMeta = Metaphone($inLink);
    
    my $metaList =  $mt{$fileName}{metaphones};
    
    next if( $metaList !~ /$inLinkMeta/i );
    push @matchRes, "$mt{$fileName}{score} ## $fileName";

  }#for filename keys in metaphone hash
  return(@matchRes); 
  
}#containsTest



Back to top


Modifying the Apache httpd.conf file

The MetaphoneSuggest script as designed above is a straightforward cgi-bin script to be called from Apache. You'll need to modify your httpd.conf file to run the MetaphoneSuggest script instead of displaying a 404 error page. For example, if your default httpd.conf file has the section:


Listing 9. Default httpd.conf section
                
# Customizable error responses come in three flavors:
# 1) plain text 2) local redirects 3) external redirects
#
# Some examples:
#ErrorDocument 500 "The server made a boo boo."
#ErrorDocument 404 /missing.html
#ErrorDocument 404 "/cgi-bin/missing_handler.pl"
#ErrorDocument 402 http://www.example.com/subscription_info.html

Insert the following line: ErrorDocument 404 "/cgi-bin/MetaphoneSuggest" after the commented-out ErrorDocument lines. Make sure the MetaphoneSuggest and metaphonesScore.txt file are in the <document_root</cgi-bin/ directory on your Web server. Issue a server restart command as root: /usr/local/apache2/bin/apachectl restart (for example), and you're ready to start serving smart suggestions instead of dead-end errors.



Back to top


Implementation options and usability considerations

Share this...

digg Digg this story
del.icio.us Post to del.icio.us
Slashdot Slashdot it!

Keep in mind when using the tools described in the MetaphoneSuggest program that a 404 page is an error condition. Consider providing just a few suggested alternatives and keeping the design simple. Consult the big names in Web design for information on why they do not provide automatic link suggestions, or usability studies for how best to implement a link suggestion tool into your site.

This article provides the tools and code necessary to create options for useful link suggestions from 404s. However they are chosen to be implemented, you now have the ability to provide more than simple directory links or spelling suggestions. With results tailored for specific sites and content, the dead-end 404 can be a thing of the past.




Back to top


Download

DescriptionNameSizeDownload method
Codeos-metaphone.web404MetaphoneSuggest.zip2KBHTTP
Information about download methods


Resources

Learn
  • To listen to interesting interviews and discussions for software developers, check out check out developerWorks podcasts.

  • Stay current with developerWorks' Technical events and webcasts.

  • Watch and learn about IBM and open source technologies and product functions with the no-cost developerWorks On demand demos.

  • Check out upcoming conferences, trade shows, webcasts, and other Events around the world that are of interest to IBM open source developers.

  • Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.


Get products and technologies
  • Download Michael Schwern's Text::Metaphone module from CPAN.

  • Check out Apache.org for the best in Web servers. You can download a version of Apache HTTP Server for almost any operating system.

  • If your system doesn't have Perl, you can download it for almost any operating system.

  • Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

  • Innovate your next open source development project with IBM trial software, available for download or on DVD.

Discuss


About the author

Nathan Harrington is a programmer at IBM currently working with Linux and resource-locating technologies.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top



    About IBM Privacy Contact