This is an archived cached-text copy of the developerWorks article. Please consider viewing the original article at: IBM developerWorks

Country/region [select]

Home

Products

Services & industry solutions

Support & downloads

My IBM

developerWorks

More in this series: LDAP search engines

In this article:
	Requirements
	Scoring the search results
	Modifying the code
	Smart suggestions using Text-Metaphone
	Summary
	Download
	Resources
	About the author
	Rate this page

Related links
	Open source technical library

developerWorks > Open source >

LDAP search engines, Part 2: Adding a scoring system

Develop a custom Web-search style engine for your LDAP data

Document options

Document options requiring JavaScript are not displayed

Sample code

Rate this page

Help us improve this content

Level: Intermediate

Nathan Harrington (harrington.nathan@gmail.com), Programmer, IBM

20 Mar 2007

Learn how to add a scoring system to the search engine described in Part 1 of this "LDAP search engines" series. Develop your own metaphone-matching techniques for spelling corrections, query suggestions, and effective display of search results.

Many organizations implement some form of Lightweight Directory Access Protocol (LDAP) service for storing enterprise directory information. Existing search options allow for a range of lookups based on where certain data is stored in the directory. The articles in this series allow you to combine the power of regular expressions with the grep tool to create your own custom LDAP search capability. In the spirit of successful search engines, such as Google, we'll change the search format from a LDAP-style query string to simple and powerful keyword matching and results display.

For a complete introduction to the data architecture strategies and search code referenced here, be sure to read Part 1 of this series.

Requirements

Hardware

Any modern PC manufactured after year 2000 should provide plenty of horsepower for compiling and running the code here. Revisions of this code generate subsecond response times for complex searches with 200 MB or more of information on systems with 500-MHz processors and more than 1 GB of RAM. The components that need to be fast -- grep and Perl -- are very fast, indeed, and the algorithm and display code stays out of the way enough to keep things fast.

Software

For metaphone matching, you'll need to use the Text-Metaphone module by Michael Schwern. Install the module from your favorite CPAN mirror, and you'll be ready to get started.

Back to top

Scoring the search results

In Part 1, we create a simple print out of results matching the query words specified. Unsorted, this printout provided the basic display required for a useful return of information. Scoring of match results is a complex subject with dozens of brilliant minds working on a variety of relevant problems to increase the quality of search results. The good news is that LDAP search is bounded by a relatively small number of well-defined constraints, and we can use some simple criteria and tuning to deliver highly relevant results with a minimum of program complexity and processing time.

Scoring criteria file

The first step is to define the fields to be scored. The simple approach taken here is to only score fields considered more important than other fields. Consider:

Listing 1. Example scoring criteria file


name 600
workloc 400
mail 300
jobresponsibilities 50
coverage_state 50
additional 40

The name field is given ultimate priority in our score weighting scheme. If a query word is found in the name, we want to give it a very high score, vs. a much lower score when found in the jobresponsibilities field, and no score at all when found in any of the other fields not defined in Listing 1. You'll need to tune these scoring parameters for your data environment. For example, to match primarily on physical address, tune the score weighting definitions to emphasize the physical address field. So even if a match is found in the Person Name field, you can use the appropriate weighting definitions to place the matching physical address records first in the list. For example, if you search on "Orchard," you may be searching for the IBM® corporate address. With an appropriately tuned score weight file, it will place the "IBM Mailing Address" record at "101 Orchard Drive" ahead of individual records like "John Orchard" or "Orchard Industries."

Note how the scores are separated by orders of magnitude and divisions of 10. Use broad spacing in your initial scoring setups to allow maximum flexibility as your scoring criteria changes.

Back to top

Modifying the code

With the scoring criteria file in place (fieldWeights), we need to load the various scoring parameters and build our match test subroutines. The Resources section contains a link to the complete code, or follow the modification instructions below as we transmogrify Part 1 into the completed Part 2 program. The first step is to add the required hash to the variable declaration section. Make sure to insert the line my %fieldWeights = (); at the beginning of the variable declaration section. Then use the following subroutine to load the field weights into memory.

Listing 2. loadMainFieldWeight subroutine


sub loadMainFieldWeight{
  open( FLDWT, "fieldWeights") or die "can't open field weights file";
  while( my $line = <FLDWT> )
  {
    my($key, $val ) = split " ", $line;
    $fieldWeights{ $key } = $val;
  }#while field weights file
}#loadTwoWordFieldWeight

After the grep stage, Part 1 printed out the results returned in sequential order. Part 2 injects the scoreSortResults subroutine into the main program logic to score each matched record, sort the resulting list, and pass it on to the display step.

Listing 3. scoreSortResults subroutine


sub scoreSortResults
{
  my @scored = ();

  for my $oneRec ( @_ )
  {
    chomp($oneRec);
    my @delRecs = split "##", $oneRec;
    shift(@delRecs); # first field is empty
    my $localScore = 0;

    for my $fld ( @delRecs  )
    { 
      $localScore += match( "contains", $fld );
      $localScore += match( "isExact",  $fld );
    }#for each field in the linex

    push @scored, "$localScore" . "$oneRec";
  }#for each line 

  my @idx = (); #temporary index for sorting
  for( @scored ){
    # create an index of scores
    my $item =  substr($_,0,index($_,'##'));
    push @idx, $item;
  }

  # sort the index of scores
  my @sorted = @scored[ sort { $idx[$b] <=> $idx[$a] } 0 .. $#idx ];

  return( @sorted );
}#scoreResults

The main for loop iterates over every record that matched from the flat file. For each of these newline-delimited records, the fields are extracted and checked for a match with any of the query words. The separate check for a contains match and an isExact match effectively doubles the score for a precise match of a word. This was the first simple yet moderately robust method I came upon to deliver reliable results that differentiate "John" from "Johnny" in a useful way. With the score of a record inserted at the beginning of the record line, the loop repeats.

When every record that matched from the flat file has been scored, the sorting process is set up. The first step is to isolate the actual score by selecting the first record in the ##-delimited list. With the index array of scores created, we use the standard sort command as described in section 4 of the perlfaq to create the @sorted array of ordered search results.

Let's go back and flesh out the match subroutine:

Listing 4. match subroutine


sub match
{
  my $matchType = $_[0];
  my $input = $_[1];
  my( $field, $value ) = split ':', $input;
  my $retScore = 0;

  for my $qPiece ( @queryWords )
  {
    if( $matchType eq "contains" )
    { 
      if( $value =~ /($qPiece)/i )
      { 
        next unless exists($fieldWeights{$field});
        $retScore .= $fieldWeights{$field};
      }#if word match found
    }#if contains

    else
    { 
      if( $value =~ /(\($qPiece\))/i   ||   $value =~ /(\b$qPiece\b)/i  )
      { 
        next unless exists($fieldWeights{$field});
        $retScore .= $fieldWeights{$field};
      }#if exact word match found
    }#if isexact

  }#for each query word
  return($retScore);
}#match

For each word specified in the query, check for results that contain the word and results that are exact matches for the word. One particular modification for the LDAP data is to consider '(' and ')' as word boundary indicators. This will help match entries such as "smith,dave" and "smith, david (dave)."

For easy tuning with your particular data set, consider modifying the scoreSortResults subroutine to only accept exact matches.

With the simple scoring procedures in place and a sorted array of result text, we can print the results. We'll need to modify the buildHashPrintResults subroutine to store the print output in a variable. We'll use this for some simple results checking later on. Add the line my $outStr = ""; to the main program variable declaration section. Change the print lines in the buildHashPrintResults subroutine from those in Listing 5 to what those in Listing 6.

Listing 5. Part 1 buildHashPrintResults print statments


my @delRecs = split "##", $oneRec;
...
print getSelectedFields();
%fieldHash = ();

Listing 6. Part 2 buildHashPrintResults print statments


my @delRecs = split "##", $oneRec;
$outStr .= "Score: $delRecs[0]\n";
...
$outStr .= getSelectedFields();
%fieldHash = ();

All that remains to complete the scoring code upgrade is to modify the main program logic flow from Listing 7 to Listing 8.

Listing 7. Part 1 main program logic


@queryWords = split " ", $searchQuery;
  
buildHashPrintResults( alg_N_Word( @queryWords ) );

Listing 8. Part 2 main program logic


@queryWords = split " ", $searchQuery;

loadMainFieldWeight();

buildHashPrintResults
( 
  scoreSortResults
  ( 
    alg_N_Word( @queryWords )
  )
);

print "$outStr";

Go ahead and give it a try with your previous data set. For my example data set, if I do a query like perl cmdSearch.pl devel chri, I would get a result like that shown in Listing 9. Note how the first record has a higher score because "chris" matches in the mail field, as well as the name field.

Listing 9. Scored query results


Score: 950
mail:  nchristo@us.ibm.com
telephonenumber:  1-522-223-2214
physicaldeliveryofficename:  1P-027
co:  USA
cn:  Christopher Q Public 
buildingname:  007
jobresponsibilities:  Senior Software Engineer, \
IBM Developer Skills Program, developerWorks
givenname:  Christopher,  Chris,  Kris,  Christian,  Christine,  Cristiane
primaryuserid:  NCHRIS
name:  Public, Christopher

Score: 650
mail:  crothemooi@us.ibm.com
telephonenumber:  1-822-223-2215
physicaldeliveryofficename:  HOME
co:  USA
cn:  Christine D. Public
buildingname:  311
jobresponsibilities:  developerWorks WebSphere Editor: Wireless, Web Services, Voice
givenname:  Christine D.,  Christine,  Chris,  Kris,  Christian,  Christopher,  Cristiane
name:  Public, Christine D. (Chris)
preferredfirstname:  Christine

You now have in place a fully functional, weighted-scoring, free-form query search engine. Although tailored to LDAP-type data, you can use the code above for any number of text-related queries. Keep in mind as you tune your search engine that major modifications to the performance and results of the algorithms can be acquired by modifying just a few sections of the code. For example, if you remove the ability of a query word to match more than once, or more than once in the same field, the distribution of the scores within a record will change dramatically.

Back to top

Smart suggestions using Text-Metaphone

Metaphone matching is an algorithmic method for generating keys based on the phonetic pronunciation of words. For example, the metaphone for "developerWorks" is "TFLPRWRKS." The same metaphone is generated for devoloperworks, devaloperworks, devaloperwerks, etc. Used in a wide variety of spelling checker and suggestion-making programs, we'll use a simple version of the metaphone-matching approach to provide some useful suggestions based on our particular data.

Building the metaphone database

The approach used in this article is to extract all of the individual words from a specified field in the entire flat-file database. A frequency count is then created for each of these words, along with their associated metaphone keys. Use the program buildMetaphones.pl in Listing 10 to create your own metaphone databases.

Listing 10. buildMetaphones.pl script


#!/usr/bin/perl -w
use strict;
use Text::Metaphone;

if( @ARGV != 1 ){ die "specify a fieldname" };

my $selectField = $ARGV[0];
my %metaPhones;

# build the metaphone frequency counts
while( my $line = <STDIN>){
  my @allRecs = split '##', $line;
  shift( @allRecs ); #first on is empty

  for my $entry ( @allRecs )
  {
    next unless $entry =~ /\:/;
    my( $field, $value ) = split ':', $entry;
    next unless $field =~ $selectField;

    # remove characters that inhibit space delimiting
    $value =~ s/\W/ /g;

    for my $word ( split ' ', lc($value) )
    { 
      $metaPhones{ Metaphone($word) }{ $word }++;
    }
  }#for each field entry
}#while input

# now print them out
for my $metaKey( keys %metaPhones )
{
  print "$metaKey##";

  # build a sorted list of actual word counts associated with a metaphone
  my @countKeys = sort
  {
    $metaPhones{$metaKey}{$b} <=> $metaPhones{$metaKey}{$a}
  } keys %{ $metaPhones{$metaKey} };

  for my $sortKey ( @countKeys )
  {
    print "$sortKey $metaPhones{$metaKey}{$sortKey}##";
  }

  print "\n";
}#for metakey

Run the script with the command cat <data_file> | perl buildMetaphones.pl name > metaphones.name, and it will produce output like that shown in Listing 11. The data_file entry should be your newline-delimited flat file generated from your LDAP data extract as described in Part 1.

Listing 11. Example metaphones.name file


XKR##shekerow 1##
JLBR0##gilbreath 1##
JM##jim 103##
MKKMN##mccommon 12##
MRKN##morgan 33##
TNS##denise 3##
JKFLF##jakovlev 1##
JN##john 18##jeanne 12##jenni 10##jon 1##

You also want to create a metaphones.jobresponsibilities file with the command cat <data_file> | perl buildMetaphones.pl jobresponsibilities > metaphones.jobresponsibilities.

To use these files, we'll need to update the cmdSearch.pl code with the loading subroutine.

Listing 12. loadMetaphones subroutine


sub loadMetaPhones{

  open( FH, "metaphones.name" ) or \
  die "can't load name metaphones";
    while(<FH>){
      my( $key, $val ) =split '##';
      $nameMP{$key} = $val;
    }#while FH
  close(FH);

  open( FH, "metaphones.jobresponsibilities" ) or \
  die "can't load jobresponsibilities metaphones";
    while(<FH>){
      my( $key, $val ) = split '##';
      $jobMP{$key} = $val;
    }#while FH
  close(FH);

}#loadMetaPhones

The loadMetaPhones subroutine is straightforward code that creates two hashes of metaphone keys and values. In our example, a 1-frequency of metaphone match in the name field is considered more desirable than a 50-frequency match in the jobresponsibilities field. Instead of creating another scoring system to take this into account, the code will simply search for a name metaphone match first, then move on to the jobresponsibilities metaphones if none is available. The actual code to make the metaphone matches takes place in the getMetaphoneMatch subroutine:

Listing 13. getMetaphonematch subroutine


sub getMetaphoneMatch{
  my $retStr = "";
  my @notFound = ();
  #HRNKTN ##harrington 41##herrington 4 - metaphone file format

  for my $qPiece ( @_ )
  {
    my $word = $qPiece;
    my $phone = Metaphone($word);
    my @mParts = ();

    if( exists($nameMP{$phone}) )
    { 
      @mParts = split '##', $nameMP{$phone};
    }
    elsif( exists($jobMP{$phone}) )
    { 
      @mParts = split '##', $jobMP{$phone};
    }

    next unless @mParts ne "";

    # print the most common match only
    for my $metaWord ( @mParts )
    { 
      if( $metaWord !~ /$qPiece/i )
      { 
        $retStr .= "$metaWord, ";
        last;
      }
    }#for metaword check
  }#for each query word

  return($retStr);
}#getMetaphoneMatch

After the local variable declaration, the subroutine will make a pass through every query word specified searching for a metaphone match in the name or jobresponsibilities hashes. If a metaphone match has been found, just the first matched word with the appropriate metaphone is selected for printing. For example, a query for "Horington" has a metaphone of "HRNKTN" and will print a suggestion for "Harrington," even though "Herrington," "Herringdon," and others may be metaphone matches. If none of the name metaphones are a match, the job responsibilities metaphones are searched. The choice to print the most common match (as determined in the buildMetaphones.pl script) is based on a desire to give the users a simple level of suggestion. You can print all of the matching metaphones, or even search the database automatically based on the first few characters in the metaphone match. Choose a strategy that works best for your application.

To complete the modifications, add my %jobMP = (); my %nameMP = (); to the variable declaration section and modify the main program logic from Listing 14 to Listing 15.

Listing 14. Previous main program logic


@queryWords = split " ", $searchQuery;
loadMainFieldWeight();

buildHashPrintResults
(
  scoreSortResults
  (
    alg_N_Word( @queryWords )
  )
);

print "$outStr";

Listing 15. Final Part 2 main program logic


@queryWords = split " ", $searchQuery;
loadMainFieldWeight();
loadMetaPhones();

buildHashPrintResults
(
  scoreSortResults
  (
    alg_N_Word( @queryWords )
  )
);

print "$outStr";
if( $outStr eq "" ){ print "Try: ", getMetaphoneMatch( @queryWords ), "\n" }

Now run the cmdSearch.pl application with a query that you know is not in the dataset. For example, if you do a search on "jaff devaloperWerks," the cmdSearch.pl program output will be Try: jeff 114, developerworks 50,.

Back to top

Summary

Share this...


	Digg this story

	Post to del.icio.us

	Slashdot it!

With the LDAP search engine described above, complete with effective scoring and simple search suggestions, you have many options for integration within the enterprise. One typical approach is to migrate the interface to a HTML-generating environment and allow users to search your LDAP database just like they would the rest of the Internet. You can create a Service-Oriented Architecture (SOA) application programming interface (API) for integration with other applications, including suggestion support to give your various applications smarter people lookups.

The general algorithms and text-searching approach are portable and easily integrated across the enterprise, as well. Use the metaphone-matching and regular-expression generator to create a suggestion link based on 404 hits to your Web site. Insert a call to the Aspell or Ispell API for another avenue of search query suggestion, or replace the metaphone matching with the best modern soundex for your application.

Back to top

Download

Description	Name	Size	Download method
Source code	os-ldap2Search.zip	2KB	HTTP


	Information about download methods

Resources

Learn

This article builds on "LDAP search engines, Part 1."
Check out the source of Perl and LDAP.
Paul Dwerryhouse wrote an article titled "An Introduction to perl-ldap" about Perl and LDAP.
There's plenty of background and implementation knowledge available at OpenLDAP.org.
Expand your PHP skills by checking out IBM developerWorks' PHP project resources.
To listen to interesting interviews and discussions for software developers, check out developerWorks podcasts.
Stay current with developerWorks' Technical events and webcasts.
Check out upcoming conferences, trade shows, webcasts, and other Events around the world that are of interest to IBM open source developers.
Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
Visit Safari Books Online for a wealth of resources for open source technologies.

Get products and technologies

Grab Michael Schwern's Text-Metaphone module at CPAN.
Innovate your next open source development project with IBM trial software, available for download or on DVD.

Discuss

Participate in developerWorks blogs and get involved in the developerWorks community.
Participate in the developerWorks PHP Developer Forum.

About the author


		Nathan Harrington is a programmer at IBM currently working with Linux and resource-locating technologies.

Rate this page

Back to top

About IBM

Privacy

Contact