This is an archived cached-text copy of the developerWorks article. Please consider viewing the original article at: IBM developerWorks

Country/region [select]

Home

Products

Services & industry solutions

Support & downloads

My IBM

developerWorks

In this article:
	Requirements
	Modification of the Search::VectorSpace.pm module
	LDAP data files
	vectorSearch.pl program
	Running vectorSearch.pl, search examples
	Conclusion and further modification options
	Download
	Resources
	About the author
	Rate this page

Related links
	Open source technical library

developerWorks > Open source >

Search structured LDAP data with a vector-space engine

Find people in your enterprise directory faster while compensating for typographical and spelling errors automatically

Document options

Document options requiring JavaScript are not displayed

Sample code

Rate this page

Help us improve this content

Level: Intermediate

Nathan Harrington (harrington.nathan@gmail.com), Programmer, IBM　

18 Sep 2007

Use Perl and a vector-space search engine to search and display records from your Lightweight Directory Access Protocol (LDAP) database. Use inflected letters and numbers to create a useful vector space from structured LDAP data. And compensate for typographical and spelling errors automatically while showing the most appropriate match for any query entered.

Articles describing vector-space searching usually begin with a description of vector spaces and how to project a specified query into a term space. Let's work backward, instead, with the following example: With a specified query of Nathen, we want to match data entries of Nathan and Jonathan, in that order. Existing approaches might involve building a regular expression based on stems of a word, or metaphone, and other linguistic derivatives of a search term. In our case, effective search results can be obtained by creating a vector for each letter in a word and returning results based on the closest match in vector space. In this case, the Nathan result will be printed first because it has five of six letters (vectors) in common, and Jonathan will be printed second because it only has five of eight letters in common.

The code and descriptions in this article are a highly simplified view of vector spaces and how to search them effectively. Fortunately, we have Maciej Ceglowski's excellent Search::VectorSpace module to encapsulate the logic, mathematics, and overhead of working with vector spaces efficiently. Consult Resources for information and code for all you ever wanted to know about the theory and practice of vector spaces.

Vector-space search approaches are generally suited to free-form text that include a certain number of words that help distinguish a block of text from others in the vector space. Web pages, scientific research reports, and other forms of natural language writing are best associated with the vector-space search method.

LDAP data usually comprises a set of highly redundant and a set of unique data components. The uniqueness of surnames (harrington vs. herrington, for example), phone numbers, and other fields can make for ambiguous search results, as the number of vectors in common between query and dataset are very low. Traditional vector-space approaches will treat a word like harrington as having a single vector. However, the approach used in this article is to treat each letter in a word as a vector. Capable of disregarding spelling mistakes (common when searching on names), impervious to typographical errors, such as 919-41 5-2042, splitting each alphanumeric character into a word inflective is a fast and easily manageable way to use LDAP data in a vector-space search engine.

Requirements

Hardware

Any modern PC manufactured after the year 2000 should provide plenty of horsepower for compiling and running the code in this article. However, large datasets will require substantial amounts of RAM and processing power if subsecond response times are desired. Initial vector-space creation times can be long, especially for large data sets. Actual search is rapid, though, and scales linearly for increases in data size.

Software

The code relies on the excellent VectorSpace module from Maciej Ceglowski as a basis for its search code. You'll need to pull down the Search::VectorSpace module from CPAN or your favorite mirror. Note that you don't need to install the VectorSpace module. Simply download the archive containing the VectorSpace.pm module and extract it to your working directory. You'll also need the File::Find module (standard with most Perl distributions) for working with LDAP data files on your system.

Back to top

Modification of the Search::VectorSpace.pm module

Once you have the VectorSpace.pm module, a small change needs to be made to support the expansion of all alphanumeric characters into words in the vector space. Modify the get_words subroutine from the VectorSpace.pm module, as shown below.

Listing 1. VectorSpace.pm get_words subroutine

                
sub get_words { 
  
  # Splits on whitespace and strips some punctuation    
  my ( $self, $text ) = @_;
  my %doc_words;  
  my @words = map { stem($_) }
        grep { !( exists $self->{'stop_list'}->{$_} ) }
        map { lc($_) } 
        map {  $_ =~/([a-z\-']+)/i} 
        split /\s+/, $text;
  do { $_++ } for @doc_words{@words};
  return %doc_words;
}

Remove the grep { !( exists $self->{'stop_list'}->{$_} ) } line from the map block in the subroutine to disable the stop list checking for our purposes.

Back to top

LDAP data files

This article makes use of LDAP data extracted from our company's enterprise directory. The vectorSearch.pl program described below expects each data file to have one LDAP record. Listing 2 shows an example LDAP data file in the ldapData directory. Note that the example code in this article is designed to give you the appropriate file name associated with a given query. Further programs can be written to process this file and display the relevant user contact information, job title, or other desired fields.

Listing 2. Example LDAP data file

                
objectclass: person
alternatelocalityname: Research Triangle Park
notesemail: CN=Nathan J Harrington/OU=Raleigh/O=IBM@ibmus
jobresponsibilities: System Administrator, Applications Developer
phone: 919-415-2042

Back to top

vectorSearch.pl program

We've eliminated the stop list and grabbed some LDAP data. Now we're ready to start searching with the vectorSearch.pl program. Listing 3 shows the variable declaration and subroutine calls from the program header.

Listing 3. vectorSearch.pl main program header

                
#!/usr/bin/perl -w 
# vectorSearch.pl - search ldap data using Search::VectorSpace
use strict;
use VectorSpace;
use File::Find;
die "specify a directory of files, maximum matches" unless @ARGV == 2;

my %fileHash = ();
my %phHash = ();
my $maxMatch = $ARGV[1];

loadInflectives();
find(\&loadVectorSpace,"$ARGV[0]");

Let's take a closer look at those subroutines, starting with loadInflectives. Given a data file of the format shown in Listing 4, the loadInflectives subroutine builds an alphanumeric phonetic hash. We'll use this hash later to change words into a string of their phonetic alphabet equivalents.

Listing 4. Example phonetic alphabet data file

                
A Alpha
B Bravo
C Charlie
...
7 Seven
8 Eight
9 Nine

The loadInflectives subroutine is shown in Listing 5, as simple as can be.

Listing 5. loadInflectives subroutine

                
sub loadInflectives
{
  open(INF,"alphaNumericInflectives" ) or die "can't open phonetic alph bet";
    while(my $inLine = <INF> )
    { 
      chomp($inLine);
      my ($letter, $name ) = split " ", lc($inLine);
      $phHash{$letter} = $name;
    }
  close(INF);
}#loadInflectives

loadVectorSpace is more complicated and is where the bulk of the program's time will be spent. The call to the loadVectorSpace subroutine in the main program code find(\&loadVectorSpace,"$ARGV[0]"); uses the File::Find module's callback structure to run the loadVectorSpace code for every file in the directory specified. As shown below, each line in every file provided will be transformed into an inflected alphanumeric string. This string is then passed into the VectorSpace module, and a data hash data structure is keyed on the filename for the list of vectors within a file. Searching this data structure for query results will be performed in the main program logic.

Listing 6. loadInflectives subroutine

                
sub loadVectorSpace
{
  return if( $File::Find::name eq $File::Find::dir ); # next file if a directory
  my @docs = ();

  open( INF, "$ENV{PWD}/$File::Find::name") or
    die "can't open file $ENV{PWD}/$File::Find::name";

    while(my $line = <INF>)
    { 
      next unless( $line =~ /:/ );          # ignore lines with no data
      my( undef, $line) = split ':', $line; # disregard fieldname
      push @docs, buildInflectString($line);

    }#while file read

  close(INF);

  $fileHash{$File::Find::name} = Search::VectorSpace->new(
                                   docs => \@docs,threshold => .04
                                 );
  $fileHash{$File::Find::name}->build_index();
  print "loaded: $File::Find::name\n";

}#loadVectorSpace

Note that the inflected alphanumeric string is created with a call to the buildInflectString subroutine. Shown in Listing 7, the buildInflectString subroutine uses the phonetic hash built earlier to transform a series of characters like nathan to an inflected string of the format "November Alpha Tango Hotel Alpha November." This will fulfill our strategy for creating a vector on each letter in the data field.

Listing 7. buildInflectString subroutine

                
sub buildInflectString
{
  my $data = "";
  # empty split is for each character in the string 
  for my $oneChar ( split //, lc("@_") ) 
  {
    # skip character if not a-z or 0-9 case insensitive
    next if( $oneChar !~ /([a-z]|[0-9])/i );
    $data .= $phHash{$oneChar} . " ";
  }#for each line in file
  return( $data );
}#buildInflectString

We now have a fully built vector space for searching that will return filenames containing the relevant LDAP data record where the query matches.

The main program holds the query, sorting, and display logic for returning the matches. It also supplies a simple loop for handling queries on the command line. Listing 8 shows the main program logic.

Listing 8. vectorSearch.pl main program logic

                
print "Enter a query: \n";
while ( my $query = <STDIN> ) {
    
  $query = buildInflectString( $query );
  my %resHash = ();
      
  for my $fKey( keys %fileHash )
  {
    my %results = $fileHash{$fKey}->search( $query );

    # return top result only from each file
    my ($topRes) = ( sort { $results{$b} <=> $results{$a} } keys %results );
    $resHash{ sprintf("%0.2f", $results{$topRes}) } = $fKey;
 
  }#for each filename
  
  my $count = 0;
  foreach my $resultKey (  reverse sort  keys %resHash )
  { 
    $count++;
    print "Result: $resultKey $resHash{$resultKey}\n";
    last if( $count == $maxMatch );         
  }#for each result

  print "Enter a query: \n";

}#while stdin

buildInflectString is called with the user's query to build a alphanumeric inflected string for searching the vector space. Each file has its own vector space to search, and the results are numerically sorted within each file. We want just the highest match from each file, so we pop the top off the list of sorted key value pairs and place that in the %resHash variable. After each file has been processed, we'll need to sort the results hash again to get the best of all matches across all the available files.

Back to top

Running vectorSearch.pl, search examples

Open a command prompt and run perl vectorSearch.pl ldapData/ 5. Every file from the ldapData directory will be loaded, and you will see a prompt to Enter a query. Try different queries and variations on those queries to get a feel for how the system works. For example, you can see that 415-2042 and 41#5 2-042 provide the same ordered results with the Nathan Harrington record at the top of the sorted list.

To check the automatic spelling derivatives by-product of this inflection into vector-space approach, try a query like nathen harrington or linda horrington. Due to the large number of vectors in common between the query and results, the appropriate files are listed at the top of the search results.

Back to top

Conclusion and further modification options

The code as presented above will enable you to efficiently query structured data, such as LDAP-based employee information, quickly. Note how retrieving the top result from each files' scores is just one way to process the results. You may want to retrieve all of the results from all of the files and run a second pass of sorting, or score based on the presence of the query word in the file. Adding punctuation inflectives, such as "Dash" for "-" or "AtSign" for "@" may prove useful when processing certain types of data, such as employee serial numbers or postal addresses.

Maciej Ceglowski's Search::VectorSpace module and some inflective word creation rules give you the options you need to explore the best scenario for your particular dataset.

Share this...


	Digg this story

	Post to del.icio.us

	Slashdot it!

Back to top

Download

Description	Name	Size	Download method
Source code	os-vectorldap-Space_0.1.zip	1.3KB	HTTP


	Information about download methods

Resources

Learn

Maciej Ceglowski's article "Building a Vector Space Search Engine in Perl" is hosted at Perl.com.
You can find the Search::VectorSpace module on CPAN.
Wikipedia has great information about vector spaces.
If your Windows®, Linux®, or Mac OS X system doesn't have Perl, grab it from Perl.org.
"Enable C++ applications for Web services using XML-RPC" is a step-by-step guide to exposing C++ methods as services.
To listen to interesting interviews and discussions for software developers, check out developerWorks podcasts.
Stay current with developerWorks' Technical events and webcasts.
Check out upcoming conferences, trade shows, webcasts, and other Events around the world that are of interest to IBM open source developers.
Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
Watch and learn about IBM and open source technologies and product functions with the no-cost developerWorks On demand demos.

Get products and technologies

Innovate your next open source development project with IBM trial software, available for download or on DVD.
Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

Participate in developerWorks blogs and get involved in the developerWorks community.
Participate in the developerWorks PHP Forum: Developing PHP applications with IBM Information Management products (DB2, IDS).

About the author


		Nathan Harrington is a programmer at IBM currently working with Linux and resource-locating technologies.

Rate this page

Back to top

About IBM

Privacy

Contact