| Level: Intermediate Nathan Harrington (harrington.nathan@gmail.com), Programmer, IBM
18 Sep 2007 Use Perl and a vector-space search engine to search and display records from
your Lightweight Directory Access Protocol (LDAP) database. Use inflected letters and
numbers to create a useful vector space from structured LDAP data. And compensate for
typographical and spelling errors automatically while showing the most appropriate
match for any query entered.
Articles describing vector-space searching usually begin with a description of vector
spaces and how to project a specified query into a term space. Let's work backward,
instead, with the following example: With a specified query of Nathen, we want to
match data entries of Nathan and Jonathan, in that order. Existing
approaches might involve building a regular expression based on stems of a word, or
metaphone, and other linguistic derivatives of a search term. In our case, effective
search results can be obtained by creating a vector for each letter in a word and
returning results based on the closest match in vector space. In this case, the
Nathan result will be printed first because it has five of six letters (vectors) in
common, and Jonathan will be printed second because it only has five of eight letters
in common.
The code and descriptions in this article are a highly simplified view of vector spaces
and how to search them effectively. Fortunately, we have Maciej Ceglowski's excellent
Search::VectorSpace module to encapsulate the logic, mathematics, and overhead of
working with vector spaces efficiently. Consult Resources for
information and code for all you ever wanted to know about the theory and practice of
vector spaces.
Vector-space search approaches are generally suited to free-form text that include a
certain number of words that help distinguish a block of text from others in the vector
space. Web pages, scientific research reports, and other forms of natural language
writing are best associated with the vector-space search method.
LDAP data usually comprises a set of highly redundant and a set of unique data
components. The uniqueness of surnames (harrington vs. herrington, for example), phone
numbers, and other fields can make for ambiguous search results, as the number of
vectors in common between query and dataset are very low. Traditional vector-space
approaches will treat a word like harrington as having a single vector. However,
the approach used in this article is to treat each letter in a word as a vector.
Capable of disregarding spelling mistakes (common when searching on names), impervious
to typographical errors, such as 919-41 5-2042, splitting each alphanumeric character
into a word inflective is a fast and easily manageable way to use LDAP data in a
vector-space search engine.
Requirements
Hardware
Any modern PC manufactured after the year 2000 should provide plenty of horsepower for
compiling and running the code in this article. However, large datasets will require
substantial amounts of RAM and processing power if subsecond response times are
desired. Initial vector-space creation times can be long, especially for large data
sets. Actual search is rapid, though, and scales linearly for increases in data size.
Software
The code relies on the excellent VectorSpace module from Maciej Ceglowski as a basis
for its search code. You'll need to pull down the Search::VectorSpace
module from CPAN or your favorite mirror. Note that you don't need to install the VectorSpace
module. Simply download the archive containing the VectorSpace.pm module and extract it
to your working directory. You'll also need the File::Find module (standard with most
Perl distributions) for working with LDAP data files on your system.
Modification of the Search::VectorSpace.pm module
Once you have the VectorSpace.pm module, a small change needs to be made to support the
expansion of all alphanumeric characters into words in the vector space. Modify the
get_words subroutine from the VectorSpace.pm module, as shown
below.
Listing 1. VectorSpace.pm get_words subroutine
sub get_words {
# Splits on whitespace and strips some punctuation
my ( $self, $text ) = @_;
my %doc_words;
my @words = map { stem($_) }
grep { !( exists $self->{'stop_list'}->{$_} ) }
map { lc($_) }
map { $_ =~/([a-z\-']+)/i}
split /\s+/, $text;
do { $_++ } for @doc_words{@words};
return %doc_words;
}
|
Remove the grep { !( exists $self->{'stop_list'}->{$_} )
} line from the map block in the subroutine to disable the stop list checking for
our purposes.
LDAP data files
This article makes use of LDAP data extracted from our company's enterprise directory.
The vectorSearch.pl program described below expects each data file to have one LDAP
record. Listing 2 shows an example LDAP data file in the ldapData directory. Note
that the example code in this article is designed to give you the appropriate file name
associated with a given query. Further programs can be written to process this file
and display the relevant user contact information, job title, or other desired fields.
Listing 2. Example LDAP data file
objectclass: person
alternatelocalityname: Research Triangle Park
notesemail: CN=Nathan J Harrington/OU=Raleigh/O=IBM@ibmus
jobresponsibilities: System Administrator, Applications Developer
phone: 919-415-2042
|
vectorSearch.pl program
We've eliminated the stop list and grabbed some LDAP data. Now we're ready to start
searching with the vectorSearch.pl program. Listing 3 shows the variable declaration
and subroutine calls from the program header.
Listing 3. vectorSearch.pl main program header
#!/usr/bin/perl -w
# vectorSearch.pl - search ldap data using Search::VectorSpace
use strict;
use VectorSpace;
use File::Find;
die "specify a directory of files, maximum matches" unless @ARGV == 2;
my %fileHash = ();
my %phHash = ();
my $maxMatch = $ARGV[1];
loadInflectives();
find(\&loadVectorSpace,"$ARGV[0]");
|
Let's take a closer look at those subroutines, starting with loadInflectives . Given a data file of the format shown in Listing 4,
the loadInflectives subroutine builds an alphanumeric
phonetic hash. We'll use this hash later to change words into a string of their
phonetic alphabet equivalents.
Listing 4. Example phonetic alphabet data file
A Alpha
B Bravo
C Charlie
...
7 Seven
8 Eight
9 Nine
|
The loadInflectives subroutine is shown in Listing 5, as
simple as can be.
Listing 5. loadInflectives subroutine
sub loadInflectives
{
open(INF,"alphaNumericInflectives" ) or die "can't open phonetic alph bet";
while(my $inLine = <INF> )
{
chomp($inLine);
my ($letter, $name ) = split " ", lc($inLine);
$phHash{$letter} = $name;
}
close(INF);
}#loadInflectives
|
loadVectorSpace is more complicated and is where the bulk of
the program's time will be spent. The call to the loadVectorSpace subroutine in the main program code find(\&loadVectorSpace,"$ARGV[0]"); uses the File::Find module's
callback structure to run the loadVectorSpace code for every
file in the directory specified. As shown below, each line in every file provided
will be transformed into an inflected alphanumeric string. This string is then passed
into the VectorSpace module, and a data hash data structure is keyed on the filename
for the list of vectors within a file. Searching this data structure for query results
will be performed in the main program logic.
Listing 6. loadInflectives subroutine
sub loadVectorSpace
{
return if( $File::Find::name eq $File::Find::dir ); # next file if a directory
my @docs = ();
open( INF, "$ENV{PWD}/$File::Find::name") or
die "can't open file $ENV{PWD}/$File::Find::name";
while(my $line = <INF>)
{
next unless( $line =~ /:/ ); # ignore lines with no data
my( undef, $line) = split ':', $line; # disregard fieldname
push @docs, buildInflectString($line);
}#while file read
close(INF);
$fileHash{$File::Find::name} = Search::VectorSpace->new(
docs => \@docs,threshold => .04
);
$fileHash{$File::Find::name}->build_index();
print "loaded: $File::Find::name\n";
}#loadVectorSpace
|
Note that the inflected alphanumeric string is created with a call to the buildInflectString subroutine. Shown in Listing 7, the buildInflectString subroutine uses the phonetic hash built earlier to
transform a series of characters like nathan to an inflected string of the
format "November Alpha Tango Hotel Alpha November." This will fulfill our strategy for
creating a vector on each letter in the data field.
Listing 7. buildInflectString subroutine
sub buildInflectString
{
my $data = "";
# empty split is for each character in the string
for my $oneChar ( split //, lc("@_") )
{
# skip character if not a-z or 0-9 case insensitive
next if( $oneChar !~ /([a-z]|[0-9])/i );
$data .= $phHash{$oneChar} . " ";
}#for each line in file
return( $data );
}#buildInflectString
|
We now have a fully built vector space for searching that will return filenames
containing the relevant LDAP data record where the query matches.
The main program holds the query, sorting, and display logic for returning the matches.
It also supplies a simple loop for handling queries on the command line. Listing 8
shows the main program logic.
Listing 8. vectorSearch.pl main program logic
print "Enter a query: \n";
while ( my $query = <STDIN> ) {
$query = buildInflectString( $query );
my %resHash = ();
for my $fKey( keys %fileHash )
{
my %results = $fileHash{$fKey}->search( $query );
# return top result only from each file
my ($topRes) = ( sort { $results{$b} <=> $results{$a} } keys %results );
$resHash{ sprintf("%0.2f", $results{$topRes}) } = $fKey;
}#for each filename
my $count = 0;
foreach my $resultKey ( reverse sort keys %resHash )
{
$count++;
print "Result: $resultKey $resHash{$resultKey}\n";
last if( $count == $maxMatch );
}#for each result
print "Enter a query: \n";
}#while stdin
|
buildInflectString is called with the user's query to build
a alphanumeric inflected string for searching the vector space. Each file has its own
vector space to search, and the results are numerically sorted within each file. We
want just the highest match from each file, so we pop the top off the list of sorted
key value pairs and place that in the %resHash variable.
After each file has been processed, we'll need to sort the results hash again to get
the best of all matches across all the available files.
Running vectorSearch.pl, search examples
Open a command prompt and run perl vectorSearch.pl ldapData/
5 . Every file from the ldapData directory will be loaded, and you will see a
prompt to Enter a query. Try different queries and variations on those queries
to get a feel for how the system works. For example, you can see that 415-2042
and 41#5 2-042 provide the same ordered results with the Nathan Harrington
record at the top of the sorted list.
To check the automatic spelling derivatives by-product of this inflection into
vector-space approach, try a query like nathen harrington or linda
horrington. Due to the large number of vectors in common between the query and
results, the appropriate files are listed at the top of the search results.
Conclusion and further modification options
The code as presented above will enable you to efficiently query structured data, such
as LDAP-based employee information, quickly. Note how retrieving the top result from
each files' scores is just one way to process the results. You may want to retrieve all
of the results from all of the files and run a second pass of sorting, or score based
on the presence of the query word in the file. Adding punctuation inflectives, such as
"Dash" for "-" or "AtSign" for "@" may prove useful when processing certain types of
data, such as employee serial numbers or postal addresses.
Maciej Ceglowski's
Search::VectorSpace module and some inflective word creation rules give you the options
you need to explore the best scenario for your particular dataset.
Download Description | Name | Size | Download method |
---|
Source code | os-vectorldap-Space_0.1.zip | 1.3KB | HTTP |
---|
Resources Learn
Get products and technologies
-
Innovate your next open source development project with IBM trial software, available for download or on DVD.
-
Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
About the author | | | Nathan Harrington is a programmer at IBM currently working with Linux and resource-locating technologies. |
Rate this page
| |