This is an archived cached-text copy of the developerWorks article. Please consider viewing the original article at: IBM developerWorks

Country/region [select]

Home

Products

Services & industry solutions

Support & downloads

My IBM

developerWorks

More in this series: LDAP search engines

In this article:
	Requirements
	Flat-file database selection
	Building the flat-file database
	Searching the database with cmdSearch.pl
	Basics in place
	Download
	Resources
	About the author
	Rate this page

Related links
	Open source technical library

developerWorks > Open source >

LDAP search engines, Part 1: Use Perl and a regular-expression generator to search for and display LDAP database records

Document options

Document options requiring JavaScript are not displayed

Sample code

Rate this page

Help us improve this content

Level: Intermediate

Nathan Harrington (harrington.nathan@gmail.com), Programmer, IBM

20 Feb 2007

Find out how to use Perl and a regular-expression generator to search and display records from your Lightweight Directory Access Protocol (LDAP) database using simple keyword-type searches. Search and processes your LDAP data without knowing precisely which field the data is in or how it is formatted. Part 2 of this "LDAP search engines" series introduces scoring and metaphone suggestions to the code.

Many organizations implement some form of LDAP service for storing enterprise directory information. Existing search options allow for a range of lookups based on where certain data is stored in the directory. This article allows you to combine the power of regular expressions with the grep tool to create your own custom LDAP search capability. In the spirit of successful search engines, such as Google, we'll change the search format from a LDAP-style query string to simple and powerful keyword matching and results display.

In this article, we will cover building the flat-file database, regular-expression creation, and basic search and display. Part 2 covers the scoring and metaphone matching topics to help complete your search capabilities.

Requirements

Hardware

Any modern PC manufactured after year 2000 should provide plenty of horsepower for compiling and running the code here. Revisions of this code generate subsecond response times for complex searches with 200 MB or more of information on systems with 500-MHz processors and more than 1 GB of RAM. The components that need to be fast -- grep and Perl -- are very fast, indeed, and the algorithm and display code stays out of the way enough to keep things fast.

Software

No special packages are needed for this project. If you've got Perl and grep installed on Linux®, you're ready to start.

Back to top

Flat-file database selection

Existing search tools for LDAP queries require the searcher to know -- or at least specify -- the correct fields for which they are searching for data. Regular-expression support is minimal at best and can often be unrecognized or unprocessable for complex queries that regular grep can handle easily. Note that the code in this article does not search the LDAP data directly, but requires an export into a flat-file database before processing can occur. The existing data store search options provided by LDAP do not have the speed and flexibility grep users are accustomed to. We will use an extract of the LDAP data and create our own free-form search engine for finding exactly the information we want.

The processing, searching, and displaying of data is a simple process compared to the free text or structured document searches. We are generally only searching directory information: an individual's name, address, phone number, etc. Using the tools developed here, you'll be able to build your own search engine that will provide much faster results than existing LDAP lookups. A critical component of this speed is to use the highly refined, efficient algorithms contained within grep. The most efficient and portable way of utilizing the speed and regular-expression capabilities of grep is to create a newline-delimited flat file from the contents of your LDAP directory.

Back to top

Building the flat-file database

In our organization, it is comparatively easy to extract the LDAP data into a format like:

Listing 1. LDAP -- one record, multiple lines


dn: uid=123456897,c=us,ou=bluepages,o=ibm.com
objectclass: person
objectclass: organizationalPerson
objectclass: ibmPerson
objectclass: ePerson
objectclass: top
internalmaildrop: MARKET ST
personaltitle: Ms.
mail: developerWorks@us.ibm.com
uid: 123456897
...

Each of the individual records from the LDAP database will be separated by a blank line. What we need is a newline-delimited flat file, where each full record is contained on one line. If your LDAP data is in a format like that shown above, use the compileLineByLine.pl script to create the one-record-per-line flat file.

The compileLineByLine.pl script is listed below or you can download it. You may need to modify the code to work with the particulars of your LDAP data.

Listing 2. compileLineByLine.pl script


#!/usr/bin/perl -w
# compileLineByLine.pl - build a line by line record of LDAP data
use strict;

my $lastFlip = 0; # flipper for end of record output, start of record

while( my $ldapLine = <STDIN>)
{
  # ignore meta-data "object" lines
  next if( $ldapLine =~ / objects\n/ );
  next if( $ldapLine =~ / Ok\n/      );
  next if( $ldapLine =~ / object\n/  );

  if( $ldapLine eq "\n" )
  { 
    # if a blank line and "in-record", print out the end of record
    print "##\n" if $lastFlip == 1;
    $lastFlip = 0;
  }else
  { 
    # if not a blank line, print out the field, set "in-record" 
    chomp($ldapLine);
    print qq{##$ldapLine};
    $lastFlip = 1;
  }#if newline

}#while stdin

Run the script with the command cat <data_file> | perl compileLineByLine.pl, and it will produce output like:

Listing 3. Line-by-line LDAP records


##alternatenode: RHQVM19##internalmaildrop: MD 1329##pdif: 1##tieline: 82...
##pdif: 1##tieline: 930-5888##preferredfirstname: Christine##mail: crothe...
##additional: Mobex:274571##alternatenode: UKVM##internalmaildrop: 135##p...
##alternatenode: RALVM14##internalmaildrop: 1034##pdif: 1##tieline: 793-0...
##additional: MAIL-ADDR:(PO Box 12195, 3039 Cornwallis Road RTP, NC 27709...
##alternatenode: RALVM17##internalmaildrop: 007-1S023##pdif: 1##tieline: ...
##alternatenode: RALVM14##tieline: 273-3598##pdif: 1##preferredfirstname:...
##additional: MAIL-ADDR:(PO Box 12195, 3039 Cornwallis Road RTP, NC 27709...
##additional: MAIL-ADDR:(PO Box 12195, 3039 Cornwallis Road RTP, NC 27709...
...

Database mavens will no doubt be aghast at this apparent retrograde into mainframe history. Have no fear, as the algorithms and regular-expression building code here are easily adaptable to your relational database environment. This ## field delimited and \n record delimited flat file is simply the most elegant and portable approach for our purposes. Again, by using grep, this approach builds on the brilliant hackers who have optimized grep for speed and efficiency. In addition, many versions of Linux will load the entire flat file into memory during the grep run, vastly enhancing the performance on subsequent runs.

The example data above shows delimiter (##), field name: data value, delimiter, field name, etc. Including the field names as part of the data itself may seem redundant, but is an excellent method for searching for the presence of fields in certain records. For example, some records have IPTelephoneNumber as a field name, and some do not. Allowing the free-text search for IPTelephoneNumber will list every person in the company with IP telephone access. Including the field name in the record with the data is a great way to track changes. Simply append the date of insertion onto the field name, and you can have a timeline of changes on a per-record basis. To see what sort of field-name distributions you have among your records, try this:

cat <data_file> | perl -lane '@a=split " ";$h{$a[0]}++;END{for(keys %h)\
{print "$h{$_} $_"}}' | sort -nr

where data_file is your LDAP directory output of one field per line and record separated by an empty line. The above line will give you a sorted frequency distribution of your field names, similar to the following.

Listing 4. LDAP field-name frequency distributions


505 objectclass:
491 cn:
357 givenname:
113 mail:
101 serialnumber:
99 div:
98 buildingname:
97 worklocation:
97 workloc:
97 physicaldeliver

Back to top

Searching the database with cmdSearch.pl

With our flat file in place, we can begin the creation of our searching code. The cmdSearch.pl program contained in Download is designed to be run on the command line with a series of words specified as the search string. We will start with an overview of the program, followed by a sectional description of each component.

The first step is to create a compatible regular expression out of each query word, although modification is only required if the word contains a wildcard character: *. With the regular expression-compatible words in place, the first grep is performed on the flat file itself, followed by a refining of the results with Perl's grep. When all of the query words have been checked, the final records are processed and displayed to the user in a simplified format suitable for contact lookups.

Let's get started with the code in the declaration section.

Listing 5. cmdSearch.pl declaration section


#!/usr/bin/perl -w
#cmdSearch.pl - return LDAP data from command line query
use strict;

die "usage cmdSearch.pl 'query w*rds here'" if ( @ARGV == 0 );

my $initQuery = "@ARGV";
my $searchQuery = $initQuery;
my $grepFile = "data/10head";
my @queryWords = ();
my %fieldHash = ();

Skipping ahead to the main program body:

Listing 6. cmdSearch.pl main program


# remove extraneous spaces, + signs
$searchQuery =~ s/\s+$//g;
$searchQuery =~ s/^\s+//g;
$searchQuery =~ s/\+//g;

@queryWords = split " ", $searchQuery;

buildHashPrintResults( alg_N_Word( @queryWords ) );

We can see that any leading or trailing spaces are removed, then the + signs are escaped to remove potential interference with the regular-expression search. After passing in the query words array to the N word algorithm, we'll print out the results.

Let's get started with the alg_N_Word subroutine.

Listing 6. alg_N_Word subroutine


sub alg_N_Word
{
  my @regexpWords = ();

  for my $qPiece ( @_ )
  { 
    push @regexpWords, createRegexp( $qPiece );
  }
  
  my @step1Recs = `grep -i -E "$regexpWords[0]" $grepFile`;
  for my $rWord ( @regexpWords )
  { 
    next if $rWord eq $regexpWords[0];
    my @step2Recs = ();
    @step2Recs = grep( /($rWord)/i, @step1Recs );
    @step1Recs = ();
    @step1Recs = @step2Recs;
  }#for each regexp word
  
  return( @step1Recs );
}#alg_N_Word

While leaving the original query words array intact, we build a regular expression for each word. Starting with the first regular expression, create an array of records from the data file, then loop through the remaining regular expressions to refine the results. Finish by returning the final array of matching records from the flat file. The first subroutine call in the alg_N_Word function is to createRegexp. Let's take a look.

Listing 8. createRegexp subroutine, Section 1


sub createRegexp
{
  my $inQuery = $_[0];
  my $localQuery = "";
  my $returnStr = "";
  my $astCount = ($inQuery =~ tr/\*// );
  my $longPart = "";
  my @wordParts = split '\*', $inQuery;
  my $breakString = '\b';
  
  $inQuery =~ s/\./\\\./g;  # replace . with \.

  # if no wildcards, return the plain words
  return( $inQuery ) if ( $astCount == 0 );

We create some variables and count the number of asterisks in the incoming "word." The Web search engine-style interface used for processing the queries is designed to handle multiple asterisks in the query. After escaping the '.' character, the subroutine will return an otherwise unmodified string if there were no asterisks found. If one or more asterisks are present, the subroutine will continue its processing to create the regular expressions.

Listing 9. createRegexp subroutine, Section 2


  # determine the longest part of the string to search for
  for( @wordParts )
  {
    next if ( length($_) < length($longPart) );
    $longPart = $_;
  }

  # if an asterisk is present in the query, build the regular expression
  if( $astCount == 1 )
  {
    # examples: *sam, sam*, sam*l
    if(     substr( $inQuery, 0, 1 ) eq '*' )
    {

# this is a (any word char) one or more times, (query), word boundary - *sam
      $localQuery = "(" . '\w' . ")+($longPart)" . '\b';

    }elsif( substr($inQuery, length($inQuery)-1,1) eq '*' ){

# this is word boundary, query, (any word char) one or more times - sam*
      $localQuery = '\b' . "($longPart)(" . '\w' . ")+";

    }else{

# word boundary, query, (any word char) one or more times,  query, word boundary sam*l
      $localQuery = '\b' . "($wordParts[0])(" . '\w' . ")+($wordParts[1])" . '\b';

    }#if a single asterisk is at beginning, end or middle

This code is focused on creating regular expressions where only one asterisk is present. Examples of a one-asterisk query include *sam, sam* and sam*l. The first section of the code finds the longest part of the query word, which is used for insertion into the regular expression. If the asterisk appears at the beginning of a query word, we want to consider that to mean that the word needs to end in the specified query. For example, if the query is *sam, we want to make sure it matches iamsam, but not iamsamson.

The regular expression built requires a word boundary at the end of the built expression to ensure this behavior. Similarly, if the asterisk appears at the end of the query word, the program will treat that as requiring the word to begin with the specified query. In this manner, sam* will match samson, but not iamsamson.

Finally, if the query word has an asterisk in the middle, require the query word to match at the beginning and end. sam*l will, therefore, match samuel, but not iamsamuel or samueliam. Listing 10 details the steps to follow if there is more than one asterisk per-query word.

Listing 10. createRegexp subroutine, Section 3


  }elsif( $astCount > 1 ){
    # examples: s*m*l, *am*l, sa*m*

    for my $partChunk( @wordParts ) {
      next if( $partChunk eq "" );
      $breakString .= "($partChunk)(". '\w' . ")+";
    }#for each part of the query

    if( substr($inQuery, length($inQuery)-1,1) ne '*' ){

      # if the last characters is not a *, remove the (any word char) section 
      # from the end, and replace with a word boundary
      $breakString = substr($breakString,0,length($breakString)-5);
      $breakString .= '\b';

    }#if not an asterisk at the end

    if( substr($inQuery, 0, 1)  eq '*' ){

      # if beginning is a asterisk, remove the word starting boundary
      $breakString = substr($breakString,2);

    }#if asterisk at the beginning

    $localQuery = $breakString;

  }#count asterisks in the query 

  return($localQuery);
}#createRegexp

Regardless of the content of each piece of the current word query, create a regular expression that searches for that chunk followed by a word character one or more times. This process occurs for any multiasterisk query, then is post-processed to ensure modification of the start and end of the regular expression to match the expected criteria. For example, with a query, s*m*l, after the first for loop, the breakString variable would contain the value (s)(\w)+(m)(\w)+(l)(\w)+. The next if statement will remove the trailing "any word character section" to produce (s)(\w)+(m)(\w)+(l). The last if statement is false in this case (as the query word did not begin with an asterisk), so the built regular expression is returned to the alg_N_Word subroutine. After the alg_N_Word subroutine has completed, and returned the searched and refined records, the buildHashPrintResults subroutine is called.

Let's take a look at the subroutine and describe how it works.

Listing 11. buildHashPrintResults


sub buildHashPrintResults
{
  for my $oneRec ( @_ )
  {
    chomp($oneRec);
    my @delRecs = split "##", $oneRec;
    shift(@delRecs); # first field is empty
    
    for my $fld ( @delRecs  )
    { 
      #example data: additional: MAIL-ADDR:(PO...
      my $key = substr($fld, 0, index($fld,':') );
      my $val = substr($fld, index($fld,':')+1 );
     
      $fieldHash{$key} .= ", " if( exists($fieldHash{$key}) );
      $fieldHash{$key} .= "$val";
    }
    
    print getSelectedFields();
    %fieldHash = ();
  }#for each line 

}#buildHashPrintResults

The @_ variable is a list of all records that have matched the search criteria. For each of these records, we will extract the fields by name and store each of the values in a hash keyed on the field name. Recall that the fields are delimited by ##, and the subroutine above puts all of the fields for one record into the delRecs array. For each of these fields, gather the field name as everything before the first colon and the value as everything after. Append the value to the hash value keyed on the field name if it already exists, as this will enable the simple print out of every combination of the field no matter which of the fields matched. For example, you can match ju*y in the cn field if the name is jane, but the nickname also stored in the cn field is judy or july. Combining multiple field names into one variable for printing allows you to see the actual name and the nicknames with one search.

Now that the hash is built for the current record, print out the selected fields. The subroutine getSelectedFields below shows what these selected fields are for our example.

Listing 12. getSelectedFields


sub getSelectedFields
{
  my @desiredFields = split " ",
    "mail telephonenumber physicaldeliveryofficename co cn buildingname";
  my $returnStr = "";

  # print the desired fields
  for my $key ( @desiredFields ){ $returnStr .= "$key: $fieldHash{$key}\n"; }

  # find other fields where search terms exist
  for my $searchWord( @queryWords )
  {
    for my $searchKey ( keys %fieldHash )
    {

      if( $fieldHash{$searchKey} =~ /$searchWord/i )
      {
        # word found in value, make sure it's not already printed as part 
        # of the desiredFields
        next if( "@desiredFields" =~ /$searchKey/i );
        $returnStr .= "$searchKey: $fieldHash{$searchKey}\n";
      }#if match found

    }#for each key in the field hash
  }#for each search word

  $returnStr .= "\n";
  return($returnStr);
}#getSelectedFields

For each of the field names specified in the desiredFields array, search the built fieldhash for any of the words specified in the search criteria. If a match is found, check to ensure that a match found for the same search word in a different field has not already been printed. If it's the first match found, add it to the string to be printed. For example, with the same data shown in Listing 3, a search for 930-5* will produce a search result of:

Listing 13. 930-5* search output


mail:  chrisQDevel1@us.ibm.com
telephonenumber:  1-877-848-8888
physicaldeliveryofficename:  HOME
co:  USA
cn:  Christine Q. Micham,  Chris D. Micham
buildingname:  131
tieline:  930-5888

Note how the getSelectedFields subroutine prints the specified fields, as well as any that match the search criteria -- if they have not already been matched. You can also see from the above example the printout of all the cn: (command name) values.

Back to top

Basics in place

Share this...


	Digg this story

	Post to del.icio.us

	Slashdot it!

You now have the basic LDAP search engine in place. You can use the simple regular-expression building tools here in many avenues besides LDAP search. Consider tuning the regular-expression generation or wildcard syntax to your specific environment. The same algorithms and free-form text searching described here will also work on customer databases and other small-scope data repositories.

In Part 2 of this series, weighted scoring and sorting of the search results are introduced, along with metaphone matching to correct common spelling mistakes for your dataset.

Back to top

Download

Description	Name	Size	Download method
Source code	os-ldap.SearchPart1.zip	2KB	HTTP


	Information about download methods

Resources

Learn

Check out the Perl and LDAP source.
Paul Dwerryhouse wrote an article for Linux Journal about Perl and LDAP titled "An Introduction to perl-ldap."
There's plenty of background and implementation knowledge available at OpenLDAP.
Browse all the open source content on developerWorks.
To listen to interesting interviews and discussions for software developers, check out developerWorks' podcasts.
Stay current with developerWorks' technical events and webcasts.
Check out upcoming conferences, trade shows, webcasts, and other Events around the world that are of interest to IBM open source developers.
Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
Visit Safari Books Online for a wealth of resources for open source technologies.

Get products and technologies

Innovate your next open source development project with IBM trial software, available for download or on DVD.

Discuss

Participate in developerWorks blogs and get involved in the developerWorks community.

About the author


		Nathan Harrington is a programmer at IBM currently working with Linux and resource-locating technologies.

Rate this page

Back to top

About IBM

Privacy

Contact