This is an archived cached-text copy of the developerWorks article. Please consider viewing the original article at: IBM developerWorks



IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & industry solutions      Support & downloads      My IBM     
skip to main content

developerWorks  >  Open source  >

Create automated verbal conversation annotations for phone numbers, acronyms, and other spoken words

Use Sphinx-4, a custom dictionary, and text-processing tools to extract relevant data from conversations

developerWorks
Document options

Document options requiring JavaScript are not displayed

Sample code


Rate this page

Help us improve this content


Level: Intermediate

Nathan Harrington (harrington.nathan@gmail.com), Programmer, IBM 

13 Nov 2007

Use the open source Sphinx-4 speech-recognition package to capture letters and numbers from spoken conversations in near real time to create notes. Employ a custom Sphinx-4 dictionary file to extract likely matches to spoken letters and numbers. Process the text for higher order values, such as phone numbers and acronyms, and create a meeting annotator through search-engine lookups and local databases.

The Carnegie Mellon University Sphinx project creates open source speech-recognition tools for developers and users. This article uses the Sphinx-4 code base to provide automatic recognition of a very small dictionary of common letters and numbers. Converting this spoken information to text and processing the strings for certain data structures, such as phone numbers and acronyms, allows for the creation of an automated descriptive annotation of verbal conversations.

One of the more useful areas in which to implement this project is in a teleconference annotation application. Next time you join the developmental meeting, fire up your conversation annotator, and you can have automatic lookups of individuals based on their phone numbers when spoken in the meeting, or see what the acronym of the day is according to a Web search engine. You won't have to stop what you are doing to enter in the latest acronym or employee serial number mentioned in the meeting to find out the associated data. Sphinx-4 and the conversation annotator we build here can take care of a large portion of the drudgery for you.

Requirements

Hardware

Sphinx is very resource-intensive, and, as a result, you will need fast hardware to make the software useful. A large heap of dedicated memory is required for useful performance, so plan on running the Sphinx application on an Intel® Pentium® 4-class machine with at least 1 GB of RAM. By contrast, the text-processing hardware requirements are negligible and can be run on the same machine without affecting the performance of the speech-recognition processing.

Software

You can run the applications we create in this article on hardware running Linux® or Microsoft® Windows®. Sphinx-4 depends on a recent JDK and Apache Ant to create a custom grammar processor. We need Perl and the associated lookup modules of your choice. See the Resources section for links to learn more about and download the software packages mentioned.



Back to top


Installing Sphinx-4

Sphinx comes in many forms for various types and capabilities of speech recognition. This article makes use of the Sphinx-4 package, which is the most user- and developer-friendly of the recent releases. Installing Sphinx-4 can be intimidating, so consider the following steps highlighted from the installation instructions:

  1. Download and extract Apache Ant.
  2. Download and extract the Sun JDK (as of this writing, V1.6.0_02 appears to be the current release).
  3. Download and extract the Sphinx-4 source package because we'll be modifying one of the demo programs to suit our purposes.
  4. Set up your environment variables with the following commands:
    export ANT_HOME=${PWD}/apache-ant-1.7.0
    export JAVA_HOME=${PWD}/jdk1.6.0_02
    export PATH=${PATH}:${ANT_HOME}/bin
    

    On Windows, you may need to set up your environment variables under Control Panel > System > Advanced > Environment variables.
  5. Change to the sphinx4-beta directory, then to the lib subdirectory.
  6. Activate the JSAPI binary license by running the jsapi.sh shell script. Sphinx-4 provides support for JSAPI with a binary license, so you'll need to accept the agreement.
  7. You may be asked to install uudecode to unpack the components that JSAPI requires. Most Linux distributions have a package that includes uudecode in some form, so consider checking your available packages first if a uudecode installation is required. On Windows, double-click the jsapi.exe file and accept the license agreement.
  8. Back out and change to the main Sphinx4 directory.
  9. Run the command ant, and the build process should begin.

The status message of "BUILD SUCCESSFUL" means you've got your environment set up correctly and you're ready to move on to modification steps. If you receive a different message, check your build directory and environment variables or consult the Apache Ant and Sphinx-4 documentation for detailed installation instructions for your environment.



Back to top


Strategy for extracting letters and numbers from speaker-independent voices

Speech recognition is a technology that always seems two to 10 years away from speaker-independent recognition of a large vocabulary. Annotating a meeting with multiple voices, including overlapping speech, globally influenced accents, and a broad range of technical and colloquial vocabularies, is nearly impossible for any consumer-level software available on the market. Sphinx and specifically the Sphinx-4 package delivers all the options we need to reliably recognize a very small (yet useful) vocabulary in a speaker-independent context.

We've already specified our limited vocabulary: the letters A-Z and numbers 0-9. Our strategy is to simply extract any location where these letters or numbers are uttered. A common description for this approach is word spotting. Although Sphinx-4 does not currently support word spotting, we can still achieve useful results by forcing all utterances to match at least one of the words in the grammar. Once we have this list of best-guess letters and numbers, we can apply standard text-processing tools and informational lookups to extract useful information.



Back to top


Custom dictionary, modification of Hello World example

Creation of dictionary file

The first step in creating the pseudo word-spotting setup is to build the desired dictionary file. In the Sphinx-4 directory tree, there is a directory called bld/models/edu/cmu/sphinx/model/acoustic/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/. This directory contains the alpha.dict and digit.dict dictionary files. At first glance, it appears that combining these two dictionary files will produce the desired file. This is not the case, however, as we'll need to build our dictionary files from the cmudict.0.6d file in the same directory.

Change to the bld/models/edu/cmu/sphinx/model/acoustic/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/ directory and issue these commands to build the desired dictionary file:

perl -ne 'print if( /^[A-Z]\ / )'                       cmudict* >  alN.dict
perl -ne 'print if(/^(ZERO|ONE|TWO|THREE|FOUR)[ (]/)'   cmudict* >> alN.dict
perl -ne 'print if(/^(FIVE|SIX|SEVEN|EIGHT|NINE)[ (]/)' cmudict* >> alN.dict

Listing 1 shows the alN.dict file as created with simple letters and numbers as the sole part of the dictionary.


Listing 1. Snippet from alN.dict dictionary file
                
...
W                    D AH B AH L Y UW
X                    EH K S
Y                    W AY
Z                    Z IY
FOUR                 F AO R
ONE                  HH W AH N
ONE(2)               W AH N
THREE                TH R IY
...

Modification of Hello World example

Sphinx-4 provides many configuration options to meet almost any need in the field of speech recognition. For our purposes, the most efficient approach is to simply modify the existing Hello World example. Under the Sphinx-4 root directory, change to the demo/sphinx/helloworld directory and edit the helloworld.config.xml file. Listing 2 shows the one line of change required to use the alN.dict dictionary file we built.


Listing 2. helloworld.config.xml changes
                
original (line 114):
        <property name="dictionaryPath"
   value="resource:/edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.
Model!/edu/cmu/sphinx/model/acoustic/WSJ_8gau_13d
Cep_16k_40mel_130Hz_6800Hz/dict/cmudict.0.6d"/>

new:
        <property name="dictionaryPath"
   value="resource:/edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.
Model!/edu/cmu/sphinx/model/acoustic/WSJ_8gau_13d
Cep_16k_40mel_130Hz_6800Hz/dict/alN.dict"/>

Modifications are also necessary to the hello.gram grammar file in the same directory. Listing 3 shows the changes required to pick up just the letters and numbers in our dictionary file.


Listing 3. hello.gram changes
                

original:
public <greet> = (Good morning | Hello) 
( Bhiksha | Evandro | Paul | Philip | Rita | Will );

new:
public <greet> = ( zero | one | two | three | four | five | six |seven | eight | nine | 
a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | 
w | x | y | z) * ;

You'll also need to make a cosmetic change to the HelloWorld.java file, as shown below.


Listing 4. HelloWorld.java change
                
original (line 59):
    System.out.println
        ("Say: (Good morning | Hello) " +
                     "( Bhiksha | Evandro | Paul | Philip | Rita | Will )");

new:
    System.out.println
        ("Listening for letters and numbers");

With the above changes in place, you can build and run the modified example. Change the Sphinx-4 home directory and issue the command ant (the same "BUILD SUCCESSFUL" message will let you know if your changes were correct). Run the updated example with the command $JAVA_HOME/bin/java -mx312m -jar bin/HelloWorld.jar (on Linux). The command for Windows is:java -mx312m -jar bin/HelloWorld.jar. Speak this sentence: "The phone number for IBM tech support is one eight zero zero four two six seven three seven eight," and you should see output like that shown below:

f o nine r four i b m x a four t one eight zero zero four two six seven three seven eight



Back to top


Text processing

As you can see, the sentence uttered will be processed for letters and numbers semi-correctly. The letters "IBM" and the numbers in the phone number are recognized correctly, but the remainder of the words are incorrectly categorized as various letters and numbers that are the best match for a particular sound.

You may be asking yourself: Why not simply use a multithousand-word dictionary to recognize those incorrect best guesses? After all, Sphinx-4 provides large vocabulary dictionaries and language models. Why not simply configure the demonstration example to recognize the remaining: "The phone number for tech support is" and any other words that may be uttered?

The answer is because Sphinx-4 is good, but not perfect. Expanding the dictionary file to recognize hundreds of thousands of words will drastically reduce the effectiveness of the simple number and letter matching. You can test this yourself by checking some of the other programs in the Sphinx-4 "demo" directory or by modifying the existing examples to use large dictionary files and expanded grammar lists. Post-processing the text of only letters and numbers for higher-order data is a much easier method of developing a useful annotation system with available open source systems.

With two simple rules, extracting acronyms and phone numbers from the output text becomes relatively simple: Any three consecutive letters are considered an acronym, and any five or more digits together are considered a phone number. Listings 5, 6, and 7 show the components of the annotateAcrNum.pl program that perform these extractions and lookups:


Listing 5. annotateAcrNum.pl part 1 — Main program logic
                
#!/usr/bin/perl -w
# annotateAcrNum.pl - extract and lookup acronyms and numbers from speech 
#                     recognition text output
use strict;
use Yahoo::Search;
use Net::Dict;
$|=1;  # non buffered output for better user feedback

my %numHash =
("zero" => "0",
"one"   => "1",
"two"   => "2",
"three" => "3",
"four"  => "4",
"five"  => "5",
"six"   => "6",
"seven" => "7",
"eight" => "8",
"nine"  => "9" );

while( my $line = <STDIN> )
{
  print "$line" if( $line =~ /(Start|You said:)/ );

  next unless ( $line =~ /You said:/ );
  my @words = split " ", substr($line,9);  # ignore the "You said:" prefix

  my @numArr = ();
  my @letArr = ();

  foreach my $chunk ( @words )
  {
    if( length($chunk) == 1 )
    { 
      phoneNmSearch(@numArr) if( @numArr > 4 );
      @numArr = ();

      push @letArr, $chunk;

      if( @letArr > 2 )
      { 
        acronymSearch( @letArr );
        shift( @letArr );
      }

    }elsif( length($chunk) > 1 )
    { 
      push @numArr, $numHash{$chunk};
      @letArr = ();
    }#if length greater
  }#for each word

  phoneNmSearch( @numArr ) if( @numArr > 4 );
  acronymSearch( @letArr ) if( @letArr > 2 );

}#while stdin

The main program logic above searches for letter and number strings matching our simplistic criteria. For each line of speech-recognition text output by the Hello World modified code, build separate arrays of letters and numbers only. The letters array is searched using the acronymSearch subroutine described below. Note that the letters array is shifted after each acronym lookup in order to search for both "ibm" and "bmx" from the string "i b m x." The numbers array does not perform this same position shift, instead taking the largest number it can find and performing a Web search.


Listing 6. annotateAcrNum.pl part 2 — acronymSearch
                
sub acronymSearch
{
  my $dict = Net::Dict->new('dict.org');

  my $str = @_; $str =~ s/ //g;

  my $eref = $dict->define($str);
  next if ($eref eq "" );
  foreach my $entry (@$eref)
  {   
      my ($db, $definition) = @$entry;
      next if (   !(defined($definition)) || !(defined($db))  );
      if( $db =~ /(wn|vera|gazetteer|foldoc)/ ){ print "$db: $definition\n" }
  }#for each definition

}#acronymSearch

Subroutine acronymSearch makes use of the helpful Net::Dict module. Simply specify a dictionary server and a query to look up in the large variety of databases available. Regular expression /(wn|vera|gazetteer|foldoc)/ limits the printout to those databases that provide relatively terse descriptions. You may find that your acronym space is better represented by other databases available at dict.org, requiring removal of this regular expression limiter.


Listing 7. annotateAcrNum.pl part 3 — phoneNmSearch
                
sub phoneNmSearch
{
  my $str = @_; $str =~ s/ //g;
  if( length($str) == 11 )
  {
    $str =~ /(\d)(\d\d\d)(\d\d\d)(\d\d\d\d)/;
    $str = "$1-$2-$3-$4\n";
  }elsif( length($str) == 10 )
  {
    $str =~ /(\d\d\d)(\d\d\d)(\d\d\d\d)/;
    $str = "$1-$2-$3\n";
  }elsif( length($str) == 7 )
  {
    $str =~ /(\d\d\d)(\d\d\d\d)/;
    $str = "$1-$2\n";
  }
  print "Results for: $str\n";

  my @results = Yahoo::Search->Results(Doc => "$str", AppId => "PhNmLookup" );
  warn $@ if $@; # report any errors
  
  my $recCount = 0;
  for my $res (@results)
  {   
      print "Title: ", $res->Title, " \n";
      print $res->Summary, "\n";
      print $res->Url, "\n";
      print "\n";
      last if( $recCount > 1 ); # print first 3 results only
      $recCount++;
  }#for each result

}#phoneNmSearch

For certain search engines, drastically more accurate search results can be attained by the addition of formatting to the phone number digits. For example, changing 18004267378 to 1-800-426-7378 or 4152042 into 415-2042 is performed by the first portion of the phoneNmSearch subroutine. This slightly modified phone number is then used as the query in a Yahoo! search parameter using Jeffrey Friedl's handy Yahoo::Search Perl module.

With your custom Sphinx-4 speech recognition and the annotateAcrNum Perl program, you're ready to start annotating spoken conversations. Run the annotator with the command $JAVA_HOME/bin/java -mx312m -jar bin/HelloWorld.jar | perl annotateAcrNum.pl (on Linux). For Windows, the command is java -mx312m -jar bin/HelloWorld.jar | perl annotateAcrNum.pl.

Figure 1 shows the output of the annotator setup in "Terminal" on Vector Linux. Note the underlined link text available to launch pages based on the Web search results.


Figure 1. Conversation annotator screenshot in Terminal on Vector Linux
Conversation Annotater screenshot in Terminal on Vector Linux


Back to top


Conclusion, further examples

Share this...

digg Digg this story
del.icio.us Post to del.icio.us
Slashdot Slashdot it!

The types of queries and databases chosen to search in this article are just general examples of useful annotation. You may find that using Google for your Web search lookups is more effective, or you can link your phone number lookups to your employer's address book. Equal options are available for the methods chosen to extract higher-order data from the recognized letters and numbers. Perhaps your conversations focus more on IP addresses or employee serial numbers. Using some of the techniques described, you can extract your dotted quads and unique identifiers, and link the lookups to your own databases.

Sphinx-4 also provides many options for enhancing the effectiveness of speech recognition. Consider creating your own trained acoustic models for you and members of your team to provide a much higher accuracy rate. Expand the dictionary file to include tens of thousands of commonly spoken words and test Sphinx-4's real time transcription qualities.




Back to top


Download

DescriptionNameSizeDownload method
Sample codeos-sphinxspeechrecAnnotations_0.1.zip4KBHTTP
Information about download methods


Resources

Learn

Get products and technologies

Discuss


About the author

Nathan Harrington

Nathan Harrington is a programmer at IBM currently working with Linux and resource-locating technologies.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top



    About IBM Privacy Contact