Level: Intermediate Nathan Harrington (harrington.nathan@gmail.com), Programmer, IBM
13 Nov 2007 Use the open source Sphinx-4 speech-recognition package to capture letters and
numbers from spoken conversations in near real time to create notes. Employ a custom
Sphinx-4 dictionary file to extract likely matches to spoken letters and numbers.
Process the text for higher order values, such as phone numbers and acronyms, and create
a meeting annotator through search-engine lookups and local databases.
The Carnegie Mellon University Sphinx project creates open source
speech-recognition tools for developers and users. This article uses the Sphinx-4 code
base to provide automatic recognition of a very small dictionary of common
letters and numbers. Converting this spoken information to text and processing the
strings for certain data structures, such as phone numbers and acronyms, allows for the
creation of an automated descriptive annotation of verbal conversations.
One of the more useful areas in which to implement this project is in a teleconference
annotation application. Next time you join the developmental meeting, fire up your
conversation annotator, and you can have automatic lookups of individuals based on their
phone numbers when spoken in the meeting, or see what the acronym of the day is
according to a Web search engine. You won't have to stop what you are doing to enter in
the latest acronym or employee serial number mentioned in the meeting to find out the
associated data. Sphinx-4 and the conversation annotator we build here can take care of
a large portion of the drudgery for you.
Requirements
Hardware
Sphinx is very resource-intensive, and, as a result, you will need fast hardware to
make the software useful. A large heap of dedicated memory is required for useful
performance, so plan on running the Sphinx application on an Intel®
Pentium® 4-class machine
with at least 1 GB of RAM. By contrast, the text-processing hardware requirements are
negligible and can be run on the same machine without affecting the performance of the
speech-recognition processing.
Software
You can run the applications we create in this article on hardware running Linux® or
Microsoft® Windows®. Sphinx-4 depends on a recent JDK and Apache Ant
to create a custom grammar processor. We need Perl and the associated lookup modules of
your choice. See the Resources section for links to learn more
about and download the software packages mentioned.
Installing Sphinx-4
Sphinx comes in many forms for various types and capabilities of speech recognition.
This article makes use of the Sphinx-4 package, which is the most user- and
developer-friendly of the recent releases. Installing Sphinx-4 can be intimidating, so
consider the following steps highlighted from the installation instructions:
- Download and extract Apache Ant.
- Download and extract the Sun JDK (as of this writing,
V1.6.0_02 appears to be the current release).
- Download and extract the Sphinx-4 source package because we'll be modifying one of the demo programs to suit our
purposes.
- Set up your environment variables with the following commands:
export ANT_HOME=${PWD}/apache-ant-1.7.0
export JAVA_HOME=${PWD}/jdk1.6.0_02
export PATH=${PATH}:${ANT_HOME}/bin
|
On Windows, you may need to set up your environment variables under Control Panel
> System > Advanced > Environment variables.
- Change to the sphinx4-beta directory, then to the lib subdirectory.
- Activate the JSAPI binary license by running the jsapi.sh shell script. Sphinx-4
provides support for JSAPI with a binary license, so you'll need to accept the
agreement.
- You may be asked to install uudecode to unpack the components that JSAPI
requires. Most Linux distributions have a package that includes uudecode in some form,
so consider checking your available packages first if a uudecode installation is
required. On Windows, double-click the jsapi.exe file and accept the license
agreement.
- Back out and change to the main Sphinx4 directory.
- Run the command
ant , and the build process should begin.
The status message of "BUILD SUCCESSFUL" means you've got your environment set up correctly and you're ready
to move on to modification steps. If you receive a different message, check your build
directory and environment variables or consult the Apache Ant and Sphinx-4
documentation for detailed installation instructions for your environment.
Strategy for extracting letters and numbers
from speaker-independent voices
Speech recognition is a technology that always seems two to 10 years away from
speaker-independent recognition of a large vocabulary. Annotating a meeting with
multiple voices, including overlapping speech, globally influenced accents, and a broad
range of technical and colloquial vocabularies, is nearly impossible for any
consumer-level software available on the market. Sphinx and specifically the Sphinx-4
package delivers all the options we need to reliably recognize a very small (yet
useful) vocabulary in a speaker-independent context.
We've already specified our limited vocabulary: the letters A-Z and numbers
0-9. Our strategy is to simply extract any location where these letters or numbers are
uttered. A common description for this approach is word spotting. Although
Sphinx-4 does not currently support word spotting, we can still achieve useful results
by forcing all utterances to match at least one of the words in the grammar. Once we
have this list of best-guess letters and numbers, we can apply standard text-processing
tools and informational lookups to extract useful information.
Custom dictionary, modification of Hello World example
Creation of dictionary file
The first step in creating the pseudo word-spotting setup is to build the desired
dictionary file. In the Sphinx-4 directory tree, there is a directory called
bld/models/edu/cmu/sphinx/model/acoustic/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/ .
This directory contains the alpha.dict and digit.dict dictionary files. At first
glance, it appears that combining these two dictionary files will produce the desired
file. This is not the case, however, as we'll need to build our dictionary files from
the cmudict.0.6d file in the same directory.
Change to the bld/models/edu/cmu/sphinx/model/acoustic/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/
directory and issue these commands to build the desired dictionary file:
perl -ne 'print if( /^[A-Z]\ / )' cmudict* > alN.dict
perl -ne 'print if(/^(ZERO|ONE|TWO|THREE|FOUR)[ (]/)' cmudict* >> alN.dict
perl -ne 'print if(/^(FIVE|SIX|SEVEN|EIGHT|NINE)[ (]/)' cmudict* >> alN.dict
|
Listing 1 shows the alN.dict file as created with simple letters and numbers as the
sole part of the dictionary.
Listing 1. Snippet from alN.dict dictionary file
...
W D AH B AH L Y UW
X EH K S
Y W AY
Z Z IY
FOUR F AO R
ONE HH W AH N
ONE(2) W AH N
THREE TH R IY
...
|
Modification of Hello World example
Sphinx-4 provides many configuration options to meet almost any need in the field of
speech recognition. For our purposes, the most efficient approach is to simply modify
the existing Hello World example. Under the Sphinx-4 root directory, change to the
demo/sphinx/helloworld directory and edit the
helloworld.config.xml file. Listing 2 shows the one line of change required to use the
alN.dict dictionary file we built.
Listing 2. helloworld.config.xml changes
original (line 114):
<property name="dictionaryPath"
value="resource:/edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.
Model!/edu/cmu/sphinx/model/acoustic/WSJ_8gau_13d
Cep_16k_40mel_130Hz_6800Hz/dict/cmudict.0.6d"/>
new:
<property name="dictionaryPath"
value="resource:/edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.
Model!/edu/cmu/sphinx/model/acoustic/WSJ_8gau_13d
Cep_16k_40mel_130Hz_6800Hz/dict/alN.dict"/>
|
Modifications are also necessary to the hello.gram grammar file in the same directory.
Listing 3 shows the changes required to pick up just the letters and numbers in our
dictionary file.
Listing 3. hello.gram changes
original:
public <greet> = (Good morning | Hello)
( Bhiksha | Evandro | Paul | Philip | Rita | Will );
new:
public <greet> = ( zero | one | two | three | four | five | six |seven | eight | nine |
a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v |
w | x | y | z) * ;
|
You'll also need to make a cosmetic change to the HelloWorld.java file, as shown
below.
Listing 4. HelloWorld.java change
original (line 59):
System.out.println
("Say: (Good morning | Hello) " +
"( Bhiksha | Evandro | Paul | Philip | Rita | Will )");
new:
System.out.println
("Listening for letters and numbers");
|
With the above changes in place, you can build and run the modified example. Change the
Sphinx-4 home directory and issue the command ant (the same
"BUILD SUCCESSFUL" message will let you know if your changes were correct). Run the
updated example with the command $JAVA_HOME/bin/java -mx312m -jar
bin/HelloWorld.jar (on Linux). The command for Windows is:java -mx312m
-jar bin/HelloWorld.jar . Speak this sentence: "The phone
number for IBM tech support is one eight zero zero four two six seven three seven
eight," and you should see output like that shown below:
f o nine r four i b m x a four t one eight zero zero four two six seven three seven eight
|
Text processing
As you can see, the sentence uttered will be processed for letters and numbers
semi-correctly. The letters "IBM" and the numbers in the phone number are recognized
correctly, but the remainder of the words are incorrectly categorized as various
letters and numbers that are the best match for a particular sound.
You may be asking yourself: Why not simply use a multithousand-word dictionary to
recognize those incorrect best guesses? After all, Sphinx-4 provides large vocabulary
dictionaries and language models. Why not simply configure the demonstration example to
recognize the remaining: "The phone number for tech support is" and any other words that may be uttered?
The answer is because Sphinx-4 is good, but not perfect. Expanding the dictionary file
to recognize hundreds of thousands of words will drastically reduce the effectiveness
of the simple number and letter matching. You can test this yourself by checking some
of the other programs in the Sphinx-4 "demo" directory or by modifying the existing
examples to use large dictionary files and expanded grammar lists. Post-processing the
text of only letters and numbers for higher-order data is a much easier method of
developing a useful annotation system with available open source systems.
With two simple rules, extracting acronyms and phone numbers from the output text
becomes relatively simple: Any three consecutive letters are considered an acronym, and
any five or more digits together are considered a phone number. Listings 5, 6, and 7
show the components of the annotateAcrNum.pl program that perform these extractions and lookups:
Listing 5. annotateAcrNum.pl part 1
— Main program logic
#!/usr/bin/perl -w
# annotateAcrNum.pl - extract and lookup acronyms and numbers from speech
# recognition text output
use strict;
use Yahoo::Search;
use Net::Dict;
$|=1; # non buffered output for better user feedback
my %numHash =
("zero" => "0",
"one" => "1",
"two" => "2",
"three" => "3",
"four" => "4",
"five" => "5",
"six" => "6",
"seven" => "7",
"eight" => "8",
"nine" => "9" );
while( my $line = <STDIN> )
{
print "$line" if( $line =~ /(Start|You said:)/ );
next unless ( $line =~ /You said:/ );
my @words = split " ", substr($line,9); # ignore the "You said:" prefix
my @numArr = ();
my @letArr = ();
foreach my $chunk ( @words )
{
if( length($chunk) == 1 )
{
phoneNmSearch(@numArr) if( @numArr > 4 );
@numArr = ();
push @letArr, $chunk;
if( @letArr > 2 )
{
acronymSearch( @letArr );
shift( @letArr );
}
}elsif( length($chunk) > 1 )
{
push @numArr, $numHash{$chunk};
@letArr = ();
}#if length greater
}#for each word
phoneNmSearch( @numArr ) if( @numArr > 4 );
acronymSearch( @letArr ) if( @letArr > 2 );
}#while stdin
|
The main program logic above searches for letter and number strings matching our
simplistic criteria. For each line of speech-recognition text output by the Hello World
modified code, build separate arrays of letters and numbers only. The letters array is
searched using the acronymSearch subroutine described below. Note that the
letters array is shifted after each acronym lookup in order to search for both "ibm"
and "bmx" from the string "i b m x." The numbers array does not perform this same
position shift, instead taking the largest number it can find and performing a Web search.
Listing 6. annotateAcrNum.pl part 2 — acronymSearch
sub acronymSearch
{
my $dict = Net::Dict->new('dict.org');
my $str = @_; $str =~ s/ //g;
my $eref = $dict->define($str);
next if ($eref eq "" );
foreach my $entry (@$eref)
{
my ($db, $definition) = @$entry;
next if ( !(defined($definition)) || !(defined($db)) );
if( $db =~ /(wn|vera|gazetteer|foldoc)/ ){ print "$db: $definition\n" }
}#for each definition
}#acronymSearch
|
Subroutine acronymSearch makes use of the helpful Net::Dict module.
Simply specify a dictionary server and a query to look up in the large variety of
databases available. Regular expression /(wn|vera|gazetteer|foldoc)/ limits the printout to those databases
that provide relatively terse descriptions. You may find that your acronym space is
better represented by other databases available at dict.org, requiring removal of this
regular expression limiter.
Listing 7. annotateAcrNum.pl part 3 — phoneNmSearch
sub phoneNmSearch
{
my $str = @_; $str =~ s/ //g;
if( length($str) == 11 )
{
$str =~ /(\d)(\d\d\d)(\d\d\d)(\d\d\d\d)/;
$str = "$1-$2-$3-$4\n";
}elsif( length($str) == 10 )
{
$str =~ /(\d\d\d)(\d\d\d)(\d\d\d\d)/;
$str = "$1-$2-$3\n";
}elsif( length($str) == 7 )
{
$str =~ /(\d\d\d)(\d\d\d\d)/;
$str = "$1-$2\n";
}
print "Results for: $str\n";
my @results = Yahoo::Search->Results(Doc => "$str", AppId => "PhNmLookup" );
warn $@ if $@; # report any errors
my $recCount = 0;
for my $res (@results)
{
print "Title: ", $res->Title, " \n";
print $res->Summary, "\n";
print $res->Url, "\n";
print "\n";
last if( $recCount > 1 ); # print first 3 results only
$recCount++;
}#for each result
}#phoneNmSearch
|
For certain search engines, drastically more accurate search results can be attained by
the addition of formatting to the phone number digits. For example, changing
18004267378 to 1-800-426-7378 or 4152042 into 415-2042 is performed by the first portion
of the phoneNmSearch subroutine. This slightly modified phone number is then used as
the query in a Yahoo! search parameter using Jeffrey Friedl's handy Yahoo::Search Perl module.
With your custom Sphinx-4 speech recognition and the annotateAcrNum Perl program,
you're ready to start annotating spoken conversations. Run the annotator with the
command $JAVA_HOME/bin/java -mx312m -jar bin/HelloWorld.jar | perl
annotateAcrNum.pl (on Linux). For Windows, the command is
java -mx312m -jar bin/HelloWorld.jar | perl
annotateAcrNum.pl .
Figure 1 shows the output of the annotator setup in "Terminal" on Vector Linux. Note
the underlined link text available to launch pages based on the Web search results.
Figure 1. Conversation annotator screenshot in Terminal on Vector Linux
Conclusion, further examples
The types of queries and databases chosen to search in this article are just general
examples of useful annotation. You may find that using Google for your Web search
lookups is more effective, or you can link your phone number lookups to your employer's
address book. Equal options are available for the methods chosen to extract
higher-order data from the recognized letters and numbers. Perhaps your conversations focus
more on IP addresses or employee serial numbers. Using some of the techniques
described, you can extract your dotted quads and unique identifiers, and link the lookups to your own databases.
Sphinx-4 also provides many options for enhancing the effectiveness of speech
recognition. Consider creating your own trained acoustic models for you and members of
your team to provide a much higher accuracy rate. Expand the dictionary file to include
tens of thousands of commonly spoken words and test Sphinx-4's real time transcription qualities.
Download Description | Name | Size | Download method |
---|
Sample code | os-sphinxspeechrecAnnotations_0.1.zip | 4KB | HTTP |
---|
Resources Learn
Get products and technologies
Discuss
About the author | | | Nathan Harrington is a programmer at IBM currently working with Linux and resource-locating technologies. |
Rate this page
|