| Level: Intermediate Nathan Harrington (harrington.nathan@gmail.com), Programmer, IBM
15 Apr 2008 Use sndpeek and custom algorithms to match voices to a pre-recorded library.
Create applications to let you know who is speaking in teleconferences, podcasts, and
live media events. Build basic assistance programs to help the hearing-impaired identify
speakers in a bandwidth-limited context.
Reliable authentication of an individual through voice-print analysis is complex and
difficult. However, sndpeek and some custom algorithms can provide a voice-print
matching configuration of considerably reduced complexity while retaining a great deal
of usefulness. This article demonstrates the tools and code you'll need to modify
sndpeek to record individual voice-print files for a given speaker. Each of these files
are then compared to the incoming real-time audio stream to provide best-guess matches
and visualizations for the current speaker.
Requirements
Hardware
You need a system with the capability of processing sound input, preferably from an
external microphone. The code in this article was developed and tested on an
IBM® ThinkPad T42p sporting an 1,800-MHz processor and 1 GB of RAM.
Less-powerful systems should be capable of using of the code presented here, as sndpeek
is the primary resource consumer and is an efficient program.
Software
You need an operating system that supports sound processing and a microphone, which
current versions of Mac OS X, Windows®, and Linux® do today. Although sound
configuration and troubleshooting is beyond the scope of this article, it may be useful
to test this code on a Vector Linux Live CD, which has most of the drivers and
components necessary for a functional setup on a diverse range of sound hardware. Also
required for useful levels of detail on the display is hardware 3-D acceleration.
The sndpeek application (see Resources),
is designed to work on Windows, Mac OS X, and Linux. Make sure you have a functional
audio environment before proceeding with the modifications described here.
Building a library of sound files to match against
Voice reference file requirements
To accurately match a voice requires something to compare the current sound to. You'll
need a long duration sample of the person speaking to create a reliable template upon
which to match. The preferred amount is 5 minutes of normal speech, including silences,
pauses between words, etc.
There are many attributes of normal human conversation that should be avoided, such as
coughs, keyboard clacking, as well as excessive line or ambient noise. A relatively
noise-free environment is required because the presence of any sound outside the emitted
vocal expression can adversely affect the reference voice print.
From the available recorded voice materials, you'll need to use your favorite audio
editing program (such as Audacity) to splice together a single-voice audio file. For
example, I used both recorded conference calls and IBM developerWorks podcasts
as the source material for single-voice audio files used in the development of this article.
Note that you may need more or substantially less source data depending on the inherent
differences of the speakers to be matched. Consider Figure 1 and the average
differences between a small portion of their voices. The graph was generated in real
time using baudline, another excellent audio-processing tool.
Figure 1. Example average voice waveforms using baudline
Modifications to sndpeek
Download and extract the sndpeek source code (see Resources).
Building an average of a voice print's spectral components requires modifications to
the sndpeek.cpp file. Begin by adding some library includes and variable declarations
starting at line 284.
Listing 1. Library includes, variable declarations
// for reading *.vertex* entries in the current directory
#include <dirent.h>
// voice matching function prototypes
void initialize_vertices( );
void build_match_number( int voices_index );
// for voiceprint matching
int g_voice_spectrum[200]; // human voice useful data in 0-199 range
int g_total_sample_size = 0; // current size, or number of data points
float g_loudness_threshold = -0.8; // what is a loud enough sample
|
Next, add the code shown below to begin the monitoring process starting at line 1339.
Listing 2. display_func variable declaration, sample size incrementer
// simple vU meter, sample incrementer
int loud_total = 0;
g_total_sample_size++;
|
Finish the monitoring process by adding the code shown in Listing 3 directly below the
glVertex3f function call on line 1519. Listing 3 sets the
current spectrum count if only the first waveform in the waterfall display is being
processed and the current data point is from 0 to 200 as part of the overall spectrum.
The most useful data points regarding the human voice, especially when band-limited on
a phone line, is in the 0-200 range.
Listing 3. Recording of voice spectrum data
// record spectrum of vertices for storing or analysis
// only for the most significant portion of the
// current waveform
if( i== 0 && j < 200 )
{
if( pt->y > g_loudness_threshold )
{
g_voice_spectrum[j]++;
loud_total++;
}
}// if current waveform and significant position
|
Once the specified audio file has been read in its entirety, we'll need to print out
the stored spectrum information for the created voice print. Starting at line 720,
change the code shown in Listing 4 to that shown in Listing 5.
Listing 4. Original end of file section
else
memset( buffer, 0, 2 * buffer_size * sizeof(SAMPLE) );
|
Listing 5. Write vertex file and exit after WAV file processed
else
{
memset( buffer, 0, 2 * buffer_size * sizeof(SAMPLE) );
}
fprintf( stdout, "Vertex freq. count in %s.vertex \n", g_filename);
FILE *out_file;
static char str[1024];
sprintf( str, "%s.vertex", g_filename);
out_file = fopen(str, "w");
fprintf(out_file, "%2.0d\n", g_total_sample_size);
fprintf(out_file, "%s\n",str);
for( int i = 0; i < 200; i++ )
fprintf( out_file, "%03d %08d\n", i, g_voice_spectrum[i]);
fclose(out_file);
exit( 0 );
|
After a successful build using make linux-alsa , you can
build as many vertex files as you like with the command sndpeek
'personVoice'.wav . Each vertex file will be created in the current directory with
the name 'personVoice'wav.vertex . You may find it useful to
edit the .vertex file directly to change the speaker's name to something more legible.
Matching algorithm with average approximation
Strategy
As you can see in Figure 1, the human voice produces a distinct average of
characteristic frequencies when speaking. Although this attribute may change for
languages with a strong tonal component, empirical evidence shows an English speaker's
sound is very similar regardless of the words actually spoken. This fact will be used
by the modification shown below to create a simple count of the number of deviations
between the current waveform and all of the stored voice prints from the *.vertex files.
Further modification of sndpeek — Matching implementation
To complete the matching, we'll need to set up some additional variables and data
structures. Place the contents of Listing 6 in sndpeek.cpp, starting at line 296.
Listing 6. Matching-variable declarations
struct g_vprint
{
char name[50]; // voice name
int sample_size; // number of data samples from vertex file
int freq_count[200]; // spectrum data from vertex file
int draw_frame[48]; // render memory
int match_number; // last 3 running average
int average_match[3]; // last 3 data points
} ;
int g_total_voices = 0; // number of vertex files read
int g_maximum_voices = 5; // max first 5 vertex files in ./
g_vprint g_voices[5];
int g_average_match_count = 3; // running average of match number
int g_dev_threshold = 20; // deviation between voiceprint and current
struct g_text_characteristics
{
float x;
float y;
float r;
float g;
float b;
} ;
static g_text_characteristics g_text_attr[5];
|
With the variables in place, loading the vertex data into the appropriate structures is
performed by the initialize_vertices function. Add the code
in Listing 7, starting at line 399.
Listing 7. initialize_vertices function
//-----------------------------------------------------------------------------
// Name: initialize_vertices
// Desc: load "voiceprint" data from *.vertex
//-----------------------------------------------------------------------------
void initialize_vertices()
{
DIR *current_directory;
struct dirent *dir;
current_directory = opendir(".");
if (current_directory)
{
while ((dir = readdir(current_directory)) != NULL)
{
if( strstr( dir->d_name, ".vertex" ) && g_total_voices < g_maximum_voices )
{
FILE * in_file;
char * line = NULL;
size_t len = 0;
ssize_t read;
int line_pos = 0;
in_file = fopen( dir->d_name, "r");
if (in_file == NULL)
exit(EXIT_FAILURE);
// file format is sample size, file name, then data all on separate lines
while ((read = getline(&line, &len, in_file)) != -1)
{
if( line_pos == 0 )
{
g_voices[g_total_voices].sample_size = atoi(line);
// intialize structure variables
g_voices[g_total_voices].match_number = -1;
for( int j=0; j< g_average_match_count; j++ )
g_voices[g_total_voices].average_match[j] = -1;
}else if( line_pos == 1 )
{
sprintf( g_voices[g_total_voices].name, "%s", line );
}else
{
// read numbers 0-200 frequency count
static char temp_str[1024] ;
g_voices[g_total_voices].freq_count[
atoi( (strncpy(temp_str, line, 4))) ] =
atoi( (strncpy(temp_str, line+5, 8)) );
}
line_pos++;
}
fclose(in_file);
g_total_voices++;
}// if vertex file
}// while files left
closedir(current_directory);
}// if directory exists
}
|
The initialize_vertices function needs to be called in the
main program, so add the function call in the main subroutine at line 597.
Listing 8. Loading vertices in main program call
// load vertices if not building new ones
if( !g_filename ) initialize_vertices();
|
Initializing the data to default values and specifying locations for the match text to
be rendered is shown below. Starting at line 955, inside the initialize_analysis function, add the code in Listing 9.
Listing 9. Initialize analysis data structures,
text-display attributes
// initialize the spectrum buckets for voice, text color and position attr.
for( int i=0; i < 200; i++ )
g_voice_spectrum[i]= 0;
g_text_attr[0].x = -0.2f;
g_text_attr[0].y = -0.35f;
g_text_attr[1].x = 0.8f;
g_text_attr[1].y = 0.35f;
g_text_attr[2].x = 0.8f;
g_text_attr[2].y = -0.35f;
g_text_attr[3].x = 0.2f;
g_text_attr[3].y = -0.35f;
g_text_attr[0].r = 1.0f;
g_text_attr[0].g = 0.0f;
g_text_attr[0].b = 0.0f;
g_text_attr[1].r = 0.0f;
g_text_attr[1].g = 0.0f;
g_text_attr[1].b = 1.0f;
g_text_attr[2].r = 0.0f;
g_text_attr[2].g = 1.0f;
g_text_attr[2].b = 0.0f;
g_text_attr[3].r = 0.01f;
g_text_attr[3].g = 1.0f;
g_text_attr[3].b = 0.0f;
|
The build_match_number is the final function to add. Place the
contents of Listing 10 on line 1449 under the pre-existing compute_log_function . The first part of build_match_number looks to see if the current value in the recorded
voice-print spectrum is outside the accepted deviation. If it is, a variable is
incremented to count all deviations for the current sample. After ensuring that at
least three data points are available, the current match number is set based on an
average of the most recent three recordings. For more accuracy, consider increasing the
number of samples required or the number of data points to average.
Listing 10. build_match_number function
//-----------------------------------------------------------------------------
// Name: build_match_number
// Desc: compute the current deviation and average of last 3 deviations
//-----------------------------------------------------------------------------
void build_match_number(int voices_index)
{
int total_dev = 0;
int temp_match = 0;
for( int i=0; i < 200; i++ )
{
int orig = g_voices[voices_index].freq_count[i] /
(g_voices[voices_index].sample_size/100);
if( abs( orig - g_voice_spectrum[i]) >= g_dev_threshold)
total_dev ++;
}// for each spectrum frequency count
// walk the average back in time
for( int i=2; i > 0; i-- )
{
g_voices[voices_index].average_match[i] =
g_voices[voices_index].average_match[i-1];
if( g_voices[voices_index].average_match[i] == -1 )
temp_match = -1;
}
g_voices[voices_index].average_match[0] = total_dev;
// if all 3 historical values have been recorded
if( temp_match != -1 )
{
g_voices[voices_index].match_number =
(g_voices[voices_index].average_match[0] +
g_voices[voices_index].average_match[1] +
g_voices[voices_index].average_match[2]) / 3;
}
}
|
Matching voice entries is nearly complete, the final step being an addition directly
below line 1712. Add the contents of Listing 11 to perform the matching. Note how the
most recent match is determined regardless of the render state. This is to ensure an
accurate rendering back in time of the text waterfall display when sufficient match
data is unavailable. If a full sample of data has been read, the current match state is
created using the build_match_number subroutine, and the voice
print with the best match is printed on stdout and set to be
rendered. Next, the data is reinitialized to prepare for the next run.
If a full sample of data has not been read, only the most recent text is printed to the
screen. If there is enough loudness on the line, which usually indicates someone
speaking, the render variable is set. This ensures that the most recent match is
continuously rendered while another sample is gathered.
Listing 11. Main match processing
// run the voice match if a filename is not specified
if( !g_filename )
{
// compute most recent match
int lowestIndex = 0;
for( int vi=0; vi < g_total_voices; vi++ )
{
if( g_voices[vi].match_number < g_voices[lowestIndex].match_number )
lowestIndex = vi;
}// for voice index vi
if( g_total_sample_size == 100 )
{
g_total_sample_size = 0;
for( int j =0; j < g_total_voices; j++ )
build_match_number( j );
// decide if first frame is renderable
if( g_voices[lowestIndex].match_number != -1 &&
g_voices[lowestIndex].match_number < 20 )
{
fprintf(stdout, "%d %s",
g_voices[lowestIndex].match_number,
g_voices[lowestIndex].name );
fflush(stdout);
g_voices[lowestIndex].draw_frame[0] = 1;
}
// reset the current spectrum
for( int i=0; i < 200; i++ )
g_voice_spectrum[i]= 0;
}else
{
// fill in render frame if virtual vU meter active
if( loud_total > 50 && g_voices[lowestIndex].match_number < 20 )
{
if( g_voices[lowestIndex].match_number != -1 )
g_voices[lowestIndex].draw_frame[0] =1;
}//if enough signal
}// if sample size reached
// move frames back in time
for( int vi = 0; vi < g_total_voices; vi++ )
{
for( i= (g_depth-1); i > 0; i-- )
g_voices[vi].draw_frame[i] = g_voices[vi].draw_frame[i-1];
g_voices[vi].draw_frame[i] = 0;
}//shift back in time
}// if not a g_filename
|
|
Visualization of matches
With matching complete and the render states updated, all that's left is to actually
draw the match text on the screen. Maintaining similarity with sndpeek visualizations
conventions is provided by rendering the text in its own waterfall display. Listing 12
shows the rendering process, with some custom colors depending on which voice
prints matched. Insert the code in Listing 12 at line 1893 in sndpeek.cpp.
Listing 12. Rendering of match names
// draw the renderable voice match text
if( !g_filename )
{
for( int vi=0; vi < g_total_voices; vi++ )
{
for( i=0; i < g_depth; i++ )
{
if( g_voices[vi].draw_frame[i] == 1 )
{
fval = (g_depth - i) / (float)(g_depth);
fval = fval /10;
sprintf( str, g_voices[vi].name );
if( vi == 0 )
{
glColor3f( g_text_attr[vi].r * fval,
g_text_attr[vi].g, g_text_attr[vi].b );
}else if( vi == 1 )
{
glColor3f( g_text_attr[vi].r,
g_text_attr[vi].g, g_text_attr[vi].b * fval);
}else if( vi == 2 )
{
glColor3f( g_text_attr[vi].r,
g_text_attr[vi].g * fval , g_text_attr[vi].b);
}else if( vi == 3 )
{
glColor3f( g_text_attr[vi].r,
g_text_attr[vi].g * fval, g_text_attr[vi].b);
}
draw_string( g_text_attr[vi].x, g_text_attr[vi].y,
-i, str, 0.5f );
}// draw frame check
}//for depth i
}//for voice index vi
}
|
Usage
Test your changes with make linux-alsa; sndpeek . If there
are no errors in the build process, you should see a normal sndpeek window with the
usual waveform and lissjous visualizations. One of the easiest ways to test the program
is by splicing together various 10- to 30-second intervals of speech from your
single-speaker source files. Depending on the size of your voice-print files, available
speakers, and various other factors, you may have to tweak some of the options
implemented in the changes above. As various people begin speaking, you should see the
associated names appear as part of the sndpeek visualizations, as well as a
text-percentage match printed to stdout . Consult the demonstration video link in the
Resources section for an example of what this can look like.
Conclusion and further examples
As you can see from the demonstration video, the results are not 100-percent accurate,
but do provide for a useful augmentation of identifiable speakers. With the tools and
modifications to sndpeek described here, you can begin to isolate specific individuals
from audio recordings based on their voice prints.
Consider hooking up the real-time voice monitor to your next conference call and have a
better idea of who is speaking when. Create an additional widget on your
Web-conferencing page to help automatically identify speakers to new members of your
team. Track certain voices in televised programs, so you know when the news anchors
break into your favorite show. Move beyond Caller ID and identify who left you a
message, not just what number they called from.
Download Description | Name | Size | Download method |
---|
Sample code | os-sndpeek.voicePrint_0.1.zip | 15KB | HTTP |
---|
Resources Learn
Get products and technologies
-
Download sndpeek, the real-time
audio visualization program, which is hosted by Princeton University.
-
Innovate your next open source development project with IBM trial software, available for download or on DVD.
-
Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
About the author | | | Nathan Harrington is a programmer at IBM currently working with Linux and resource-locating technologies. |
Rate this page
| |