This is an archived cached-text copy of the developerWorks article. Please consider viewing the original article at: IBM developerWorks



Skip to main content

skip to main content

developerWorks  >  Open source  >

Identify speakers with sndpeek

Let your computer tell you who is speaking in teleconferences, podcasts, and live media events

developerWorks
Document options

Document options requiring JavaScript are not displayed

Sample code


Rate this page

Help us improve this content


Level: Intermediate

Nathan Harrington (harrington.nathan@gmail.com), Programmer, IBM 

15 Apr 2008

Use sndpeek and custom algorithms to match voices to a pre-recorded library. Create applications to let you know who is speaking in teleconferences, podcasts, and live media events. Build basic assistance programs to help the hearing-impaired identify speakers in a bandwidth-limited context.

Reliable authentication of an individual through voice-print analysis is complex and difficult. However, sndpeek and some custom algorithms can provide a voice-print matching configuration of considerably reduced complexity while retaining a great deal of usefulness. This article demonstrates the tools and code you'll need to modify sndpeek to record individual voice-print files for a given speaker. Each of these files are then compared to the incoming real-time audio stream to provide best-guess matches and visualizations for the current speaker.

Requirements

Hardware

You need a system with the capability of processing sound input, preferably from an external microphone. The code in this article was developed and tested on an IBM® ThinkPad T42p sporting an 1,800-MHz processor and 1 GB of RAM. Less-powerful systems should be capable of using of the code presented here, as sndpeek is the primary resource consumer and is an efficient program.

Software

You need an operating system that supports sound processing and a microphone, which current versions of Mac OS X, Windows®, and Linux® do today. Although sound configuration and troubleshooting is beyond the scope of this article, it may be useful to test this code on a Vector Linux Live CD, which has most of the drivers and components necessary for a functional setup on a diverse range of sound hardware. Also required for useful levels of detail on the display is hardware 3-D acceleration.

The sndpeek application (see Resources), is designed to work on Windows, Mac OS X, and Linux. Make sure you have a functional audio environment before proceeding with the modifications described here.



Back to top


Building a library of sound files to match against

Voice reference file requirements

To accurately match a voice requires something to compare the current sound to. You'll need a long duration sample of the person speaking to create a reliable template upon which to match. The preferred amount is 5 minutes of normal speech, including silences, pauses between words, etc.

There are many attributes of normal human conversation that should be avoided, such as coughs, keyboard clacking, as well as excessive line or ambient noise. A relatively noise-free environment is required because the presence of any sound outside the emitted vocal expression can adversely affect the reference voice print.

From the available recorded voice materials, you'll need to use your favorite audio editing program (such as Audacity) to splice together a single-voice audio file. For example, I used both recorded conference calls and IBM developerWorks podcasts as the source material for single-voice audio files used in the development of this article.

Note that you may need more or substantially less source data depending on the inherent differences of the speakers to be matched. Consider Figure 1 and the average differences between a small portion of their voices. The graph was generated in real time using baudline, another excellent audio-processing tool.


Figure 1. Example average voice waveforms using baudline
Example average voice waveforms using baudline

Modifications to sndpeek

Download and extract the sndpeek source code (see Resources). Building an average of a voice print's spectral components requires modifications to the sndpeek.cpp file. Begin by adding some library includes and variable declarations starting at line 284.


Listing 1. Library includes, variable declarations
                
// for reading *.vertex* entries in the current directory
#include <dirent.h>

// voice matching function prototypes
void initialize_vertices( );
void build_match_number( int voices_index );

// for voiceprint matching
int g_voice_spectrum[200];          // human voice useful data in 0-199 range
int g_total_sample_size = 0;        // current size, or number of data points
float g_loudness_threshold = -0.8;  // what is a loud enough sample

Next, add the code shown below to begin the monitoring process starting at line 1339.


Listing 2. display_func variable declaration, sample size incrementer
                
    // simple vU meter, sample incrementer
    int loud_total = 0;
    g_total_sample_size++; 

Finish the monitoring process by adding the code shown in Listing 3 directly below the glVertex3f function call on line 1519. Listing 3 sets the current spectrum count if only the first waveform in the waterfall display is being processed and the current data point is from 0 to 200 as part of the overall spectrum. The most useful data points regarding the human voice, especially when band-limited on a phone line, is in the 0-200 range.


Listing 3. Recording of voice spectrum data
                
                        // record spectrum of vertices for storing or analysis
                        //  only for the most significant portion of the 
                        //  current waveform
                        if( i== 0 && j < 200 )
                        {
                          if( pt->y > g_loudness_threshold )
                          {
                            g_voice_spectrum[j]++;
                            loud_total++; 
                          }
                        }// if current waveform and significant position

Once the specified audio file has been read in its entirety, we'll need to print out the stored spectrum information for the created voice print. Starting at line 720, change the code shown in Listing 4 to that shown in Listing 5.


Listing 4. Original end of file section
                
            else
                memset( buffer, 0, 2 * buffer_size * sizeof(SAMPLE) );


Listing 5. Write vertex file and exit after WAV file processed
                
            else
            {
                memset( buffer, 0, 2 * buffer_size * sizeof(SAMPLE) );
            }

            fprintf( stdout, "Vertex freq. count in %s.vertex \n", g_filename);
            FILE *out_file;
            static char str[1024];
            sprintf( str, "%s.vertex", g_filename);

            out_file = fopen(str, "w");
            fprintf(out_file, "%2.0d\n", g_total_sample_size);
            fprintf(out_file, "%s\n",str);
            for( int i = 0; i < 200; i++ )
              fprintf( out_file, "%03d  %08d\n", i, g_voice_spectrum[i]);
            fclose(out_file);
            exit( 0 );

After a successful build using make linux-alsa, you can build as many vertex files as you like with the command sndpeek 'personVoice'.wav. Each vertex file will be created in the current directory with the name 'personVoice'wav.vertex. You may find it useful to edit the .vertex file directly to change the speaker's name to something more legible.



Back to top


Matching algorithm with average approximation

Strategy

As you can see in Figure 1, the human voice produces a distinct average of characteristic frequencies when speaking. Although this attribute may change for languages with a strong tonal component, empirical evidence shows an English speaker's sound is very similar regardless of the words actually spoken. This fact will be used by the modification shown below to create a simple count of the number of deviations between the current waveform and all of the stored voice prints from the *.vertex files.

Further modification of sndpeek — Matching implementation

To complete the matching, we'll need to set up some additional variables and data structures. Place the contents of Listing 6 in sndpeek.cpp, starting at line 296.


Listing 6. Matching-variable declarations
                
struct g_vprint
{ 
  char name[50];        // voice name
  int sample_size;      // number of data samples from vertex file
  int freq_count[200];  // spectrum data from vertex file
  int draw_frame[48];   // render memory
  int match_number;     // last 3 running average
  int average_match[3]; // last 3 data points
} ;

int g_total_voices = 0;        // number of vertex files read
int g_maximum_voices = 5;      // max first 5 vertex files in ./
g_vprint g_voices[5];
int g_average_match_count = 3; // running average of match number
int g_dev_threshold = 20;      // deviation between voiceprint and current

struct g_text_characteristics
{ 
  float x;
  float y;
  float r;
  float g;
  float b;
} ;

static g_text_characteristics g_text_attr[5];

With the variables in place, loading the vertex data into the appropriate structures is performed by the initialize_vertices function. Add the code in Listing 7, starting at line 399.


Listing 7. initialize_vertices function
                
//-----------------------------------------------------------------------------
// Name: initialize_vertices
// Desc: load "voiceprint" data from *.vertex
//-----------------------------------------------------------------------------
void initialize_vertices()
{
  DIR           *current_directory;
  struct dirent *dir;
  current_directory = opendir(".");
  if (current_directory)
  {
    while ((dir = readdir(current_directory)) != NULL)
    { 
      if( strstr( dir->d_name, ".vertex" ) && g_total_voices < g_maximum_voices )
      { 
        FILE * in_file;
        char * line = NULL;
        size_t len = 0;
        ssize_t read;
        int line_pos = 0;

        in_file = fopen( dir->d_name, "r");
        if (in_file == NULL)
             exit(EXIT_FAILURE);

        // file format is sample size, file name, then data all on separate lines
        while ((read = getline(&line, &len, in_file)) != -1)
        { 
          if( line_pos == 0 )
          { 
            g_voices[g_total_voices].sample_size = atoi(line);
            // intialize structure variables
            g_voices[g_total_voices].match_number = -1;
            for( int j=0; j< g_average_match_count; j++ )
              g_voices[g_total_voices].average_match[j] = -1;
          }else if( line_pos == 1 )
          { 
            sprintf( g_voices[g_total_voices].name, "%s", line );
          }else
          { 
            // read numbers 0-200  frequency count
            static char temp_str[1024] ;
            g_voices[g_total_voices].freq_count[
              atoi( (strncpy(temp_str, line, 4))) ] =
                atoi( (strncpy(temp_str, line+5, 8)) );
          }
          line_pos++;
        }

        fclose(in_file);

        g_total_voices++;
      }// if vertex file
    }// while files left

    closedir(current_directory);
  }// if directory exists

}

The initialize_vertices function needs to be called in the main program, so add the function call in the main subroutine at line 597.


Listing 8. Loading vertices in main program call
                
    // load vertices if not building new ones
    if( !g_filename ) initialize_vertices();

Initializing the data to default values and specifying locations for the match text to be rendered is shown below. Starting at line 955, inside the initialize_analysis function, add the code in Listing 9.


Listing 9. Initialize analysis data structures, text-display attributes
                
    // initialize the spectrum buckets for voice, text color and position attr.
    for( int i=0; i < 200; i++ )
      g_voice_spectrum[i]= 0;
      
    g_text_attr[0].x = -0.2f;
    g_text_attr[0].y = -0.35f;
    
    g_text_attr[1].x = 0.8f;
    g_text_attr[1].y = 0.35f;
    
    g_text_attr[2].x = 0.8f;
    g_text_attr[2].y = -0.35f;
    
    g_text_attr[3].x = 0.2f;
    g_text_attr[3].y = -0.35f;
    

    g_text_attr[0].r = 1.0f;
    g_text_attr[0].g = 0.0f;
    g_text_attr[0].b = 0.0f;
    g_text_attr[1].r = 0.0f;
    g_text_attr[1].g = 0.0f;
    g_text_attr[1].b = 1.0f;
    g_text_attr[2].r = 0.0f;
    g_text_attr[2].g = 1.0f;
    g_text_attr[2].b = 0.0f;
    g_text_attr[3].r = 0.01f;
    g_text_attr[3].g = 1.0f;
    g_text_attr[3].b = 0.0f;

The build_match_number is the final function to add. Place the contents of Listing 10 on line 1449 under the pre-existing compute_log_function. The first part of build_match_number looks to see if the current value in the recorded voice-print spectrum is outside the accepted deviation. If it is, a variable is incremented to count all deviations for the current sample. After ensuring that at least three data points are available, the current match number is set based on an average of the most recent three recordings. For more accuracy, consider increasing the number of samples required or the number of data points to average.


Listing 10. build_match_number function
                
//-----------------------------------------------------------------------------
// Name: build_match_number
// Desc: compute the current deviation and average of last 3 deviations
//-----------------------------------------------------------------------------
void build_match_number(int voices_index)
{ 

  int total_dev = 0;
  int temp_match = 0;

  for( int i=0; i < 200; i++ )
  {  
    int orig =  g_voices[voices_index].freq_count[i] /
                  (g_voices[voices_index].sample_size/100);
    if( abs( orig - g_voice_spectrum[i]) >= g_dev_threshold)
      total_dev ++;

  }// for each spectrum frequency count

  // walk the average back in time
  for( int i=2; i > 0; i-- )
  { 
    g_voices[voices_index].average_match[i] =
      g_voices[voices_index].average_match[i-1];
    if( g_voices[voices_index].average_match[i] == -1 )
      temp_match = -1;
  }
  g_voices[voices_index].average_match[0] = total_dev;

  // if all 3 historical values have been recorded
  if( temp_match != -1 )
  { 
    g_voices[voices_index].match_number =
      (g_voices[voices_index].average_match[0] +
        g_voices[voices_index].average_match[1] +
        g_voices[voices_index].average_match[2]) / 3;
  }
  
} 

Matching voice entries is nearly complete, the final step being an addition directly below line 1712. Add the contents of Listing 11 to perform the matching. Note how the most recent match is determined regardless of the render state. This is to ensure an accurate rendering back in time of the text waterfall display when sufficient match data is unavailable. If a full sample of data has been read, the current match state is created using the build_match_number subroutine, and the voice print with the best match is printed on stdout and set to be rendered. Next, the data is reinitialized to prepare for the next run.

If a full sample of data has not been read, only the most recent text is printed to the screen. If there is enough loudness on the line, which usually indicates someone speaking, the render variable is set. This ensures that the most recent match is continuously rendered while another sample is gathered.


Listing 11. Main match processing
                
        // run the voice match if a filename is not specified
        if( !g_filename )
        {
          
          // compute most recent match
          int lowestIndex = 0;
          for( int vi=0; vi < g_total_voices; vi++ )
          { 
            if( g_voices[vi].match_number < g_voices[lowestIndex].match_number )
              lowestIndex = vi;
          }// for voice index vi
          
          if( g_total_sample_size == 100 )
          { 
            g_total_sample_size = 0;
            for( int j =0; j < g_total_voices; j++ )
              build_match_number( j );
            
            // decide if first frame is renderable
            if( g_voices[lowestIndex].match_number != -1 &&
                  g_voices[lowestIndex].match_number < 20 )
            { 
              fprintf(stdout, "%d %s", 
                        g_voices[lowestIndex].match_number,
                        g_voices[lowestIndex].name );
              fflush(stdout);
              g_voices[lowestIndex].draw_frame[0] = 1;
            }
            
            // reset the current spectrum
            for( int i=0; i < 200; i++ )
              g_voice_spectrum[i]= 0;
          
          }else
          { 
            // fill in render frame if virtual vU meter active
            if( loud_total > 50 && g_voices[lowestIndex].match_number < 20 )
            { 
              if(  g_voices[lowestIndex].match_number != -1 )
                g_voices[lowestIndex].draw_frame[0] =1;
            
            }//if enough signal
          
          }// if sample size reached
          
          // move frames back in time
          for( int vi = 0; vi < g_total_voices; vi++ )
          { 
            for( i= (g_depth-1); i > 0; i-- )
              g_voices[vi].draw_frame[i] = g_voices[vi].draw_frame[i-1];
            g_voices[vi].draw_frame[i] = 0;
          
          }//shift back in time
        
        }// if not a g_filename



Back to top


Visualization of matches

With matching complete and the render states updated, all that's left is to actually draw the match text on the screen. Maintaining similarity with sndpeek visualizations conventions is provided by rendering the text in its own waterfall display. Listing 12 shows the rendering process, with some custom colors depending on which voice prints matched. Insert the code in Listing 12 at line 1893 in sndpeek.cpp.


Listing 12. Rendering of match names
                
        // draw the renderable voice match text
        if( !g_filename ) 
        {       
          for( int vi=0; vi < g_total_voices; vi++ )
          { 

            for( i=0; i < g_depth; i++ )
            {
            
              if( g_voices[vi].draw_frame[i] == 1 )
              {
                fval = (g_depth -  i) / (float)(g_depth);
                fval = fval /10;
                sprintf( str, g_voices[vi].name );
              
                if( vi == 0 )
                {
                  glColor3f( g_text_attr[vi].r * fval,
                              g_text_attr[vi].g, g_text_attr[vi].b );
                }else if( vi == 1 )
                {
                  glColor3f( g_text_attr[vi].r, 
                              g_text_attr[vi].g, g_text_attr[vi].b * fval);
                }else if( vi == 2 )
                {
                  glColor3f( g_text_attr[vi].r, 
                              g_text_attr[vi].g * fval , g_text_attr[vi].b);
                }else if( vi == 3 )
                {
                  glColor3f( g_text_attr[vi].r,
                              g_text_attr[vi].g * fval, g_text_attr[vi].b);
                }
                draw_string( g_text_attr[vi].x, g_text_attr[vi].y,
                              -i, str, 0.5f );
              }// draw frame check

            }//for depth i

          }//for voice index vi
        }



Back to top


Usage

Test your changes with make linux-alsa; sndpeek. If there are no errors in the build process, you should see a normal sndpeek window with the usual waveform and lissjous visualizations. One of the easiest ways to test the program is by splicing together various 10- to 30-second intervals of speech from your single-speaker source files. Depending on the size of your voice-print files, available speakers, and various other factors, you may have to tweak some of the options implemented in the changes above. As various people begin speaking, you should see the associated names appear as part of the sndpeek visualizations, as well as a text-percentage match printed to stdout. Consult the demonstration video link in the Resources section for an example of what this can look like.



Back to top


Conclusion and further examples

As you can see from the demonstration video, the results are not 100-percent accurate, but do provide for a useful augmentation of identifiable speakers. With the tools and modifications to sndpeek described here, you can begin to isolate specific individuals from audio recordings based on their voice prints.

Consider hooking up the real-time voice monitor to your next conference call and have a better idea of who is speaking when. Create an additional widget on your Web-conferencing page to help automatically identify speakers to new members of your team. Track certain voices in televised programs, so you know when the news anchors break into your favorite show. Move beyond Caller ID and identify who left you a message, not just what number they called from.




Back to top


Download

DescriptionNameSizeDownload method
Sample codeos-sndpeek.voicePrint_0.1.zip15KBHTTP
Information about download methods


Resources

Learn

Get products and technologies
  • Download sndpeek, the real-time audio visualization program, which is hosted by Princeton University.

  • Innovate your next open source development project with IBM trial software, available for download or on DVD.

  • Download IBM product evaluation versions, and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss


About the author

Nathan Harrington

Nathan Harrington is a programmer at IBM currently working with Linux and resource-locating technologies.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top