slides - Pace University

Download Report

Transcript slides - Pace University

Voiceprint System Development
Design, implement, test unique voiceprint biometric system
Research Day Presentation, May 3rd 2013
Rahul Raj (Team Lead), Geeta Bothe,
Mahesh Sooryambylu, Ravi Ray, Sreeram Vancheeswaran
IBM India
Customer: Jonathan Leet (DPS 2013)
Instructor: Dr. Charles Tappert
Common Passphrase
 Background: four possible types of passphrases
1.
2.
3.
4.
User-specified phrase, like the user's name
Specified phrase common to all users
•
“My name is” from phrase “My name is user’s name”
Random phrase displayed on the computer screen
Random phrase that can vary at the user's discretion
 Advantages of a Common Passphrase
 Simplifies the segmentation problem
 Allows for careful selection of common phrase to optimize variety of
phonetic units for their authentication value
 Facilitates testing for imposters
 Permits the measurement of true voice authentication biometric
performance
 Avoids potential experimental flaws
2
Software Used: Audacity & Matlab
• Audacity
•
•
•
Open source audio editing software supports sound recording
and editing.
Supports resampling and stereo to mono conversion
Available all platforms: Windows, Linux, Mac
• Matlab
•
•
Signal Processing Toolbox provides industry-standard
algorithms and apps for analog and digital signal processing
Supports visualizing signals in time and frequency domains,
FFT computation for spectral analysis, resampling, and other
signal processing techniques.
3
System Architecture
Collection and
management of
Speech Samples in
repository
Preprocessing and
spectrogram
Generation
Mel Filter Banks
and MFCC
calculation
Feature Vector
extraction
Automatic
Segmentation of
phonemes using
DTW
Automatic
segmentation “My
name is” portion
Pace’s Biometric Authentication System will obtain
performance results from the feature vectors
4
Voice Sample  Spectrogram using Matlab
Input speech Sample (Mono,
44100 Samples/sec)
•
•
•
•
Voice stream collected into 1024
frames
Samples are read sliding stream
by 512 bytes, maintaining overlap
Represent samples of a frame
One Frame ~ 23ms since
• Frame size = 44100/1024
• Length of one frame =
1000ms/frame size 5
Voice Sample  Spectrogram using Matlab
•
•
Represent component
frequencies of a frame after
applying FFT
Frequency Vs Time data
Represent the complete spectral
data available for processing
Spectrogram constructed out of
the above values
Voiceprint Systems CS692 2013 Spring Batch
6
Mel-Frequency bands space filters appropriately

Corresponds to frequency transform performed by the cochlea of human ear.

Mel filters are shown below, 13 lower bands are used for processing.
7
Segmenting “My Name Is”
•
•
•
•
•
•
•
Voiceprint Systems CS692 2013 Spring Batch
Speech Waveform indicating the voiced and unvoiced
segments
Energy vs Zero Crossing plotted for same speech
sample
Non-voiced segments captures high zero crossing
rate(red) and low energy(green) values
Voiced segments indicate low zero crossing rate and
high energy values
Higher frequency components of ‘z’ sound will have
higher energy compared to the other phonemes
Diagram shows the automatically Marked Spectrum
in Matlab
Vertical lines demarcate speech beginning and end
of ‘z’
8
Seven sound units of “My name is”
9
Discrete Time Warp (TDW) Algorithm
Segments a Sample into Seven Sounds
•
•
DTW operates on spectrographic data: amp x freq x time
To segment a speech sample into the seven sound units,
a sample’s time sequence is "warped" non-linearly
against a manually sound segmented sample.
Sample warp path represents the cost matrix and the
warped path for the two time series represented long
the axes
The decision to find the next point in the warp
W(i, j) is:
If the warp path passes through D(i, j) then the
sample Xi is warped to the point Yi. If there is a
vertical section in the warp path, a single point in
series X is warped to multiple points of series Y.
Voiceprint Systems CS692 2013 Spring Batch
10
Feature Extraction
• Features measurements reduce data & characterize speaker
• The features extracted:
•
•
•
•
Energy mean and variance in each frequency band over the entire
utterance (~13*2 = 26 features)
Energy mean in each frequency band within each of the 7 phonetic
sounds (~13*7 = 91 features)
Voice Fundamental Frequency (F0) – not completed
Voice Formant Frequencies (F1-F3) – not completed
• Feature extractor output is a fixed-length vector appropriate
as input to Pace University Biometric Authentication System
Note: 13 is the number of frequency bands
11
System Performance
• Performance was measured on 20 sample utterances from each of 30
speakers, manually segmented into the seven sounds.
• Receiver Operating Characteristic (ROC) curves were obtained to find the
Equal Error Rate (EER) and system performance from two feature sets.
Feature Set
Performance
Features from entire phrase
98.05%
Features from seven sounds
98.95%
12