COMBINED SPEECH RECOGNITION AND SPEAKER …

Download Report

Transcript COMBINED SPEECH RECOGNITION AND SPEAKER …

Secure-Access System via Fixed
and Mobile Telephone Networks
using Voice Biometrics
Authors:
Anastasis Kounoudes, Anixi Antonakoudi,
Vasilis Kekatos
Introduction
We propose a double-digit voice biometric system for secure access in
telephone services. The system combines text-dependent speaker
Authentication and also text validation.
Main System Characteristics:
– Feature Extraction based on Perceptual Linear Prediction (PLP)
coefficients and Mel Frequency Cepstral Coefficients (MFCC)
– Concatenated phoneme HMMs for both speech recognition and
user authentication
– Operates in a sound-prompted mode.
• Speech recognition and speaker verification performance was
evaluated against:
– The length of the training data,
– The number of embedded re-estimations and Gaussian
mixtures in training of the HMMs,
– The use of world models and bootstrapping,
– User-depended thresholds
A. Kounoudes
2
System Overview
•
•
•
•
•
User is voice-prompted for utterances to create speech samples.
A front-end feature extractor calculates the voice features.
Input speech is validated against the prompted utterance.
Successful validation leads to verification.
During the verification phase, the system verifies that the captured
speech matches the models of the enrolled user.
- the accumulated log likelihood probability of the input speech frames
against the registered users model is compared with a threshold to decide
whether to accept or reject the speaker.
• The system accepts or rejects the speaker.
The enrolment procedure, is used from the system to create HMM
speaker-specific phoneme models for each user.
A. Kounoudes
3
System Architecture
A. Kounoudes
4
Data Collection
• In-house database:
– Comprises of data that were collected over a period of four
months over the GSM and PSTN networks.
– Contains speech samples from 23 speakers, which are
categorized for enrolment and verification purposes.
• YOHO-PSTN:
– Replica of the YOHO corpus recorded over the PSTN network
(using an analogue modem).
• YOHO-GSM Database:
– Replica of the YOHO corpus recorded over the GSM network
(using an analogue modem).
• The YOHO database were used for initial training of the HMMs
A. Kounoudes
5
Text Validation
(Speech Recognition)
• Aim: evaluate the performance of the text validation over
the two telephone channels.
• The text validation performance is evaluated against:
– Number of embedded re-estimations used in training
– The utilisation of bootstrapping in training,
– The number of Gaussian mixtures of the HMM models,
– The incorporation of PLP and MFCC coefficients.
A. Kounoudes
6
Embedded Re-estimations
Evaluation against: Number of embedded re-estimations of the BaumWelch Algorithm on DD recognition performance.
Models used:
•12 MFCC + Normalized Energy + Delta + Delta-Delta Coefficients
•Continuous density single Gaussian mono-phone HMMs (18)
•3 left-to-right states
90
80
70
% DD Recognition
Results:
• 4 embedded re-estimations
suffice.
• Asymptotically converges
to maximum performance for
the specific 1 GM system.
60
50
40
30
20
10
0
1
2
3
4
5
6
7
8
Embedded Re-estimations
29
A. Antonakoudi
52
76
81
All
7
Gaussian Mixtures
Evaluation against: number of Gaussian Mixtures (GM) per HMM state,
while keeping the number of embedded re-estimations at 4.
A. Kounoudes
100
% DD Recognition
Results:
•Recognition performance
increases with the number of
GMs.
•The computational complexity
is exponentially increasing
with the number of GMs.
•The increase in performance
from 4 to 8 GMs does not
compensate for the
computational complexity
which almost doubles.
90
80
70
60
50
40
1
2
4
8
Gaussian Mixtures per state
29
52
76
81
All
8
Use of YOHO-trained HMMs
% DD Recognition
Evaluation:
-If pre-trained HMM prototype can result in a better performance.
-Whether additional training will adapt the models to the Greek accent
and pronunciation of the speakers in the in-house database.
Experiment Setup:
100
•YOHO-PSTN trained HMM
98
models for bootstrapping
96
94
•Additional training using the
92
enrolment files of the In-house
90
88
database
86
84
•Testing using the verification
82
files of the database.
80
1
2
3
4
Results:
•Recognition performance
2 GM with Bootstrap
2GM no bootstrap
4GM with bootstrap
4GM no bootstrap
8GM no bootstrap
8GM with bootstrap
increased by 2-4%.
Embedded Re-estimations
A. Kounoudes
9
PLP and MFCC coefficients
A. Kounoudes
Overall Performance
98
% Sentence Recognition
Experiment Setup:
•YOHO-GSM and YOHO-PSTN
databases to bootstrap additional
HMM training on the 80% of the Inhouse database.
•The remaining 20% of the database
was used for testing.
Results:
•PLP coefficients outperform MFCC
(2-3% increase in performance).
•Cepstral Mean Subtraction (CMS)
improves performance by
approximately 2%.
•8 GM DD recognizer with
PLP+CMS (10 embedded reestimations) results in a 98.4%
sentence recognition performance.
93
88
83
78
1
2
4
6
8
10
Embedded Re-estimations
PLPS_1GM
PLPs_2GM
PLPs_4GM
PLPs_8GM
PLPs_CMS_1GM
MFCCs_1GM
PLPs_CMS_2GM
MFCCs_2GM
PLPs_CMS_4GM
MFCCs_4GM
PLPs_CMS_8GM
MFCCs_8GM
MFCCs_CMS_1GM
MFCCs_CMS_2GM
MFCCs_CMS_4GM
MFCCs_CMS_8GM
10
Speaker Verification
Evaluation of the speaker verification performance of the system
against various parameters:
– The use of MFCC and PLP Coefficients.
– The Number of the Utterances used for training speaker-specific
HMM models.
– The selection of the Speaker Authentication Decision Threshold.
– The Normalization of HMM scores through the use of a World
Model.
A. Kounoudes
11
MFCC and PLP coefficients
Experiment Setup:
•Single GM HMMs were trained for each speaker using the five
enrolment sessions (each session contains 10 DD utterances) of the Inhouse database.
•Each speaker is authenticated against all 23 HMM speaker models
using his/her 150 DD authentication utterances.
A. Kounoudes
1500
1000
Summed Score
X axis: Impostor speakers attacking
each model
Y axis: Speaker dependent HMM
models
Z axis: Averaged HMM scores.
Horizontal plane: threshold for which
FAR=FRR.
Main diagonal: represents speaker
identification for the 23 sets of the Inhouse database speakers.
500
EER
0
-500
-1000
-1500
30
20
10
Impostors
0
0
5
10
15
20
25
Models
12
SA using PSTN and GSM
Enrolment Data
Using 30 authentication sessions from each speaker, tests were performed to
evaluate the speaker authentication performance against False Acceptance Rate
(FAR), False Rejection Rate (FRR) and Equal Error rate (EER).
Enrolment using 5 Sessions
Enrolment using 5 + 1 Mobile Sessions
16
14
13.5
13.3
12
10.84
14
13.1
10.85
10.84
10
12
14.2
13.9
10.68
13.6
10.56
10.74
10
8
% 8
%
6
5.28
4.46
5.27
4.46
5.3
4.46
6
4.38
4
4
2
2
3.52
4.37
4.41
3.51
3.53
0
0
EER
MFCC
FAR
MFCC-CMS
FRR
PLP
PLP-CMS
EER
MFCC
FAR
MFCC-CMS
FRR
PLP
PLP-CMS
– CMS can improve speaker authentication performance when applied
either on MFCC or PLP feature sets.
– The use of PLP coefficients was found to improve the speaker
verification performance by 1-4% when compared to the MFCCs.
A. Kounoudes
13
HMM SA Decision Threshold
•Applying the threshold which corresponds to FAR=FRR, the individual FAR
and FRR for each speaker can be estimated.
•Observation: FAR is not equal to FRR for each speaker and at some cases the
deviation is considerably high.
•Repetition of tests using PLPs and CMS but calculating the EER as the mean
of the individual EER of each speaker showed that:
- The EER is significantly dropped from 3.52% to 1.14%.
- The decision threshold (estimated as the average individual threshold) was found to
produce a much better EER when compared to the one estimated by averaging the
utterance scores.
Speaker FRR - 5 + Mobile Sessions
Speaker FAR- 5 + Mobile Sessions
40
120
35
100
30
25
60
%
%
80
20
40
15
20
10
0
5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
0
1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23
Speakers
MFCC
A. Kounoudes
MFCC-CMS
PLP
Speakers
PLP-CMS
MFCC
MFCC-CMS
PLP
PLP-CMS
14
Normalization using World Model
• We investigated whether the use of a World model for the normalization of
the HMM scores of each individual improves the overall SA performance.
• The world model relies on the development of a universal speaker model
from a pool of speech utterances produced by various speakers.
•Present evaluations were based on pre-training a world model using all the
speakers in the enrollment data of the In-house and the two YOHO
databases evaluating speaker authentication using the verification part of
the In-house database.
• Test showed that:
•The EER calculated over all individual EER for each speaker using a world
model was 0.094%, while the EER calculated performing identical tests
without using a world model was 1.14%.
•The use of the world model to normalize verification scores can significantly
improve speaker authentication performance.
A. Kounoudes
15
Conclusions
Voice Biometric System:
• Text-dependent concatenate phoneme HMM-based speaker verifier
• Concatenate phoneme HMM-based speech recognizer
• Sound-prompted operation over the PSTN and GSM
• Evaluation using a custom In-house database + 2 versions of YOHO.
Text Validation Evaluation:
• 4 GMs is a good tradeoff for accuracy and complexity
• DD speech recognition performance converges asymptotically after 4 embedded reestimations of the Baum-Welch algorithm.
• Bootstrapping initial HMM training results in an improvement of performance.
• CMS improves double-digit recognition performance by approximately 2%.
• PLP coefficients outperform MFCCs when speech is recorded via different channels.
Speaker Verification Evaluation:
• CMS increases HMM speaker authentication performance (MFCC and PLP).
• PLP produce approximately 2% better performance compared to MFCCs.
• Speaker dependent thresholds and the use of a world model further improve speaker
verification performance resulting in EER=0.094%.
A. Kounoudes
16