Data Mining and Bioinformatics: Some Challenges

Download Report

Transcript Data Mining and Bioinformatics: Some Challenges

Age and Gender Recognition
from Speech Patterns Based on
Supervised Non-Negative Matrix Factorization
Mohamad Hasan Bahari
Hugo Van hamme
July 2011
1
Outline

Introduction and Motivations

Age and Gender Recognition

Corpora

Supervised Non-negative Matrix Factorization

Proposed Method

Results

Conclusions and Future Researches
2
Introduction

Confirming the identity of individuals

Biometric Characteristics


Fingerprint

Face

Iris

Hand Geometry

Ear Shape

Voice pattern

…
Choosing a characteristic

Availability

Reliability
3
Motivation

In many real world cases, only speech patterns are available
(kidnapping, threatening calls, …)

Speech patterns can include many interesting information

Gender

Age

Dialect (original or previous regions)

Membership of a particular social group

…
To facilitates in identifying a criminal
To narrow down the number of suspects
4
Goal
Goal:
To extract different physical and psychological characteristics of
the speaker from his/her voice patterns (Speaker Profiling).
Physical:
Psychological:
1.
Gender
1.
Anxiousness
2.
Age
2.
Stress
3.
Accent
3.
Confidence
4.
…
4.
…
5
Age and Gender Recognition
Three approaches:
I. Directly from speech signal.
II. Modeling the speech generation
system.
III. Modeling the hearing system.
6
Age and Gender Recognition
I.
Directly from speech signal.

Different acoustic features vary with age.
1)
Fundamental frequency
2)
Speech rate
3)
Sound pressure level
4)
…

By Finding all acoustic features varying with age and their exact relation
to the speaker age.

Conceptually simple and computationally inexpensive
x
These features are affected by many other parameters, such as weight,
height, voice quality, emotional condition, …
7
Age and Gender Recognition
Effect of Age and Gender on speech (Fundamental frequency) [1]
Age is only one of inputs affecting
the speech and consequently acoustic
features.
It is impossible to estimate the age
without considering the rest of inputs
Perceptions of gender and age have a
significant mutual impact on each
other.
[1] W. S. Brown, R. J. Morris, H. Hollien, and E. Howell, Journal of Voice, vol. 5, pp. 310–315, 1991.
8
Age and Gender Recognition
II. Modeling the speech generation system.

It is an input estimation problem.
x
Modeling the speech generation system of the speaker is very
difficult.
9
Age and Gender Recognition
III. Modeling the hearing system

To solve the speech recognition problem, the hearing system is
modeled using Hidden Markove Models (HMMs).

Using the tools applied in speech recognition problems (HMMs) .

Well established.

Accurate in recognizing content.
x
There exist a difference between the age of a speaker as perceived,
and their actual age.
x
Computationally complex
10
Corpora
 555 speakers from the N-best evaluation corpus [1]
 The corpus contains live and read commentaries, news, interviews, and
reports broadcast in Belgium
Different age groups and genders
Category Name
Young
Male
Young
Female
Middle
Male
Middle
Female
Senior
Male
Senior
Female
Age
18-35
18-35
36-45
36-45
46-81
46-81
Number of Speakers
85
53
160
41
191
25
[1] D. A. Van Leeuwen, J. Kessens, E. Sanders, and H. van den Heuvel, In proc. Interspeech, pp. 2571-2574, 2009.
11
SNMF

Non-negative Matrix Factorization (NMF) is a popular machine
learning algorithm [1]

It is used in supervised or unsupervised modes.




Supervised NMF or SNMF is a pattern recognition method [1]
It is very effective in the case of high dimension input space.
It is a generative classifier.
It can directly classify patterns into multiple classes (no need to
change the problem into multiple binary classification).
[1] H. Van hamme, In proc. Interspeech, Australia, pp. 2554-2557, 2008.
12
SNMF
Problem Statement:
Given a training data-set:
Str= {(x1, y1), . . ., (xn, yn), . . . , (xN, yN)}
xn is a vector of observed characteristics for the data item
yn denotes a label vector which represents the class that xn belongs to
Goal:
Approximation of a classifier function (g), such that ŷ=g(xtst) is as
close as possible to the true label.
xtst is an unseen observation
13
SNMF
SNMF in Training Phase:
First step:
Second step:
VStr   y1  y N 
V  x1  x N 
tr
B
tr


V
tr
S
V   tr 
VB 
V W H
tr
tr
VStr  WStr  tr
 tr    tr  H
VB  WB 
tr
Extended Kullbeck-Leibler divergence:

tr
tr
DKL V W H
tr


Vmntr
  V log
tr
tr
mn
 W H
tr
mn
Multiplicative updating formula:
W 
tr

W 
M N
H 
tr
( H tr )T
H 


mn
V 
tr
1

tr
tr
 W H

(W
mn
 
 Vmntr    H tr
zn
zn
tr


W
tr
H tr

tr
) 1M  N  
tr T

( H tr )T
V 
tr

tr T
 (W )
W
tr
H tr

14
SNMF
SNMF in Testing Phase:


tst tr
tst
tr
tst
ˆ
y

g
(
x
)

W
arg
min
D
x
W
H
S
KL
B
tst
H
First step:
Second step:
yˆ  g ( xtst )  WStr H tst
xtst  WBtr H tst
Extended Kullbeck-Leibler divergence:

tst
tr
B
DKL x W H
tst


xmtst
  x log
tr
tst
m
 WB H
tst
m


m

tr
tst
  WB H



m
 
 xmtst    H tst
z
z
Multiplicative updating formula:
H
tst

H 
tst
(W
tr T
B
M 1
) 1
x 
tst


tr T
B
 (W )
W
tr
B
H tst

15
Proposed Method
1.
Feature selection
2.
Acoustic modeling
3.
Supervector making procedure
4.
Training phase
5.
Testing phase
16
Proposed Method
1.
Feature selection
•
MEL Spectra
•
Mean normalization
•
vocal tract length normalization
•
Augmented with their first and second order time derivatives.
Speech Signal
Feature Vectors
Feature selection
….
17
Proposed Method
Acoustic modeling
2.
Speaker
Independent
Model
Speaker
Adaptation
Method
Model of
the
Speaker
Speaker independent Model:
•
An HMM with a shared pool of 49740 Gaussians to model the observations in 3873 cross-word
context-dependent tied triphone states.
Adaptation Method:
•
The speaker dependent mixture weights for each speaker result from a re-estimation of the
speaker independent weights based on a forced alignment of the training data for that speaker
using a speaker-independent acoustic model.
The result of this step is 555 speaker adapted models
18
Proposed Method
3.
Supervector making procedure
Gaussian Mixture Model (GMM) of each speaker adapted HMMs is:
Js
f (ot )   wsj (ot ,  sj ,  sj )
s
j 1
Three type of supervectors:
1. Means
2. Variances
3. Weights
Weights supervectors: s  fr s w1s

 wqs  wQs

T
 n  (1 )T  (s )T  (S )T 
T
The result of this step is 555 supervectors for each of 555 speakers
19
Proposed Method
4.
Training phase
5.
Testing phase
20
Results
Evaluation Methodology

5-fold cross-validation (five independent run)

In each of five run:

Training set is speech data of 444 speakers

Testing set is speech data of 111 speakers
Database
Run 1
TST
TR
TR
TR
TR
Database
Run 2
TR
TST
TR
TR
TR
.
.
.
21
Results
Gender recognition is 96%.
relative confusion matrix
CL
AC
YM
YF
MM
MF
SM
SF
YM
YF
MM
MF
SM
SF
13
02
06
0
03
0
03
77
01
54
01
2
58
04
44
02
19
08
0
11
01
24
0
28
26
057
47
17
76
28
0
0
0
02
0
16
Age group recognition
Category
Name
Prior
Accuracy
Young Male Young Female Middle Male
15
13
10
77
29
44
Middle
Female
7
24
Senior Male Senior Female
34
76
4
16
22
Conclusions and Future Researches
Conclusions:
1.
A new age-gender recognition method based on SNMF
2.
Supervectors of GMM weights were used
3.
Evaluated on N-Best Corpus
4.
Gender recognition accuracy is 96%
5.
Age group recognition accuracy is significantly higher than chance level
Future Researches:
1.
Age estimation instead of age group recognition.
2.
Using supervectors of GMM means and variances and combining these features
23
Thank You for Your Attention
24