DTW for Speech Recognition

Transcript DTW for Speech Recognition

DTW for Speech Recognition
J.-S. Roger Jang (張智星)
[email protected]
http://www.cs.nthu.edu.tw/~jang
MIR Lab (多媒體資訊檢索實驗室)
CS, Tsing Hua Univ. (清華大學資工系)
Dynamic Time Warping (DTW)
Characteristics:
Pattern-matching-based approach
Require less memory/computation
Suitable for speaker-dependent recognition
Suitable for small to medium vocabulary
Suitable for microprocessor/chip implementation
Applications
Speaker identification & verification for
surveillance
Voice commands for mobile phones, toys
-2-
Dynamic Time Warping: Type 1
j
r(j)
t: input MFCC matrix
(Each column is a frame’s feature.)
r: reference MFCC matrix
Local paths: 27-45-63 degrees
D(i, j )
r(j-1)
DTW recurrence:
D(i, j )  t (i )  r ( j ) 
 D(i  1, j  2)


min D(i  1, j  1) 
 D(i  2, j  1)


t(i-1) t(i)
i
-3-
Dynamic Time Warping: Type 2
j
r(j)
t: input MFCC matrix
(Each row is a frame’s feature.)
r: reference MFCC matrix
Local paths: 0-45-90 degrees
D(i, j )
r(j-1)
DTW recurrence:
D(i, j )  t (i ), r ( j ) 
 D(i, j  1) 


min D(i  1, j  1)
 D(i  1, j ) 


t(i-1) t(i)
i
-4-
Local Path Constraints
Type 1
Type 2
27-45-63 local paths
0-45-90 local paths
Di, j 
Di 1, j 
Di  2, j 1
Di  1, j  1
Di  1, j  1
Di 1, j  2
Di, j 
Di, j 1
D(i, j )  t (i )  r ( j ) 
D(i, j )  t (i )  r ( j ) 
 D(i  1, j  2)


min D(i  1, j  1) 
 D(i  2, j  1)


 D(i, j  1) 


min D(i  1, j  1)
 D(i  1, j ) 


-5-
Path Penalty for Type-1 DTW
Path penalty
No penalty for 45-degree path
Some penalty for paths deviated from 45degree
D(i, j )
 D (i  1, j  2)   


D (i, j )  t (i )  r ( j )  min D(i  1, j  1) 
 D (i  2, j  1)   



D(i  2, j  1) 0 
D(i  1, j  1)
D(i  1, j  2)
-6-
DTW Paths of “Match Corners”
 We assume the speed of
a user’s acoustic input
falls within 1/2 and 2
times of that of the
intended sentence.
 Both corners are fixed.
(End point detection
is critical.)
 Suitable for voice
command applications
j
i
-7-
DTW Paths of “Match Anywhere”
No fixed anchored
positions
Suitable for
retrieval of
personal spoken
documents
j
i
-8-
Other Variants
Local constraints
Start/ending area
-9-
Implementation Issues
To save memory
Use 2-column table for type-1 DTW
Use 1-column table for type-2 DTW
To avoid too many if-then statements
Pad type-1 DTW with two-layer padding
Pad type-2 DTW with one-layer padding
To find a suitable path
Minimizing total distance
Minimizing average distance
-10-
DTW Path of “Match Corners”
-11-
DTW Path of “Match Anywhere”
-12-
DTW Path of “Match Anywhere”
DTW total distance = 304.957
160
我今天很高興來到清華大學進行演講
我今天很高興來到清華大學進行演講
160
140
120
100
80
60
40
20
140
120
800
600
400
200
100
150
100
80
50
60
20 40
40
20
20
40
清華大學
20
40
清華大學
-13-
DTW for Spoken Document Retrieval
Applications
Voice-based audio/video retrieval
Issues in SDR using DTW
Speaker normalization
Vocal track length normalization (VTLN)
Frequency warping
Efficiency
-14-
DTW for Speaker-independent
Voice Command Recognition
Applications
Digit recognition
Technical highlights
Extensive recordings
Clustering within each command
Some indexing methods for DTW
Suitable for small-vocabulary
applications
-15-

DTW for Speech Recognition

Transcript DTW for Speech Recognition

Directory