Automatic Speaker Recognition for Series 60 Mobile Devices

Download Report

Transcript Automatic Speaker Recognition for Series 60 Mobile Devices

Specom’2004, Sep 20, 2004
Automatic Speaker
Recognition for Series 60
Mobile Devices
Juhani Saastamoinen, Evgeny Karpov,
Ville Hautamäki, and Pasi Fränti
University of Joensuu,
Department of Computer Science
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Background
• Project in National FENIX programme
– New Methods and Applications in Speech
Technology
• 7 research institutes
• Project partners: NRC, Lingsoft, National
Bureau of Investigation, etc.
• Joensuu: Speaker Recognition
• http://cs.joensuu.fi/pages/pums
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Research Group
PUMS project
Pasi Fränti
Professor
Ismo Kärkkäinen
Clustering algorithms
Juhani Saastamoinen
Project manager
Evgeny Karpov
Project researcher
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Tomi Kinnunen
Researcher
Ville Hautamäki
Project researcher
Application Scenarios
Speaker Recognition
Speaker Verification
Is this Bob’s voice?
(Claim)
+
Speaker Identification
Whose voice is this?
?
Identification
Verification
Imposter!
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Project Goal
Port speaker recognition
to Series 60 mobile phone
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Symbian Phones
• Series 60 phone
features:
–
–
–
–
–
16 MB ROM
8 MB RAM
176 x 208 display
ARM-processor
No floating-point
unit!!!
Series 60
UIQ
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Series 80
Symbian OS
• Defined by Symbian consortium
• Based on EPOC
• Operating system for mobile phones
– Real-time system
– Long uptime required
• Multitasking, multithreading
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Problems of Porting
• Usual considerations when porting to phone
– GUI event driven program(ming)
– Platform specific programming model
– Real-time system, exceptions
• Application specific porting problems
– Number crunching without floating point unit!!!
– Signal processing numerically challenging
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Identification System
Speech
Audio
Signal Processing
Feature Extraction
Speaker Recognition:
Classify input speech
based on existing
profiles
Speaker Modelling:
Create speaker
profile
Feature
Vectors
Add speaker
profiles during
training
Read and use all
profiles during
recognition
Speaker Profile
Database
Decision
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
MFCC Signal Processing
Digital speech
signal frame
Preemphasis
Time
windowing
DFT
Abs
Feature
vector
Filter
bank
Log
DCT
• pre-emph. coeff. 0.97, Hamm window, 30 triangular
mel-filters, base-2 logarithm, output 12 MFCC's
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Fixed-Point Implementation
• Numerical analysis needed for fixedpoint arithmetic implementation
• Truncation and re-scaling to avoid
overflows in the converted algorithm
• Minimize information loss caused by
computation in fixed-point arithmetic
– Minimize relative error
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
FFT, Fixed-Point
• Frequency spectrum of speech
– Biggest source of numerical error
– Butterflies have multiplications
– Layers repeat truncation errors
• Fixed number of bits per element
– 32, native integer size in many systems
• Reference implementation: FFTGEN
– http://www.jjj.de/fft/fftgen.tgz
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
FFTGEN (16/16)
• Multiplication: 32 x 32 -bit result must fit in 32
bits: truncate input
• FFTGEN: Truncate inputs to 16/16 bits
FFT layer input
16-bit integer
X
X
FFT Twiddle Factor
16-bit integer
32-bit multiplication result
16 used bits
16 crop-off bits
16-bit integer
FFT layer output (part of it)
Crop-off for next layer: 16 bits!
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Info Preserving FFT (22/10)
• Approximate DFT operator F with G
• Increase ||F-G||, preserve more signal information
– minimize maximum relative error in scaled sine values
with respect to scale; 980 good for FFT sizes up to 1024
– Truncate multiplication inputs to 22/10 bits (signal/op)
FFT layer input
X
FFT Twiddle Factor
32-bit integer
22 used bits
10 crop-off bits
32-bit integer, 22 bits used
32-bit multiplication result
X
16-bit integer, 10 bits used
FFT layer output (part of it)
Crop-off for next layer: 10 bits
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
FFT Spectrum, Fixed-Point
• x-axis: fixed-point FFT element abs. values
• y-axis: correct FFT element abs. values
16/16 abs values
22/10 abs values
original
TIMIT
signal
TIMIT
signal x 4
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Scale of Error in Proposed FFT
Log10 of relative error in FFT elements
16/16
22/10
average
-0.775
-2.118
standard deviation
0.797
0.590
16/16
22/10
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Magnitude Spectrum, Fixed-Point
• Compute complex absolute values using
maximum coordinate and coordinate ratio
• Suppose |x| > |y| for z = x + i y, then
| z |=
x + y = x 1+  y / x 
2
2
2
• Interpret the (squared) y/x by t
• Approx. square root by a polynomial P(t)
• Constant time algorithm (vs. Newton)
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Logarithm, Fixed-Point
• Use base 2 instead of base 10
– corresponds to output multiplication
• Standard technique:
– Return problem to interval [1,2)
– Use linear interpolation from values
stored in a look-up table
– 8 bits used for indexing the look-up
table values
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Rest of System, Fixed-Point
• No improvement needed in VQ/GLA
• Should apply similar technique as
with FFT to other signal processing
– Pre-emphasis, utilize full 32 bits
– Time windowing, use less bits in
windowing function
– FB, use less bits in frequency responses
– DCT, use less bits for the cosines
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Effect of Signal Processing
• TIMIT data sets, varying number of speakers (N)
• For each N repeat (6x, 5x, 2x) train/recognize
cycles (eliminate GLA initial solution randomness)
• FFTGEN: FFT with 16/16 multiplication
• Fixed-point: use proposed 22/10 FFT
• Mixed: floating-point DSP, fixed-point GLA/VQ
FFTGEN
Fixed-point
Mixed
Floating-point
N=10 (6x) N=20 (5x) N=100 (2x)
93,3%
68,0%
59,5%
98,3%
95,0%
82,5%
100,0%
100,0%
100,0%
100,0%
100,0%
100,0%
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Effect of Signal Quality
• GSM/PC data: 16 aligned dual recordings
• All computations in floating-point arith.
• Signal recorded with laptop and PC mic
gives average recognition rate 100%
• Signal recorded with Nokia 3660 results in
average recognition rate 84,9%
Symbian audio
PC audio
13/16 14/16 15/16 16/16
1
3
3
10
0
0
0
17
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi
Conclusion
• Speaker identification was ported to
Symbian Series 60 mobile phone
• 22/10 bit usage in multiplication
proposed instead of “standard” 16/16
• Experiments indicate that recognition
accuracy improves from 68% to 95%
University of Joensuu
Dept. of Computer Science
P.O. Box 111
FIN- 80101 Joensuu
Tel. +358 13 251 7959
fax +358 13 251 7955
www.cs.joensuu.fi