Introduction to Multimedia Systems

Transcript Introduction to Multimedia Systems

Speech in Multimedia
Hao Jiang
Computer Science Department
Boston College
Oct. 9, 2007
Outline
 Introduction
 Topics in speech processing
–
–
–
–
Speech coding
Speech recognition
Speech synthesis
Speaker verification/recognition
 Conclusion
Introduction
 Speech is our basic communication tool.
 We have been hoping to be able to communicate
with machines using speech.
C3PO and R2D2
Speech Production Model
Anatomy Structure
Mechanical Model
Characteristics of Digital Speech
1
0.8
0.6
0.4
0.2
Waveform
0
-0.2
-0.4
Speech
-0.6
-0.8
0
0.5
1
1.5
2
4
x 10
1
0.9
0.8
Frequency
0.7
0.6
Spectrogram
0.5
0.4
0.3
0.2
0.1
0
0
2000
4000
6000
Time
8000
10000
Voiced and Unvoiced Speech
0.4
0.3
0.2
0.1
0
-0.1
-0.2
-0.3
0
100
Silence
200
300
400
500
600
700
800
unvoiced
voiced
900
1000
Short-time Parameters
0.4
0.3
0.2
Short time
power
0.1
0
-0.1
-0.2
-0.3
0
100
200
300
400
500
600
700
800
900
1000
0.4
0.3
Waveform
Envelop
0.2
0.1
0
-0.1
-0.2
-0.3
0
100
200
300
400
500
600
700
800
900
1000
0.4
0.3
0.2
Zero
crossing
rate
0.1
0
-0.1
-0.2
-0.3
0
100
200
300
400
500
600
700
800
900
1000
0.5
Pitch
period
0.4
0.3
0.2
0.1
0
-0.1
-0.2
-0.3
-0.4
-0.5
0
100
200
300
400
500
600
700
Speech Coding
 Similar to images, we can also compress speech to
make it smaller and easier to store and transmit.
 General compression methods such as DPCM can
also be used.
 More compression can be achieved by taking
advantage of the speech production model.
 There are two classes of speech coders:
– Waveform coder
– Vocoder
LPC Speech Coder
Vocal track
Parameter
speech
Pitch
Speech
buffer
Speech
Analysis
Voiced/
unvoiced
Quantizer
Code
generation
Energy
Parameter
Frame n
Frame n+1
Code
stream
LPC and Vocal Track
 Mathematically, speech can be modeled as the
following generation model:
x(n) = p=1k ap x(n-p) + e(n)
 {a1, a2, …, ak} are called Linear Prediction
Coefficients (LPC), which can be used to model the
shape of vocal track.
 e(n) is the excitation to generate the speech.
Decoding and Speech Synthesis
Pitch Period
Impulse
Train
Generator
Glottal
Pulse
Generator
Gain
Vocal
Track
Model
Random
Noise
Generator
U/V
Radiation
Model
speech
An Example for Synthesizing Speech
Glottal Pulse
Go through vocal track filter with gain control
Go through radiation filter
Blending region
LPC10 (FS1015)
 2.4kbps LPC10 was DOD speech coding standard for
voice communication at 2.4kbps.
 LPC10 works on speech of 8Hz, using a 22.5ms
frame and 10 LPC coefficients.
Original
Speech
LPC Decoded
Speech
Mixed Excitation LP
 For real speech, the excitation is usually not pure
pulse or noise but a mixture.
 The new 2.4kbps standard (MELP) addresses this
problem.
Gain
pulses
Bandpass
filter
w
+
noise
Bandpass
filter
Vocal
Track
Model
Radiation
Model
1-w
Original
Speech
MELP
Decoded
Speech
speech
Hybrid Speech Codecs
 For higher bit rate speech coders, hybrid speech codecs have
more advantage than vocoders.
speech
“perceptual”
comparison
Analysis by Synthesis
Model parameter
generation
Speech
synthesis
 FS1016: CELP (Code Excitation Linear Predictive)
 G.723.1: A dual bit rate codec (5.3kbps and 6.3kbps) for
multimedia communication through Internet.
Sound at 5.3kbps
Sound at 6.3kbps
 G.729: CELP based codec at 8kbps.
Sound at 8kbps
code
Speech Recognition
 Speech recognition is the foundation of human
computer interaction using speech.
 Speech recognition in different contexts
–
–
–
–
Dependent or independent on the speaker.
Discrete words or continuous speech.
Small vocabulary or large vocabulary.
In quiet environment or noisy environment.
Reference patterns
speech
Parameter
analyzer
Comparison
and decision
algorithm
Language model
Words
How does Speech Recognition Work?
Words:
Phonemes:
Each phoneme
has different
characteristics
(for example,
The power
distribution).
grey whales
g r e y w ey l z
Speech Recognition
g
g
r
ey ey ey ey
w
ey ey
l
l
z
How do we “match” the word when there are time and other variations?
Hidden Markov Model
P12
{a,b,c,…}
S1
S2
S3
{a,b,c,…}
{a,b,c,…}
Dynamic Programming in Decoding
time
states
We can find a path that corresponds to max-probable phonemes
to generate the observation “feature” (extracted in each
speech frame) sequence.
HMM for a Unigram Language Model
HMM1
(word1)
p1
s0
p2
HMM2
(word2)
p3
HMM3
(wordn)
Speech Synthesis
 Speech synthesis is to generate (arbitrary) speech
with desired prosperities (pitch, speed, loudness,
articulation mode, etc.)
 Speech synthesis has been widely used for text-tospeech systems and different telephone services.
 The easiest and most often used speech synthesis
method is waveform concatenation.
Increase the pitch without changing the speed
Speaker Recognition
 Identifying or verifying the identity of a speaker is an
application where computer exceeds human being.
 Vocal track parameter can be used as a feature for
speaker recognition.
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3
3
2
2
1
1
2
3
4
5
6
Speaker one
7
8
9
10
1
1
2
3
4
5
6
Speaker two
LPC covariance feature
7
8
9
10
Applications
Speech recognition
Call routing
Document input
Operator Services
Voice Commands
Directory Assistance
Speaker
recognition
Voice over Internet
Speech Coding
Fraud Control
Wireless Telephone
Personalized service
Document Correction
Speech Interface
Text-to-Speech
synthesis