Sound and Speech Recognition What is Sound ? Acoustics is the study of sound. Physical - sound as a disturbance in the.
Download
Report
Transcript Sound and Speech Recognition What is Sound ? Acoustics is the study of sound. Physical - sound as a disturbance in the.
Sound and Speech Recognition
What is Sound ?
Acoustics is the study of sound.
Physical - sound as a disturbance in the air
Psychophysical - sound as perceived by the ear
Sound as stimulus (physical event) & sound as a sensation.
Pressures changes (in band from 20 Hz to 20 kHz)
Physical terms
Amplitude
Frequency
Spectrum
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
2
Carnegie Mellon
Sound Waves
In a free field, an ideal source of acoustical energy
sends out sound of uniform intensity in all directions.
=> Sound is propagating as a spherical wave.
Intensity of sound is inversely proportional to the square
of the distance (Inverse distance law).
6 dB decrease of sound pressure level per doubling the
distance.
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
3
Carnegie Mellon
Sound Waves
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
4
Carnegie Mellon
What is Sound
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
5
Carnegie Mellon
How we hear
– Ear connected to the brain
left brain: speech
right brain: music
Ear's sensitivity to frequency is logarithmic
Varying frequency response
Dynamic range is about 120 dB (at 3-4 kHz)
Frequency discrimination 2 Hz (at 1 kHz)
Intensity change of 1 dB can be detected.
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
6
Carnegie Mellon
Digitizing Sound
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
7
Carnegie Mellon
Digitally Sampling
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
8
Carnegie Mellon
Undersampling
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
9
Carnegie Mellon
Clipping
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
10
Carnegie Mellon
Quantization
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
11
Carnegie Mellon
Digital Sampling
• Sampling is dictated by the Nyquist sampling
theorem which states how quickly samples must be
taken to ensure an accurate representation of the
analog signal.
fs 2 f
or
T
Ts
2
• The Nyquist sampling theorem states that the
sampling frequency must be two times greater than
the highest frequency in the original analog signal.
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
12
Carnegie Mellon
Dithering a Sampled Signal
• Analog signal added to the signal to remove the artifacts of
quantization error.
• Dither causes the audio signal to always move between
quantization levels.
• Otherwise, a low level signal would be encoded as a square wave
=> granulation noise.
• Dithered, the A/D converter output is signal + noise
=> perceptually preferred,
since noise is better tolerated than distortion.
• Amplitude of dither signal:
high dither amplitudes more easily remove quantization artifacts
too much dither decreases the signal-to-noise ratio
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
13
Carnegie Mellon
Common Sound Sampling Parameters
•
•
•
Common Sampling Rates
•
8KHz (Phone) or 8.012820513kHz (Phone, NeXT)
•
11.025kHz (1/4 CD std)
•
16kHz (G.722 std)
•
22.05kHz (1/2 CD std)
•
44.1kHz (CD, DAT)
•
48kHz (DAT)
Bits per Sample
• 8 or 16
Number of Channels
• mono/stereo/quad/ etc.
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
14
Carnegie Mellon
Audio Data Rates
Quality
Format
Disk Space
Disk Space
(examples)
Transfer
Rate
1 hour
100,000 hours
Netcasting
RealAudio
20 Kbit/s
8.8 MByte
0.9 TByte
Preview
RealAudio
80 Kbit/s
35.2 MByte
3.5 TByte
Preview
MPEG Layer 3
(MP3)
192 Kbit/s
84.4 MByte
8.4 TByte
Broadcasting or
Editing
MPEG Layer 2
384 Kbit/s
168.8 MByte 16.9 TByte
Archive
Waveform
1538 Kbit/s
675.9 MByte 67.6 TByte
(uncompressed) PCM
Space/Storage Requirements
1 Minute of Sound
Type
Mono
Mono
Stereo
Stereo
Resolution 8 bit
16 bit
8 bit
16 bit
Sampling
Rate
44.1k
2646k
5292k
5292k
10584k
22.05k
1323k
2646k
2646k
5292k
11.025k
661.5k
1323k
1323k
2646k
8k
480k
960k
960k
1920k
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
16
Carnegie Mellon
Many (!) Sound File Formats
•
Mulaw (Sun, NeXT) .au
•
RIFF (Resource Interchange File Format)
•
MS WAV and .AVI
•
MPEG Audio Layer (MPEG) .mpa .mp3
•
AIFC (Apple, SGI) .aiff .aif
•
HCOM (Mac) .hcom
•
SND (Sun, NeXT) .snd
•
VOC (Soundblaster card proprietary standard) .voc
•
AND MANY OTHERS!
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
17
Carnegie Mellon
What’s in a Sound File Format
•
•
Header Information
•
Magic Cookie
•
Sampling Rate
•
Bits/Sample
•
Channels
•
Byte Order
•
Endian
•
Compression type
Data
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
18
Carnegie Mellon
Example File Format (NIST SPHERE)
NIST_1A
1024
sample_rate -i 16000
channel_count -i 1
sample_n_bytes -i 2
sample_byte_format -s2 10
sample_sig_bits -i 16
sample_count -i 594400
sample_coding -s3 pcm
sample_checksum -i 20129
end_head
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
19
Carnegie Mellon
WAV file format (Microsoft) RIFF
A collection of data chunks.
Each chunk has a 32-bit Id
followed by a 32-bit chunk length
followed by the chunk data.
0x00
0x04
0x08
0x0C
0x10
0x14
0x16
0x18
0x1C
0x20
0x22
0x24
0x28
0x2C
chunk id 'RIFF'
chunk size (32-bits)
wave chunk id 'WAVE'
format chunk id 'fmt '
format chunk size (32-bits)
format tag (currently pcm)
number of channels 1=mono, 2=stereo
sample rate in hz
average bytes per second
number of bytes per sample
1 = 8-bit mono
2 = 8-bit stereo or
16-bit mono
4 = 16-bit stereo
number of bits in a sample
data chunk id 'data'
length of data chunk (32-bits)
Sample data
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
20
Carnegie Mellon
Digital Audio Today
Analog elements in the audio chain are replaced with digital
elements.
16-bit wordlength, 32/44.1/48 kHz sampling rates.
Mostly linear signal processing.
Wide range of digital formats and storage media.
Rapid development of technology
=> better SNR, phase and linearity.
Rapid increase of signal processing power
=> possibility to implement new, complex features.
Soon: Digital radio (satellite), HDTV
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
21
Carnegie Mellon
Digital (CD) vs Analog (LP or cassette tape)
Information is stored digitally.
The length of its data pits represents a series
of 1s and 0s.
Both audio channels are stored along the
same pit track.
Data is read using laser beam.
Information density about 100 times greater
than in LP.
CD player can correct disc errors.
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
22
Carnegie Mellon
Benefits of Digital Representation (CD)
Robust
No degradation from repeated playings because data is read
by the laser beam.
Error correction
Transport’s performance does not affect the quality of audio
reproduction.
Digital circuitry more immune to aging and temperature
problems
Data conversion is independent of variations in disc rotational
speed, hence wow and flutter are negligible.
SNR over 90 dB.
Subcode for display, control and user information
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
23
Carnegie Mellon
CD Format
•
•
Sampling
44.1 kHz => 10 % margin with respect to the Nyquist frequency (audible
frequencies below 20 kHz)
16-bit linear
=> theoretical SNR about 98 dB (for sinusoidal signal with maximum amplitude)
audio bit rate 1.41 Mbit/s (44.1 kHz * 16 bits * 2 channels)
Cross Interleaved Reed-Solomon Code (CIRC) for error correction
Subcode
Original Specifications
Playing time max. 74.7 min
Disc diameter 120 mm
Disc thickness 1.2 mm
One sided medium, rotates clockwise
Signal is recorded from inside to outside
Pit is about 0.5 µm wide
Pit edge is 1 and all other areas whether inside or outside a pit, are 0s
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
24
Carnegie Mellon
Speech Recognition in Brief
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
25
Carnegie Mellon
Acoustic Origins
• A wave for the words “speech lab” looks like:
s
p
ee
ch
l
a
b
“l” to “a”
transition:
Graphs from Simon Arnfield’s web tutorial on speech, Sheffield:
http://lethe.leeds.ac.uk/research/cogn/speech/tutorial/
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
26
Carnegie Mellon
Speech Recognition Knowledge Sources
Acoustic Modeling
Describes the sounds that
make up speech
Speech Recognition
Lexicon
Describes which
sequences of speech
sounds make up
valid words
Language Model
Describes the likelihood
of various sequences of
words being spoken
Speech Recognition
THE FUNDAMENTAL EQUATION
O is an acoustical ‘Observation’
w is a ‘word’ we are trying to recognize
Maximize w = argmax (P(W) | O)
P(W|O) is unknown so by Bayes’ rule:
P(O|W) P(W)
P(W|O) =
-----------------------P(O)
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
28
Carnegie Mellon
Mechanism of state-of-the-art speech recognizers
Speech in
Acoustic
analysis
x1 ... xT
Recognition:
Maximize
P (x1... xT | w1... wk )・ P(w1... wk )
Recognized
Sentence
P(x1... xT | w1... wk )
P(w1 ... wk )
Pronunciation lexicon
Language model
Acoustic Sampling
• 10 ms frame (ms = millisecond = 1/1000 second)
• ~25 ms window around frame to smooth signal
processing
25 ms
...
10ms
a1
a2
Result:
Acoustic Feature Vectors
a3
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
30
Carnegie Mellon
Spectral Analysis
• Frequency gives pitch; amplitude gives volume
• sampling at ~8 kHz phone, ~16 kHz mic (kHz=1000
cycles/sec)
s
p
ee
ch
l
a
b
• Fourier transform of wave yields a spectrogram
• darkness indicates energy at each frequency
• hundreds to thousands of frequency samples
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
31
Carnegie Mellon
Features for Speech Recognition
Coding scheme (typical)
• 10 millisecond step size; 25 millisecond window
• ~39 coefficients each step:
• mel-scale cepstra derived from frequency
representation
• and coefficients
• power
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
32
Carnegie Mellon
The Markov Assumption
• Only immediately preceding history matters
n
P( X 1 , X 2 , X 3 , , X n ) P( X i | X 1 , X 2 , X 3 , , X i 1 )
i 1
P( X i | X 1 , X 2 , X 3 , , X n ) P( X i | X i 1 )
n
P( X 1 , X 2 , X 3 , , X n ) P( X i | X i 1 )
i 1
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
33
Carnegie Mellon
Hidden Markov Models
• In speech recognition the number of states is very
large; we can simplify the problem by factoring the
problem into two components
p(s2 | s1 ) q( y1 | s2 , s1 )
S1
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
S2
34
S3
Carnegie Mellon
Hidden Markov Model
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
35
Carnegie Mellon
Searching the Speech Signal Trellis
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
36
Carnegie Mellon
Lexicon - links words to phones
in acoustic model
Aaron EH R AX N
Aaron(2)
AE R AX N
abandon
AX B AE N D AX N
abandoned
AX B AE N D AX N DD
abandoning
AX B AE N D AX N IX NG
abandonment
AX B AE N D AX N M AX N TD
abated AX B EY DX IX DD
abatement
AX B EY TD M AX N TD
abbey AE B IY
Abbott AE B AX TD
Abboud AA B UW DD
abby AE B IY
abducted
AE BD D AH KD T IX DD
Abdul AE BD D UW L
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
37
Carnegie Mellon
When Language Modeling Goes Wrong
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
38
Carnegie Mellon
When P(w) is incorrect
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
39
Carnegie Mellon
Language Modeling
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
40
Carnegie Mellon
Language Models
A language model is a probability distribution over word sequences
n
p(W ) p( w1,...wn) p( wi | w0,...,wi 1)
i 1
• n = 3,4,5 [lose the rest of the context]
• Hard to estimate large contexts: consider 64,000^3 words
Need large collections of text
Smoothing P(wi| wi-2, wi-1) is necessary
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
41
Carnegie Mellon
Creating models for recognition
Speech
data
Transcribe*
Text
data
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
42
Train
Acoustic
models
Train
Language
models
Carnegie Mellon
Continual Progress in Speech Recognition
Increasingly Difficult Tasks, Steadily Declining Error Rates
CONVERSATIONAL SPEECH
100
Non-English
English
Word Error Rate (%)
50
READ SPEECH
5000 word
BROADCAST NEWS
20,000 Word
1000 Word
vocabulary
Varied microphones
10
Standard microphone
Noisy environment
Unlimited Vocabulary
All results are Speaker -Independent
1
1988
1989
1990
1991
1992
1993
NSA/Wayne/Doddington
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
43
1994
1995
1996
1997
1998
Carnegie Mellon
References
•
Speech Recognition resource links can be found at:
http://svr-www.eng.cam.ac.uk/comp.speech/Section2/speechlinks.html
An excellent tutorial on speech recognition by Wayne Ward:
http://www-2.cs.cmu.edu/~roni/11761-s01/Presentations/whw%20hmm's%20in%20speech%20recognition%203.0.pdf
© Copyright 2002 Michael G. Christel and Alexander G. Hauptmann
44
Carnegie Mellon
Sound + Speech Recognition
That’s all for today