Voice over Packet

Transcript Voice over Packet

V
O
P
YJS
Speech
production
mechanisms
Y(J) Stein VoP2
1
V
O
Spch
Prod
Speech Production Organs
P
Brain
Hard
Palate
Nasal
cavity
Velum
Teeth
Lips
Mouth
cavity
Uvula
Pharynx
Tongue
Esophagus
Larynx
Trachea
Lungs
YJS
Y(J) Stein VoP2
2
V
Spch
Prod
Speech Production Organs - cont.
O
P
YJS

Air from lungs is exhaled into trachea (windpipe)

Vocal chords (folds) in larynx can produce periodic pulses of air
by opening and closing (glottis)

Throat (pharynx), mouth, tongue and nasal cavity modify air flow

Teeth and lips can introduce turbulence

Epiglottis separates esophagus (food pipe) from trachea
Y(J) Stein VoP2
3
V
Spch
Prod
Voiced vs. Unvoiced Speech
O
P





When vocal cords are held open air flows unimpeded
When laryngeal muscles stretch them glottal flow is in bursts
When glottal flow is periodic called voiced speech
Basic interval/frequency called the pitch
Pitch period usually between 2.5 and 20 milliseconds
Pitch frequency between 50 and 400 Hz
You can feel the vibration of the larynx


YJS
Vowels are always voiced (unless whispered)
Consonants come in voiced/unvoiced pairs
for example : B/P K/G D/T V/F J/CH TH/th W/WH Z/S ZH/SH
Y(J) Stein VoP2
4
V
Spch
Prod
Excitation spectra
O
P

Voiced speech
Pulse train is not sinusoidal - harmonic rich
f

Unvoiced speech
Common assumption : white noise
f
YJS
Y(J) Stein VoP2
5
V
Spch
Prod
Effect of vocal tract
O
P
YJS

Mouth and nasal cavities have resonances

Resonant frequencies
depend on geometry
Y(J) Stein VoP2
6
V
Spch
Prod
Effect of vocal tract - cont.
O
P


Sound energy at these resonant frequencies is amplified
Frequencies of peak amplification are called formants
frequency response
F1
F2
F3
F4
frequency
voiced speech
unvoiced speech
F0
YJS
Y(J) Stein VoP2
7
V
Spch
Prod
Formant frequencies
O

Peterson - Barney data (note the “vowel triangle”)
P
YJS
Y(J) Stein VoP2
8
V
O
Spch
Prod
Sonograms
P
YJS
Y(J) Stein VoP2
9
V
Spch
Prod
Cylinder model(s)
O
P
Rough model of throat and mouth cavity
Voice
Excitation
With nasal cavity
Voice
Excitation
YJS
open
open
open/closed
Y(J) Stein VoP2
10
V
Spch
Prod
Phonemes
O
P



YJS
The smallest acoustic unit that can change meaning
Different languages have different phoneme sets
Types:
(notations: phonetic, CVC, ARPABET)
– Vowels
• front (heed, hid, head, hat)
• mid (hot, heard, hut, thought)
• back (boot, book, boat)
• dipthongs (buy, boy, down, date)
– Semivowels
• liquids (w, l)
• glides (r, y)
Y(J) Stein VoP2
11
V
O
P
Spch
Prod
Phonemes - cont.
– Consonants
• nasals (murmurs) (n, m, ng)
• stops (plosives)
– voiced (b,d,g)
– unvoiced (p, t, k)
• fricatives
– voiced (v, that, z, zh)
– unvoiced (f, think, s, sh)
• affricatives (j, ch)
• whispers (h, what)
• gutturals (
‫ע‬, ‫) ח‬
• clicks, etc. etc. etc.
YJS
Y(J) Stein VoP2
12
V
Spch
Prod
Basic LPC Model
O
P
Pulse
Generator
U/V
Switch
LPC
synthesis
filter
White Noise
Generator
YJS
Y(J) Stein VoP2
13
V
Spch
Prod
Basic LPC Model - cont.
O
P
YJS

Pulse generator produces a harmonic rich periodic impulse
train (with pitch period and gain)

White noise generator produces a random signal
(with gain)

U/V switch chooses between voiced and unvoiced speech

LPC filter amplifies formant frequencies
(all-pole or AR IIR filter)

The output will resemble true speech to within residual error
Y(J) Stein VoP2
14
V
O
P
Spch
Prod
Cepstrum
Another way of thinking about the LPC model
Speech spectrum is the obtained from multiplication
Spectrum of (pitch) pulse train times
Vocal tract (formant) frequency response
So log of this spectrum is obtained from addition
Log spectrum of pitch train plus
Log of vocal tract frequency response
Consider this log spectrum to be the spectrum of some new signal
called the cepstrum
The cepstrum is the sum of two components:
excitation plus vocal tract
YJS
Y(J) Stein VoP2
15
V
Spch
Prod
Cepstrum - cont.
O
P
Cepstral processing has its own language
 Cepstrum (note that this is really a signal in the time domain)

Quefrency (its units are seconds)

Liftering (filtering)

Alanysis

Saphe
Several variants:
 complex cepstrum
 power cesptrum
 LPC cepstrum
YJS
Y(J) Stein VoP2
16
V
O
P
Spch
Prod
Do we know enough?
Standard speech model (LPC)
(used by most speech processing/compression/recognition systems)
is a model of speech production
Unfortunately, speech production and speech perception systems
are not matched
So next we’ll look at the biology of the hearing (auditory) system
and some psychophysics (perception)
YJS
Y(J) Stein VoP2
17
V
O
P
Speech
Hearing & Perception
Mechanisms
YJS
Y(J) Stein VoP2
18
V
O
Spch
Perc
Hearing Organs
P
YJS
Y(J) Stein VoP2
19
V
Spch
Perc
Hearing Organs - cont.
O
P











YJS
Sound waves impinge on outer ear enter auditory canal
Amplified waves cause eardrum to vibrate
Eardrum separates outer ear from middle ear
The Eustachian tube equalizes air pressure of middle ear
Ossicles (hammer, anvil, stirrup) amplify vibrations
Oval window separates middle ear from inner ear
Stirrup excites oval window which excites liquid in the cochlea
The cochlea is curled up like a snail
The basilar membrane runs along middle of cochlea
The organ of Corti transduces vibrations to electric pulses
Pulses are carried by the auditory nerve to the brain
Y(J) Stein VoP2
20
V
Spch
Perc
Function of Cochlea
O
P







YJS
Cochlea has 2 1/2 to 3 turns
were it straightened out it would be 3 cm in length
The basilar membrane runs down the center of the cochlea
as does the organ of Corti
15,000 cilia (hairs) contact the vibrating basilar membrane
and release neurotransmitter stimulating 30,000 auditory neurons
Cochlea is wide (1/2 cm) near oval window and tapers towards apex
is stiff near oval window and flexible near apex
Hence high frequencies cause section near oval window to vibrate
low frequencies cause section near apex to vibrate
Overlapping bank of filter frequency decomposition
Y(J) Stein VoP2
21
V
O
P
Spch
Perc
Psychophysics - Weber’s law
Ernst Weber Professor of physiology at Leipzig in the early 1800s
Just Noticeable Difference :
minimal stimulus change that can be detected by senses
Discovery:
DI=KI
Example
Tactile sense: place coins in each hand
subject could discriminate between with 10 coins and 11,
but not 20/21, but could 20/22!
Similarly vision lengths of lines, taste saltiness, sound frequency
YJS
Y(J) Stein VoP2
22
V
Spch
Perc
Weber’s law - cont.
O
P
This makes a lot of sense
Bill Gates
YJS
Y(J) Stein VoP2
23
V
O
Spch
Perc
Psychophysics - Fechner’s law
P Weber’s law is not a true psychophysical law
it relates stimulus threshold to stimulus (both physical entities)
not internal representation (feelings) to physical entity
Gustav Theodor Fechner
student of Weber medicine, physics philosophy
Simplest assumption: JND is single internal unit
Using Weber’s law we find:
Y = A log I + B
Fechner Day (October 22 1850)
YJS
Y(J) Stein VoP2
24
V
O
Spch
Perc
Fechner’s law - cont.
P Log is very compressive
Fechner’s law explains the fantastic ranges of our senses
Sight: single photon - direct sunlight 1015
Hearing: eardrum move 1 H atom - jet plane 1012
Bel defined to be log10 of power ratio
decibel (dB) one tenth of a Bel
d(dB) = 10 log10 P 1 / P 2
YJS
Y(J) Stein VoP2
25
V
O
P
Spch
Perc
Fechner’s law - sound amplitudes
Companding
adaptation of logarithm to positive/negative signals
m-law
and
A-law
are piecewise linear approximations
Equivalent to linear sampling at 12-14 bits
(8 bit linear sampling is significantly more noisy)
YJS
Y(J) Stein VoP2
26
V
O
P
Spch
Perc
Fechner’s law - sound frequencies
octaves, well tempered scale
12
2
Critical bands
Frequency warping
Melody
1 KHz = 1000, JND afterwards
f
M ~ 1000 log2 ( 1 + fKHz )
Barkhausen can be simultaneously heard
B ~ 25 + 75 ( 1 + 1.4 f2KHz )0.69
excite different basilar membrane regions
YJS
Y(J) Stein VoP2
27
V
O
P
Spch
Perc
Psychophysics - changes
Our senses respond to changes
Inverse
E
Filter
YJS
Y(J) Stein VoP2
28
V
O
P
Spch
Perc
Psychophysics - masking
Masking: strong tones block weaker ones at nearby frequencies
narrowband noise blocks tones (up to critical band)
f
YJS
Y(J) Stein VoP2
29
V
O
P
YJS
Speech
DSP
Y(J) Stein VoP2
30
V
O
Some Speech DSP
P
Simplest processing
– Gain
– AGC
– VAD
More complex processing
– pitch tracking
– U/V decision
– computing LPC
– other features
YJS
Y(J) Stein VoP2
31
V
O
P
YJS
Simple
Speech
DSP
Y(J) Stein VoP2
32
V
O
P
Spch
DSP
Gain (volume) Control
In analog processing (electronics) gain requires an amplifier
Great care must be taken to ensure linearity!
In digital processing (DSP) gain requires only multiplication
y=Gx
Need enough bits!
YJS
Y(J) Stein VoP2
33
V
O
P
Spch
DSP
Automatic Gain Control (AGC)
Can we set the gain automatically?
Yes, based on the signal’s Energy!
E=
x2 (t) dt =
S xn2
All we have to do is apply gain until attain desired energy
Assume we want the energy to be Y
Then
y =
Y/ E
x = Gx
has exactly this energy
YJS
Y(J) Stein VoP2
34
V
Spch
DSP
AGC - cont.
O
P
What if the input isn’t stationary (gets stronger and weaker over time) ?
<t<
8
8
The energy is defined for all times so it can’t help!
So we define “energy in window” E(t)
and continuously vary gain G(t)
This is Adaptive Gain Control
We don’t want gain to jump from window to window
so we smooth the instantaneous gain
G(t)
a G(t) + (1-a) Y/E(t)
IIR filter
YJS
Y(J) Stein VoP2
35
V
Spch
DSP
AGC - cont.
O
P
The a coefficient determines how fast G(t) can change
In more complex implementations we may separately control
integration time, attack time, release time
What is involved in the computation of G(t) ?
–
–
–
–
Squaring of input value
Accumulation
Square root (or Pythagorean sum)
Inversion (division)
Square root and inversion are hard for a DSP processor
but algorithmic improvements are possible (and often needed)
YJS
Y(J) Stein VoP2
36
V
O
P
Spch
DSP
Simple VAD
Sometimes it is useful to know whether someone is talking (or not)
– Save bandwidth
– Suppress echo
– Segment utterances
We might be able to get away with “energy VOX”
Normally need Noise Riding Threshold / Signal Riding Threshold
However, there are problems energy VOX
since it doesn’t differentiate between speech and noise
What we really want is a speech-specific activity detector
Voice Activity Detector
YJS
Y(J) Stein VoP2
37
V
O
P
Spch
DSP
Simple VAD - cont.
VADs operate by recognizing that speech is different from noise
– Speech is low-pass while noise is white
– Speech is mostly voiced and so has pitch in a given range
– Average noise amplitude is relatively constant
A simple VAD may use:
– zero crossings
– zero crossing “derivative”
– spectral tilt filter
– energy contours
– combinations of the above
YJS
Y(J) Stein VoP2
38
V
Spch
DSP
Other “simple” processes
O
P
Simple = not significantly dependent on details of speech signal








YJS
Speed change of recorded signal
Speed change with pitch compensation
Pitch change with speed compensation
Sample rate conversion
Tone generation
Tone detection
Dual tone generation
Dual tone detection (need high reliability)
Y(J) Stein VoP2
39
V
O
P
YJS
Complex
Speech
DSP
Y(J) Stein VoP2
40
V
O
P
Spch
DSP
Correlation
One major difference between simple and complex processing
is the computation of correlations (related to LPC model)
Correlation is a measure of similarity
Shouldn’t we use squared difference to measure similarity?
D2 =
< (x(t) - y(t) )2 >
No, since squared difference is sensitive to
– gain
– time shifts
YJS
Y(J) Stein VoP2
41
V
Spch
DSP
Correlation - cont.
O
P
D2 =
< (x(t) - y(t) )2 > =
< x2 > + < y2 > - 2 < x(t) y(t) >
So when D2 is minimal C(0) = < x(t) y(t) > is maximal
and arbitrary gains don’t change this
To take time shifts into account
C(t) = < x(t) y(t+t) >
and look for maximal
t!
We can even find out how much a signal resembles itself
YJS
Y(J) Stein VoP2
42
V
O
P
Spch
DSP
Autocorrelation
Crosscorrelation Cx y (t) = < x(t) y(t+t) >
Autocorrelation Cx (t) = < x(t) x(t+t) >
Cx (0) is the energy!
Autocorrelation helps find hidden periodicities!
Much stronger than looking in the time representation
Wiener Khintchine
Autocorrelation C(t) and Power Spectrum S(f) are FT pair
So autocorrelation contains the same information as the power spectrum
… and can itself be computed by FFT
YJS
Y(J) Stein VoP2
43
V
O
P
Spch
DSP
Pitch tracking
How can we measure (and track) the pitch?
We can look for it in the spectrum
– but it may be very weak
– may not even be there (filtered out)
– need high resolution spectral estimation
Correlation based methods
The pitch periodicity should be seen in the autocorrelation!
Sometimes computationally simpler is the
Absolute Magnitude Difference Function
< | x(t) - x(t+t) | >
YJS
Y(J) Stein VoP2
44
V
O
P
Spch
DSP
Pitch tracking - cont.
Sondhi’s algorithm for autocorrelation-based pitch tracking :
– obtain window of speech
– determine if the segment is voiced (see U/V decision below)
– low-pass filter and center-clip
to reduce formant induced correlations
– compute autocorrelation lags corresponding to valid pitch intervals
• find lag with maximum correlation OR
• find lag with maximal accumulated correlation in all multiples
Post processing
Pitch trackers rarely make small errors (usually double pitch)
So correct outliers based on neighboring values
YJS
Y(J) Stein VoP2
45
V
O
P
Spch
DSP
Other Pitch Trackers
Miller’s data-reduction & Gold and Rabiner’s parallel processing methods
Zero-crossings, energy, extrema of waveform
Noll’s cepstrum based pitch tracker
Since the pitch and formant contributions are separated in cepstral domain
Most accurate for clean speech, but not robust in noise
Methods based on LPC error signal
LPC technique breaks down at pitch pulse onset
Find periodicity of error by autocorrelation
Inverse filtering method
Remove formant filtering by low-order LPC analysis
Find periodicity of excitation by autocorrelation
Sondhi-like methods are the best for noisy speech
YJS
Y(J) Stein VoP2
46
V
O
Spch
DSP
U/V decision
P
Between VAD and pitch tracking
 Simplest U/V decision is based on energy and zero crossings
 More complex methods are combined with pitch tracking
 Methods based on pattern recognition
Is voicing well defined?
 Degree of voicing (buzz)
 Voicing per frequency band (interference)
 Degree of voicing per frequency band
YJS
Y(J) Stein VoP2
47
V
Spch
DSP
LPC Coefficients
O
P
How do we find the vocal tract filter coefficients?
System identification problem
Unknown
known input


filter
All-pole (AR) filter
Connection to prediction
Sn = G en +
known output
Sm
am sn-m
Can find G from energy (so let’s ignore it)
YJS
Y(J) Stein VoP2
48
V
Spch
DSP
LPC Coefficients
O
P
For simplicity let’s assume three a coefficients
Sn = en + a1 sn-1 + a 2 s n-2 + a 3 s n-3
Need three equations!
Sn = en + a1 sn-1 + a 2 s n-2 + a 3 s n-3
Sn+1 = en+1 + a1 sn + a 2 s n-1 + a 3 s n-2
Sn+2 = en+2 + a1 sn+1 + a 2 s n + a 3 s n-1
In matrix form
Sn
Sn+1
Sn+2
s
YJS
=
=
en
en+1
en+2
+
sn-1 s n-2 s n-3
sn s n-1 s n-2
sn+1 s n s n-1
e
+
S
a1
a2
a3
a
Y(J) Stein VoP2
49
V
O
Spch
DSP
LPC Coefficients - cont.
P
S=e+Sa
so by simple algebra
a = S-1 ( s - e )
and we have reduced the problem to matrix inversion
Toeplitz matrix so the inversion is easy (Levinson-Durbin algorithm)
Unfortunately noise makes this attempt break down!
Move to next time and the answer will be different.
Need to somehow average the answers
The proper averaging is before the equation solving
correlation vs autocovariance
YJS
Y(J) Stein VoP2
50
V
O
P
Spch
DSP
LPC Coefficients - cont.
Can’t just average over time - all equations would be the same!
Let’s take the input to be zero
Sn =
Sm
am sn-m
multiply by Sn-q and sum over n
Sn Sn Sn-q = Sm
am Sn sn-m sn-q
we recognize the autocorrelations
Cs (q) =
Sm
Cs (|m-q|) am
Yule-Walker equations
autocorrelation method: sn outside window are zero (Toeplitz)
autocovariance method: use all needed sn (no window)
Also - pre-emphasis!
YJS
Y(J) Stein VoP2
51
V
Spch
DSP
Alternative features
O
P
The a coefficients aren’t the only set of features
 Reflection coefficients (cylinder model)
 log-area coefficients (cylinder model)
 pole locations
 LPC cepstrum coefficients
 Line Spectral Pair frequencies
All theoretically contain the same information (algebraic transformations)



YJS
Euclidean distance in LPC cepstrum space ~ Itakura Saito measure
so these are popular in speech recognition
LPC (a) coefficients don’t quantize or interpolate well
so these aren’t good for speech compression
LSP frequencies are best for compression
Y(J) Stein VoP2
52
V
Spch
DSP
LSP coefficients
O
P



a coefficients are not statistically equally weighted
pole positions are better (geometric)
but radius is sensitive near unit circle
Is there an all-angle representation?
Theorem 1: Every real polynomial with all roots on the unit circle
is palindromic (e.g. 1 + 2t + t2) or antipalindromic (e.g. t + t2 - t3)
Theorem 2: Every polynomial can be written as the sum of
palindromic and antipalindromic polynomials
Consequence: Every polynomial can be represented by roots
on the unit circle, that is, by angles
YJS
Y(J) Stein VoP2
53
V
O
P
Spch
DSP
LPC - based Compression
We learned that from
– gain
– pitch
– a small number of LPC coefficients
we could synthesize speech
It is easy to find the energy of a speech signal
We have seen methods to find pitch
We saw how to extract LPC coefficients from speech
So do we know how to compress speech?
YJS
Y(J) Stein VoP2
54

Voice over Packet

Transcript Voice over Packet

Directory