Analysis and Digital Implementation of the Talk
Download
Report
Transcript Analysis and Digital Implementation of the Talk
Analysis and Digital Implementation
of the Talk Box Effect
Yuan Chen
Advisor: Professor Paul Cuff
Introduction
What is a talk box?
Motivation?
Allows a musician to add
diction and intelligibility to
an instrument’s sound
Popular as an analog device
Application of signal
processing
Goals?
Analyze output
Digital implementation
Figure 1 – Talk Box
Background – Speech and Intelligibility
Human speech production of convolution between
source and filter (1)
s(n) e(n) (n)
Not really time invariant
Only valid for voiced speech
Frequencies of formant peaks account for intelligibility of
speech (Lingard, McLoughlin)
Most important are F2, F3 formants which occur in frequency
band 800 Hz – 3 kHz
Complex Cepstrum
Formant peaks arise from , need a way to “deconvolve”
Intuitively source excitation varies quickly in frequency,
vocal tract response varies slowly in frequency (Deller)
Complex Cepstrum (eq. 2) (Deller):
1
s ( n)
2
log S ( )e
jn
d
Apply a low quefrency lifter to separate source and filter
Analysis Results – Vowel Sounds
Talk box most successfully impresses F2, F3 peaks
Relative Error in peak frequency: F1 – 19.6%, F2 – 9.33%, F3 –
6.22%
Error due to inability to replicate sound
For voice, ~90% of energy in 0 Hz – 1000 Hz
For talk box, ~10% of energy in 0 Hz – 1000 Hz
Design Overview
Problem definition:
Implement in MATLAB
Vocal Tract Impulse Response Extraction
Calculate cepstrum (eq. 3):
Lifter: Eliminate all quefrency above cutoff nc (eq. 4)
From liftered cepstrum, invert to calculate
impulse/frequency response (eq. 5):
Impulse Response Preprocessing
Calculated impulse response has too high low frequency
(0 – 1000 Hz) magnitude
Different frames of speech have different energy levels
Speech input should not directly determine output amplitude
Normalize, preprocess in frequency domain (eq. 6):
Synthesis
50% overlap between successive frames
Define system response to be linear interpolation of
vocal tract impulse responses in overlapping region (eq.
7):
α: relative index (eq. 8)
p: frame index (eq. 9)
Synthesis
From causality, output at time n0 depends only on input
occurring no later than n0
From finite-length impulse response, output at time n0
depends only on input occurring no earlier than
n0 – M + 1
Closed Form expression for y(n) (eq.11):
Design Summary
Performance
F2, F3 peaks on vowel speech inputs:
Static implementation relative error: 3.0% F2, 3.5% F3
Dynamic implementation relative error: 3.7% F2, 3.2% F3
Qualitatively, output has similar intelligibility to analog talk
box
Dynamic implementation can produce voiced non-vowel
phonemes and whole words
Not always consistent, depends on alignment in time
Performance Issues
Even with linearly-interpolated system impulse response,
noticeable transitions between frames
Computationally expensive: 2 FFTs, 2 IFFTs per frame
In MATLAB, computation time takes longer than duration of
the frame
Performance dependent on alignment of input signals
Conclusions and Further Considerations
Dynamic implementation closely models performance of
analog talk box:
Can produce vowels and voiced phonemes
Real-time setup
Demonstrate possibility of fully digital implementation of
talk box using speech input
Further considerations:
Improve transitions between frames
Decrease calculation time
Physical implementation