VAD - Southern Oregon University

Download Report

Transcript VAD - Southern Oregon University

Voice Activity Detection (VAD)
• Problem: Determine if voice is present in a particular
audio signal.
• Issues: loud noise classified as speech and soft speech
classified as noise
• Applications
– Speech Recognition
– Speech transmission
– Speech enhancement
• Increases performance of speech applications more than
any other single component
• Goal: extract features from a signal that emphasize
differences between speech and background noise
General Signal Characteristics
• Energy compared to long term noise estimates
– K. Srinivasan, A. Gersho, “Voice activity detection for
cellular networks,” Proc. Of the IEEE Speech Coding
Workshop, Oct 1993, pp. 85-86
• Likelihood ratio based on statistical methods
– Y.D. Cho, K. Al-Naimi, A. Kondoz, “Improved voide activity
detection based on a smoothed statistical likelihood ratio,”
Proceedings ICASSP, 2001, IEE Press
• Compute the kurtosis
– R. Gaubran, E. Nemer and S.Mahmoud, “SNR estimation of
speech signals using subbands and fourth-order statistics,”
IEEE Signal Processing Letters, vol. 6, no. 7, pp. 171-174,
1999
Extract Features in Speech Model
• Presence of pitch
– “Digital cellular telecommunication system (phase 2+);
voice activity detector for adaptive multi-rate (amr) speech
traffic channels,” ETSI Report, DEN/SMG-110694Q7, 2000
• Formant shape
– J.D. Hoyt, H.Wechsler, “Detection of human speech in
structured noise,” Proc. IEEE International Conference on
Acoustics, Speech , and Signal Processing, 1994, pp. 237240
• Cepstrum
– J.A. Haigh, J.S. Mason, “Robust voice activity detection
using cepstral features,” IEEE TEN-CON, pp. 321-324, 1993
Multi-channel Algorithms
• Utilize additional information provided by additional
sensors
– P. Naylor, N. Doukas, T. Stathaki, “Voice activity
detection using source separation techniques,” Proc.
Eurospeech, 1997, pp. 1099-1102
– J.F. Chen, W. Ser, “Speech detection using microphone
array,” Electronic Letters, vol 36(2), pp. 181-182, 2000
– Q. Zou, X. Zou, M. Zhang, Z. Lin, “A robust speech
detection algorithm in a microphone array
teleconferencing system,” Proc. ICASSP, 2001, IEEE Press
Statistics: Mean
•
•
•
•
First moment - Mean or average value: μ = ∑i=1,N si
Second moment - Variance or spread: σ2 = 1/N∑i=1,N(si - μ)2
Standard deviation – probability of distance from mean: σ
3rd standardized moment- Skewness: γ1 = 1/N∑i=1,N(si-μ)3/σ3
– Negative tail: skew to the left
– Positive tail: skew to the right
• 4th standardized moment – Kurtosis: γ2 = 1/N∑i=1,N(si-μ)4/σ4
– Positive: relatively peaked
– Negative: relatively flat
VAD General approaches
• Noise
– Level estimated during periods of low energy
– Adaptive estimate: The noise floor estimate lowers quickly
and raises slowly when encountering non-speech frames
• Energy: Speech energy significantly exceeds the noise level
• Cepstrum Analysis
– Voiced speech contains F0 plus harmonics that will show
as a Cepstrum peak related to that periodicity and to voice.
– Flat Cepstrums can result from a door slam or clap
• Kurtosis: Linear prediction coding clean voiced speech
residuals have a large kurtosis
Likelihood Ratio Test (LRT)
• L.Sohn, N.S. Kim, W.Sung, “A statistical model-based voice
activity detection,” IEEE Signal Processing Letters, vol. 6, no.1,
pp. 1-3, Jan 1999
• J. Ramirez, J.C. Segura, et. al., “Statistical voice activity
detection using a multiple observation likelihood ratio test,”
IEEE Signal Processing letters, vol. 12, no. 10, pp. 689-692, Oct
2005
• Utilizes the geometric mean:
GM = (∏1,nai)1/n= e1/n∑1,n ln(ai))
log(GM) = log (∏1,nai)1/n) = 1/n log(∑1,nai)
Geometric Mean
• Arithmetic mean: applicable when using numeric quantities
– Annual growth: 2.5, 3, and 3.5 million dollars
• Geometric mean: applicable when using percentages
– Company grows annually by 2.5, 3, and 3.5%
• Example: A company starts with $1,000,000
–
–
–
–
–
–
Assets grow by 2.5, 3, and 3.5 percent over three years
Arithmetic mean: 1/N∑i=1,Ngi = (1.025 + 1.03 + 1.035)/3 = 1.03
Geometric mean: (∏i=1,Npi)1/N = (1.025*1.03*1.035)1/3 = 1.02999191
Actual increase: $1,000,000*1.025*1.03*1.035 = $1,092,701.25
Use arithmetic mean: $1,000,000*(1.03)3 = $1,092,727
Use geometric mean: $1,000,000 * (1.02999191)3 = $1,092,701.25
LRT Algorithm
• Perform a DFFT of the audio signal
• Likelihood of a fft bin magnitude being speech (p(k))
• Perform log of geometric mean of the bin probabilities:
(1/K ∑k=0,K-1 log(p(k|speech)/P(k|non-speech)
• Mark as speech or non-speech
– if > upper threshold, mark as speech
– If < lower threshold, mark as non-speech
– If in between, use HMM or mark based on surrounding
frames (multiple observance)
Statistical
Modeling
of Noise
K determines shape
Θ determines spread
Gamma
Laplacian
Gaussian
Probability Distribution Formulas
•
•
•
•
•
•
Gamma
F(x;k,r)
= xk-1rke-rx)/(k-1)!
k: shape,
r: rate of arrival
Mean: k/r
Variance: k/r2
Skew: 2/k½
Kurtosis: 6/k
Laplacian
Gaussian
• f(x;u,b)
= 1/(2b)e-|x-μ|/b
• μ: location
b:scale
• Mean: μ
• Variance: 2b2
• Skew: 0
• Kurtosis: 3
• F(x; μ,σ)
2)½ e-(x=21/(2πσ
2
μ) /(2σ )
•
•
•
•
Mean: μ
Variance: σ2
Skew: 0
Kurtosis: 0
VAD: Determine which distribution most matches the noise
Example: if Kurtosis ≠ 0, can’t be Gaussian
Harmonic Frequencies
• Background
– Voiced speech energy clusters around the formants
– More frequency is in formant bins
• Algorithm
– If voice is present (high energy level compared to noise)
• Determine fundamental frequency (f0) using Cepstral analysis
or some other method.
• Determine harmonics of F0
• Decide if speech by the geometric mean of the DFFT bins in
the vicinity of the harmonics
– Else Mark speech based on geometric mean of all DFFT bins
Auto Correlation
• Remove the DC offset and apply pre-emphasis
xf[i] = (sf[i] – μf) – α(sf[i-1] – μf)
where f=frame, μf = mean, α typically 0.96
• Apply the auto-correlation formula to estimate pitch
Rf[z] = ∑i=1,n-z xf[i]xf[i+z]/∑i=1,F xf[i]2
M[k] = max(rf[z])
• Expectation: Voiced speech should produce a higher
M[k] than unvoiced speech, silence, or noise frames
• Notes:
– We can do the same thing with Cepstrals
– Auto-correlation complexity improved by limiting the Rf[z]
values that we bother to compute
Zero Crossing
• The effectiveness of auto correlation
decreases as SNR approaches zero
• Enhancement to Auto Correlation method
when SNR values are low
• Algorithm
– Eliminate the pre-emphasis step (preserve the
original pitch)
– Assume every two zero crossings is a pitch period
– Auto correlate each period with its predecessor
Use of Entropy as VAD Metric
FOR each frame
Decompose the signal into 24 Bark (or Mel) scale bands
Compute the energy in each frequency band
FOR each band of frequencies
energy[band] = ∑i=bstart,bend|x[i]|2
IF an initial or low energy frame, noise[band] = energy[band]
ELSE speech[band] = energy[band] – noise[band]
Sort speech[band] and select subset of bands with max speech[band] values
Compute the probability of energy[band]/totalEnergy
Compute entropy = - ∑useful bandsP(energy[band]) log(P(energy[band]))
Note: We expect higher entropy in noise; signal, should be organized
Adaptive Noise adjustment: for frame f and 0 < α <1
noise[band] = α energy[band]f-1 * (1- α) energy[band]f
Unvoiced Speech Detector
Bark Scale Decomposition
•
•
•
•
EL,0 = sum of all level five energy bands
EL,1 = sum of first four level 4 energy bands
EL,2 = sum of last five level 4 energy bands + first level 3 energy band
IF EL,2 > EL,1 > EL,0 and EL,0/EL,2 < 0.99, THEN frame is unvoiced speech
G 729 VAD Algorithm
• Importance: An industry standard and a reference to
compare new algorithm proposals
• Overview
– A VAD decision is made every 10 ms
– Features: full band energy, the low band energy, the zerocrossing rate, and a spectral measure.
– A Long term calculation averages frames judged to not
contain voice
– VAD decision:
• compute differences between a frame and noise estimate.
• Average values from predecessor frames to prevent
eliminating non-voiced speech
• IF differences > threshold, return true; ELSE return false
Other algorithms
1. Analyze a larger window and classify based
on the percentage of time voice appears to
be present or absent
2. Focus on the change of signal peak energy at
the onset and termination of speech.
a. Onset: drastic increase in peak energy
b. Continuous speech: intermittent peak spikes
c. Termination: absence of peak spikes and energy
Evaluating ASR Performance
• Importance of VAD
– The VAD component impacts speech recognition the most
– Without VAD, ASR accuracy degrades to less than 30% in
noisy environments
• Evaluation standards
– Without an objective standard, researchers will not be
able evaluate how various algorithms impact ASR accuracy
– H.G. Hirsch, D. Pearce, “The Aurora experimental
framework for the performance evaluation of speech
recognition systems under noisy conditions,” Proc. ISCA
ITRW ASR2000, vol. ASSP-32, pp. 181-188, Sep. 2000