Advanced Speech Recognition

Download Report

Transcript Advanced Speech Recognition

Advanced Speech Enhancement in
Noisy Environments
Qiming Zhu
Supervisor: Prof. John Soraghan
Centre for excellence in Signal and Image Processing
Dept Electronic and Electrical Engineering
[email protected]
Presentation structure
• Introduction
• Speech Enhancement
– Improved Minima Controlled Recursive Averaging (IMCRA)
• Robust Voice Activity Detection (VAD)
– 1-D Local Binary Pattern (LBP)
– 1-D LBP of energy based VAD
– Performance Evaluation
• Improved IMCRA
– Performance Evaluation
• Discussion & Conclusion
Introduction
• Automatic speech recognition (ASR)
– Speech recognition system aims to create intelligent
machines that can ‘hear’, ‘understand’ and ‘comply’ to
speech input.
– Speech enhancement and VAD are applied as the integral
parts in ASR system.
• Aim of current research
– Improve the recognition system performance in babble
noisy background.
IMCRA
• IMCRA:
IMCRA Processing
* Israel Cohen, ‘Noise spectrum estimation in adverse environments: improved minima
controlled recursive averaging.’ (IEEE Tran. On speech and audio, 2003)
IMCRA with babble
• IMCRA Performance
– Clean Signal:
– Noisy Signal at 0 dB:
– Enhanced by IMCRA:
1-D LBP
• 2-D LBP
– Extensively used in 2-D image processing
• 1-D LBP
– Used for 1-D signal processing (Navin Chatlani, EUSIPCO
2010, Qiming Zhu, EUSIPCO 2012)
– LBP Code Calculation:
𝑷
−𝟏
𝟐
𝑳𝑩𝑷𝑷 𝒙 𝒊
=
𝑺 𝒙 𝒊+𝒓−
𝒓=𝟎
𝑷
𝑷
− 𝒙 𝒊 𝟐𝒓 + 𝑺 𝒙 𝒊 + 𝒓 + 𝟏 − 𝒙 𝒊 𝟐𝒓+𝟐
𝟐
where P is the number of neighbouring samples used. The Sign function S[∙
] is:
𝑺𝒙 =
𝟏,
𝟎,
𝒇𝒐𝒓 𝒙 ≥ 𝟎
𝒇𝒐𝒓 𝒙 < 𝟎
– On-set detection of Myoelectric signal (Paul McCool,
EUSIPCO 2012)
1-D LBP code
• 1-D LBP calculate the LBP code after thresholding the
neighbour samples.
LBP code calculation for p=8
*Navin Chatlani et al, ‘Local binary patterns for 1-D signal processing’, (EUSIPCO 2010)
1-D LBP histogram
• The distribution of the LBP codes can perform a histogram to
describe the continuous signal 𝑥 𝑖 with the window size of
N:
𝑯 =
𝜹(𝑳𝑩𝑷 𝒙 𝒊 , 𝒃)
𝒃
𝑷
𝑷 ≤𝒊≤𝑵−𝑷
𝟐
𝟐
where 𝑏 = 0,1, ⋯ , 𝐵 and B is the number of histogram bins. δ i, j is Kronecker Delta
function.
1-D LBP perform the Histogram with the window data
Overview of 1-D LBP procedure on a histogram
1-D LBP of energy
• Short-time energy and the histogram
Speech Signals and the Short-time Energy
a) energy of clean speech signal, b) energy of noisy speech signal, c) histogram of
clean speech energy, d) histogram of noisy speech energy.
1-D LBP of energy with offset value
• LBP code with offset values 𝜶
𝑃
−1
2
𝐿𝐵𝑃𝑃′ 𝐸 𝑚
=
𝑆 𝐸 𝑚+𝑟−
𝑟=0
𝑃
𝑃
− 𝐸 𝑚 − 𝛼 2𝑟 + 𝑆 𝐸 𝑚 + 𝑟 + 1 − 𝐸 𝑚 − 𝛼 2𝑟+2
2
𝑯𝟎 of the Energy with Different offset value 𝛂
a) 𝐸 𝑚 of noisy signal, b) 𝐻0 with 𝛼 = 0.01, c) 𝐻0 with 𝛼 = 0.02,
d) 𝐻0 with 𝛼 = 0.03, e) 𝐻0 with 𝛼 = 0.04, f) 𝐻0 with 𝛼 = 0.05
1-D LBP of energy based VAD
• System block diagram
VAD block diagram
VAD performance
• Experimental background
– Test speech sampling frequency is 16 kHz.The total length of the test
set used is 73 seconds. Mixed with babble noise from 0-20 dB. 𝛼 set to
be 0.03.
– VAD 1: 1-D LBP of energy based VAD.
– VAD 0: VAD proposed by Navin Chatlani.
– G.729: G.729 B Standard VAD.
𝑵𝟎,𝟎
– HR0: Speech absence hit-rate:
𝑯𝑹𝟎 = 𝒓𝒆𝒇
𝑵
𝟎
– FAR0: Speech absence false alarm rate:
𝑭𝑨𝑹𝟎 = 𝟏 −
𝑵𝟏,𝟏
𝒓𝒆𝒇
𝑵𝟏
VAD performance
• VAD performance
VAD performance
Improved IMCRA
• Experimental background
– 198 samples from VoxForge database, includes 9 people: 6
males and 3 females. Sampling frequency at 16 kHz.
– Babble noise from NOISEX-92 Database added at SNR from
-10 dB to 10 dB.
– Energy widow size set to be 5 ms, p=2, histogram size set
to be 30 ms.
– Segmental SNR and weighted spectrum slope (WSS) are
used to compare the performance.
*Klatt et al, ‘Prediction in perceived phonetic distance from critical band spectra’,
IEEE Conference on Acoustics, 1982
Improved IMCRA with babble noise
• Performance
– Clean signal:
– Noisy signal ( SNR at 0 dB):
– IMCRA:
– Improved IMCRA:
Improved IMCRA with babble noise
• Performance
Segmental SNR
Improved IMCRA with babble noise
• Performance
Weighted spectrum slope
Discussion
• Conclusion for the results
–
–
–
•
1-D LBP in energy domain can distinguish the voiced and
unvoiced components of noisy speech signals.
LBP in energy domain is shown to be superior to the
G.729 VAD and Navin’s LBP VAD.
Improved IMCRA is superior to IMCRA with enhanced
segmental SNR and higher likelihood.
Future work
–
Applied this algorithm as the pre-processing of a ASR
system.
Acknowledge
• Thank Prof. John Soraghan for the idea of babble noise
reduction.
• Thank Paul and Navin for the previous work on 1-D LBP.
Thank you!
Any Question?