Biologically inspired noise-robust speech recognition for

Transcript Biologically inspired noise-robust speech recognition for

Biologically Inspired Noise Robust Speech Recognition for Both Man and Machine

Mark D. Skowronski Ph.D. Proposal University of Florida Gainesville, FL, USA

Outline

• Introduction • Biologically inspired algorithms – Speech: Energy Redistribution – Features: Human Factor Cepstral Coefficients – Classifier: Nonlinear dynamic systems • Future work

Biological Inspiration

Example of Read Speech: AWGN: 10 dB SNR Wall Street Journal/Broadcast news readings Untrained human listeners vs Cambridge HTK LVCSR system

• Introduction • Biologically inspired algorithms – Speech: Energy Redistribution – Features: Human Factor Cepstral Coefficients – Classifier: Nonlinear dynamic systems • Future work

Speech Enhancement

Motivations: •Noisy cell phone conversations •Power-constrained transducers •Public address systems in noisy environments What can you do when turning up the volume is not an option?

Biology: Lombard Effect

The Lombard Effect

Lombard Effect: changes in vocal characteristics, produced by a speaker in the presence of background noise.

• Amplitude increases.

• Duration increases.

• Pitch increases.

• Formant frequencies increase.

• • High-freq to low-freq energy ratio increases.

Intelligibility increases.

Psychoacoustic Experiments

Speech contains regions of

relatively high information

content, and emphasis of these regions increases perceived intelligibility.

• • •

Fletcher

(1953): LPF or HPF phonemes varied in robustness to the filtering process, with vowels being the most robust.

Miller and Nicely

of articulation and frication most, less so for voicing and nasality.

(1955): AWGN to speech affects place

Furui

(1986): Truncated vowels in consonant-vowel pairs dramatically decreased in intelligibility beyond a certain point of truncation. These points correspond to spectrally dynamic regions.

Solution: Energy Redistribution

We redistribute energy from regions of

low

information content to regions of

high

information content while conserving overall energy across words.

SFM of “clarification” We partition speech into Voiced/Unvoiced regions using the

Spectral Flatness Measure

(SFM): 1

SFM j

  

k N

  1 1

N k N X

  1

j X

(

)  

(

)

N X j (k)

is the magnitude of the short-term Fourier transform of the

th speech window of length

Listening Test

Confusable set test, from Junqua I f, s, x, yes II a, h, k, 8 III b, c, d, e, g, p, t, v, z, 3 IV m, n • 500 trials forced decision • 3 algorithms (control, ERVU, HPF) • 0 dB and -10 dB SNR, AWGN • unlimited playback over headphones • 26 participants, 30-45 minutes

Listening Test Results

-10 dB SNR, white noise Errors decreased 20% compared to control.

“S” “A” “E” “M”

Energy Redistribution Summary

• Biologically inspired – Lombard Effect says

how

to modify.

– Psychoacoustic experiments say

where

to modify.

• Increases

intelligibility

while maintaining naturalness and conserving energy.

• Naturalness elegantly preserved by retaining

spectral and temporal cues

• Effective because everyday speech is not clearly

enunciated

• Introduction • Biologically inspired algorithms – Speech: Energy Redistribution – Features: Human Factor Cepstral Coefficients – Classifier: Nonlinear dynamic systems • Future work

ASR Introduction

A utomatic S peech R ecognition is the extraction of

linguistic information

from an utterance of speech (Text-to-Speech).

• •

Isolated

Continuous

speech

Dependent

Independent

speaker operation •

Word

Phoneme

recognition unit • Vocabulary

size

and

perplexity

Input Feature Extraction Classification

Input

“seven”

Information: phonetic

, gender, age, emotion, pitch, accent, physical state, additive/channel noise

Feature Extraction

Goal:

emphasize phonetic information over other characteristics.

• Acoustic:

formant frequencies, bandwidths

• Model based:

linear prediction

• Filter-bank based:

mel freq cepstral coeff

(mfcc) Provides dimensionality reduction on

quasi-stationary

windows.

“ seven ” Features Time

Hidden Markov Model

“one” Time domain State space Feature space

MFCC Algorithm

MFCC--the most widely-used speech feature extractor.

“seven” x(t) F Mel-scaled filter bank Log energy DCT Cepstral domain Filter # Time

DCT vs Eigenvectors

Spectra of

DCT

basis vectors Spectra of

Eigenvectors

from log energy of filtered speech Basis # Frequency Average spectral difference < 15%

MFCC Filter Bank

• Design parameters:

FB freq range

number of filters

• Center freqs equally-spaced in

mel frequency

• Triangle endpoints set by center freqs of

adjacent filters

Although filter spacing is determined by perceptual mel frequency scale, bandwidth is set more for

convenience

than by

biological motivation

Human Factor Cepstral Coefficients

• Decouple filter bandwidth from filter bank design parameters.

• Set filter width according to the critical bandwidth of the human auditory system.

• Use Moore and Glasberg approximation of critical bandwidth, defined in Equivalent Rectangular Bandwidth (ERB).

ERB  6.23

f c

2  93.39

f c

 28.52

(Hz)

f c

is critical band center frequency (KHz).

ASR Experiments Review

• Isolated English digits “zero” through “nine” from TI-46 corpus, 8 male speakers, • HMM word models, 8 states per model, diagonal covariance matrix, • Three mfcc versions (different filter banks), • Several degrees of freedom, • Linear ERB scale factor.

ASR Results

White noise (local SNR), hfcc vs D&M

ASR Results

White noise (global SNR), hfcc vs D&M, Linear ERB scale factor (E-factor).

HFCC Conclusions

• • Added

biologically inspired

filter bank of popular speech feature extractor.

bandwidth to

Decoupled

bandwidth from other filter bank design parameters.

• Demonstrated superior

noise-robust

performance of new feature extractor.

• Demonstrated advantages of

wider filters

• Introduction • Biologically inspired algorithms – Speech: Energy Redistribution – Features: Human Factor Cepstral Coefficients – Classifier: Nonlinear dynamic systems • Future work

HMM Limitations

• HMMs are piecewise-stationary, while speech is continuous and nonstationary.

• Assumes frames of speech are i.i.d.

• State pdf estimates are data-driven.

HMMs make no claim of modeling biology.

Novel Classifiers

• Deng's trended HMM.

• Rabiner's autoregression HMM.

• Morgan's HMM/neural network hybrid.

• Robinson's recurrent neural network.

• Wismüller's self-organizing map.

• Herrmann's transient attractor network.

• Maass' dynamic synapse MLP.

• Berger's dynamic synapse RNN.

Freeman's Chaotic Model

• Biologically inspired nonlinear dynamic model of cortical signal processing, from rabbit olfactory neo-cortex experiments.

• A hierarchical network of oscillators that are

locally stable

and

globally chaotic

• Demonstrated as classifier of

static

patterns.

• Represents a

radical departure

classifier paradigms.

from current

KI Model

• Smallest element in network hierarchy.



   d d

2 2

x i

(

)  (



)  d d

t x i

(

)  (



) 

x i

(

)    

j N

 

i W ij



(

x j

(

q j

) 

I i

(

)

 1 ,  ,

•

constants • • state variable

x i (t) N

states •

W ij

weight from state

• asymmetric sigmoid

to state

• input

I i (t)

to state

Reduced KII Network

• Locally stable element is KII network.

•

m(t)

excitatory mitral cell •

g(t)

inhibitory granule cell • • Weights

K mg N > 0, K

pairs in parallel

gm < 0

• Mitral cells fully connected • Granule cells fully connected • Input

I(t)

into excitatory cell.

KII Simulations

g(t) m(t)

Reduced KII reaches steady state point attractor or limit cycle, based on

|K mg

K gm |.

• Introduction • Biologically inspired algorithms – Speech: Energy Redistribution – Features: Human Factor Cepstral Coefficients – Classifier: Nonlinear dynamic systems • Future work

Work Completed

• Developed biologically inspired algorithms:

Energy redistribution

: combines Lombard Effect (how) with psychoacoustic experimental results (where) to increase speech intelligibility.

•

Human factor cepstral coefficients

bandwidth information (ERB).

: combines existing speech front end (mfcc) with critical Published 3 papers, and submitted 3 more, on novel algorithms.

Literature survey on novel speech classifiers, and simulations of nonlinear Freeman model.

Work Proposed

Compare hfcc to human speech recognition using rhyming test in ASR experiments.

Measure affects of ERVU in ASR experiments.

Analyze hfcc algorithm, accounting for nonlinear log(·) function.

Experiment with other bandwidth functions besides ERB or scaled ERB.

Quantify tradeoff between spectral resolution and noise smoothing for hfcc using synthetic data.

Work Proposed, Con't

Build on the reduced KII network results recently reported by CNEL suggesting the network can operate as a content-addressable memory (CAM).

Investigate alternative information storage strategies to CAM, focusing on inherent time varying nature of dynamic system (coupling theory is intriguing).

Expand literature search to areas outside speech recognition experiments that use nonlinear dynamic (chaotic) systems for information processing/storage, with emphasis on applications with

time-varying

signals.

Work Proposed, Con't

Consider alternative roles for nonlinear dynamics: embedded extracted features for hfcc/HMM system, trajectory tracking in the spirit of Deng’s trended HMM.

10. Demonstrate classification of static vowel patterns (vowel phonemes) with novel classifier, in presence of noise.

11. Demonstrate classification of time-varying signals (isolated English digits, rhyming test corpus), in noisy environments.

Biologically inspired noise-robust speech recognition for

Transcript Biologically inspired noise-robust speech recognition for