Transcript Part1

Introduction

C.V.

 Juan Arturo Nolazco-Flores  Associate Professor Computer Science Department, ITESM, campus Monterrey, México  Courses: – Speech Processing, Computer Networks.

 Ph.D. in Speech Recognition  M.Phil. in Speech and Language Processing.

 M.Sc. In Control Engineering  B.Sc. in Electronic Systems

Useful Information

 E-mail: [email protected]

.

 Office: VII-426  Phone: 83-582000, ext. 4535, subext. 114

Plan of Work

 Fundamentals of Speech Science (8 hours) – Speech System Production – Acoustic-Phonetic Characterisation  Modelling Speech Production  Signal Processing and Analysis Methods (8 hours) – Bank-of-filters – Windowing – LPC – Cepstral Coefficients – Vector Quantization

Plan of work

 Speech Recognition (12 hours) – Distances – Time Alignment and Normalisation – DTW – Discrete HMM – Continuos HMM  Speech Coding

Speech Recognition Chapter 1(Rabiner & Juang)

Introduction

 What is speech recognition ?

– It is the identification of words in an utterance ( speech-> orthographic transcription ).

– Based on pattern matching techniques.

– Knowledge learn from data, usually using a stochastic techniques.

– It uses powerful algorithms to optimise a mathematical model for a given task.

Notes

   Do not confuse with

speech understanding

, which is the identification of the utterance meaning.

Do not confuse with

speaker recognition

, which is the identification of a speaker in a set of speakers.

– Main Problem: The speaker do not want to be recognised.

Do not confuse with

speaker verification

, which verifies if a speaker is the one he (she) say he (she) is.

– Main Problem: The speaker can have a pharyngeal problem.

Word Speech Recognition

User Speech Word Recognition System Set of valid words.

Syntax

Model for Speech Understanding

User Speech Word Recognition Model Higher Level Processing Voice Output Syntax, Semantics Pragmatics Task Description

Speech Recognition Disciplines

 Signal Processing: Spectral analysis.

 Physics (Acoustics): Human Hearing studies.  Pattern Recognition: Data clustering  Communication and Information Theory: statistical models, Viterbi algorithms, etc.

 Linguistics: grammar and language parsing.

 Physiology: knowledge based systems.

 Computer Science: efficient algorithms, UNIX, c language.

History (50’s)

 Speaker Dependent Isolated Digit Recognition System (Bell Labs, 1952).

 Phone recogniser (4 vowels and 9 consonants) (UCL, 1959).

– Statistical recognition  Speaker Independent 10 vowels recognition (MIT, 1959).

History (60’s)

     Hardware vowel recogniser (Radio Research Lab. in Tokyo, 1960).

Hardware phoneme recogniser (Kyoto University, 1962).

Realistic solution to the problem of

nonuniformity

of time scales in speech events (RCA Labs. 1964).

DTW (Soviet Union, 1968). Re discovered in the 80’s in the west.

Continuous Tracking of phonemes (CMU,1966).

History (70’s)

 Research effort in isolated word Recognition.  Dynamic programming methods successfully applied in Speech Recognition.

 Uses of LPC in Speech Recognition.

 Start work on Independent Speaker Speech Recognition.

History (80’s)

     Research effort in Connect word Recognition.

Template based approach to statistical modelling methods (specially HMMs).

Applications of Neural Networks to Speech Recognition.

Large impetus to Large vocabulary speech recognition, continuous speech recognition.

DARPA (Defence Advanced Research Projects Agency) project, which sponsored a large research program to obtain a high recognition performance for a 1000-word database.

History (90’s)

 DARPA project  Emphasis in natural language.

– Spontaneous Speech  Speech-technology used in within telephone networks.

Why is it difficult?

 Speech is a complex combination of information from different levels that is used to convey an information.

 Signal variability: – Intra-speaker variablity  emotional state, environment (Lombard effect) – Inter-speaker variablity  physiological differences, accent, dialect, etc.

– Acoustic channel  Telephone channel, background noise/speech, etc.

Task classification

Mode of speaking

Isolated word Connect-word Continuous

Speaker set

Speaker Dependent

Environment

noise free Multi-speaker Independent office telephone

Vocabulary

small (<50) medium (<500) large (<5000) high noise very large (>5000)