A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Download ReportTranscript A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
A brief overview of Speech Recognition and Spoken Language Processing
Advanced NLP Guest Lecture August 31 Andrew Rosenberg
Speech and NLP
• Communication in Natural Language • Text: – Carefully prepared – Grammatical – Machine readable • Typos • Sometimes OCR or handwriting issues 1
Speech and NLP
• Communication in Natural Language • Speech: – Spontaneous – Less Grammatical – Machine readable • with > 10% error using on speech recognition.
2
NLP Tasks
• Parsing • Name Tagging • Sentiment Analysis • Entity Coreference • Relation Extraction • Machine Translation 3
Speech Tasks
• Parsing – Speech isn’t always grammatical • Name Tagging – If a name isn’t “in vocabulary” what do you do?
• Sentiment Analysis –
How
the words are spoken helps.
• Entity Coreference • Relation Extraction • Machine Translation – how can these handle misrecognition errors?
4
Speech Tasks
• Speech Synthesis • Text Normalization • Dialog Management • Topic Segmentation • Language Identification • Speaker Identification and Verification – Authorship and security 5
The traditional view
Text Documents Training Text Documents Application Text Processing System Named Entity Recognizer 6
The simplest approach
Text Documents Training Transcribed Documents Application Text Processing System Named Entity Recognizer 7
Speech is errorful text
Transcribed Documents Training Transcribed Documents Application Text Processing System Named Entity Recognizer 8
Speech signal can be used
Transcribed Documents Training Transcribed Documents Application Text Processing System Named Entity Recognizer 9
Hybrid speech signal and text
Training Transcribed Documents Text Documents Transcribed Documents Application Text Processing System Named Entity Recognizer 10
Speech Recognition
• Standard HMM speech recognition.
• Front End • Acoustic Model • Pronunciation Model • Language Model • Decoding 11
Speech Recognition
Phone Likelihoods Front End Acoustic Model Pronunciation Model Language Model Word Sequence Acoustic Feature Vector Word Likelihoods 12
Speech Recognition
Front End Convert sounds into a sequence of observation vectors Language Model Calculate the probability of a sequence of words Acoustic Model The probability of a set of observations given a phone label Pronunciation Model The probability of a pronunciation given a word 13
Front End
• How do we convert a wave form into a useful representation?
• We are looking for a vector of numbers which describe the acoustic content • Assuming 22kHz 16bit sound. Modeling this directly is not feasible.
14
Discrete Cosine Transform
• Every wave can be decomposed into component sine or cosine waves.
• Fast Fourier Transform is used to do this efficiently 15
Overlapping frames
• Spectrograms allow for visual inspection of spectral information.
• We are looking for a compact, numerical representation 10ms 10ms 10ms 10ms 10ms 16
Single Frame of FFT
Australian male /i :/ from “heed” FFT analysis window 12.8ms
http://clas.mq.edu.au/acoustics/speech_spectra/fft_lpc_settings.html
17
Example Spectrogram
18
“Standard” Representation
• Mel Frequency Cepstral Coefficients – MFCC Pre Emphasis window FFT Mel-Filter Bank 12 MFCC 12 ∆ MFCC 12∆∆ MFCC 1 energy 1 ∆ energy 1 ∆∆ energy energy log Deltas 12 MFCC FFT -1 19
Speech Recognition
Front End Convert sounds into a sequence of observation vectors Language Model Calculate the probability of a sequence of words Acoustic Model The probability of a set of observations given a phone label Pronunciation Model The probability of a pronunciation given a word 20
Language Model
• What is the probability of a sequence of words?
• Assume you have a vocabulary of V words.
• How many possible sequences of N words are there?
21
N-gram Language Modeling
• • Simplify the calculation.
Big simplifying assumption: Each word is only dependent on the previous N-1 words.
22
N-gram Language Modeling
• Same question. Assume a V word vocabulary, and an N word sequence. How many “counts” are necessary?
23
General Language Modeling
• • • Any probability calculation can be used here.
Class based language models.
e.g. Recurrent neural networks
24
Speech Recognition
Front End Convert sounds into a sequence of observation vectors Language Model Calculate the probability of a sequence of words Acoustic Model The probability of a set of observations given a phone label Pronunciation Model The probability of a pronunciation given a word 25
Pronunciation Modeling
• Identify the likelihood of a phone sequence given a word sequence.
• There are many simplifying assumptions in pronunciation modeling.
1. The pronunciation of each word is independent of the previous and following.
26
Dictionary as Pronunciation Model
• Assume each word has a single pronunciation I AY CAT K AE T THE DH AH HAD H AE D ABSURD AH B S ER D YOU Y UH D 27
Weighted Dictionary as Pronunciation Model • Allow multiple pronunciations and weight each by their likelihood I AY I IH THE DH AH THE DH IY YOU Y UH YOU Y UW .4
.6
.7
.3
.5
.5
28
Grapheme to Phoneme conversion
• What about words that you have never seen before? • What if you don’t think you’ve seen every possible pronunciation?
• How do you pronounce: “McKayla”? or “Zoomba”?
• Try to learn the phonetics of the language.
29
Letter to Sound Rules
• Manually written rules that are able to convert one or more letters to one or more sounds.
• T -> /t/ • H -> /h/ • TH -> /dh/ • E -> /e/ • These rules can get complicated based on the surrounding context. – K is silent when word initial and followed by N.
30
Automatic learning of Letter to Sound rules • First: Generate an alignment of letters and sounds T T E EH X K S T T T E X T T EH K S T 31
Automatic learning of Letter to Sound rules • Second: Try to learn the mapping automatically. • Generate “Features” from the letter sequence • Use these feature to predict sounds • Almost any machine learning technique can be used.
– We’ll use decision trees as an example.
32
Decision Trees example
• Context: L1, L2, p, R1, R2 Yes R1 = “h” P Yes P F F F F loophole loophole physics telephone graph photo L1 = “o” F F F F No physics telephone graph photo P ø ø ø No P P P ø ø ø ø peanut pay apple apple psycho pterodactyl pneumonia Yes R1 = consonant No apple psycho pterodactyl pneumonia P P peanut pay 33
Decision Trees example
• Context: L1, L2, p, R1, R2 Yes R1 = “h” P Yes P F F F F loophole loophole physics telephone graph photo L1 = “o” F F F F No physics telephone graph photo P ø ø ø try “PARIS” No P P P ø ø ø ø peanut pay apple apple psycho pterodactyl pneumonia Yes R1 = consonant No apple psycho pterodactyl pneumonia P P peanut pay 34
Decision Trees example
• Context: L1, L2, p, R1, R2 Yes R1 = “h” P Yes P F F F F loophole loophole physics telephone graph photo L1 = “o” F F F F No physics telephone graph photo P ø ø ø Now try “GOPHER” No P P P ø ø ø ø peanut pay apple apple psycho pterodactyl pneumonia Yes R1 = consonant No apple psycho pterodactyl pneumonia P P peanut pay 35
Speech Recognition
Front End Convert sounds into a sequence of observation vectors Acoustic Model The probability of a set of observations given a phone label Pronunciation Model The probability of a pronunciation given a word 36
Acoustic Modeling
• Hidden markov model.
– Used to model the relationship between two sequences.
37
Hidden Markov model
q 1 q 2 q 3 x 1 x 2 x 3 • In a
Hidden Markov Model
the state sequence is unobserved.
• Only an observation sequence is available 38
Hidden Markov model
q 1 q 2 q 3 x 1 x 2 x 3 • Observations are MFCC vectors • States are phone labels • Each state (phone) has an associated GMM modeling the MFCC likelihood 39
Training acoustic models
• TIMIT – close, manual phonetic transcription – 2342 sentences • Extract MFCC vectors from each frame within each phone • For each phone, train a GMM using Expectation Maximization.
• These GMM is the
Acoustic Model.
– Common to use 8, or 16 Gaussian Mixture Components.
40
Gaussian Mixture Model
41
HMM Topology for Training
• Rather than having one GMM per phone, it is common for acoustic models to represent each phone as 3
triphones
/r/ S1 S2 S3 S4 S5 42
Speech in Natural Language Processing ALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHAT ’ S THE STATION NAME DOWNTOWN CROSSING UM AND THAT ’ LL GET YOU BACK TO THE RED LINE JUST AS EASILY 43
Speech in Natural Language Processing Also, from the North Station... (I think the Orange Line runs by there too so you can also catch the Orange Line... ) And then instead of transferring (um I- you know, the map is really obvious about this but) Instead of transferring at Park Street, you can transfer at (uh what ’ s the station name) Downtown Crossing and (um) that ’ ll get you back to the Red Line just as easily.
44
Spoken Language Processing
Speech Recognition
NLP system
IR IE QA Summarization Topic Modeling 45
Spoken Language Processing
ALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHAT ’ S THE STATION NAME DOWNTOWN CROSSING UM AND THAT ’ LL GET YOU BACK TO THE RED LINE JUST AS EASILY
NLP system
IR IE QA Summarization Topic Modeling 46
Dealing with Speech Errors
ALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHAT ’ S THE STATION NAME DOWNTOWN CROSSING UM AND THAT ’ LL GET YOU BACK TO THE RED LINE JUST AS EASILY
Robust NLP system
IR IE QA Summarization Topic Modeling 47
Automatic Speech Recognition Assumption ASR produces a “ transcript ” of Speech.
ALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHAT ’ S THE STATION NAME DOWNTOWN CROSSING UM AND THAT ’ LL GET YOU BACK TO THE RED LINE JUST AS EASILY 48
Automatic Speech Recognition Assumption ASR produces a “ transcript ” of Speech.
Also, from the North Station... (I think the Orange Line runs by there too so you can also catch the Orange Line... ) And then instead of transferring (um I- you know, the map is really obvious about this but) Instead of transferring at Park Street, you can transfer at (uh what ’ s the station name) Downtown Crossing and (um) that ’ ll get you back to the Red Line just as easily.
“ Rich Transcription ” 49
Speech as Noisy Text
Decrease WER Increase Robustness
Speech Recognition Robust NLP system
IR IE QA Summarization Topic Modeling 50
Other directions for improvement.
Prosodic Analysis Speech Recognition Robust NLP system
IR IE QA Summarization Topic Modeling Use Lattices or N-Best lists 51
Prosody
• Variation is production properties that lead to changes in intended interpretation.
• Pitch • Intensity • Duration, Rhythm, Speaking Rate • Spectral Emphasis • Pausing 52
Tasks that can use prosody
• Part of Speech Tagging [Eidelman et al. 2010] • Parsing [Huang, et al. 2010] • Language Modeling [Su & Jelinek, 2008] • Pronunciation Modeling [Rosenberg 2012] • Acoustic Modeling [Chen, et al. 2006] • Emotion Recognition [Lee, et al. 2009] • Topic Segmentation [Rosenberg & Hirschberg, 2006, Rosenberg, et al. 2007] • Speaker Identification/Verification [Leung, et al. 2008] 53
Symbolic vs. Direct Modeling
Acoustic Features Symbolic Prosodic Analysis Task-Specific Classifier Direct Acoustic Features Task-Specific Classifier • Symbolic Modeling – Modular – Linguistically Meaningful – Perceptually Salient – Dimensionality Reduction • Direct Modeling – Appropriate to the Task – Lower information loss – General Interspeech 2011 Tutorial M1 - More Than Words Can Say 54
ToBI (
To
nes and
B
reak
I
ndices)
• Based on Pierrehumbert’s “intonational phonology”
Silverman et al. 1992
• Prosody is described by high (H) and low (L)
tones
that are associated with prosodic events (pitch accents, phrase accents, and boundary tones) and prosody Catalan
break indices
which describe the degree of disjuncture between words.
– ToBI is inherently categorical in its description of • ToBI variants exist for at least American English, German, Japanese, Korean, Portuguese, Greek, Interspeech 2011 Tutorial M1 - More Than Words Can Say 55
ToBI Accenting
• Words are labeled as containing a pitch accent or not.
• There are five possible pitch accent types (in SAE).
• High tones can be produced in a compressed pitch range – catathesis, or “downstepping”.
Interspeech 2011 Tutorial M1 - More Than Words Can Say H* L* L*+H L+H* H+!H* 56
ToBI Phrasing
• ToBI describes phrasing as a hierarchy of two levels.
– Intermediate phrases contain one or more words.
– Intonational phrases contain one or more intermediate phrases.
• Word boundaries are marked with a degree of disjuncture, or break index – Break indices range from 0-4 – >3 intermediate phrase boundary – 4 intonational phrase boundary.
Interspeech 2011 Tutorial M1 - More Than Words Can Say 57
ToBI Phrase Ending Types
• Intermediate Phrase boundaries have associated
Phrase Accents
describing the pitch movement from the last accent to the phrase boundary – Phrase Accents: H-, !H- or L • Intonational phrase boundaries have
Boundary Tones
describing the pitch movement immediately before the boundary – Boundary Tones: H% or L% L-L% L-H% H-H% H-L% Interspeech 2011 Tutorial M1 - More Than Words Can Say !H-L% 58
ToBI Example (in Praat)
Interspeech 2011 Tutorial M1 - More Than Words Can Say 59
The Standard Corpus-Based Approach • Identify labeled training data • Decide what to label – syllables or words • Extract aggregate acoustic features based on the labeling region • Train a supervised classifier • Evaluate using cross-validation or a held-out test set.
Interspeech 2011 Tutorial M1 - More Than Words Can Say 60
The Standard Corpus-Based Approach • Identify labeled training data – Can we use unlabeled data?
• Decide what to label – syllables or words • Extract aggregate acoustic features based on the labeling region • Train a supervised model • Evaluate using cross-validation or a held-out test set.
Interspeech 2011 Tutorial M1 - More Than Words Can Say 61
The Standard Corpus-Based Approach • Identify labeled training data • Decide what to label – syllables or words – Are these the only options? [Context and Region of analysis] • Extract aggregate acoustic features based on the labeling region • Train a supervised model • Evaluate using cross-validation or a held-out test set.
Interspeech 2011 Tutorial M1 - More Than Words Can Say 62
The Standard Corpus-Based Approach • Identify labeled training data • Decide what to label – syllables or words • Extract aggregate acoustic features based on the labeling region – There are always new features to explore [Shape Modeling] • Train a supervised model • Evaluate using cross-validation or a held-out test set.
Interspeech 2011 Tutorial M1 - More Than Words Can Say 63
The Standard Corpus-Based Approach • Identify labeled training data • Decide what to label – syllables or words • Extract aggregate acoustic features based on the labeling region • Train a supervised model – Unsupervised and Semi-supervised approaches – Structured ensembles of classifiers • Evaluate using cross-validation or a held-out test set.
Interspeech 2011 Tutorial M1 - More Than Words Can Say 64
The Standard Corpus-Based Approach • Identify labeled training data • Decide what to label – syllables or words • Extract aggregate acoustic features based on the labeling region • Train a supervised model • Evaluate using cross-validation or a held-out test set.
– Is this a reasonable approximation of generalization performance?
Interspeech 2011 Tutorial M1 - More Than Words Can Say 65
Processing Speech
• Processing speech is difficult – There are errors in transcripts.
– It is not grammatical – The style (genre) of speech is different from the available (text) training data.
• Processing speech is easy – Speaker information – Intention (sarcasm, certainty, emotion, etc.) – Segmentation 66
Questions & Comments
• What topic was clearest?
– murkiest?
• What was the most interesting?
– least interesting?
• • • [email protected]
http://speech.cs.qc.cuny.edu
http://eniac.cs.qc.cuny.edu/andrew 67