Spoken Language Understanding for Conversational Dialog

Download Report

Transcript Spoken Language Understanding for Conversational Dialog

Spoken Language Understanding
for Conversational Dialog Systems
Michael McTear
University of Ulster
IEEE/ACL 2006 Workshop on Spoken Language Technology
Aruba, December 10-13, 2006
Overview




Introductory definitions
 Task-based and conversational dialog
systems
 Spoken language understanding
Issues for spoken language understanding
 Coverage
 Robustness
Overview of spoken language understanding
 Hand-crafted approaches
 Data-driven methods
Conclusions
Basic dialog system architecture
Audio
Speech
Recognition
HMM
Acoustic
Model
Audio
Words
Spoken
Language
Understanding
N-Gram
Language
Model
Text to Speech
Synthesis
Words
Semantic
representation
Dialogue
Manager
Language
Generation
Concepts
Back
end
Task-based Dialog Systems



Mainly interact with databases to get
information or support transactions
SLU module creates a database query from
user’s spoken input by extracting relevant
concepts
System initiative: constrains user input


Keyword / keyphrase extraction
User-initiative: less constrained input


Call-routing: call classification with named entity
extraction
Question answering
Conversational Dialog

AI (agent-based systems) e.g. TRIPS





User can take initiative, e.g. raise new topic, ask for
clarification (TRIPS)
More complex interactions involving recognition of the
user’s intentions, goals, beliefs or plans
Deep understanding of the user’s utterance, taking into
account contextual information
Information State Theory, Planning Theory, User
Modelling, Belief Modelling…
Simulated conversation e.g. CONVERSE



Conversational companions, chatbots, help desk
Does not require deep understanding
SLU involves identifying system utterance type and
determining a suitable response
Defining Spoken Language
Understanding
extracting the meaning from speech
utterances
a transduction of the recognition result
to an interpretable representation
Meaning (in human–computer interactive
systems):
a representation that can be executed by
an interpreter in order to change the
state of the system
Bangelore et al. 2006
SLU for task based systems
a flight from Belfast to Malaga
uh I’d like uh um could you uh is there a flight
from Bel- uh Belfast to um Gran- I mean
Malaga
Topic: Flight
Origen: BFS
Destination: AGP
I would like to find a flight from Pittsburgh to
Boston on Wednesday and I have to be in
Boston by one so I would like a flight out of
here no later than 11 a.m.
SLU for advanced conversational
systems (TRIPS)

Interpretation requires intention recognition
can we use a helicopter to get the people from:
Abyss (request to modify plan)
Barnacle (include sub-goal and suggest
solution)
Delta (extension of a solution)
 Six possible interpretations with only change of
city name
 Requires reasoning about task and current
context to identify most plausible interpretation
 Requires more than concept spotting to identify
structure and meaning of utterance as basis for
reasoning
Role of syntax in deep understanding

List all employees of the companies
who/which are based in the city centre
employees based in city centre
I mail check
to ?where


companies based in city centre
I would like to know where to mail the
check.
I would like you to mail the check to me
You mail check
to me

Stop word removal will transform both
strings to “mail check”
SLU for simulated dialog
C: Are you still a friend of XXX?
H: I am not sure if he has any real
friends. He has achieved a lot and has
left a lot of people behind.
C: You really don’t like the guy - well, at least
that’s definite - I thought most people were
just indifferent or numb by now.
?
Coverage
Possible requirement:
“The system should be able to understand
everything the user might say”
Predicting user input
 Analysis of corpora and iterative design of
hand-crafted grammars
 Use of carefully designed prompts to
constrain user input is constrained
 Learning grammar from data
Robustness

Characteristics of spontaneous spoken
language


Output from speech recognition component



Disfluencies and filled pauses – not just errors,
reflect cognitive aspects of speech production
and interaction management
Words and word boundaries not known with
certainty
Recognition errors
Approaches


Use of semantic grammars and robust parsing for
concepts spotting
Data-driven approaches – learn mappings
between input strings and output structures
Developing the SLU component


Hand-crafted approaches
 Grammar development
 Parsing
Data-driven approaches
 Learning from data
 Statistical models rather than grammars
 Efficient decoding
Hand-crafting grammars



Traditional software
engineering approach of
design and iterative
refinement
Decisions about type of
grammar required
 Chomsky hierarchy
 Flat v hierarchical
representations
Processing issues
(parsing)
 Dealing with ambiguity
 Efficiency
ASR
n-best list, word lattice, …
Parsing
parse tree
Frame
Generation
semantic frame
Discourse
Processing
frame in context
DB
Query
SQL query
Semantic Grammar and Robust
Parsing: PHOENIX (CMU/CU)
The Phoenix parser maps input
word strings on to a sequence of
semantic frames.
 named set of slots, where the
slots represent related pieces of
information.
 each slot has an associated
Context-Free Grammar that
specifies word string patterns
that match the slot
 chart parsing with path pruning:
e.g. path that accounts for fewer
words is pruned
ASR
word string
Semantic
Parser
meaning
representation
Deriving Meaning directly from ASR
output: VoiceXML
Uses finite state grammars as language
models for recognition and semantic
tags in the grammars for semantic
parsing
I would like a coca cola and three
large pizzas with pepperoni and
mushrooms
{
drink: "coke",
pizza: {
number: "3",
size: "large",
topping: [
"pepperoni",
"mushrooms"
]
}
}
ASR
meaning
representation
Deep understanding


Requirements for deep understanding
 advanced grammatical formalisms
 syntax-semantics issues
 parsing technologies
Example: TRIPS
 Uses feature-based augmented CFG with
agenda-driven best-first chart parser
 Combined strategy: combining shallow
and deep parsing (Swift et al. )
Combined strategies: TINA (MIT)
Grammar rules include mix of syntactic and
semantic categories
 Context free grammar using probabilities
trained from user utterances to estimate
likelihood of a parse
 Parse tree converted to a semantic frame
that encapsulates the meaning
Robust parsing strategy
 Sentences that fail to parse are parsed
using fragments that are combined into a
full semantic frame
 When all things fail, word spotting is used
Problems with hand-crafted
approaches
Hand-crafted grammars are
 not robust to spoken language input
 require linguistic and engineering
expertise to develop if grammar is to
have good coverage and optimised
performance
 time consuming to develop
 error prone
 subject to designer bias
 difficult to maintain
Statistical modelling for SLU
SLU as pattern matching problem
Given word sequence W, find semantic
representation of meaning M that has
maximum a posteriori probability P(M|W)
Mˆ  arg max P(M | W )  arg max P(W | M ) P(M )
M
M
P(M): semantic prior model – assigns probability to
underlying semantic structure
P(W|M): lexicalisation model – assigns probability to
word sequence W given the semantic structure
Early Examples


CHRONUS (AT&T: Pieraccini et al, 1992; Levin &
Pieraccini, 1995)
 Finite state semantic tagger
 ‘Flat-concept’ model: simple to train but does not
represent hierarchical structure
HUM (Hidden Understanding Model) (BBN: Miller et al,
1995)
 Probabilistic CFG using tree structured meaning
representations
 Grammatical constraints represented in networks rather
than rules
 Ordering of constituents unconstrained - increases
robustness
 Transition probabilities constrain over-generation
 Requires fully annotated treebank data for training
Using Hidden State Vectors (He & Young)


Extends ‘flat-concept’ HMM model
Represents hierarchical structure (rightbranching) using hidden state vectors




Each state expanded to encode stack of a push down
automaton
Avoids computational tractability issues associated with
hierarchical HMMs
Can be trained using lightly annotated data
Comparison with FST model and with handcrafted SLU systems using ATIS test sets and
reference parse results
Which flights arrive in Burbank from Denver on Saturday?
Problem with long-distance dependency between ‘Saturday’ and ‘arrive’
‘Saturday’ associated with ‘FROMLOC’
Hierarchical model allows ‘Saturday’ to be associated with ‘ARRIVE’
Also: more expressive, allows sharing of sub-structures
SLU Evaluation: Performance



Statistical models competitive with
approaches based on handcrafted rules
Hand-crafted grammars better for full
understanding and for users familiar with
system’s coverage, statistical model better
for shallow and more robust understanding
for naïve users
Statistical systems more robust to noise
and more portable
SLU Evaluation: Software Development
“Cost of producing training data should be less than
cost of hand-crafting a semantic grammar”
(Young, 2002)

Issues






Availability of training data
Maintainability
Portability
Objective metrics? e.g. time, resources, lines of code, …
Subjective issues e.g. designer bias, designer control
over system
Few concrete results, except …


HVS model (He & Young) can be robustly trained from
only minimally annotated corpus data
Model is robust to noise and portable to other domains
Additional technologies
Named entity extraction


Rule-based methods: e.g. using grammars
in form of regular expressions compiled
into finite state acceptors (AT&T SLU
system) – higher precision
Statistical methods e.g. HMIHY, learn
mappings between strings and NEs – higher
recall as more robust
Call routing
Question Answering
Additional Issues 1


ASR/SLU coupling
Post-processing results from ASR
 noisy channel model of ASR errors (Ringger &
Allen)


Combining shallow and deep parsing
 major gains in speed, slight gains in
accuracy (Swift et al.)
Use of context, discourse history, prosodic
information
 re-ordering n-best hypotheses
 determining dialog act based on
combinations of features at various levels:
ASR and parse probabilities, semantic and
contextual features (Purver et al, Lemon)
Additional Issues 2

Methods for learning from sparse data or
without annotation


e.g. AT&T system uses ‘active learning’ (Tur
et al, 2005) to reduce effort of human data
labelling – uses only those data items that
improve classifier performance the most
Development tools e.g. SGStudio (Wang &
Acero) – build semantic grammar with little
linguistic knowledge
Additional Issues 3
Some issues addressed in poster session

Using SLU for:
 Dialog act tagging
 Prosody labelling
 User satisfaction analysis
 Topic segmentation and labelling
 Emotion prediction
Conclusions 1
SLU approach is determined by

type of application




finite state dialog with single word
recognition
frame based dialog with topic classification
and named entity extraction
advanced dialog requiring deep
understanding
simulated conversation, …
Conclusions 2
SLU approach is determined by

type of output required
 syntactic / semantic parse trees
 semantic frames
 speech / dialog acts, …
 intentions, beliefs, emotions, …
Conclusions 3
SLU approach is determined by

Deployment and usability issues
 applications requiring accurate extraction
of information
 applications involving complex
processing of content
 applications involving shallow processing
of content (e.g. conversational
companions, interactive games)
Selected References
Bangalore, S., Hakkani-Tür, D., Tur, G. (eds), (2006)
Special Issue on Spoken Language Understanding in
Conversational Systems. Speech Communication 48.
Gupta, N., Tur, G., Hakkani-Tür, D., Bangalore, S., Riccardi,
G., Gilbert, M. (2006) The AT&T Spoken Language
Understanding System. IEEE Transactions on Speech
and Audio Processing 14:1, 213-222.
Allen, JF, Byron, DK, Dzikovska, O, Ferguson, G, Galescu,
L, Stent, A. (2001) Towards conversational humancomputer interaction. AI Magazine, 22(4):27–35.
Jurafsky, D. & Martin, J. (2000) Speech and Language
Processing, Prentice-Hall
Huang, X, Acero, A, Hon, H-W. (2001) Spoken Language
Processing: A Guide to Theory, Algorithm and System
Development. Prentice-Hall