Spoken Language Understanding for Conversational Dialog
Download
Report
Transcript Spoken Language Understanding for Conversational Dialog
Spoken Language Understanding
for Conversational Dialog Systems
Michael McTear
University of Ulster
IEEE/ACL 2006 Workshop on Spoken Language Technology
Aruba, December 10-13, 2006
Overview
Introductory definitions
Task-based and conversational dialog
systems
Spoken language understanding
Issues for spoken language understanding
Coverage
Robustness
Overview of spoken language understanding
Hand-crafted approaches
Data-driven methods
Conclusions
Basic dialog system architecture
Audio
Speech
Recognition
HMM
Acoustic
Model
Audio
Words
Spoken
Language
Understanding
N-Gram
Language
Model
Text to Speech
Synthesis
Words
Semantic
representation
Dialogue
Manager
Language
Generation
Concepts
Back
end
Task-based Dialog Systems
Mainly interact with databases to get
information or support transactions
SLU module creates a database query from
user’s spoken input by extracting relevant
concepts
System initiative: constrains user input
Keyword / keyphrase extraction
User-initiative: less constrained input
Call-routing: call classification with named entity
extraction
Question answering
Conversational Dialog
AI (agent-based systems) e.g. TRIPS
User can take initiative, e.g. raise new topic, ask for
clarification (TRIPS)
More complex interactions involving recognition of the
user’s intentions, goals, beliefs or plans
Deep understanding of the user’s utterance, taking into
account contextual information
Information State Theory, Planning Theory, User
Modelling, Belief Modelling…
Simulated conversation e.g. CONVERSE
Conversational companions, chatbots, help desk
Does not require deep understanding
SLU involves identifying system utterance type and
determining a suitable response
Defining Spoken Language
Understanding
extracting the meaning from speech
utterances
a transduction of the recognition result
to an interpretable representation
Meaning (in human–computer interactive
systems):
a representation that can be executed by
an interpreter in order to change the
state of the system
Bangelore et al. 2006
SLU for task based systems
a flight from Belfast to Malaga
uh I’d like uh um could you uh is there a flight
from Bel- uh Belfast to um Gran- I mean
Malaga
Topic: Flight
Origen: BFS
Destination: AGP
I would like to find a flight from Pittsburgh to
Boston on Wednesday and I have to be in
Boston by one so I would like a flight out of
here no later than 11 a.m.
SLU for advanced conversational
systems (TRIPS)
Interpretation requires intention recognition
can we use a helicopter to get the people from:
Abyss (request to modify plan)
Barnacle (include sub-goal and suggest
solution)
Delta (extension of a solution)
Six possible interpretations with only change of
city name
Requires reasoning about task and current
context to identify most plausible interpretation
Requires more than concept spotting to identify
structure and meaning of utterance as basis for
reasoning
Role of syntax in deep understanding
List all employees of the companies
who/which are based in the city centre
employees based in city centre
I mail check
to ?where
companies based in city centre
I would like to know where to mail the
check.
I would like you to mail the check to me
You mail check
to me
Stop word removal will transform both
strings to “mail check”
SLU for simulated dialog
C: Are you still a friend of XXX?
H: I am not sure if he has any real
friends. He has achieved a lot and has
left a lot of people behind.
C: You really don’t like the guy - well, at least
that’s definite - I thought most people were
just indifferent or numb by now.
?
Coverage
Possible requirement:
“The system should be able to understand
everything the user might say”
Predicting user input
Analysis of corpora and iterative design of
hand-crafted grammars
Use of carefully designed prompts to
constrain user input is constrained
Learning grammar from data
Robustness
Characteristics of spontaneous spoken
language
Output from speech recognition component
Disfluencies and filled pauses – not just errors,
reflect cognitive aspects of speech production
and interaction management
Words and word boundaries not known with
certainty
Recognition errors
Approaches
Use of semantic grammars and robust parsing for
concepts spotting
Data-driven approaches – learn mappings
between input strings and output structures
Developing the SLU component
Hand-crafted approaches
Grammar development
Parsing
Data-driven approaches
Learning from data
Statistical models rather than grammars
Efficient decoding
Hand-crafting grammars
Traditional software
engineering approach of
design and iterative
refinement
Decisions about type of
grammar required
Chomsky hierarchy
Flat v hierarchical
representations
Processing issues
(parsing)
Dealing with ambiguity
Efficiency
ASR
n-best list, word lattice, …
Parsing
parse tree
Frame
Generation
semantic frame
Discourse
Processing
frame in context
DB
Query
SQL query
Semantic Grammar and Robust
Parsing: PHOENIX (CMU/CU)
The Phoenix parser maps input
word strings on to a sequence of
semantic frames.
named set of slots, where the
slots represent related pieces of
information.
each slot has an associated
Context-Free Grammar that
specifies word string patterns
that match the slot
chart parsing with path pruning:
e.g. path that accounts for fewer
words is pruned
ASR
word string
Semantic
Parser
meaning
representation
Deriving Meaning directly from ASR
output: VoiceXML
Uses finite state grammars as language
models for recognition and semantic
tags in the grammars for semantic
parsing
I would like a coca cola and three
large pizzas with pepperoni and
mushrooms
{
drink: "coke",
pizza: {
number: "3",
size: "large",
topping: [
"pepperoni",
"mushrooms"
]
}
}
ASR
meaning
representation
Deep understanding
Requirements for deep understanding
advanced grammatical formalisms
syntax-semantics issues
parsing technologies
Example: TRIPS
Uses feature-based augmented CFG with
agenda-driven best-first chart parser
Combined strategy: combining shallow
and deep parsing (Swift et al. )
Combined strategies: TINA (MIT)
Grammar rules include mix of syntactic and
semantic categories
Context free grammar using probabilities
trained from user utterances to estimate
likelihood of a parse
Parse tree converted to a semantic frame
that encapsulates the meaning
Robust parsing strategy
Sentences that fail to parse are parsed
using fragments that are combined into a
full semantic frame
When all things fail, word spotting is used
Problems with hand-crafted
approaches
Hand-crafted grammars are
not robust to spoken language input
require linguistic and engineering
expertise to develop if grammar is to
have good coverage and optimised
performance
time consuming to develop
error prone
subject to designer bias
difficult to maintain
Statistical modelling for SLU
SLU as pattern matching problem
Given word sequence W, find semantic
representation of meaning M that has
maximum a posteriori probability P(M|W)
Mˆ arg max P(M | W ) arg max P(W | M ) P(M )
M
M
P(M): semantic prior model – assigns probability to
underlying semantic structure
P(W|M): lexicalisation model – assigns probability to
word sequence W given the semantic structure
Early Examples
CHRONUS (AT&T: Pieraccini et al, 1992; Levin &
Pieraccini, 1995)
Finite state semantic tagger
‘Flat-concept’ model: simple to train but does not
represent hierarchical structure
HUM (Hidden Understanding Model) (BBN: Miller et al,
1995)
Probabilistic CFG using tree structured meaning
representations
Grammatical constraints represented in networks rather
than rules
Ordering of constituents unconstrained - increases
robustness
Transition probabilities constrain over-generation
Requires fully annotated treebank data for training
Using Hidden State Vectors (He & Young)
Extends ‘flat-concept’ HMM model
Represents hierarchical structure (rightbranching) using hidden state vectors
Each state expanded to encode stack of a push down
automaton
Avoids computational tractability issues associated with
hierarchical HMMs
Can be trained using lightly annotated data
Comparison with FST model and with handcrafted SLU systems using ATIS test sets and
reference parse results
Which flights arrive in Burbank from Denver on Saturday?
Problem with long-distance dependency between ‘Saturday’ and ‘arrive’
‘Saturday’ associated with ‘FROMLOC’
Hierarchical model allows ‘Saturday’ to be associated with ‘ARRIVE’
Also: more expressive, allows sharing of sub-structures
SLU Evaluation: Performance
Statistical models competitive with
approaches based on handcrafted rules
Hand-crafted grammars better for full
understanding and for users familiar with
system’s coverage, statistical model better
for shallow and more robust understanding
for naïve users
Statistical systems more robust to noise
and more portable
SLU Evaluation: Software Development
“Cost of producing training data should be less than
cost of hand-crafting a semantic grammar”
(Young, 2002)
Issues
Availability of training data
Maintainability
Portability
Objective metrics? e.g. time, resources, lines of code, …
Subjective issues e.g. designer bias, designer control
over system
Few concrete results, except …
HVS model (He & Young) can be robustly trained from
only minimally annotated corpus data
Model is robust to noise and portable to other domains
Additional technologies
Named entity extraction
Rule-based methods: e.g. using grammars
in form of regular expressions compiled
into finite state acceptors (AT&T SLU
system) – higher precision
Statistical methods e.g. HMIHY, learn
mappings between strings and NEs – higher
recall as more robust
Call routing
Question Answering
Additional Issues 1
ASR/SLU coupling
Post-processing results from ASR
noisy channel model of ASR errors (Ringger &
Allen)
Combining shallow and deep parsing
major gains in speed, slight gains in
accuracy (Swift et al.)
Use of context, discourse history, prosodic
information
re-ordering n-best hypotheses
determining dialog act based on
combinations of features at various levels:
ASR and parse probabilities, semantic and
contextual features (Purver et al, Lemon)
Additional Issues 2
Methods for learning from sparse data or
without annotation
e.g. AT&T system uses ‘active learning’ (Tur
et al, 2005) to reduce effort of human data
labelling – uses only those data items that
improve classifier performance the most
Development tools e.g. SGStudio (Wang &
Acero) – build semantic grammar with little
linguistic knowledge
Additional Issues 3
Some issues addressed in poster session
Using SLU for:
Dialog act tagging
Prosody labelling
User satisfaction analysis
Topic segmentation and labelling
Emotion prediction
Conclusions 1
SLU approach is determined by
type of application
finite state dialog with single word
recognition
frame based dialog with topic classification
and named entity extraction
advanced dialog requiring deep
understanding
simulated conversation, …
Conclusions 2
SLU approach is determined by
type of output required
syntactic / semantic parse trees
semantic frames
speech / dialog acts, …
intentions, beliefs, emotions, …
Conclusions 3
SLU approach is determined by
Deployment and usability issues
applications requiring accurate extraction
of information
applications involving complex
processing of content
applications involving shallow processing
of content (e.g. conversational
companions, interactive games)
Selected References
Bangalore, S., Hakkani-Tür, D., Tur, G. (eds), (2006)
Special Issue on Spoken Language Understanding in
Conversational Systems. Speech Communication 48.
Gupta, N., Tur, G., Hakkani-Tür, D., Bangalore, S., Riccardi,
G., Gilbert, M. (2006) The AT&T Spoken Language
Understanding System. IEEE Transactions on Speech
and Audio Processing 14:1, 213-222.
Allen, JF, Byron, DK, Dzikovska, O, Ferguson, G, Galescu,
L, Stent, A. (2001) Towards conversational humancomputer interaction. AI Magazine, 22(4):27–35.
Jurafsky, D. & Martin, J. (2000) Speech and Language
Processing, Prentice-Hall
Huang, X, Acero, A, Hon, H-W. (2001) Spoken Language
Processing: A Guide to Theory, Algorithm and System
Development. Prentice-Hall