Spoken Dialogue Systems Julia Hirschberg CS 4706 11/6/2015 Issues • Error avoidance • Error detection – From the system side: how likely is it the system made.

Download Report

Transcript Spoken Dialogue Systems Julia Hirschberg CS 4706 11/6/2015 Issues • Error avoidance • Error detection – From the system side: how likely is it the system made.

Spoken Dialogue Systems
Julia Hirschberg
CS 4706
11/6/2015
1
Issues
• Error avoidance
• Error detection
– From the system side: how likely is it the
system made an error?
– From the user side: what cues does the user
provide to indicate an error?
• Error handling: what can the system do when it
thinks an error has occurred?
• Evaluation: how do you know what needs fixing
most?
11/6/2015
2
Avoiding misunderstandings
• By imitating human performance
• Timing and grounding (Clark ’03)
11/6/2015
3
Recognizing Problematic Dialogues
• Hastie et al, “What’s the Trouble?” ACL 2002.
11/6/2015
4
Recognizing Problematic Utterances
(Hirschberg et al ’99--)
• Collect corpus from interactive voice response
system
• Identify speaker ‘turns’
• incorrectly recognized
• where speakers first aware of error
• that correct misrecognitions
• Identify prosodic features of turns in each
category and compare to other turns
• Use Machine Learning techniques to train a
classifier to make these distinctions automatically
11/6/2015
5
Turn Types
TOOT: Hi. This is AT&T Amtrak Schedule
System. This is TOOT. How may I help you?
User: Hello. I would like trains from
Philadelphia to New York leaving on Sunday at
ten thirty in the evening.
misrecognition
TOOT: Which city do you want to go to?
User: New York.
correction
aware site
11/6/2015
6
Results
• Reduced error in predicting misrecognized turns
to 8.64%
• Error in predicting ‘awares’ (12%)
• Error in predicting corrections (18-21%)
11/6/2015
7
Evidence from Human Performance
• Users provide explicit positive and negative
feedback
• Corpus-based vs. laboratory experiments –
do these tell us different things?
– Bell & Gustafson ’00
• What do we learn from this?
• What functions does feedback serve?
– Krahmer et al
• ‘go on’ and ‘go back’ signals in grounding
situations (implicit/explicit verification)
11/6/2015
8
– Pos: short turns, unmarked word order,
confirmation, answers, no corrections or
repetitions, new info
– Neg: long turns, marked word order,
disconfirmation, no answer, corrections,
repetitions, no new info
– Hypotheses supported but…
• Can these cues be identified automatically?
• How might they affect the design of SDS?
11/6/2015
9
Error Handling Strategies
• Goldberg et al ’03: how should systems best
inform the user that they don’t understand?
– System rephrasing vs. repetitions vs.
statement of not understanding
– Apologies
• What behaviors might these produce?
– Hyperarticulation
– User frustration
– User repetition or *rephrasing
11/6/2015
10
• What lessons do we learn?
– What produces least frustration?
– Best recognized input?
11/6/2015
11
Evaluating Dialogue Systems
• PARADISE framework (Walker et al ’00)
• “Performance” of a dialogue system is affected
both by what gets accomplished by the user and
the dialogue agent and how it gets
accomplished
Maximize
Task Success
Minimize
Costs
Efficiency
Measures
11/6/2015
Qualitative
Measures
12
Task Success
•Task goals seen as Attribute-Value Matrix
ELVIS e-mail retrieval task (Walker et al ‘97)
“Find the time and place of your meeting with
Kim.”
Attribute
Selection Criterion
Time
Place
Value
Kim or Meeting
10:30 a.m.
2D516
•Task success defined by match between AVM
values at end of with “true” values for AVM
11/6/2015
13
Metrics
• Efficiency of the Interaction:User Turns, System
Turns, Elapsed Time
• Quality of the Interaction: ASR rejections, Time Out
Prompts, Help Requests, Barge-Ins, Mean
Recognition Score (concept accuracy), Cancellation
Requests
• User Satisfaction
• Task Success: perceived completion, information
extracted
11/6/2015
14
Experimental Procedures
• Subjects given specified tasks
• Spoken dialogues recorded
• Cost factors, states, dialog acts automatically
logged; ASR accuracy,barge-in hand-labeled
• Users specify task solution via web page
• Users complete User Satisfaction surveys
• Use multiple linear regression to model User
Satisfaction as a function of Task Success and
Costs; test for significant predictive factors
11/6/2015
15
User Satisfaction:
Sum of Many Measures
• Was Annie easy to understand
in this conversation? (TTS
Performance)
• In this conversation, did Annie
understand what you said?
(ASR Performance)
• In this conversation, was it
easy to find the message you
wanted? (Task Ease)
• Was the pace of interaction
with Annie appropriate in this
conversation? (Interaction
Pace)
• In this conversation, did you
know what you could say at
each point of the dialog?
11/6/2015
(User Expertise)
• How often was Annie sluggish
and slow to reply to you in this
conversation? (System
Response)
• Did Annie work the way you
expected her to in this
conversation? (Expected
Behavior)
• From your current experience
with using Annie to get your
email, do you think you'd use
Annie regularly to access your
mail when you are away from
your desk? (Future Use)
16
Performance Functions from Three Systems
• ELVIS User Sat.= .21* COMP + .47 * MRS - .15 * ET
• TOOT User Sat.= .35* COMP + .45* MRS - .14*ET
• ANNIE User Sat.= .33*COMP + .25* MRS +.33* Help
– COMP: User perception of task completion
(task success)
– MRS: Mean recognition accuracy (cost)
– ET: Elapsed time (cost)
– Help: Help requests (cost)
11/6/2015
17
Performance Model
• Perceived task completion and mean recognition
score are consistently significant predictors of
User Satisfaction
• Performance model useful for system
development
– Making predictions about system
modifications
– Distinguishing ‘good’ dialogues from ‘bad’
dialogues
• But can we also tell on-line when a dialogue is
‘going wrong’
11/6/2015
18
Next Week
• Speech summarization and data mining
11/6/2015
19