Estimation Method of User Satisfaction Using N-gram

Download Report

Transcript Estimation Method of User Satisfaction Using N-gram

LREC2010: O3 - Dialogue and Evaluation
Estimation Method of User Satisfaction
Using N-gram-based Dialog History Model
for Spoken Dialog System
Sunao Hara, Norihide Kitaoka, Kazuya Takeda
{naoh, kitaoka, kazuya.takeda}@nagoya-u.jp
Graduate School of Information Science,
Nagoya University, Japan
Introduction
• The aim of this study
1.
2.
3.
4.
5.
Introduction
Musicnavi2 database
N-gram modeling
Estimation experiment
Conclusion
– Construct an estimation model of user satisfaction for
spoken dialog systems (SDSs) based on the real PC
environment data
• Experiment
– Field experiment using a SDS for the music
retrieval application
– Construct and evaluate an estimation model
for user satisfaction using N-gram history model
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
2
Background (1/2)
• Use of speech input applications (e.g. Skype)
by PC users is spreading
– More users may use Spoken Dialog Systems (SDSs)
via the Internet
• The acoustic properties of PC environments differ among
users
– e.g. microphones, noise conditions, etc.
Collect the speech under realistic PC environment
• From a practical application standpoint
– Evaluation and prediction of the system performance (User
Satisfaction) are also important issues
Build an estimation model for User Satisfaction
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
3
Background (2/2)
• The evaluation using automatically measured metrics
– Tune up the system parameters in the designing stage
– Use to select the best dialog strategy for SDS applications
– PARADISE Framework [Walker, et al. 1997]
• The detection of problematic dialog for call center
Interactive Voice Response (IVR) systems
– To detect that “the conversation will break down”, as soon
as possible
– Problematic dialog predictor using SLU-success feature
Spoken Language Understanding
[Walker, et al. 2002]
– N-gram-based call quality monitoring system [Kim 2007]
Can we estimate the user satisfaction of SDS
by modeling the dialog context?
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
4
MusicNavi2 database
1.
2.
3.
4.
5.
Introduction
Musicnavi2 database
N-gram modeling
Estimation experiment
Conclusion
• Field experiment using a music
retrieval system with spoken dialog interface
1. Download the system through the Internet
2. Use it for a certain period
3. Fill in questionnaires on the web page
• Music retrieval system - MusicNavi2
– “Music retrieval application” + “Spoken dialog interface”
– The spoken dialogue interface for retrieving
and playing songs stored in user’s PC
– Can collect speech data in corporation with a server
program via the Internet
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
5
Example of a dialog
U
S
U
S
U
S
U
S
U = User
S = System
User’s utterances / System’s prompts
Hello
( ko-n-ni-chi-wa)
Hello
Da-i-to-ka-i
Do you want to retrieve the song “Da-i-to-ka-i?”
Yes
( ha-i )
Now, playing the song “Da-i-to-ka-i” by “Crystal
King.”
Stop
( te-i-shi )
Now, stopping.
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
6
Data collection by the field test
• Large scaled field test through the Internet
– Subjects used MusicNavi2 on their own PC
– Participants: 1369 subjects
– Total of usage: 488 hours
• User’s task
– To listen to at least five songs
– To perform at least twenty Q&A dialogs, or to use the
system for over forty minutes
• Questionnaire (only by “task complete” users)
– Satisfaction level for SDS from 1 to 5
1: Extremely
2:Unsatisfied 3:Acceptable 4:Satisfied
unsatisfied
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
5: Extremely
satisfied
May 19, 2010
7
Distributions of the experimental subjects
and the equipments used by them
• Subjects who answered questionnaires
– 449 Subjects (278 males and 171 females)
– Total 34296 utterances
Male 16~19
20~29 years old
Female 16~19 20~29 years old
0
50
30~39 years old
100
Microphone
Headset
48%
Pin or
Desktop
15%
30~39 years old
150
200
250
300
Loudspeaker / headphone
inside of PC
5%
Headphone
52%
Loud
speaker
16% Inside of PC
13%
Unknown
19%
Unknown
32%
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
8
Overview of the MusicNavi2 database
0.25
0.25
0.2
0.2
0.2
0.05
0255075100125150175200225-
0
# of utterances
0.1
0.15
0.1
0.05
0.05
0
0
234567891011-
0.1
0.15
0102030405060708090-
0.15
Frequency
0.25
Frequency
Frequency
Utterances
per song played
Word Error Rate
# of utterances
Word Error Rate [%]
Utterances per song played
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
9
Pre-analysis of the MusicNavi2 database
• Classification of users by their satisfaction level
– “task complete” users : c = 1, 2, 3, 4, 5
– “task incomplete” users: c = ϕ
• Summary of data
– Total 518 subjects
c
ϕ
1
2
3
4
5
# of subjects
# of utterances
69
52.2
38
134.5
102
119.7
107
114.9
155
106.5
47
98.4
WER [%]
Utt. / song
70.5
107
54.1
7.21
51.0
5.34
46.8
5.12
41.2
4.22
35.3
3.43
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
10
Modeling method
for the dialog context
• The dialog management of SDS is
designed by a dialog developer
1.
2.
3.
4.
5.
Introduction
Musicnavi2 database
N-gram modeling
Estimation experiment
Conclusion
– The management is not always satisfactory for users
• Assume that satisfaction appears in the dialog
context
• Statistically learning the naturalness of the dialog
– Use N-gram to model the dialog context
– Construct models for each class of users
– Estimate the unknown user’s satisfaction based on the
likelihood of N-gram model
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
11
Spoken dialog logs to Dilaog act symbols
• Vocabulary size of the recognition dictionary
– That is, the number of the songs
– Is different between the users
• Word level information is informative, but it is too
sparse to deal with as statistically
• Use dialog act symbols for the users’/system’s
acts
– Defined 21 system dialog acts and 19 user dialog acts
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
12
Example of an encoded dialog
U
S
U
S
U
S
U
S
U = User
S = System
User’s utterances / System’s prompts
Dialog act symbols
x1 = USR_CMD_HELLO
Hello
( ko-n-ni-chi-wa)
x2 = SYS_INFO_GREETING
Hello
x3 = USR_REQUEST_BYMUSIC
Da-i-to-ka-i
x4 song
= SYS_CONFIRM_KEYWORD
Do you want to retrieve the
“Da-i-to-ka-i?”
x5 = USR_CMD_YES
Yes
( ha-i )
x6 = SYS_PLAY_SONG
Now, playing the song “Da-i-to-ka-i”
by “Crystal
King.”
Stop
( te-i-shi )x7 = USR_CMD_STOP
x8 = SYS_INFO_STOPPED
Now, stopping.
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
13
Modeling the dialog act sequence by N-gram
• A dialog act sequence:
– arranged the dialog act symbols in time order t.
• N-gram probability (= likelihood) when given
a model for a user class c
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
14
Estimation experiment
• Detection of the user’s class
using N-gram model
Exp.1: “task incomplete” users
1.
2.
3.
4.
5.
Introduction
Musicnavi2 database
N-gram modeling
Estimation experiment
Conclusion
Exp.2: “unsatisfied” users
• Experimental conditions
– N-gram: 1-gram, 2-gram, …, 8-gram
• Witten-Bell smoothing (using SRILM toolkit)
– Input sequence: USR, SYS, SYSUSR
– Leave-one-out cross validation
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
15
Estimation experiment
• Detection method
– Model selection by thresholding the likelihood ratio
– ROC curve
– Area under the ROC curve (AUC)
1
true detection
• Evaluation metrics
0
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
false detection
May 19, 2010
1
16
AUC (Area under the ROC curve)
• “task incomplete” users
• “unsatisfied” users
N
1-gram
2-gram
SYS
0.901
0.948
SYSUSR
USR
0.873
0.927
0.929
0.977
SYS
0.611
0.628
SYSUSR
USR
0.638
0.619
0.644
0.724
3-gram
4-gram
5-gram
0.989
0.995
0.993
0.954
0.952
0.954
0.591
0.583
0.629
0.651
0.681
0.662
0.993
0.997
0.995
0.704
0.739
0.739
6-gram
0.989performance
0.951
0.995
0.632 0.639
0.761
High
detection
in
“task
7-gramincomplete”
0.988 users
0.946 to use
0.995
0.604 0.633
0.765
the system dialog acts
8-gram
0.987 0.936
0.994
0.592
0.622
0.756
Suggested
the effectivity
of using
both system and user dialog acts
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
17
Detection result of “task incomplete” users
• SYSUSR
N
4-gram achieved
100% true detection rate 1-gram
with 6% false detection rate
2-gram
3-gram
4-gram
0.927
0.977
0.993
0.997
5-gram
6-gram
7-gram
0.995
0.995
0.995
8-gram
0.994
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
AUC
18
Detection result of “unsatisfied” users
The more N of N-gram is,
the less false detection rate becomes
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
• SYSUSR
N
AUC
1-gram
2-gram
3-gram
4-gram
0.619
0.724
0.704
0.739
5-gram
6-gram
7-gram
0.739
0.761
0.765
8-gram
0.756
May 19, 2010
19
Conclusion
1.
2.
3.
4.
5.
Introduction
Musicnavi2 database
N-gram modeling
Estimation experiment
Conclusion
• Estimation method of user satisfaction
using N-gram-based dialog history model for SDS
– Constructed the real PC environmental database
– Achieved high performance in the detection of “task incomplete”
users
• 100% true detection rate, when 6% false detection rate
– Not sufficient performance in the detection of “unsatisfied” users
– N-gram model was effective by comparison of 1-gram
– Using both system and user dialog act was effective
• Future works
– N-gram model-based estimation of dialog failure (online detection)
– Analysis of the dialog context affected user satisfaction
– Integrated method of using acoustic features, prosodic features,
dialog features, etc.
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
20
• Thanks for your kind attention!
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
21
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
22
Modeling the dialog act sequence by N-gram
• Encoded dialog logs to dialog act symbols automatically
User’s
dialog acts
System’s
dialog acts
 Using speech recognition results
 They are defined in recognition dictionary
 Using system prompts or responses
 They are the same as system’s internal act
• A dialog act sequence: x
– arranged the dialog act symbols in time order t.
• N-gram probability(=Likelihood) when given a model
with a satisfaction level s
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
23
Detection by thresholding
• Model selection by an a posteriori odds classifier,
• Introduce a priori odds 1/α and Bayes factor B
• Finally,
* α =1 means ML classifier
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
24
6-クラスの満足度推定実験
• N-gramモデルを用いたユーザ満足度クラスの推
定
• 実験条件
– 評価用被験者1名を除いた残り517名を利用して
満足度毎のモデルを学習(Leave one out)
• 満足度 s = ϕ (課題未達成), 1(不満), 2, 3, 4, 5(満足)
– N-gram: 1-gram, 2-gram, …, 8-gram
– 入力系列
• ユーザの対話行動のみを利用(USR)
• システムの対話行動のみを利用(SYS)
• ユーザ・システム両者の対話行動を利用(USRSYS)
• 評価基準
– 識別精度(Accuracy)
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
25
満足度(6-クラス)の推定手法
• 最尤推定による最尤モデルの選択
– あるユーザの入力 x に対して
満足度モデルそれぞれの尤度を算出
– 最大尤度のモデルが推定結果
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
26
Detection result for 6-classes of satisfaction
システム系列のみを利用、
3-gramの場合で 34.4%
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
27
Confusion matrix
• 3-gram of SYS sequence
Actual
Estimated
ϕ
1
2
3
4
5
ϕ 1 2 3 4 5
43 5 7 5 6 3
0 7 8 9 11 3
1 8 31 16 35 11
0 9 22 23 45 8
0 8 34 29 66 18
0 4 5 6 24 8
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
課題未達成ユーザ(Φ)は
誤検出も少なく、比較的
高い精度で識別されている
満足しているユーザも
推定結果が大きく
異なっている例は少ない
May 19, 2010
28
対話履歴を考慮したユーザ満足度
• システムとの対話を繰り返すことで
ユーザの感じる満足度合いが変化
→満足
– 逐次変化の最後に“満足度”が調査される
 性能に満足
不満←
 性能に不満
 利用を中断
対話ターン数
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
29
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
30
Modeling the N-gram
• Encoded to dialog log to dialog act symbols automatically
– User’s dialog acts
• Using speech recognition results
• They are defined in recognition dictionary
– System’s dialog acts
• Using system responses or acts
• They are the same as system’s internal act
• A dialog act sequence: x
– Arranged the dialog act symbols in time order t.
• 6クラスの満足度毎にN-gramモデルを作成
– Witten-Bell smoothing … SRILM toolkit を利用
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
31
Example of a dialog
U
S
U
S
U
S
Hello
( ko-n-ni-chi-wa)
U = User
S = System
Hello
Da-i-to-ka-i
Do you want to retrieve the song “Da-i-to-ka-i?”
Yes
( ha-i )
Now, playing the song “Da-i-to-ka-i” by “Crystal
King.”
U Stop
( te-i-shi )
S Now, stopping.
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
32
1.
2.
3.
4.
5.
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Introduction
Musicnavi2 database
N-gram modeling
Estimation experiment
Conclusion
May 19, 2010
33