Estimation Method of User Satisfaction Using N-gram
Download
Report
Transcript Estimation Method of User Satisfaction Using N-gram
LREC2010: O3 - Dialogue and Evaluation
Estimation Method of User Satisfaction
Using N-gram-based Dialog History Model
for Spoken Dialog System
Sunao Hara, Norihide Kitaoka, Kazuya Takeda
{naoh, kitaoka, kazuya.takeda}@nagoya-u.jp
Graduate School of Information Science,
Nagoya University, Japan
Introduction
• The aim of this study
1.
2.
3.
4.
5.
Introduction
Musicnavi2 database
N-gram modeling
Estimation experiment
Conclusion
– Construct an estimation model of user satisfaction for
spoken dialog systems (SDSs) based on the real PC
environment data
• Experiment
– Field experiment using a SDS for the music
retrieval application
– Construct and evaluate an estimation model
for user satisfaction using N-gram history model
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
2
Background (1/2)
• Use of speech input applications (e.g. Skype)
by PC users is spreading
– More users may use Spoken Dialog Systems (SDSs)
via the Internet
• The acoustic properties of PC environments differ among
users
– e.g. microphones, noise conditions, etc.
Collect the speech under realistic PC environment
• From a practical application standpoint
– Evaluation and prediction of the system performance (User
Satisfaction) are also important issues
Build an estimation model for User Satisfaction
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
3
Background (2/2)
• The evaluation using automatically measured metrics
– Tune up the system parameters in the designing stage
– Use to select the best dialog strategy for SDS applications
– PARADISE Framework [Walker, et al. 1997]
• The detection of problematic dialog for call center
Interactive Voice Response (IVR) systems
– To detect that “the conversation will break down”, as soon
as possible
– Problematic dialog predictor using SLU-success feature
Spoken Language Understanding
[Walker, et al. 2002]
– N-gram-based call quality monitoring system [Kim 2007]
Can we estimate the user satisfaction of SDS
by modeling the dialog context?
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
4
MusicNavi2 database
1.
2.
3.
4.
5.
Introduction
Musicnavi2 database
N-gram modeling
Estimation experiment
Conclusion
• Field experiment using a music
retrieval system with spoken dialog interface
1. Download the system through the Internet
2. Use it for a certain period
3. Fill in questionnaires on the web page
• Music retrieval system - MusicNavi2
– “Music retrieval application” + “Spoken dialog interface”
– The spoken dialogue interface for retrieving
and playing songs stored in user’s PC
– Can collect speech data in corporation with a server
program via the Internet
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
5
Example of a dialog
U
S
U
S
U
S
U
S
U = User
S = System
User’s utterances / System’s prompts
Hello
( ko-n-ni-chi-wa)
Hello
Da-i-to-ka-i
Do you want to retrieve the song “Da-i-to-ka-i?”
Yes
( ha-i )
Now, playing the song “Da-i-to-ka-i” by “Crystal
King.”
Stop
( te-i-shi )
Now, stopping.
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
6
Data collection by the field test
• Large scaled field test through the Internet
– Subjects used MusicNavi2 on their own PC
– Participants: 1369 subjects
– Total of usage: 488 hours
• User’s task
– To listen to at least five songs
– To perform at least twenty Q&A dialogs, or to use the
system for over forty minutes
• Questionnaire (only by “task complete” users)
– Satisfaction level for SDS from 1 to 5
1: Extremely
2:Unsatisfied 3:Acceptable 4:Satisfied
unsatisfied
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
5: Extremely
satisfied
May 19, 2010
7
Distributions of the experimental subjects
and the equipments used by them
• Subjects who answered questionnaires
– 449 Subjects (278 males and 171 females)
– Total 34296 utterances
Male 16~19
20~29 years old
Female 16~19 20~29 years old
0
50
30~39 years old
100
Microphone
Headset
48%
Pin or
Desktop
15%
30~39 years old
150
200
250
300
Loudspeaker / headphone
inside of PC
5%
Headphone
52%
Loud
speaker
16% Inside of PC
13%
Unknown
19%
Unknown
32%
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
8
Overview of the MusicNavi2 database
0.25
0.25
0.2
0.2
0.2
0.05
0255075100125150175200225-
0
# of utterances
0.1
0.15
0.1
0.05
0.05
0
0
234567891011-
0.1
0.15
0102030405060708090-
0.15
Frequency
0.25
Frequency
Frequency
Utterances
per song played
Word Error Rate
# of utterances
Word Error Rate [%]
Utterances per song played
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
9
Pre-analysis of the MusicNavi2 database
• Classification of users by their satisfaction level
– “task complete” users : c = 1, 2, 3, 4, 5
– “task incomplete” users: c = ϕ
• Summary of data
– Total 518 subjects
c
ϕ
1
2
3
4
5
# of subjects
# of utterances
69
52.2
38
134.5
102
119.7
107
114.9
155
106.5
47
98.4
WER [%]
Utt. / song
70.5
107
54.1
7.21
51.0
5.34
46.8
5.12
41.2
4.22
35.3
3.43
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
10
Modeling method
for the dialog context
• The dialog management of SDS is
designed by a dialog developer
1.
2.
3.
4.
5.
Introduction
Musicnavi2 database
N-gram modeling
Estimation experiment
Conclusion
– The management is not always satisfactory for users
• Assume that satisfaction appears in the dialog
context
• Statistically learning the naturalness of the dialog
– Use N-gram to model the dialog context
– Construct models for each class of users
– Estimate the unknown user’s satisfaction based on the
likelihood of N-gram model
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
11
Spoken dialog logs to Dilaog act symbols
• Vocabulary size of the recognition dictionary
– That is, the number of the songs
– Is different between the users
• Word level information is informative, but it is too
sparse to deal with as statistically
• Use dialog act symbols for the users’/system’s
acts
– Defined 21 system dialog acts and 19 user dialog acts
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
12
Example of an encoded dialog
U
S
U
S
U
S
U
S
U = User
S = System
User’s utterances / System’s prompts
Dialog act symbols
x1 = USR_CMD_HELLO
Hello
( ko-n-ni-chi-wa)
x2 = SYS_INFO_GREETING
Hello
x3 = USR_REQUEST_BYMUSIC
Da-i-to-ka-i
x4 song
= SYS_CONFIRM_KEYWORD
Do you want to retrieve the
“Da-i-to-ka-i?”
x5 = USR_CMD_YES
Yes
( ha-i )
x6 = SYS_PLAY_SONG
Now, playing the song “Da-i-to-ka-i”
by “Crystal
King.”
Stop
( te-i-shi )x7 = USR_CMD_STOP
x8 = SYS_INFO_STOPPED
Now, stopping.
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
13
Modeling the dialog act sequence by N-gram
• A dialog act sequence:
– arranged the dialog act symbols in time order t.
• N-gram probability (= likelihood) when given
a model for a user class c
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
14
Estimation experiment
• Detection of the user’s class
using N-gram model
Exp.1: “task incomplete” users
1.
2.
3.
4.
5.
Introduction
Musicnavi2 database
N-gram modeling
Estimation experiment
Conclusion
Exp.2: “unsatisfied” users
• Experimental conditions
– N-gram: 1-gram, 2-gram, …, 8-gram
• Witten-Bell smoothing (using SRILM toolkit)
– Input sequence: USR, SYS, SYSUSR
– Leave-one-out cross validation
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
15
Estimation experiment
• Detection method
– Model selection by thresholding the likelihood ratio
– ROC curve
– Area under the ROC curve (AUC)
1
true detection
• Evaluation metrics
0
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
false detection
May 19, 2010
1
16
AUC (Area under the ROC curve)
• “task incomplete” users
• “unsatisfied” users
N
1-gram
2-gram
SYS
0.901
0.948
SYSUSR
USR
0.873
0.927
0.929
0.977
SYS
0.611
0.628
SYSUSR
USR
0.638
0.619
0.644
0.724
3-gram
4-gram
5-gram
0.989
0.995
0.993
0.954
0.952
0.954
0.591
0.583
0.629
0.651
0.681
0.662
0.993
0.997
0.995
0.704
0.739
0.739
6-gram
0.989performance
0.951
0.995
0.632 0.639
0.761
High
detection
in
“task
7-gramincomplete”
0.988 users
0.946 to use
0.995
0.604 0.633
0.765
the system dialog acts
8-gram
0.987 0.936
0.994
0.592
0.622
0.756
Suggested
the effectivity
of using
both system and user dialog acts
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
17
Detection result of “task incomplete” users
• SYSUSR
N
4-gram achieved
100% true detection rate 1-gram
with 6% false detection rate
2-gram
3-gram
4-gram
0.927
0.977
0.993
0.997
5-gram
6-gram
7-gram
0.995
0.995
0.995
8-gram
0.994
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
AUC
18
Detection result of “unsatisfied” users
The more N of N-gram is,
the less false detection rate becomes
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
• SYSUSR
N
AUC
1-gram
2-gram
3-gram
4-gram
0.619
0.724
0.704
0.739
5-gram
6-gram
7-gram
0.739
0.761
0.765
8-gram
0.756
May 19, 2010
19
Conclusion
1.
2.
3.
4.
5.
Introduction
Musicnavi2 database
N-gram modeling
Estimation experiment
Conclusion
• Estimation method of user satisfaction
using N-gram-based dialog history model for SDS
– Constructed the real PC environmental database
– Achieved high performance in the detection of “task incomplete”
users
• 100% true detection rate, when 6% false detection rate
– Not sufficient performance in the detection of “unsatisfied” users
– N-gram model was effective by comparison of 1-gram
– Using both system and user dialog act was effective
• Future works
– N-gram model-based estimation of dialog failure (online detection)
– Analysis of the dialog context affected user satisfaction
– Integrated method of using acoustic features, prosodic features,
dialog features, etc.
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
20
• Thanks for your kind attention!
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
21
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
22
Modeling the dialog act sequence by N-gram
• Encoded dialog logs to dialog act symbols automatically
User’s
dialog acts
System’s
dialog acts
Using speech recognition results
They are defined in recognition dictionary
Using system prompts or responses
They are the same as system’s internal act
• A dialog act sequence: x
– arranged the dialog act symbols in time order t.
• N-gram probability(=Likelihood) when given a model
with a satisfaction level s
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
23
Detection by thresholding
• Model selection by an a posteriori odds classifier,
• Introduce a priori odds 1/α and Bayes factor B
• Finally,
* α =1 means ML classifier
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
24
6-クラスの満足度推定実験
• N-gramモデルを用いたユーザ満足度クラスの推
定
• 実験条件
– 評価用被験者1名を除いた残り517名を利用して
満足度毎のモデルを学習(Leave one out)
• 満足度 s = ϕ (課題未達成), 1(不満), 2, 3, 4, 5(満足)
– N-gram: 1-gram, 2-gram, …, 8-gram
– 入力系列
• ユーザの対話行動のみを利用(USR)
• システムの対話行動のみを利用(SYS)
• ユーザ・システム両者の対話行動を利用(USRSYS)
• 評価基準
– 識別精度(Accuracy)
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
25
満足度(6-クラス)の推定手法
• 最尤推定による最尤モデルの選択
– あるユーザの入力 x に対して
満足度モデルそれぞれの尤度を算出
– 最大尤度のモデルが推定結果
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
26
Detection result for 6-classes of satisfaction
システム系列のみを利用、
3-gramの場合で 34.4%
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
27
Confusion matrix
• 3-gram of SYS sequence
Actual
Estimated
ϕ
1
2
3
4
5
ϕ 1 2 3 4 5
43 5 7 5 6 3
0 7 8 9 11 3
1 8 31 16 35 11
0 9 22 23 45 8
0 8 34 29 66 18
0 4 5 6 24 8
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
課題未達成ユーザ(Φ)は
誤検出も少なく、比較的
高い精度で識別されている
満足しているユーザも
推定結果が大きく
異なっている例は少ない
May 19, 2010
28
対話履歴を考慮したユーザ満足度
• システムとの対話を繰り返すことで
ユーザの感じる満足度合いが変化
→満足
– 逐次変化の最後に“満足度”が調査される
性能に満足
不満←
性能に不満
利用を中断
対話ターン数
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
29
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
30
Modeling the N-gram
• Encoded to dialog log to dialog act symbols automatically
– User’s dialog acts
• Using speech recognition results
• They are defined in recognition dictionary
– System’s dialog acts
• Using system responses or acts
• They are the same as system’s internal act
• A dialog act sequence: x
– Arranged the dialog act symbols in time order t.
• 6クラスの満足度毎にN-gramモデルを作成
– Witten-Bell smoothing … SRILM toolkit を利用
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
31
Example of a dialog
U
S
U
S
U
S
Hello
( ko-n-ni-chi-wa)
U = User
S = System
Hello
Da-i-to-ka-i
Do you want to retrieve the song “Da-i-to-ka-i?”
Yes
( ha-i )
Now, playing the song “Da-i-to-ka-i” by “Crystal
King.”
U Stop
( te-i-shi )
S Now, stopping.
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
May 19, 2010
32
1.
2.
3.
4.
5.
LREC2010: Sunao HARA et al., Nagoya Univ., Japan.
Introduction
Musicnavi2 database
N-gram modeling
Estimation experiment
Conclusion
May 19, 2010
33