Interactively Optimizing Information Systems as a Dueling

Download Report

Transcript Interactively Optimizing Information Systems as a Dueling

Interactively Optimizing
Information Retrieval Systems as
a Dueling Bandits Problem
ICML 2009
Yisong Yue
Thorsten Joachims
Cornell University
Learning To Rank
• Supervised Learning Problem
– Extension of classification/regression
– Relatively well understood
– High applicability in Information Retrieval
• Requires explicitly labeled data
– Expensive to obtain
– Expert judged labels == search user utility?
– Doesn’t generalize to other search domains.
Our Contribution
• Learn from implicit feedback (users’ clicks)
– Reduce labeling cost
– More representative of end user information needs
• Learn using pairwise comparisons
– Humans are more adept at making pairwise judgments
– Via Interleaving [Radlinski et al., 2008]
• On-line framework (Dueling Bandits Problem)
– We leverage users when exploring new retrieval functions
– Exploration vs exploitation tradeoff (regret)
Team-Game Interleaving
(u=thorsten, q=“svm”)
f1(u,q)  r1
1.
2.
3.
4.
5.
f2(u,q)  r2
Kernel Machines
http://svm.first.gmd.de/
Support Vector Machine
http://jbolivar.freeservers.com/
An Introduction to Support Vector Machines
http://www.support-vector.net/
Archives of SUPPORT-VECTOR-MACHINES ...
http://www.jiscmail.ac.uk/lists/SUPPORT...
SVM-Light Support Vector Machine
http://ais.gmd.de/~thorsten/svm light/
1.
NEXT
PICK
2.
3.
4.
5.
Interleaving(r1,r2)
1.
2.
3.
4.
5.
6.
7.
Kernel Machines
T2
http://svm.first.gmd.de/
Support Vector Machine
T1
http://jbolivar.freeservers.com/
SVM-Light Support Vector Machine
T2
http://ais.gmd.de/~thorsten/svm light/
An Introduction to Support Vector Machines
T1
http://www.support-vector.net/
Support Vector Machine and Kernel ... References T2
http://svm.research.bell-labs.com/SVMrefs.html
Archives of SUPPORT-VECTOR-MACHINES ...
T1
http://www.jiscmail.ac.uk/lists/SUPPORT...
Lucent Technologies: SVM demo applet
T2
http://svm.research.bell-labs.com/SVT/SVMsvt.html
Kernel Machines
http://svm.first.gmd.de/
SVM-Light Support Vector Machine
http://ais.gmd.de/~thorsten/svm light/
Support Vector Machine and Kernel ... References
http://svm.research.bell-labs.com/SVMrefs.html
Lucent Technologies: SVM demo applet
http://svm.research.bell-labs.com/SVT/SVMsvt.html
Royal Holloway Support Vector Machine
http://svm.dcs.rhbnc.ac.uk
Invariant:
For all k, in expectation
same number of team
members in top k from
each team.
Interpretation: (r2 Â r1) ↔ clicks(T2) > clicks(T1)
[Radlinski, Kurup, Joachims; CIKM 2008]
Dueling Bandits Problem
• Continuous space bandits F
– E.g., parameter space of retrieval functions (i.e., weight vectors)
• Each time step compares two bandits
– E.g., interleaving test on two retrieval functions
– Comparison is noisy & independent
Dueling Bandits Problem
• Continuous space bandits F
– E.g., parameter space of retrieval functions (i.e., weight vectors)
• Each time step compares two bandits
– E.g., interleaving test on two retrieval functions
– Comparison is noisy & independent
• Choose pair (ft, ft’) to minimize regret:
T
T   P( f *  f t )  P( f *  f t ' )  1
t 1
• (% users who prefer best bandit over chosen ones)
T
T   P( f *  f t )  P( f *  f t ' )  1
t 1
•Example 1
•P(f* > f) = 0.9
•P(f* > f’) = 0.8
•Incurred Regret = 0.7
•Example 2
•P(f* > f) = 0.7
•P(f* > f’) = 0.6
•Incurred Regret = 0.3
•Example 3
•P(f* > f) = 0.51
•P(f* > f) = 0.55
•Incurred Regret = 0.06
Modeling Assumptions
• Each bandit f 2F has intrinsic value v(f)
– Never observed directly
– Assume v(f) is strictly concave ( unique f* )
• Comparisons based on v(f)
– P(f > f’) = σ( v(f) – v(f’) )
– P is L-Lipschitz
1
– For example:  ( x) 
1  exp( x)
Probability Functions
Dueling Bandit Gradient Descent
• Maintain ft
– Compare with ft’ (close to ft -- defined by step size)
– Update if ft’ wins comparison
• Expectation of update close to gradient of P(ft > f’)
– Builds on Bandit Gradient Descent [Flaxman et al., 2005]
δ – explore step size
γ – exploit step size
Current point
Losing candidate
Winning candidate
Dueling Bandit Gradient Descent
δ – explore step size
γ – exploit step size
Current point
Losing candidate
Winning candidate
Dueling Bandit Gradient Descent
δ – explore step size
γ – exploit step size
Current point
Losing candidate
Winning candidate
Dueling Bandit Gradient Descent
δ – explore step size
γ – exploit step size
Current point
Losing candidate
Winning candidate
Dueling Bandit Gradient Descent
δ – explore step size
γ – exploit step size
Current point
Losing candidate
Winning candidate
Dueling Bandit Gradient Descent
δ – explore step size
γ – exploit step size
Current point
Losing candidate
Winning candidate
Dueling Bandit Gradient Descent
δ – explore step size
γ – exploit step size
Current point
Losing candidate
Winning candidate
Dueling Bandit Gradient Descent
δ – explore step size
γ – exploit step size
Current point
Losing candidate
Winning candidate
Dueling Bandit Gradient Descent
δ – explore step size
γ – exploit step size
Current point
Losing candidate
Winning candidate
Dueling Bandit Gradient Descent
Analysis (Sketch)
• Dueling Bandit Gradient Descent
– Sequence of partially convex functions ct(f) = P(ft > f)
– Random binary updates (expectation close to gradient)
• Bandit Gradient Descent [Flaxman et al., SODA 2005]
– Sequence of convex functions
– Use randomized update
(expectation close to gradient)
– Can be extended to our setting
(Assumes more information)
Analysis (Sketch)
• Convex functions satisfy c( x)  c( x*)  c( x)  ( x  x*)
– Both additive and multiplicative error
– Depends on exploration step size δ
– Main analytical contribution: bounding multiplicative error
Regret Bound
• Regret grows as O(T3/4):
ET   2T
3/ 4
10RdL
• Average regret shrinks as O(T-1/4)
– In the limit, we do as well as knowing f* in
hindsight
T
T   P( f *  f t )  P( f *  f t ' )  1
t 1
δ = O(1/T-1/4 )
γ = O(1/T-1/2 )
Practical Considerations
• Need to set step size parameters
– Depends on P(f > f’)
• Cannot be set optimally
– We don’t know the specifics of P(f > f’)
– Algorithm should be robust to parameter settings
• Set parameters approximately in experiments
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
DBGD
BGD 1
BGD 2
10
570
1130
1690
2250
2810
3370
3930
4490
5050
5610
6170
6730
7290
7850
8410
8970
9530
Average Regret
Regret Comparison DBGD vs BGD
•
•
•
•
50 dimensional parameter space
Value function v(x) = -xTx
Logistic transfer function
Random point has regret almost 1
More experiments in paper.
Web Search Simulation
• Leverage web search dataset
– 1000 Training Queries, 367 Dimensions
• Simulate “users” issuing queries
– Value function based on NDCG@10 (ranking measure)
– Use logistic to make probabilistic comparisons
• Use linear ranking function.
• Not intended to compete with supervised learning
– Feasibility check for online learning w/ users
– Supervised labels difficult to acquire “in the wild”
0.62
0.6
0.58
0.56
0.54
0.52
0.5
0.48
Sample 1
Sample 10
Sample 100
Ranking SVM
0
630000
1260000
1890000
2520000
3150000
3780000
4410000
5040000
5670000
6300000
6930000
7560000
8190000
8820000
9450000
Training NDCG @10
Web Simulation Results
• Chose parameters with best final performance
• Curves basically identical for validation and test sets (no over-fitting)
• Sampling multiple queries makes no difference
What Next?
• Better simulation environments
– More realistic user modeling assumptions
• DBGD simple and extensible
– Incorporate pairwise document preferences
– Deal with ranking discontinuities
• Test on real search systems
– Varying scales of user communities
– Sheds on insight / guides future development
Extra Slides
Active vs Passive Learning
• Passive Data Collection (offline)
– Biased by current retrieval function
• Point-wise Evaluation
– Design retrieval function offline
– Evaluate online
• Active Learning (online)
– Automatically propose new rankings to evaluate
– Our approach
Relative vs Absolute Metrics
• Our framework based on relative metrics
– E.g., comparing pairs of results or rankings
– Relatively recent development
• Absolute Metrics
–
–
–
–
E.g., absolute click-through rate
More common in literature
Suffers from presentation bias
Less robust to the many different sources of noise
180
# times result selected
160
time spent in abstract
1
0.9
0.8
140
0.7
120
0.6
100
0.5
80
0.4
60
0.3
40
0.2
20
0.1
0
0
1
2
3
4
5
6
7
8
9
10
11
Rank of result
[Joachims et al., TOIS 2007]
mean time (s)
# times rank selected
What
Results do Users View/Click?
Time spent in each result by frequency of doc selected
Analysis (Sketch)
• Convex functions satisfy c( x)  c( x*)  c( x)  ( x  x*)
– We have both multiplicative and additive error
– Depends on exploration step size δ
– Main technical contribution: bounding multiplicative error
Existing results yields
sub-linear bounds on:
T

E  P( f t  f t )  P( f t  f *)
 t 1

Analysis (Sketch)
T

• We know how to bound E  P( f t  f t )  P( f t  f *)
 t 1

T
• Regret: T   P( f *  f t )  P( f *  f t ' )  1
t 1
• We can show using Lipschitz and symmetry of σ:
T

ET   2E  P( f t  f t )  P( f t  f *)  LT
 t 1

More Simulation Experiments
• Logistic transfer function σ(x) = 1/(1+exp(-x))
• 4 choices of value functions
• δ, γ set approximately
RT
NDCG
• Normalized Discounted Cumulative Gain
• Multiple Levels of Relevance
• DCG:
– contribution of ith rank position:
– Ex:
2 yi  1
log(i  1)
has DCG score of
1
3
1
0
1




 5.45
log(2) log(3) log(4) log(5) log(6)
• NDCG is normalized DCG
– best possible ranking as score NDCG = 1
Considerations
• NDCG is discontinuous w.r.t. function parameters
– Try larger values of δ, γ
– Try sampling multiple queries per update
• Homogenous user values
– NDCG@10
– Not an optimization concern
– Modeling limitation
• Not intended to compete with supervised learning
– Sanity check of feasibility for online learning w/ users