Online Search Evaluation with Interleaving Filip Radlinski Microsoft Acknowledgments • This talk involves joint work with – – – – – – – Olivier Chapelle Nick Craswell Katja Hofmann Thorsten Joachims Madhu Kurup Anne Schuth Yisong Yue.

Download Report

Transcript Online Search Evaluation with Interleaving Filip Radlinski Microsoft Acknowledgments • This talk involves joint work with – – – – – – – Olivier Chapelle Nick Craswell Katja Hofmann Thorsten Joachims Madhu Kurup Anne Schuth Yisong Yue.

Online Search Evaluation with
Interleaving
Filip Radlinski
Microsoft
Acknowledgments
• This talk involves joint work with
–
–
–
–
–
–
–
Olivier Chapelle
Nick Craswell
Katja Hofmann
Thorsten Joachims
Madhu Kurup
Anne Schuth
Yisong Yue
Motivation
Baseline Ranking Algorithm
Proposed Ranking Algorithm
Which is
better?
Retrieval evaluation
Two types of retrieval evaluation:
• Offline evaluation
Ask experts or users to explicitly evaluate your retrieval
system. This dominates evaluation research today.
• Online evaluation
See how normal users interact with your retrieval
system when just using it.
Most well known type: A/B tests
A/B testing
• Each user is assigned to one of two conditions
• They might see the left or the right ranking
Ranking A
Ranking B
• Measure user interaction with theirs (e.g. clicks)
• Look for differences between the populations
Online evaluation with interleaving
• A within-user online ranker comparison
– Presents results from both rankings to every user
Ranking A
Shown Users
(randomized)
Ranking B
• The ranking that gets more of the clicks wins
– Designed to be unbiased, and much more sensitive than A/B
Team draft interleaving
1.
2.
3.
4.
5.
6.
Ranking A
Ranking B
Napa Valley – The authority for lodging...
1. Napa Country, California – Wikipedia
www.napavalley.com
en.wikipedia.org/wiki/Napa_Valley
Napa Valley Wineries - Plan your wine...
2. Napa Valley – The authority for lodging...
www.napavalley.com/wineries
www.napavalley.com
Napa Valley College
3. Napa: The Story of an American Eden...
www.napavalley.edu/homex.asp
books.google.co.uk/books?isbn=...
Been There | Tips | Napa Valley
4. Napa Valley Hotels – Bed and Breakfast...
www.ivebeenthere.co.uk/tips/16681 Presented Rankingwww.napalinks.com
Napa Valley Wineries and1.
Wine
5. for
NapaValley.org
Napa Valley – The authority
lodging...
www.napavintners.com
www.napavalley.org
www.napavalley.com
Napa Country, California –2.Wikipedia
The Napa Valley Marathon
Napa Country, California6.
– Wikipedia
en.wikipedia.org/wiki/Napa_Valley
www.napavalleymarathon.org
en.wikipedia.org/wiki/Napa_Valley
3. Napa: The Story of an American Eden...
books.google.co.uk/books?isbn=...
4. Napa Valley Wineries – Plan your wine...
www.napavalley.com/wineries
5. Napa Valley Hotels – Bed and Breakfast...
A
B
www.napalinks.com
6. Napa Valley College
www.napavalley.edu/homex.asp
7 NapaValley.org
[Radlinski et al. 2008]
www.napavalley.org
Team draft interleaving
1.
2.
3.
4.
5.
6.
Ranking A
Ranking B
Napa Valley – The authority for lodging...
1. Napa Country, California – Wikipedia
www.napavalley.com
en.wikipedia.org/wiki/Napa_Valley
Napa Valley Wineries - Plan your wine...
2. Napa Valley – The authority for lodging...
www.napavalley.com/wineries
www.napavalley.com
Napa Valley College
3. Napa: The Story of an American Eden...
www.napavalley.edu/homex.asp
books.google.co.uk/books?isbn=...
Been There | Tips | Napa Valley
4. Napa Valley Hotels – Bed and Breakfast...
www.ivebeenthere.co.uk/tips/16681 Presented Rankingwww.napalinks.com
Napa Valley Wineries and1.
Wine
5. for
NapaValley.org
Napa Valley – The authority
lodging...
www.napavintners.com
www.napavalley.org
www.napavalley.com
Napa Country, California –2.Wikipedia
The Napa Valley Marathon
Napa Country, California6.
– Wikipedia
en.wikipedia.org/wiki/Napa_Valley
www.napavalleymarathon.org
en.wikipedia.org/wiki/Napa_Valley
3. Napa: The Story of an American Eden...
books.google.co.uk/books?isbn=...
Tie!
4. Napa Valley Wineries – Plan your wine...
www.napavalley.com/wineries
5. Napa Valley Hotels – Bed and Breakfast...
www.napalinks.com
6. Napa Balley College
www.napavalley.edu/homex.asp
7 NapaValley.org
[Radlinski et al. 2008]
www.napavalley.org
Why might mixing rankings help?
• Suppose results are worth money. For some
query:
– Ranker A:
,
 User clicks
,
– Ranker B:
,
 User also clicks
,
• Users of A may not know what they’re missing
– Difference in behaviour is small
• But if we can mix up results from A & B
Strong preference for B
Comparison with A/B metrics
Yahoo! Pair 2
p-value
Probability
Disagreement
Yahoo! Pair 1
Query set size
• Experiments with real Yahoo! rankers
(very small differences in relevance)
[Chapelle et al. 2012]
The interleaving click model
• Click == Good
• Interleaving corrects for position bias
• Yet there other sources of bias, such as bolding
vs
[Yue et al. 2010a]
Click frequency on
bottom result
The interleaving click model
Rank of Results
• Bars should be equal if there was no effect of
bolding
[Yue et al. 2010a]
Sometimes clicks aren’t even good
No…
• Satisfaction of a click can be estimated
– Time spent on URLs is informative
– More sophisticated models also consider the query and
document (some documents require more effort)
[Kim et al. WSDM 2014]
• Time before clicking is another efficiency metric
Newer A/B metrics
• Newer A/B metrics can incorporate these signals
–
–
–
–
–
Time before clicking
Time spent on result documents
Estimated user satisfaction
Bias in click signal, e.g. position
Anything else the domain expert cares about
• Suppose I’ve picked an A/B metric and assume
it to be my target
– I just want to measure it more quickly
– Can I use interleaving?
An A/B metric as a gold standard
• Does interleaving agree with these AB metrics?
AB Metric
Team Draft Agreement
Is Page Clicked?
63 %
Clicked @ 1?
71 %
Satisfied Clicked?
71 %
Satisfied Clicked @ 1?
76 %
Time – to – click
53 %
Time – to – click @ 1
45 %
Time – to – satisfied – click
47 %
Time – to – satisfied – click @ 1
42 %
[Schuth et al. SIGIR 2015]
An A/B metric as a gold standard
• Suppose we parameterize the clicks;
– Optimize to maximize agreement with our AB metric
• In particular:
– Only include clicks where the predicted probability of
satisfaction is above threshold t:
𝑇𝐷𝐼𝑆 =
1𝑃
𝑐 𝑖𝑠 𝑠𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑑 > 𝑡
𝑐∈𝑐𝑙𝑖𝑐𝑘𝑠
– Score clicks based on the time to satisfied click:
𝑇𝐷𝐼𝑇,𝑆 =
𝑇𝑖𝑚𝑒𝑇𝑜𝐶𝑙𝑖𝑐𝑘 𝑐 × 1𝑐 𝑖𝑠 𝑠𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑑
𝑐∈𝑐𝑙𝑖𝑐𝑘𝑠
– Learn a linear weighted combination of these
[Schuth et al. SIGIR 2015]
An A/B metric as a gold standard
Team Draft
Agreement
(1/80th size)
Learned
(to each
metric)
AB Self-Agreement
on Subset
(1/80th size)
63 %
84 % +
63 %
Clicked @ 1?
71 % *
75 % +
62 %
Satisfied Clicked?
71 % *
85 % +
61 %
Satisfied Clicked @ 1?
76 % *
82 % +
60 %
Time – to – click
53 %
68 % +
58 %
Time – to – click @ 1
45 %
56 % +
59 %
Time – to – satisfied – click
47 %
63 % +
59 %
Time – to – satisfied – click @ 1
42 %
50 % +
60 %
AB Metric
Is Page Clicked?
The right parameters
AB Metric
Team Draft
Agreement
Learned
Combined
Learned
(P(Sat) only)
Learned
(Time to click * P(Sat))
Satisfied Clicked?
71 %
85 % +
84 % +
48 % –
P(Sat)
>
0.76
P(Sat)
>
0.26
P(Sat)
>
0.5
• The optimal filtering parameter need not match
the metric definition
• But having the right feature is essential
Statistical Power
Does this cost sensitivity?
Team Draft
Is Sat clicked
(A/B)
What if you instead know how you
value user actions?
• Suppose we don’t have an AB metric in mind
• Instead, suppose we instead know how to value
users’ behavior on changed documents:
– If user clicks on a document that moved up k
positions, how much is it worth?
– If a user spends time t before clicking, how much is it
worth?
– If a user spends time t’ on a document, how much is it
worth?
[Radlinski & Craswell, WSDM 2013]
Example credit function
• The value if a click is proportional to how far the
document moved between A and B:
𝛿𝑖𝐿𝑖𝑛 = 𝑟𝑎𝑛𝑘 ∗ 𝑙𝑖 , 𝐴 − 𝑟𝑎𝑛𝑘 ∗ (𝑙𝑖 , 𝐵)
• Example:
–
–
–
–
–
A: 1 2 3
B: 2 3 1
Any click on
Any click on
Any click on
1
2
3
gives credit +2
gives credit -1
gives credit -1
Interleaving (making the rankings)
Ranker A
Ranker B
1
1
2
2
1
2
3
𝑝1
1
2
3
𝑝2
Team Draft
1
2
3
𝑝3
50%
1
2
3
𝑝4
50%
1
2
3
𝑝5
3
3
We generate a set of rankings that are
similar to those returned by A and B in
an A/B test
We have an optimization problem!
• We have a set of allowed rankings
{ 𝐿 ∶ ∀𝑘, ∃𝑖, 𝑗. 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝐿𝑘 = 𝐴𝑖 ⋃ 𝐵𝑗 }
• We specified how clicks translate to credit
• We solve for the probabilities:
– The probabilities of showing the rankings add up to 1
𝑝𝑖 = 1
– The expected credit given random clicking is zero
𝑝𝑖 × 𝐸 𝑐𝑟𝑒𝑑𝑖𝑡 𝐿𝑖
=0
Sensitivity
• The optimization problem so far is usually underconstrained (lots of possible rankings).
• What else do we want? Sensitivity!
• Intuition:
– When we show a particular ranking (i.e. something
combining results from A and B), it is always biased
(interleaving says that we should be unbiased on average)
– The more biased, the less informative the outcome
– We want to show individual rankings that are least biased
I’ll skip the maths here...
Illustrative optimized solution
𝑝𝑖 for different
interleaving algorithms
Allowed
interleaved
rankings
A
0.87
25%
0.73
B
25%
0.74
35%
0.60
40%
0.50
25%
25%
Summary
• Interleaving is a sensitive online metric for
evaluating rankings
– Very high agreement when reliable offline
relevance metrics are available
– Agreement of simple interleaving algorithms with AB
metrics & small / ambiguous relevance differences
can be poor
• Solutions:
– Can de-bias user behaviour (e.g. presentation effects)
– Can optimize to a known AB metric (if one is trusted)
– Can optimize to a known user model
Thanks!
Questions?