Transcript A Support Vector Method for Optimizing Rank
Beyond Position Bias: Examining Result Attractiveness as a Source of Presentation Bias in Clickthrough Data
WWW 2010
Yisong Yue Cornell Univ.
Rajan Patel Google Inc.
Hein Roehrig Google Inc.
User Feedback in Search Systems
• Cheap & representative feedback – Evaluation metrics – Optimization criterion – How to interpret feedback accurately?
• Clicks on (web) search results – Data plentiful – Important domain
Interpreting Clicks
•What does click mean?
•Click means good?
•How good?
How Are Clicks Biased?
• In what ways do clicks not directly reflect user utility or preferences?
• Presentation Bias – Only click on what they pay attention to – E.g., position bias (more clicks at top of ranking) • Understanding presentation bias essential to more accurately interpreting feedback
•Maybe 3 rd result looked more relevant •i.e., judging a book by its cover •Maybe 3 rd result attracted more attention •E.g., eye-catching •Many matching query terms (in bold)
Summary Attractiveness
• Goal: quantify the effect of summary attractiveness on click behavior – Web search context • First study to conduct a rigorous statistical analysis on summary attractiveness bias
Controlling for Position
• Position bias is the largest biasing effect • Need to control for it in order to analyze other biasing effects • Use FairPairs randomization – [Radlinski & Joachims, 2006]
FairPairs Example
• Original: 1 2 3 4 5 6 7 8 9 10 • FairPair1: 1 2 3 4 5 6 7 8 9 10 • Swap: 2 1 3 4 6 5 8 7 9 10 • FairPair2: 1 2 3 4 5 6 7 8 9 10 • Swap: 1 2 3 5 4 7 6 9 8 10 • Randomly choose pairing scheme • Randomly swap each intra-pair ordering independently [Radlinski & Joachims, AAAI 2006]
Interpreting FairPairs Clicks
A on top B on top Click on A 55% 40% Conclusion: B > A Click on B 45% 60% Clicks indicate pairwise preference (relative quality).
Thought Experiment
• Two results A & B – Equally relevant for some query – Ranked adjacently in search results • AB and BA shown equally often (FairPairs) • A has an attractive title. B does not.
• Who gets more clicks, A or B?
Click Data
• Ran FairPairs randomization – A portion of Google US web search traffic.
– 8/1/2009 to 8/20/2009 – 439,246 clicks collected
Human Judged Ratings
• Sampled a subset of 1150 FairPairs.
• Asked human raters to explicitly judge which of the pair is more relevant.
– 5 judgments for each • Human raters must navigate to landing page.
Measuring Attractiveness
• Relative measure of attractiveness • Difference of bolded query terms in title & abstract • Bottom result has +2 bolded terms in title • Bottom result has +2 bolded terms in abstract
Measuring Attractiveness
• Clearly, query/title similarity is informative.
• Good results should have titles that strongly match • But would blindly counting clicks cause us to over-value query/title similarity?
Rated Clicks Model
Null Hypothesis
• Title & abstract bolding have 0 effect • Position and relative (judged) quality are the only factors affecting click probability.
Fitted Model
Param
Base
Title
Abstract Swap Human
Mean
0.653 **
0.150 **
0.039
-0.435 ** -0.360 **
95% Conf. Interv.
+/- 0.183
+/- 0.120
+/- 0.120
+/- 0.209
+/- 0.215
Leveraging All Clicks
• Previous model required human judgments • We need to calibrate against relative quality • How to do this on all 400,000+ clicks?
• Make independence assumptions!
Intuition
• Virtually all search engines predict rankings using many attributes (or features).
• Query/title similarity is only one component. • Example: a document with low query/title similarity might achieve high ranking due to very relevant body text.
1.5
1.0
>
1.0
1.2
Example
1.2
2.0
>
1.5
0.9
2.0
0.5
>
1.0
1.5
1.4
1.9
>
1.7
1.0
1 st 2 nd feature: query/title similarity feature: query/body similarity
1.5
1.0
> 1.0
1.2
Example
1.2
2.0
> 1.5
0.9
2.0
0.5
> 1.0
1.5
1.4
1.9
> 1.7
1.0
1 st 2 nd feature: query/title similarity feature: query/body similarity
Assumption
• Take pairs of adjacent documents at random • Collect relative relevance ratings – Human rated preferences • Should be independent of title bolding difference • Can check using statistical model
Rated Agreement Model
Fitted Model
Param
Base
Title
Abstract
Mean
0.258 **
0.018
0.058
95% Conf. Interv.
+/- 0.062
+/- 0.060
+/- 0.060
Assumption approximately satisfied for query/title similarity.
Title Bias Effect (All Clicks)
• Bars should be equal if not biased
0.6
0.5
0.4
0.3
0.2
0.1
0
All Clicks Model
Evaluation Metrics & Optimization
• Pairwise preferences common for evaluation – E.g., maximize FairPairs agreement • Goal: maximize pairwise relevance agreement – Want to be aligned with click agreement –
Danger:
might conclude current system is undervaluing query/title similarity • Down-weight clicks on results with more title bolding – E.g., weight clicks by exp(-
w T X T
)
Directions to Explore
• Other ways to measure summary attractiveness – Use other summary content • Other forms of presentation bias – Anything that draws people’s attention • Ways to interpret and adjust for bias – More accurate ways to quantify bias – More accurate evaluation metrics
Extra Slides
Fitted Model (All Clicks)
Param
Base Top Title Bot Title Top Abstract Bot Abstract Swap @ 1 Swap @ 2 Swap @ 3 Swap @ 4-5 Swap @ 6-9 Swap @ 10+
Mean
0.184 ** 0.060 ** 0.061 ** 0.007
0.014 ** 0.561 ** 0.390 ** 0.372 ** 0.198 ** 0.009
0.054 **
95% Conf. Interv.
+/- 0.007
+/- 0.008
+/- 0.009
+/- 0.009
+/- 0.008
+/- 0.011
+/- 0.012
+/- 0.016
+/- 0.014
+/- 0.014
+/- 0.009