A Support Vector Method for Optimizing Rank

Download Report

Transcript A Support Vector Method for Optimizing Rank

Beyond Position Bias: Examining Result Attractiveness as a Source of Presentation Bias in Clickthrough Data

WWW 2010

Yisong Yue Cornell Univ.

Rajan Patel Google Inc.

Hein Roehrig Google Inc.

User Feedback in Search Systems

• Cheap & representative feedback – Evaluation metrics – Optimization criterion – How to interpret feedback accurately?

• Clicks on (web) search results – Data plentiful – Important domain

Interpreting Clicks

•What does click mean?

•Click means good?

•How good?

How Are Clicks Biased?

• In what ways do clicks not directly reflect user utility or preferences?

• Presentation Bias – Only click on what they pay attention to – E.g., position bias (more clicks at top of ranking) • Understanding presentation bias essential to more accurately interpreting feedback

•Maybe 3 rd result looked more relevant •i.e., judging a book by its cover •Maybe 3 rd result attracted more attention •E.g., eye-catching •Many matching query terms (in bold)

Summary Attractiveness

• Goal: quantify the effect of summary attractiveness on click behavior – Web search context • First study to conduct a rigorous statistical analysis on summary attractiveness bias

Controlling for Position

• Position bias is the largest biasing effect • Need to control for it in order to analyze other biasing effects • Use FairPairs randomization – [Radlinski & Joachims, 2006]

FairPairs Example

• Original: 1 2 3 4 5 6 7 8 9 10 • FairPair1: 1 2 3 4 5 6 7 8 9 10 • Swap: 2 1 3 4 6 5 8 7 9 10 • FairPair2: 1 2 3 4 5 6 7 8 9 10 • Swap: 1 2 3 5 4 7 6 9 8 10 • Randomly choose pairing scheme • Randomly swap each intra-pair ordering independently [Radlinski & Joachims, AAAI 2006]

Interpreting FairPairs Clicks

A on top B on top Click on A 55% 40% Conclusion: B > A Click on B 45% 60% Clicks indicate pairwise preference (relative quality).

Thought Experiment

• Two results A & B – Equally relevant for some query – Ranked adjacently in search results • AB and BA shown equally often (FairPairs) • A has an attractive title. B does not.

• Who gets more clicks, A or B?

Click Data

• Ran FairPairs randomization – A portion of Google US web search traffic.

– 8/1/2009 to 8/20/2009 – 439,246 clicks collected

Human Judged Ratings

• Sampled a subset of 1150 FairPairs.

• Asked human raters to explicitly judge which of the pair is more relevant.

– 5 judgments for each • Human raters must navigate to landing page.

Measuring Attractiveness

• Relative measure of attractiveness • Difference of bolded query terms in title & abstract • Bottom result has +2 bolded terms in title • Bottom result has +2 bolded terms in abstract

Measuring Attractiveness

• Clearly, query/title similarity is informative.

• Good results should have titles that strongly match • But would blindly counting clicks cause us to over-value query/title similarity?

Rated Clicks Model

Null Hypothesis

• Title & abstract bolding have 0 effect • Position and relative (judged) quality are the only factors affecting click probability.

Fitted Model

Param

Base

Title

Abstract Swap Human

Mean

0.653 **

0.150 **

0.039

-0.435 ** -0.360 **

95% Conf. Interv.

+/- 0.183

+/- 0.120

+/- 0.120

+/- 0.209

+/- 0.215

Leveraging All Clicks

• Previous model required human judgments • We need to calibrate against relative quality • How to do this on all 400,000+ clicks?

• Make independence assumptions!

Intuition

• Virtually all search engines predict rankings using many attributes (or features).

• Query/title similarity is only one component. • Example: a document with low query/title similarity might achieve high ranking due to very relevant body text.

1.5

1.0

>

1.0

1.2

Example

1.2

2.0

>

1.5

0.9

2.0

0.5

>

1.0

1.5

1.4

1.9

>

1.7

1.0

1 st 2 nd feature: query/title similarity feature: query/body similarity

1.5

1.0

> 1.0

1.2

Example

1.2

2.0

> 1.5

0.9

2.0

0.5

> 1.0

1.5

1.4

1.9

> 1.7

1.0

1 st 2 nd feature: query/title similarity feature: query/body similarity

Assumption

• Take pairs of adjacent documents at random • Collect relative relevance ratings – Human rated preferences • Should be independent of title bolding difference • Can check using statistical model

Rated Agreement Model

Fitted Model

Param

Base

Title

Abstract

Mean

0.258 **

0.018

0.058

95% Conf. Interv.

+/- 0.062

+/- 0.060

+/- 0.060

Assumption approximately satisfied for query/title similarity.

Title Bias Effect (All Clicks)

• Bars should be equal if not biased

0.6

0.5

0.4

0.3

0.2

0.1

0

All Clicks Model

Evaluation Metrics & Optimization

• Pairwise preferences common for evaluation – E.g., maximize FairPairs agreement • Goal: maximize pairwise relevance agreement – Want to be aligned with click agreement –

Danger:

might conclude current system is undervaluing query/title similarity • Down-weight clicks on results with more title bolding – E.g., weight clicks by exp(-

w T X T

)

Directions to Explore

• Other ways to measure summary attractiveness – Use other summary content • Other forms of presentation bias – Anything that draws people’s attention • Ways to interpret and adjust for bias – More accurate ways to quantify bias – More accurate evaluation metrics

Extra Slides

Fitted Model (All Clicks)

Param

Base Top Title Bot Title Top Abstract Bot Abstract Swap @ 1 Swap @ 2 Swap @ 3 Swap @ 4-5 Swap @ 6-9 Swap @ 10+

Mean

0.184 ** 0.060 ** 0.061 ** 0.007

0.014 ** 0.561 ** 0.390 ** 0.372 ** 0.198 ** 0.009

0.054 **

95% Conf. Interv.

+/- 0.007

+/- 0.008

+/- 0.009

+/- 0.009

+/- 0.008

+/- 0.011

+/- 0.012

+/- 0.016

+/- 0.014

+/- 0.014

+/- 0.009