Transcript PPT

Please interrupt me at any point!
Online Advertising
Open lecture at Warsaw University
February 25/26, 2011
Ingmar Weber
Yahoo! Research Barcelona
[email protected]
Disclaimers & Acknowledgments
• This talk presents the opinions of the author. It
does not necessarily reflect the views of Yahoo!
Inc. or any other entity.
• Algorithms, techniques, features, etc. mentioned
here might or might not be in use by Yahoo! or
any other company.
• Many of the slides in this lecture are based on
tables/graphs from the referenced papers. Please
see the actual papers for more details.
Review from last lecture
• Lots of money
– Ads essentially pay for the WWW
• Mostly sponsored search and display ads
– Sp. search: sold using variants of GSP
– Disp. ads: sold in GD contracts or on the spot
• Many computational challenges
– Finding relevant ads, predicting CTRs,
new/tail content and queries, detecting fraud,
…
Plan for today and tomorrow
• So far
– Mostly introductory, “text book material”
• Now
– Mostly recent research papers
– Crash course in machine learning, information
retrieval, economics, …
Hopefully more “think-along” (not sing-along)
and not “shut-up-and-listen”
But first …
• Third party cookies
www.bluekai.com (many others …)
Efficient Online Ad Serving in a
Display Advertising Exchange
Keving Lang, Joaquin Delgado,
Dongming Jiang, et al.
WSDM’11
Not so simple landscape for D A
Advertisers
“Buy shoes at nike.com”
“Visit asics.com today”
“Rolex is great.”
Publishers
A running blog
The legend of Cliff Young
Celebrity gossip
Users
32m, likes running
50f, loves watches
16m, likes sports
Basic problem:
Given a (user, publisher) pair, find a good ad(vertiser)
Ad networks and Exchanges
• Ad networks
– Bring together supply (publishers) and
demand (advertisers)
– Have bilateral agreements via revenue
sharing to increase market fluidity
• Exchanges
– Do the actual real-time allocation
– Implement the bilateral agreements
Middle-aged, middle-income New Yorker visits the web site of Cigar Magazine (P1)
D only known at end.
User constraints:
no alcohol ads to minors
Supply constraints: conservative network doesn’t want left publishers
Demand constraints: Premium blogs don’t want spammy ads
Valid Paths & Objective Function
Depth-first search enumeration
Algorithm A
Worst case running time?
Typical running time?
Algorithm B
US pruning
Upper bound
Why?
Worst case running time?
Sum vs. product?
Optimizations?
D pruning
Reusable Precomputation
Cannot fully enforce D
Depends on reachable sink …
… which depends on U
What if space limitations?
How would you prioritize?
Experiments – artificial data
Experiments – real data
Competing for Users’ Attention: On
the Interplay between Organic and
Sponsored Search Results
Christian Danescu-NiculescuMizil, Andrei Broder, et al.
WWW’10
What would you investigate?
What would you suspect?
Things to look at
• General bias for near-identical things
– Ads are preferred (as further “North”)
– Organic results are preferred
• Interplay between ad CTR and result CTR
– Better search results, less ad clicks?
– Mutually reinforcing?
• Dependence on type
– Navigational query vs. informational query
– Responsive ad vs. incidental ad
Data
• One month of traffic for subset of Y!
search servers
• Only North ads, served at least 50 times
• For each query qi most clicked ad Ai* and
most clicked organic result Oi*
• 63,789 (qi, Oi*, Ai*) triples
• Bias?
(Non-)Commercial bias?
• Look at A* and O* with identical domain
• Probably similar quality …
• … but (North) ad is higher
• What do you think?
• In 52% ctrO > ctrA
av. ctrO
av. ctrA
Correlation
ctrO
For given (range of) ctrO bucket all ads.
ctrA
av. ctrO
av. ctrA
Navigational vs. non-navigational
ctrO
ctrA
Navigational: antagonistic effect
Non-navigational: (mild) reinforcement
Dependence on similarity
Bag of words for title terms
(“Free Radio”, “Pandora Radio – Listen to
Free Internet Radio, Find New Music”) = 2/9
av. ctrA
av. ctrA
Dependence on similarity
A simple model
Want to model
Also need:
A simple model
Explains basic (quadratic) shape of overlap
vs. ad click-through-rate
Improving Ad Relevance in
Sponsored Search
Dustin Hillard, Stefan Schroedl,
Eren Manavoglu, et al.
WSDM’10
Ad relevance  Ad attractiveness
• Relevance
– How related is the ad to the search query
– q=“cocacola”, ad=“Buy Coke Online”
• Attractiveness
– Essentially click-through rate
– q=“cocacola”, ad=“Coca Cola Company Job”
– q=*, ad=“Lose weight fast and easy”
Hope: decoupling leads to better (cold-start) CTR predictions
Basic setup
• Get relevance from editorial judgments
– Perfect, excellent, good, fair, bad
– Treat non-bad as relevant
• Machine learning approach
– Compare query to the ad
– Title, description, display URL
– Word overlap (uni- and bigram), character overlap
(uni- and bigram), cosine similarity, ordered bigram
overlap
– Query length
• Data
– 7k unique queries (stratified sample)
– 80k query-ad judged relevant pairs
Basic results – text only
Precision = (“said ‘yes’ and was ‘yes’”)/(“said ‘yes’”)
Recall = (“said ‘yes’ and was ‘yes’”)/(“was ‘yes’”)
Accuracy = (“said the right thing”)/(“said something”)
F1-score = 2/(1/P + 1/R) harmonic mean < arithmetic mean
What other features?
Incorporating user clicks
• Can use historic CTRs
– Assumes (ad,query) pair has been seen
• Useless for new ads
– Also evaluate in blanked-out setting
Translation Model
In search, translation models are common
Here D = ad
Good translation = ad click
Typical model
A query term
Maximum likelihood (for historic data)
Any problem with this?
An ad term
Digression on MLE
• Maximum likelihood estimator
– Pick the parameter that‘s most likely to
generate the observed data
Example: Draw a single number from a hat
with numbers {1, …, n}.
You observe 7.
Maximum likelihood estimator?
Underestimates size (c.f. # of species)
Underestimates unknown/impossible
Unbiased estimator?
Remove position bias
• Train one model as described before
– But with smoothing
• Train a second model using expected clicks
• Ratio of model for actual and expected clicks
• Add these as additional features for the learner
Filtering low quality ads
• Use to remove irrelevant ads
- Don‘t show ads below relevance threshold
Showing fewer ads gave more clicks per search!
Second part of Part 2
Estimating Advertisability of Tail
Queries for Sponsored Search
Sandeep Pandey, Kunal Punera,
Marcus Fontoura, et al.
SIGIR’10
Two important questions
• Query advertisability
– When to show ads at all
– How many ads to show
• Ad relevance and clickability
– Which ads to show
– Which ads to show where
Focus on first problem.
Predict: will there be an ad click?
Difficult for tail queries!
Word-based Model
Query q has words {wi}. Model q‘s click propensity as:
Good/bad?
Variant w/o bias for long queries:
Maximum likelihood attempt to learn these:
s(q) = # instances of q with an ad click
n(q) = # instances of q without an ad click
Word-based Model
Then give up …each q only one word
Linear regression model
Different model: words contribute linearly
Add regularization to avoid overfitting
of underdetermined problem
Problem?
Digression
Taken from: http://www.dtreg.com/svm.htm and
http://www.teco.edu/~albrecht/neuro/html/node10.html
Topical clustering
• Latent Dirichlet Allocation
– Implicitly uses co-occurrences patterns
• Incorporate the topic distributions as
features in the regression model
Evaluation
• Why not use the observed c(q) directly?
– “Ground truth” is not trustworthy – tail queries
• Sort things by predicted c(q)
– Should have included optimal ordering!
Learning Website Hierarchies for
Keyword Enrichment in
Contextual Advertising
Pavan Kumar GM, Krishna Leela,
Mehul Parsana, Sachin Garg
WSDM’11
The problem(s)
• Keywords extracted for contextual advertising
are not always perfect
• Many pages are not indexed – no keywords
available. Still have to serve ads
• Want a system that for a given URL (indexed or
not) outputs good keywords
• Key observation: use in-site similarity between
pages and content
Preliminaries
• Mapping URLs u to key-value pairs
• Represent webpage p as vector of keywords
– tf, df, and section where found
Goals:
1. Use u to introduce new kw and/or update existing weights
2. For unindexed pages get kw via other pages from same site
Latency constraint!
What they do
• Conceptually:
– Train a decision tree with keys K as attribute
labels, V as attribute values and pages P as
class labels
– Too many classes (sparseness, efficiency)
• What they do:
– Use clusters of web pages as labels
Digression: Large scale clustering
Syntactic clustering of the Web, Broder et al., 1997
• How (and why) to detect mirror pages?
– “ls man”
What would you do?
• Want a summarizing “fingerprint”?
– Direct hashing won’t work
Shingling
w-shingles of a document (say, w=4)
– “If you are lonely when you are alone, you are
in bad company.” (Sartre)
{(if you are lonely), (you are lonely when),
(are lonely when you), (lonely when you
are), …}
Resemblance
rw(A,B) = |S(A,w)ÅS(B,w)|/|S(A,w)[S(B,w)|
Works well, but how to compute efficiently?!
Obtaining a “sketch”
•
•
•
•
Fix shingle size w, shingle universe U.
Each indvidual shingle is a number (by hashing)
Let W be a set of shingles. Define:
MINs(W) = The set of s smallest elements in W, if |W|¸s
W otherwise
Theorem:
Let : U!U be a permutation of U chosen uniformly at random.
Let M(A) = MINs((S(A))) and M(B) = MINs((S(B))).
The value |MINs(M(A)[M(B))ÅM(A)ÅM(B)|/|MINs(M(A)[M(B))|
is an unbiased estimate of the resemblance of A and B.
Proof
Note: Mins(M(A)) has a fixed size (namely s).
Back to where we were
• They (essentially) use agglomerative
single-linkage clustering with a min
similarity stopping threshold
Do you know agglomerative clustering?
• Splitting criteria
– How would you do it?
Not the best criterion?
• IG prefers attributes with many values
– They claim: high generalization error
– They use: Gain Ratio (GR)
Take impressions into account
• So far (unweighted) pages
– Class probability = number of pages
Weight things by impressions.
More weight for recent visits:
Stopping criterion
• Stop splitting in tree construction when
– All children part of the same class
– Too few impressions under the node
– Statistically not meaningful (Chi-square test)
• Now we have a decision tree for URLs
(leaves)
– What about interior nodes?
Obtaining keywords for nodes
• Belief propagation – from leaves up …and
back down down
Now we have keywords for nodes.
Keywords for matching nodes are
used.
Evaluation
• Two state-of-the-art baselines
– Both use the content
– JIT uses only first 500 bytes, syntactical
– “Semantic” uses topical page hierarchies
– All used with cosine similarity to find ads
• Relevance evaluation
– Human judges evaluated ad relevance
(Some) Results
nDCG
… slide
Digression - nDCG
• Normalized Discounted Cumulative Gain
• CG: total relevance at positions 1 to p
• DCG: the higher the better
• nDCG: take problem difficulty into account
An Expressive Mechanism for
Auctions on the Web
Paul Dütting, Monika Henzinger,
Ingmar Weber
WWW’11
More general utility functions
• Usually
– ui,j(pj) = vi,j – pj
– Sometimes with (hard) budget bi
• We want to allow
– ui,j(pj) = vi,j – ci,j¢ pj, i.e. (i,j)-dependent slopes
– multiple slopes on different intervals
– non-linear utilities altogether
Why (i,j)-dependent slopes?
• Suppose mechanism uses CPC pricing …
• … but a bidder has CPM valuation
• Mechanism computes
• Guarantees
Why (i,j)-dependent slopes?
• Translating back to impressions …
Why different slopes over intervals?
• Suppose bidding on a car on ebay
– Currently only 1-at-a-time (or dangerous)!
– Utility depends on rates of loan
Why non-linear utilities?
• Suppose the drop is supra-linear
– The higher the price the lower the profit …
– … and the higher the uncertainty
– Maybe log(Ci,j-pj)
– “Risk-averse” bidders
• Will use piece-wise linear for approximation
– Give approximation guarantees
Input definition
Set of n bidder I, set of k items J.
Items contain a dummy item j0.
Each bidder i has an outside option oi.
Each item j has a reserve price rj.
Problem statement
• Compute an outcome
• Outcome is feasible if
• Outcome is envy free if for all i and (i,j) 2IxJ
• Bidder optimal if for all other envy free
and for all bidders i (strong!)
Bidder optimality vs. truthfulness
Two bidders i2{1,2} and two items j2{1,2}.
rj = 0 for j2{1,2}, and oi = 0 for i2{1,2}
What‘s a bidder optimal outcome?
What if bidder 1 underreports u1,1(¢)?
Note: “degenerate” input!
Theorem: General position => Truthfulness.
[See paper for definition of “general position”.]
Main Results
Definition:
Overdemand-preserving directions
• Basic idea
– Algorithm iteratively increases the prices
– Price increases required to solve overdemand
• Tricky bits
– preserve overdemand (will explain)
– show necessity (for bidder optimality)
– accounting for unmatching (for running time)
Overdemand-preserving directions
The simple case
1
10-p1
5-p3
Bidders
2
5
p1=1
9-p2
8-p1
7-p2
3-p3
Explain:
First choice graph
1
2 p2=0
4
Items
Explain:
Increase required
12-p1
11-p2
3
2-p3
3 p3=0
Explain:
Path augmentation
Overdemand-preserving directions
The not-so-simple case
1
11-2p1
5-p3
Bidders
2
1
p1=1
9-p2
8-4p1
4-3p2
2 p2=0
3-p3
Explain:
ci,j matter!
12-3p1
9-7p2
3
2-p3
3 p3=0
Items
Finding ov.d-preserv. directions
• Key observation (not ours!):
– minimize
– or equivalently
• No longer preserves full first choice graph
– But alternating tree
• Still allows path augmention
The actual mechanism
Effects of Word-of-Mouth Versus
Traditional Marketing: Findings from
an Internet Social Networking Site
Michael Trusov, Randolph Bucklin,
Koen Pauwels
Journal of Marketing, 2009
The growth of a social network
• Driving factors
– Paid event marketing (101 events in 36 wks)
– Media attention (236 in 36 wks)
– Word-of-Mouth (WOM)
• Can observe
– Organized marketing events
– Mentions in the media
– WOM referrals (through email invites)
– Number of new sign-ups
What could cause what?
•
•
•
•
Media coverage => new sign-ups?
New sign-ups => media coverage?
WOM referrals => new sign-ups?
….
Time series modeling
intercept
sign-ups
WOM referrals
Media appearances
Promo events
Up to 20 days
Lots of parameters
linear trend
holidays
day of week
Time series modeling
Overfitting?
Granger Causality
• Correlation  causality
– Regions with more storks have more babies
– Families with more TVs live longer
• Granger causality attempts more
– Works for time series
– Y and (possible) cause X
– First, explain (= linear regression) Y by lagged Y
– Explain the rest using lagged X
– Significant improvement in fit?
What causes what?
Response of Sign-Ups to Shock
• IRF: impulse
response function
New to me …
Digression: Bass diffusion
New “sales” at time t:
Ultimate market potential m is given.
Model comparison
• 197 train (= in-sample)
• 61 test (= out-of-sample)
Monetary Value of WOM
•
•
•
•
•
CPM about $.40 (per ad)
Impressions visitor/month about 130
Say 2.5 ads per impression
$.13 per month per user, or about $1.50/yr
IRF: 10 WOM = 5 new sign-ups over 3 wk
1 WOM worth approx $.75/yr