Ad 2 Ad 1

Transcript Ad 2 Ad 1

机器学习在互联网广告中的
应用
庄宝童
Agenda
• 介绍
• 机器学习应用
– Common utility
– Advertiser
– Publisher
– user
• 总结
为什么需要互联网广告？
•
流量（用户）是互联网公司的重要资产
•
互联网内容免费模式，需要流量变现来维持运营
•
广告收入占比：
– Google ：95% (2012，http://investor.google.com/financial/tables.html)
– Facebook：83% （2011）
– Baidu：？
– Alibaba：？
•
特点：效果量化可追踪，运营销售参与少，曝光成本低
•
对互联网广告公司而言，是一种理想的“印钞机”商业模式（吴军，《浪潮之巅》）
我们需要什么样的广告？
Find the best match between a given user in a
given context and a suitable advertisement
-- Andrei Broder and Dr. Vanja 2011
Statistical
model
Ads
Page
User
Publisher
Bids
Auction
Select argmax f(bid, rate)
Pick
best ads
Ad
Network
Advertisers
conversion
Response rates
(click, conversion,
ad-view)
Players in the ecosystem
• Publisher’s utility：Revenue，user engagement
• Advertiser ‘s utility：ROI
• User’s utility：relevance
mechanism design
•
合同定价（ futures market），CPM 或 CPT 计价
•
拍卖定价（spot market）
– GFP
– GSP
– VCG
•
计价方式
– CPM (Cost per Mille-impressions): publisher 风险最小，如 yahoo，sina的品牌广告
– CPC (Cost per Click) ： publisher 和 advertiser 风险共担，google adwords，百度凤巢等大
部分属于此类
– CPA (cost per Action)：advertiser 风险最小，如淘宝客。
CPC 的ranking functions
• Bid ranking：bid
– 源于 goto.com (overture 前身，后被yahoo收购）
• Revenue ranking：CTR * bid
– Google 首创
– 核心问题：CTR prediction
model
P(click | user, ad, context)
• ad : creative, bid-terms, landing page, campaign,
advertiser, format (text/image/video), size, etc.
• user : cookie, demo, geo, behavioral, activity history
• context : query, publisher, page-content, session, time
algorithms
• Logistic Regression + feature engineering (google, yahoo,
baidu, facebook , etc)
• Microsoft (Baysian Probit Regression)
• Google :
boosting
http://users.soe.ucsc.edu/~niejiazhong/slides/chandra.pdf
• Taobao (Mixture of Logistic Regression)
• trends：big data + nonlinear/feature learning
challenges
• Sparsity： use Natural hierarchies or Autogenerated hierarchies
• Missing data
• Bias：position，ad category，etc
• Dynamical /seasonal effects
• Spam/noisy data
features
• Features:
– Click feedback features （COEC）
– Query features
– Query-ad text matching features
• Preprocess:
– 离散化分段
– 特征交叉
– 层次特征—处理稀疏性（variance bias trade-off)
– 特征平滑，变换
training
• 训练集
• 正负样本分层采样 – imbalance training 问题
• Instances：1B
• Features：10B
• 分布式训练
– MPI (baidu, taobao)
– map reduce (google)
Evaluation
• Offline evaluation
– MSE, MAE
– AUC
• Online A/B test
– 分层实验平台（google，Overlapping Experiment
Infrastructure: More, Better, Faster Experimentation）
– 正态/二项分布样本的假设检验
实践
• 实时计算，性能问题
– 简单有效的候选集选取
– 精确计算
• Online learning
Explore/Exploit
• 低 mean ，高方差的 ads 应该給予展示机会
• E.g. Consider 2 ads (same bids)
Probability density
– Goal: Select most popular
– CTR1 ~ (mean=.01,var=.1), CTR2~ (mean=.05,var~0)
Ad 2
Ad 1
E&E 常用算法
• Upper confidence bound policy (UCB)
– Mean + uncertainty-estimate
–
mean + k* sd(estimator)
• Thompson sampling
– 从 posterior 里随机采样，比较适合 Bayesian 类的算法
• 问题
– 广告集合巨大，explore 代价过大
– 跟传统 Multi-Arms bandits 问题不太一样，广告集合是动态的，且
每次会选择多个
Advertiser’s perspective
• Keyword selection
• Bid optimization
• Smart pricing
• Anti fraud
• Impression forecasting： time series
• Smooth delivery: allocation algorithms
• 用途：
CVR prediction
– Smart pricing ：外部流量千差万别，广告主没有精力也能力做分
媒体的出价，需要按照点击价值进行智能出价（Google， smart
pricing grows the pie)，以保证广告主的 ROI
– DSP: real time bidding
– CPA 模式的rank function： ctr * cvr * bid
• 做法：与CTR 预估问题类似，但更困难
– 转化数据获取困难，且更为稀疏
– 不同广告主的转化定义不一致
User’s perspective
• User fatigue
• User privacy
• Behavioral targeting / retargeting
• Query intent
• Low quality ads detection（google, detecting
adversarial advertisements in the wild)
Publisher’s perspective
• Revenue
• User engagement
谢谢

Ad 2 Ad 1

Transcript Ad 2 Ad 1

Directory