Transcript Ad 2 Ad 1
机器学习在互联网广告中的 应用 庄宝童 Agenda • 介绍 • 机器学习应用 – Common utility – Advertiser – Publisher – user • 总结 为什么需要互联网广告? • 流量(用户)是互联网公司的重要资产 • 互联网内容免费模式,需要流量变现来维持运营 • 广告收入占比: – Google :95% (2012,http://investor.google.com/financial/tables.html) – Facebook:83% (2011) – Baidu:? – Alibaba:? • 特点:效果量化可追踪,运营销售参与少,曝光成本低 • 对互联网广告公司而言,是一种理想的“印钞机”商业模式(吴军,《浪潮之巅》) 我们需要什么样的广告? Find the best match between a given user in a given context and a suitable advertisement -- Andrei Broder and Dr. Vanja 2011 Statistical model Ads Page User Publisher Bids Auction Select argmax f(bid, rate) Pick best ads Ad Network Advertisers conversion Response rates (click, conversion, ad-view) Players in the ecosystem • Publisher’s utility:Revenue,user engagement • Advertiser ‘s utility:ROI • User’s utility:relevance mechanism design • 合同定价 ( futures market),CPM 或 CPT 计价 • 拍卖定价 (spot market) – GFP – GSP – VCG • 计价方式 – CPM (Cost per Mille-impressions): publisher 风险最小,如 yahoo,sina的品牌广告 – CPC (Cost per Click) : publisher 和 advertiser 风险共担,google adwords,百度凤巢等大 部分属于此类 – CPA (cost per Action):advertiser 风险最小,如淘宝客。 CPC 的ranking functions • Bid ranking:bid – 源于 goto.com (overture 前身,后被yahoo收购) • Revenue ranking:CTR * bid – Google 首创 – 核心问题:CTR prediction model P(click | user, ad, context) • ad : creative, bid-terms, landing page, campaign, advertiser, format (text/image/video), size, etc. • user : cookie, demo, geo, behavioral, activity history • context : query, publisher, page-content, session, time algorithms • Logistic Regression + feature engineering (google, yahoo, baidu, facebook , etc) • Microsoft (Baysian Probit Regression) • Google : boosting http://users.soe.ucsc.edu/~niejiazhong/slides/chandra.pdf • Taobao (Mixture of Logistic Regression) • trends:big data + nonlinear/feature learning challenges • Sparsity: use Natural hierarchies or Autogenerated hierarchies • Missing data • Bias:position,ad category,etc • Dynamical /seasonal effects • Spam/noisy data features • Features: – Click feedback features (COEC) – Query features – Query-ad text matching features • Preprocess: – 离散化 分段 – 特征交叉 – 层次特征—处理稀疏性 (variance bias trade-off) – 特征平滑,变换 training • 训练集 • 正负样本分层采样 – imbalance training 问题 • Instances:1B • Features:10B • 分布式训练 – MPI (baidu, taobao) – map reduce (google) Evaluation • Offline evaluation – MSE, MAE – AUC • Online A/B test – 分层实验平台(google,Overlapping Experiment Infrastructure: More, Better, Faster Experimentation) – 正态/二项分布样本的假设检验 实践 • 实时计算,性能问题 – 简单有效的候选集选取 – 精确计算 • Online learning Explore/Exploit • 低 mean ,高方差的 ads 应该給予展示机会 • E.g. Consider 2 ads (same bids) Probability density – Goal: Select most popular – CTR1 ~ (mean=.01,var=.1), CTR2~ (mean=.05,var~0) Ad 2 Ad 1 E&E 常用算法 • Upper confidence bound policy (UCB) – Mean + uncertainty-estimate – mean + k* sd(estimator) • Thompson sampling – 从 posterior 里随机采样,比较适合 Bayesian 类的算法 • 问题 – 广告集合巨大,explore 代价过大 – 跟传统 Multi-Arms bandits 问题不太一样,广告集合是动态的,且 每次会选择多个 Advertiser’s perspective • Keyword selection • Bid optimization • Smart pricing • Anti fraud • Impression forecasting: time series • Smooth delivery: allocation algorithms • 用途: CVR prediction – Smart pricing :外部流量千差万别,广告主没有精力也能力做分 媒体的出价,需要按照点击价值进行智能出价 (Google, smart pricing grows the pie),以保证广告主的 ROI – DSP: real time bidding – CPA 模式的rank function: ctr * cvr * bid • 做法:与CTR 预估问题类似,但更困难 – 转化数据获取困难,且更为稀疏 – 不同广告主的转化定义不一致 User’s perspective • User fatigue • User privacy • Behavioral targeting / retargeting • Query intent • Low quality ads detection(google, detecting adversarial advertisements in the wild) Publisher’s perspective • Revenue • User engagement 谢谢