Template to create a scientific poster

Download Report

Transcript Template to create a scientific poster

User Modeling in Search Engine Logs
Hongning Wang, Advisort: ChengXiang Zhai,
Department of Computer Science, University of Illinois at Urbana-Champaign Urbana, IL 61801 USA
{wang296,czhai}@Illinois.edu
A Non-parametric Bayesian Approach [WSDMโ€™14]
A Ranking Model Adaptation Approach [SIGIRโ€™13]
In this work, we study the problem of user modeling in the search log data and propose a generative model,
dpRank, within a non-parametric Bayesian framework. By postulating generative assumptions about a user's
search behaviors, dpRank identifies each individual user's latent search interests and his/her distinct result
preferences in a joint manner. Experimental results on a large-scale news search log data set validate the
effectiveness of the proposed approach, which not only provides in-depth understanding of a user's search
intents but also benefits a variety of personalized applications.
Methods
๐œ‡๐‘˜๐‘ก ~๐‘(๐œ‡0 , ๐œŽ02 )
In this work, we propose a general ranking model adaptation framework for personalized search. The
proposed framework quickly learns to apply a series of linear transformations, e.g., scaling and shifting,
over the parameters of the given global ranking model such that the adapted model can better fit each
individual user's search preferences. Extensive experimentation based on a large set of search logs from
a major commercial Web search engine confirms the effectiveness of the proposed method compared to
several state-of-the-art ranking model adaptation methods.
Methods
2
๐œŽ๐‘˜๐‘ก
~๐บ๐‘Ž๐‘š๐‘š๐‘Ž(๐›ผ0 , ๐›ฝ0 ) ๐›ฝ๐‘˜๐‘ฃ ~๐‘(0, ๐‘Ž02 )
โ€ข Adjust the generic ranking modelโ€™s parameters with respect to each individual userโ€™s
ranking preferences
Dirichlet Process Prior
y
y
๐‘‚(๐‘‰ 2 )
p(Q)
(๐œ‡1 , ๐œŽ12 , ๐›ฝ1 )
p(Q)
2
(๐œ‡๐‘˜ , ๐œŽ๐‘˜ , ๐›ฝ๐‘˜ )
p(Q)
(๐œ‡๐‘ , ๐œŽ๐‘2 , ๐›ฝ๐‘ )
Modeling of search interest
๐‘ ๐‘ž๐‘– ~๐‘(๐œ‡๐‘– , ๐œŽ๐‘˜2 ๐ผ)
Modeling of result preferences
Latent User Groups
โˆž
๐œ‹๐‘˜ ๐‘˜=1 ~๐ท๐‘ƒ(๐›พ, ๐œ‚)
โ€ฆโ€ฆ
โ€ฆโ€ฆ
f1
f1
f1
๐œ‹1 ๐œ‹2 ๐œ‹3 ๐œ‹๐‘’
๐‘ ๐ท ๐‘ž๐‘– =
๐‘ฆ๐‘–๐‘  >๐‘ฆ๐‘–๐‘ก
1 โˆ’ ๐‘ ๐œ‹๐‘’ ๐‘๐œ‹๐‘’
Group 1
f2
Group k
f2
1
1 + exp(โˆ’๐›ฝ๐‘˜๐‘ก (๐‘‘๐‘–๐‘  โˆ’ ๐‘‘๐‘–๐‘ก ))
Individual level: characterize
userโ€™s own interest
โ€ฆ
โ€ฆ
๐‘“ ๐‘ข ๐‘ฅ = ๐ด๐‘ข ๐‘ค ๐‘  ๐‘‡ ๐‘ฅ
x
๐‘ข
๐‘Ž๐‘” 1
Clicks
๐‘ข
๐ด =
๐‘‚(๐‘‰)
0
โ‹ฏ
0
๐‘Ž๐‘”๐‘ข 2
โ‹ฎ
0
โ‹ฎ
โ‹ฏ
โ‹ฏ
โ‹ฑ
๐‘ข
๐‘๐‘” 1
๐‘Ž๐‘”๐‘ข 2
โ‹ฎ
๐‘Ž๐‘”๐‘ข ๐‘‰
๐‘๐‘”๐‘ข 1
โ€ข Linear regression based model adaptation
๐‘ข
๐œ‹ ๐‘ข3
โ€ฆ
x
Timestamp
Query
5/29/2012 14:06:04
coney island Cincinnati
5/30/2012 12:12:04
drive direction to coney island
5/31/2012 19:40:38
motel 6 locations
5/31/2012 19:45:04 Cincinnati hotels near coney island
min
๐ฟ๐‘Ž๐‘‘๐‘Ž๐‘๐‘ก ๐ด
๐‘ข
f2
Group c
๐œ‹ ๐‘ข2
๐œ‹ ๐‘ข1
๐‘“ ๐‘ฅ = ๐‘ค๐‘‡๐‘ฅ
Aggregated level: information
shared by all the users
๐ด
๐‘ข
= ๐ฟ ๐‘„ ;๐‘“
๐‘ข
๐‘ข
+ ๐œ†๐‘…(๐ด )
Induced optimization ๐‘คโ„Ž๐‘’๐‘Ÿ๐‘’ ๐‘“ ๐‘ข ๐‘ฅ = ๐ด๐‘ข ๐‘ค ๐‘  ๐‘‡ ๐‘ฅ ๐‘Ž๐‘›๐‘‘ ๐‘ค ๐‘  = (๐‘ค ๐‘  , 1)
problem in the same
Lose function from any linear
complexity as the
Complexity of adaptation
learning-to-rank
algorithm,
e.g.,
original problem
RankNet, LambdaRank, RankSVM
โ€ข Instantiation of RankSVM
A fully generative model for exploring usersโ€™ search behaviors
1. Draw latent user groups from DP:
2
~๐บ๐‘Ž๐‘š๐‘š๐‘Ž(๐›ผ0 , ๐›ฝ0 ) ๐›ฝ๐‘˜๐‘ฃ ~๐‘(0, ๐‘Ž02 )
๐œ‡๐‘˜๐‘ก ~๐‘(๐œ‡0 , ๐œŽ02 ) ๐œŽ๐‘˜๐‘ก
2. Draw group membership for each user from DP:
๐œ‹๐‘˜ โˆž
๐‘˜=1 ~๐ท๐‘ƒ(๐›พ, ๐œ‚)
3. To generate a query in user u:
3.1 Draw a latent user group c: ๐‘๐‘– ~๐œ‹๐‘ข
2
๐‘
๐‘ž
~๐‘(๐œ‡
,
๐œŽ
3.2 Draw query qi for user u accordingly:
๐‘–
๐‘˜ ๐‘˜ ๐ผ)
3.3 Draw click preferences for qi accordingly:
Gibbs sampling for
posterior inference
๐‘ ๐ท๐‘– ๐‘ž๐‘– =
๐‘ฆ๐‘–๐‘  >๐‘ฆ๐‘–๐‘ก
โ€ข Document ranking
1
โ€ข ๐‘  ๐‘‘๐‘—๐‘ก , ๐‘ž๐‘— =
Experimental Results
|๐‘†|
โ€ข Yahoo! News search logs
โ€ข May to July, 2011
โ€ข 65 ranking features for each Query-Document pair
โ€ข Query distribution in latent user groups
Group
10
Top Ranked Queries
๐‘ โˆˆ๐‘†
๐‘˜
๐‘ ๐‘
๐‘ 
P@1
P@3
MRR
0.487
0.616
0.622
0.617
0.298
0.446
0.459
0.449
0.220
0.283
0.283
0.281
0.501
0.632
0.638
0.632
dpRank
0.642
0.485
0.290
site authority
proximity in titleโ€ข Click preferences in latent user groups
query match in title
0.658
URSVM
GRSVM
TRSVM
IRSVM
today in history, nascar 2011 schedule, today history, this day in history
9
miami heat, los angeles lakers, liverpool football club, arsenal football, nfl lockout
8
los angeles lakers, arsenal football, the dark knight rises, transformers 3,
manchester united
8
the titanic, the bachelorette, cars 2, hangover 2, the voice
6
tree of life, game of thrones, sonic the hedgehog, world of warcraft, mtv awards
2011
casey anthony trial, casey anthony jurors, casey anthony, crude oil prices, air france
flight 447
2
+C
๐œ‰๐‘–๐‘—๐‘™
๐‘ž๐‘–
fake tupac story, pbs hackers, alaska earthquake, southwest pilot, arizona wildfires
1
2
selena gomez, lady gaga, britney spears, jennifer aniston, taylor swift
0
1
iran, china, libya, vietnam, Syria
Global model
๐‘—,๐‘™
User Set
0
โ€ข0.2
User Class
๐พ1 ๐‘ฅ๐‘ก , ๐‘ฅ๐‘Ÿ
1
=
๐œŽ
โ€ข0.6
4
6
8
Feature ID
10
12
๐‘” ๐‘ฃ =๐‘˜
๐‘ค๐‘ฃ๐‘  ๐‘ฅ๐‘Ÿ๐‘ฃ
๐‘” ๐‘ฃ =๐‘˜
๐‘ฅ๐‘ก๐‘ฃ
๐‘˜
๐‘” ๐‘ฃ =๐‘˜
๐‘ฅ๐‘Ÿ๐‘ฃ
๐‘” ๐‘ฃ =๐‘˜
โ€ข Query-level improvement against global model
# Queries
# Documents
-
49,782
2,320,711
34,827
187,484
1,744,969
% Population
[10, โˆž) queries Heavy
6.8
[5, 10) queries Medium
14.9
(0, 5) queries
78.3
โ€ข0.4
2
๐‘ค๐‘ฃ๐‘  ๐‘ฅ๐‘ก๐‘ฃ
๐‘˜
# Users
Annotation Set
0.2
0
Non-linear kernels
๐‘คโ„Ž๐‘’๐‘Ÿ๐‘’ ๐พ1 ๐‘ฅ๐‘ก , ๐‘ฅ๐‘Ÿ =
โ€ข Adaptation efficiency
per-user basis adaptation baseline
3
3
๐‘ . ๐‘ก. 0 โ‰ค ๐›ผ๐‘ก โ‰ค ๐ถ, โˆ€๐‘ก
0.4
4
2
๐‘ก
๐œ‰๐‘–๐‘—๐‘™ โ‰ฅ 0
๐‘คโ„Ž๐‘’๐‘Ÿ๐‘’ ๐‘ฆ๐‘–๐‘— > ๐‘ฆ๐‘–๐‘™ ๐‘Ž๐‘›๐‘‘ ฮ”๐‘ฅ๐‘–๐‘—๐‘™ = ๐‘ฅ๐‘–๐‘— โˆ’ ๐‘ฅ๐‘—๐‘™
5
joplin missing, apple icloud, sony hackers, google subpoena, ford transmission
๐›ผ
1 โˆ’ ๐‘“ ๐‘ฅ๐‘ก
โ€ข User-level improvement against global model
6
4
max
๐‘ . ๐‘ก. ๐‘ค ๐‘‡ ฮ”๐‘ฅ๐‘–๐‘—๐‘™ โ‰ฅ 1 โˆ’ ๐œ‰๐‘–๐‘—๐‘™ , โˆ€๐‘ž๐‘– , ๐‘ฅ๐‘–๐‘— , ๐‘ฅ๐‘–๐‘™
7
Group ID
7
1
min w
๐‘ค,๐œ‰๐‘–๐‘—๐‘™ 2
1 ๐‘‡
๐›ผ๐‘ก โˆ’ ๐›ผ ๐พ1 ๐‘ฅ, ๐‘ฅ + ๐พ2 ๐‘ฅ, ๐‘ฅ ๐›ผ
2
๐‘ 
โ€ข Bing query log: May 27, 2012 โ€“ May 31, 2012
โ€ข 1830 ranking features
10
document age
Pairwise ranking model
Experimental Results
๐‘ 
= ๐‘˜ ๐‘ž๐‘— ๐›ฝ๐‘˜ ๐‘‘๐‘—๐‘ก
MAP
9
5
1
1 + exp(โˆ’๐›ฝ๐‘˜๐‘ก (๐‘‘๐‘–๐‘  โˆ’ ๐‘‘๐‘–๐‘ก ))
Margin rescaling
14
โ€ข0.8
Light
Method
RA
Cross
RA
Cross
RA
Cross
ฮ”MAP
ฮ”P@1
0.1843 0.3309
0.1998 0.3523
0.1102 0.2129
0.1494 0.2561
0.0042 0.0575
0.0403* 0.0894*
ฮ”P@3
0.0120
0.0182
0.0025
0.0208
-0.0221
-0.0021
ฮ”MRR
0.1832
0.1994
0.1103
0.1500
0.0041
0.0406*
* Indicates p-value<0.01
Use cross-training to determine feature grouping