TrueSkill Hits the Web - uni
Download
Report
Transcript TrueSkill Hits the Web - uni
Collaborative Ordinal Regression
Shipeng Yu
Joint work with Kai Yu, Volker Tresp
and Hans-Peter Kriegel
University of Munich, Germany
Siemens Corporate Technology
[email protected]
Motivations
Features
Genre
Actors
Directors
Superman
The Pianist
Star Wars
Descriptions
....................
....................
....................
....................
....................
....................
Ordinal Regression
Very
Dislike
Very
Like
·
·
·
...
...
...
The Matrix
The Godfather
American Beauty
…
Ratings
?
2
Motivations (Cont.)
Features
Superman
The Pianist
Star Wars
......
......
......
·
·
·
·
?
·
...
?
?
?
?
?
·
?
…
·
...
...
...
...
......
The Matrix
The Godfather . . . . . .
American Beauty . . . . . .
Ratings
?
·
Collaborative Ordinal Regression
?
·
3
Outline
Motivations
Ranking Problem
Bayesian Framework for Ordinal Regression
Collaborative Ordinal Regression
Learning and Inference
Experiments
Conclusion and Extensions
4
Ranking Problem
lowest
rank
Goal: Assign ranks to objects
Ordinal
Regression
x1
xn
Rd
y
1 < 2 < ::: < r
x1
x2
..
.
xn
x2
highest
rank
£ X ¢¢¢
X £ ¢¢¢
..
.. . .
.
.
.
£ £ ¢¢¢
£
£
..
.
X
x1
x2
..
.
2
1
..
.
xn
r
xn  : : :  x1  x2  : : :
Preference
Learning
Different from classification/regression problem
Binary classification: Has only 2 labels
Multi-class classification: Ignores ordering property
Regression: Only deals with real outputs
5
Ordinal Regression
Goal: Assign ordered labels to objects
x1
xn
x2
f
1
2
...
b 0 y 2 b 1 y1 b 2
Rd
r
br¡1 yn br
f (x)
Applications
User preference prediction
Web ranking for search engines
…
6
One-task vs Multi-task
x1
f1 ; f2 ; : : : ; fm
f1
xn
x2
Each function
only ranked
part of data
Rd
Different ranking
functions are
correlated
f1 (x)
…
f2 (x)
fm (x)
Common in real world problems
Collaborative filtering: preference learning for multiple users
Web ranking: ranking of web pages for different queries
Question: How to learn related ranking tasks jointly?
7
Outline
Motivations
Ranking Problem
Bayesian Framework for Ordinal Regression
Collaborative Ordinal Regression
Learning and Inference
Experiments
Conclusion and Extensions
8
Bayesian Ordinal Regression
Conditional model on ranking outputs
Ranking likelihood: Conditional on the latent function
P (yjX; f; µ) = P (yjf (x1 ); : : : ; f (xn ); µ) = P (yjf ; µ)
Prior: Gaussian Process prior for latent function
f » N (f ; h; K)
Marginal ranking likelihood: Integrate out latent
Z
function values
P (yjX; µ; h; K) =
Ordinal regression
P (yjf ; µ)P (f jh; K) df
Q
jf ; µ) =
jf (x ); µ)
P
(y
P
(y
i
i
i
likelihood
9
Bayesian Ordinal Regression (1)
Need to define ranking likelihood
Example Model (1):
GP Regression (GPR)
P (yi jf (xi ); µ)
Assume a Gaussian form
Regression on the ranking label directly
P (yi jf (xi ); µ) / N (yi ; f (xi ); ¾2 )
0.5
0.4
0.3
0.2
0.1
0
0
10
1
2
3
4
5
6
Bayesian Ordinal Regression (2)
P (yi jf (xi ); µ)
Need to define ranking likelihood
Example Model (2):
GP Ordinal Regression (GPOR) (Chu & Ghahramani, 2005)
µ
A probit ranking likelihood
P (yi jf (xi ); µ) = ©
byi ¡f (xi )
¾
¶
µ
¡©
byi ¡1 ¡f (xi )
¾
¶
Assign labels based on the surrounding area
1 2 3
b1 b 2
4 5
b3
b4
11
Outline
Motivations
Ranking Problem
Bayesian Framework for Ordinal Regression
Collaborative Ordinal Regression
Learning and Inference
Experiments
Conclusion and Extensions
12
Multi-task Setting?
Naïve approach 1: Learn a GP model for each task
Naïve approach 2: Fit one parametric kernel jointly
No share of information between tasks
The parametric kernel is too restrictive to fit all tasks
The collaborative effects
Common preferences:
Functions share similar regression labels on some items
Similar variabilities:
Functions tend to have same predictability on similar items
13
Collaborative Ordinal Regression
Hierarchical GP model for multi-task ordinal regression
mean function:
model common preferences
covariance matrix:
model similar variabilities
Both mean function and
(non-stationary) covariance
matrix are learned from data
Ordinal
Regression
Likelihood
x1
x2
..
.
xn
h; K
GP Prior
f1
f2
¢¢¢
fm
y1
y2
¢¢¢
ym
2 3 ¢¢¢
1 1 ¢¢¢
.. .. . .
.
. .
3 4 ¢¢¢
5
2
..
.
5
14
COR: The Model
Hierarchical Bayes model on functions
All the latent functions are sampled from the same GP prior
f » N (f ; h; K)
j
Allow different parameter settings for
different tasks
Z
P (DjX; £; h; K) =
Y
m
j=1
j
P (yj jX; µj ; h; K) =
Y
m
P (yj jf j ; µj )P (f j jh; K) df j
j=1
We may only observe part of rank labels for each function
15
COR: The Key Points
The GP prior connects all ordinal regression tasks
The lower level features are incorporated naturally
More general than pure collaborative filtering
We don’t fix a parametric form for the kernel
Model the first and second sufficient statistics
Instead we assign the conjugate prior
P (h; K) = N (h; h0 ; ¼1 K)IW (K; ¿; K0 )
We can make predictions for new input data and
new tasks
16
Toy Problem (GPR Model)
Mean
rank
labels
Mean
function
New task
predictio
n with
base
kernel
(RBF)
New task
predictio
n with
learned
kernel
17
Covariance matrix
Outline
Motivations
Ranking Problem
Bayesian Framework for Ordinal Regression
Collaborative Ordinal Regression
Learning and Inference
Experiments
Conclusion and Extensions
18
Learning
Variational lower bound
XZ
log P (DjX; £; h; K) ¸
m
j=1
P (yj jf j ; µj )P (f j jh; K)
Q(f j ) log
df j
Q(f j )
EM Learning
E-step: Approximate each posterior Q(f j ) as a Gaussian
^ j)
Q(f j ) = N (f j ; ^f j ; K
Estimate the mean vector and covariance matrix using EP
M-step: Fix Q(f j ) and maximize w.r.t. µj and (h; K)
19
E-step
The true posterior Q
distribution factorizes:
Q(f ) /
jf (x ); µ)P (f jh; K)
P
(y
i
i
i
Approximate with
Gaussian factor tk (X)
EP procedures
Deletion: Delete factor tk (X) from the approximated Gaussian
Moments matching: Match moments by adding true likelihood
Update: Update the factor tk (X)
Can be done analytically for the example models
For GPR model the EP step is exact
20
M-step
Update GP prior:
^ =
K
1
¿ +m
µ
^=
h
1
¼+m
µ
P
m ^
f
j=1 j
^ ¡ h0 )(h
^ ¡ h0 )> + ¿ K0 +
¼(h
+ ¼h0
P
m
j=1
h
¶
^ ^f ¡ h)
^ > +K
^j
(^f j ¡ h)(
j
i¶
Does not depend on the form of ranking likelihood
The conjugate prior corresponds to a smooth term
Update likelihood parameter µj
Do it separately for each task
Have the same update equation as the single-task case
21
Inference
Ordinal Regression
¤ ¤
^ K)
^ =
P (yj jx ; D; X; µ^j ; h;
Z
¤ ¤
¤ ¤
^ K)
^ df ¤
P (yj jfj ; µ^j )P (fj jx ; D; X; h;
j
Non-stationary kernel on test data is unknown!
Solution: work in the dual space (Yu et al. 2005)
» N (f ; ^f ; K
^ j)
j j
Posterior f j
¡1
» N (® ; K¡1 ^f ; K¡1 K
^
f
=
K®
®
K
)
j, posterior
j
j
j
j
By constraint j
For test data we have
¤
¤>
fj = k
¤> ¡
¤> ¡ ^
¡1 ¤
¤
®j » N (fj ; k K 1^f j ; k K 1 K
K
k )
j
22
Outline
Motivations
Ranking Problem
Bayesian Framework for Ordinal Regression
Collaborative Ordinal Regression
Learning and Inference
Experiments
Conclusion and Extensions
23
Experiments
Predict user ratings in movie data
MovieLens: 591 movies, 943 users
EachMovie: 1,075 movies, 72,916 users
19 features from the “Genre” part of each movie (binary)
23,753 features from online database (TF-IDF)
Experimental Settings
Pick up 100 users with the most ratings as “tasks”
Randomly choose 10, 20, 50 ratings for each user for
training
Base kernel: cosine similarity
24
Comparison Metrics
Ordinal Regression Evaluation
P
Mean absolute error (MAE):
^ =
MAE(R)
1
t
Mean 0-1 error (MZOE): P
^ =
MZOE(R)
t
jR(i)
^
i=1
1
t
¡ R(i)j
t
1 ^ 6=R(i)
i=1 R(i)
Use Macro & Micro average over multiple tasks
Ranking Evaluation
Normalized Discounted P
Cumulative Gain (NDCG):
^ /
NDCG(R)
t
2r(k) ¡1
k=1 log(1+k)
NDCG@10: Only count the top 10 ranked items
25
Results - MovieLens
N: Number of training items for each user
MMMF: Maximum Margin Matrix Factorization (Srebro et al 2005)
State-of-the-art collaborative filtering model
26
Results - EachMovie
N: Number of training items for each user
MMMF: Maximum Margin Matrix Factorization (Srebro et al 2005)
State-of-the-art collaborative filtering model
27
New Ranking Functions
Test on the
rest users for
MovieLens
Use different
kernels
The more users we use for training, the better kernel we obtain!
28
Observations
Collaborative models are always better than individual models
We can learn a good non-stationary kernel from users
GPR & CGPR are fast in training and robust in testing
GPOR & CGPOR are slow and sometimes overfit
Since there is no approximation
Due to the numerical M-step
jf (x ); µ)
P
(y
i
i
We can use other ranking likelihood
Then we may need to do numerical integration in EP step
29
Outline
Motivations
Ranking Problem
Bayesian Framework for Ordinal Regression
Collaborative Ordinal Regression
Learning and Inference
Experiments
Conclusion and Extensions
30
Conclusion
A Bayesian framework for multi-task ordinal
regression
An efficient EM-EP learning algorithm
COR is better than individual OR algorithms
COR is better than pure collaborative filtering
Experiments show very encouraging results
31
Extensions
The framework is applicable to preference learning
Collaborative version of GP preference learning (Chu &
Ghahramani, 2005)
A probabilistic version of RankNet (Burges et al. 2005)
exp(f(xi )¡f(xj ))
Â
j
¡
/
P (y
y f (x ) f (x ))
i
j
i
j
1+exp(f(xi )¡f(xj ))
GP mixture model for multi-task learning
Assign a Gaussian mixture model to each latent function
Prediction uses a linear combination of learned kernels
Connection to Dirichlet Processes
32
Thanks!
Questions?