TrueSkill Hits the Web - uni

Transcript TrueSkill Hits the Web - uni

Collaborative Ordinal Regression
Shipeng Yu
Joint work with Kai Yu, Volker Tresp
and Hans-Peter Kriegel
University of Munich, Germany
Siemens Corporate Technology
[email protected]
Motivations
Features
Genre
Actors
Directors
Superman
The Pianist
Star Wars
Descriptions
....................
....................
....................
....................
....................
....................
Ordinal Regression
Very
Dislike
Very
Like
·
·
·
...
...
...
The Matrix
The Godfather
American Beauty
…
Ratings
?
2
Motivations (Cont.)
Features
Superman
The Pianist
Star Wars
......
......
......
·
·
·
·
?
·
...
?
?
?
?
?
·
?
…
·
...
...
...
...
......
The Matrix
The Godfather . . . . . .
American Beauty . . . . . .
Ratings
?
·
Collaborative Ordinal Regression
?
·
3
Outline







Motivations
Ranking Problem
Bayesian Framework for Ordinal Regression
Collaborative Ordinal Regression
Learning and Inference
Experiments
Conclusion and Extensions
4
Ranking Problem
lowest
rank

Goal: Assign ranks to objects
Ordinal
Regression
x1
xn
Rd
y
1 < 2 < ::: < r
x1
x2
..
.
xn
x2
highest
rank
£ X ¢¢¢
X £ ¢¢¢
..
.. . .
.
.
.
£ £ ¢¢¢
£
£
..
.
X
x1
x2
..
.
2
1
..
.
xn
r
xn Â : : : Â x1 Â x2 Â : : :
Preference
Learning

Different from classification/regression problem



Binary classification: Has only 2 labels
Multi-class classification: Ignores ordering property
Regression: Only deals with real outputs
5
Ordinal Regression

Goal: Assign ordered labels to objects
x1
xn
x2
f
1
2
...
b 0 y 2 b 1 y1 b 2

Rd
r
br¡1 yn br
f (x)
Applications



User preference prediction
Web ranking for search engines
…
6
One-task vs Multi-task
x1
f1 ; f2 ; : : : ; fm
f1
xn
x2
Each function
only ranked
part of data
Rd
Different ranking
functions are
correlated
f1 (x)
…
f2 (x)
fm (x)

Common in real world problems
 Collaborative filtering: preference learning for multiple users


Web ranking: ranking of web pages for different queries
Question: How to learn related ranking tasks jointly?
7
Outline







Motivations
Ranking Problem
Bayesian Framework for Ordinal Regression
Collaborative Ordinal Regression
Learning and Inference
Experiments
Conclusion and Extensions
8
Bayesian Ordinal Regression

Conditional model on ranking outputs

Ranking likelihood: Conditional on the latent function
P (yjX; f; µ) = P (yjf (x1 ); : : : ; f (xn ); µ) = P (yjf ; µ)

Prior: Gaussian Process prior for latent function
f » N (f ; h; K)

Marginal ranking likelihood: Integrate out latent
Z
function values
P (yjX; µ; h; K) =

Ordinal regression
P (yjf ; µ)P (f jh; K) df
Q
jf ; µ) =
jf (x ); µ)
P
(y
P
(y
i
i
i
likelihood
9
Bayesian Ordinal Regression (1)


Need to define ranking likelihood
Example Model (1):
GP Regression (GPR)
P (yi jf (xi ); µ)

Assume a Gaussian form

Regression on the ranking label directly
P (yi jf (xi ); µ) / N (yi ; f (xi ); ¾2 )
0.5
0.4
0.3
0.2
0.1
0
0
10
1
2
3
4
5
6
Bayesian Ordinal Regression (2)


P (yi jf (xi ); µ)
Need to define ranking likelihood
Example Model (2):
GP Ordinal Regression (GPOR) (Chu & Ghahramani, 2005)

µ
A probit ranking likelihood
P (yi jf (xi ); µ) = ©

byi ¡f (xi )
¾
¶
µ
¡©
byi ¡1 ¡f (xi )
¾
¶
Assign labels based on the surrounding area
1 2 3
b1 b 2
4 5
b3
b4
11
Outline







Motivations
Ranking Problem
Bayesian Framework for Ordinal Regression
Collaborative Ordinal Regression
Learning and Inference
Experiments
Conclusion and Extensions
12
Multi-task Setting?

Naïve approach 1: Learn a GP model for each task


Naïve approach 2: Fit one parametric kernel jointly


No share of information between tasks
The parametric kernel is too restrictive to fit all tasks
The collaborative effects


Common preferences:
Functions share similar regression labels on some items
Similar variabilities:
Functions tend to have same predictability on similar items
13
Collaborative Ordinal Regression

Hierarchical GP model for multi-task ordinal regression



mean function:
model common preferences
covariance matrix:
model similar variabilities
Both mean function and
(non-stationary) covariance
matrix are learned from data
Ordinal
Regression
Likelihood
x1
x2
..
.
xn
h; K
GP Prior
f1
f2
¢¢¢
fm
y1
y2
¢¢¢
ym
2 3 ¢¢¢
1 1 ¢¢¢
.. .. . .
.
. .
3 4 ¢¢¢
5
2
..
.
5
14
COR: The Model

Hierarchical Bayes model on functions

All the latent functions are sampled from the same GP prior
f » N (f ; h; K)
j

Allow different parameter settings for
different tasks
Z
P (DjX; £; h; K) =
Y
m
j=1

j
P (yj jX; µj ; h; K) =
Y
m
P (yj jf j ; µj )P (f j jh; K) df j
j=1
We may only observe part of rank labels for each function
15
COR: The Key Points

The GP prior connects all ordinal regression tasks


The lower level features are incorporated naturally


More general than pure collaborative filtering
We don’t fix a parametric form for the kernel


Model the first and second sufficient statistics
Instead we assign the conjugate prior
P (h; K) = N (h; h0 ; ¼1 K)IW (K; ¿; K0 )
We can make predictions for new input data and
new tasks
16
Toy Problem (GPR Model)
Mean
rank
labels
Mean
function
New task
predictio
n with
base
kernel
(RBF)
New task
predictio
n with
learned
kernel
17
Covariance matrix
Outline







Motivations
Ranking Problem
Bayesian Framework for Ordinal Regression
Collaborative Ordinal Regression
Learning and Inference
Experiments
Conclusion and Extensions
18
Learning

Variational lower bound
XZ
log P (DjX; £; h; K) ¸
m
j=1

P (yj jf j ; µj )P (f j jh; K)
Q(f j ) log
df j
Q(f j )
EM Learning

E-step: Approximate each posterior Q(f j ) as a Gaussian
^ j)
Q(f j ) = N (f j ; ^f j ; K
Estimate the mean vector and covariance matrix using EP
M-step: Fix Q(f j ) and maximize w.r.t. µj and (h; K)


19
E-step

The true posterior Q
distribution factorizes:
Q(f ) /
jf (x ); µ)P (f jh; K)
P
(y
i
i
i
Approximate with
Gaussian factor tk (X)


EP procedures

Deletion: Delete factor tk (X) from the approximated Gaussian

Moments matching: Match moments by adding true likelihood

Update: Update the factor tk (X)
Can be done analytically for the example models

For GPR model the EP step is exact
20
M-step

Update GP prior:
^ =
K



1
¿ +m
µ
^=
h
1
¼+m
µ
P
m ^
f
j=1 j
^ ¡ h0 )(h
^ ¡ h0 )> + ¿ K0 +
¼(h
+ ¼h0
P
m
j=1
h
¶
^ ^f ¡ h)
^ > +K
^j
(^f j ¡ h)(
j
i¶
Does not depend on the form of ranking likelihood
The conjugate prior corresponds to a smooth term
Update likelihood parameter µj


Do it separately for each task
Have the same update equation as the single-task case
21
Inference

Ordinal Regression
¤ ¤
^ K)
^ =
P (yj jx ; D; X; µ^j ; h;
Z
¤ ¤
¤ ¤
^ K)
^ df ¤
P (yj jfj ; µ^j )P (fj jx ; D; X; h;
j

Non-stationary kernel on test data is unknown!

Solution: work in the dual space (Yu et al. 2005)

» N (f ; ^f ; K
^ j)
j j
Posterior f j

¡1
» N (® ; K¡1 ^f ; K¡1 K
^
f
=
K®
®
K
)
j, posterior
j
j
j
j
By constraint j

For test data we have
¤
¤>
fj = k
¤> ¡
¤> ¡ ^
¡1 ¤
¤
®j » N (fj ; k K 1^f j ; k K 1 K
K
k )
j
22
Outline







Motivations
Ranking Problem
Bayesian Framework for Ordinal Regression
Collaborative Ordinal Regression
Learning and Inference
Experiments
Conclusion and Extensions
23
Experiments

Predict user ratings in movie data
 MovieLens: 591 movies, 943 users


EachMovie: 1,075 movies, 72,916 users


19 features from the “Genre” part of each movie (binary)
23,753 features from online database (TF-IDF)
Experimental Settings
 Pick up 100 users with the most ratings as “tasks”
 Randomly choose 10, 20, 50 ratings for each user for
training
 Base kernel: cosine similarity
24
Comparison Metrics

Ordinal Regression Evaluation

P
Mean absolute error (MAE):
^ =
MAE(R)

1
t
Mean 0-1 error (MZOE): P
^ =
MZOE(R)


t
jR(i)
^
i=1
1
t
¡ R(i)j
t
1 ^ 6=R(i)
i=1 R(i)
Use Macro & Micro average over multiple tasks
Ranking Evaluation


Normalized Discounted P
Cumulative Gain (NDCG):
^ /
NDCG(R)
t
2r(k) ¡1
k=1 log(1+k)
NDCG@10: Only count the top 10 ranked items
25
Results - MovieLens


N: Number of training items for each user
MMMF: Maximum Margin Matrix Factorization (Srebro et al 2005)

State-of-the-art collaborative filtering model
26
Results - EachMovie


N: Number of training items for each user
MMMF: Maximum Margin Matrix Factorization (Srebro et al 2005)

State-of-the-art collaborative filtering model
27
New Ranking Functions
Test on the
rest users for
MovieLens
Use different
kernels
The more users we use for training, the better kernel we obtain!
28
Observations



Collaborative models are always better than individual models
We can learn a good non-stationary kernel from users
GPR & CGPR are fast in training and robust in testing


GPOR & CGPOR are slow and sometimes overfit


Since there is no approximation
Due to the numerical M-step
jf (x ); µ)
P
(y
i
i
We can use other ranking likelihood
 Then we may need to do numerical integration in EP step
29
Outline







Motivations
Ranking Problem
Bayesian Framework for Ordinal Regression
Collaborative Ordinal Regression
Learning and Inference
Experiments
Conclusion and Extensions
30
Conclusion

A Bayesian framework for multi-task ordinal
regression

An efficient EM-EP learning algorithm

COR is better than individual OR algorithms

COR is better than pure collaborative filtering

Experiments show very encouraging results
31
Extensions

The framework is applicable to preference learning

Collaborative version of GP preference learning (Chu &
Ghahramani, 2005)

A probabilistic version of RankNet (Burges et al. 2005)
exp(f(xi )¡f(xj ))
Â
j
¡
/
P (y
y f (x ) f (x ))
i

j
i
j
1+exp(f(xi )¡f(xj ))
GP mixture model for multi-task learning



Assign a Gaussian mixture model to each latent function
Prediction uses a linear combination of learned kernels
Connection to Dirichlet Processes
32
Thanks!
Questions?

TrueSkill Hits the Web - uni

Transcript TrueSkill Hits the Web - uni

Directory