Transcript slides

A fast algorithm for learning large scale
preference relations
Vikas C. Raykar and Ramani Duraiswami
University of Maryland College Park
Balaji Krishnapuram
Siemens medical solutions USA
AISTATS 2007
1
Learning
Many learning tasks can be
viewed as function estimation.
2
Learning from examples
Not all supervised learning
procedures fit in the standard
classification/regression
framework.
Learning
algorithm
Training
In this talk we are mainly
concerned with ranking/ordering.
3
Ranking / Ordering
For some applications ordering is more important
Example 1: Information retrieval
Sort in the order of
relevance
4
Ranking / Ordering
For some applications ordering is more important
Example 2: Recommender systems
Sort in the order of
preference
5
Ranking / Ordering
For some applications ordering is more important
Example 3: Medical decision making
Decide over different treatment options
6
Plan of the talk
• Ranking formulation
• Algorithm
• Fast algorithm
• Results
7
Preference relations
Given a
we can order/rank a set of instances.
Goal - Learn a preference relation
Training data – Set of pairwise preferences
8
Ranking function
Provides a numerical score
Not unique
Goal - Learn a preference relation
New Goal - Learn a ranking function
Why not use
classifier/ordinal
regressor as the
ranking function?
9
Why is ranking different?
Pairwise
preference
Relations
Learning
algorithm
Training
Pairwise
disagreements
10
Training data..more formally
From these two we can get a set of
pairwise preference realtions
11
Loss function..
Minimize fraction of pairwise disagreements
Maximize fraction of pairwise agreements
Total # of pairwise agreements
Total # of pairwise preference relations
Generalized Wilcoxon-Mann-Whitney (WMW)
statistic
12
Consider a two class problem
+
+ +
+
+ +
+
-
-
-
-
-
-
13
Function class..Linear ranking function
Different algorithms use different function class
• RankNet – neural network
• RankSVM – RKHS
• RankBoost – boosted decision stumps
14
Plan of the talk
• Ranking formulation
– Training data – Pairwise preference relations
– Ideal Loss function – WMW statistic
– Function class – linear ranking functions
• Algorithm
• Fast algorithm
• Results
15
The Likelihood
Choose w to maximize
Discrete optimization problem
Log-likelihood
Assumption : Every pair is drawn independently
Sigmoid
[Burges et.al.]
16
The MAP estimator
17
Another interpretation
What we want to maximize
What we actually maximize
O-1 indicator function
Log-sigmoid
Log-sigmoid is a lower bound for the indicator function
18
Lower bounding the WMW
Log-likelihood <= WMW
19
Gradient based learning
•
•
•
•
•
Use nonlinear conjugate-gradient algorithm.
Requires only gradient evaluations.
No function evaluations.
No second derivatives.
Gradient is given by
20
RankNet
Pairwise
preference
relations
Learning
algorithm
Backpropagation
neural net
Training
Cross entropy
21
RankSVM
Pairwise
preference
relations
Learning
algorithm
SVM
RKHS
Training
Pairwise disagreements
22
RankBoost
Pairwise
preference
relations
Learning
algorithm
Boosting
Decision stumps
Training
Pairwise disagreements
23
Plan of the talk
• Ranking formulation
– Training data – Pairwise preference relations
– Loss function – WMW statistic
– Function class – linear ranking functions
• Algorithm
– Maximize a lower bound on WMW
– Use conjugate-gradient
– Quadratic complexity
• Fast algorithm
• Results
24
Key idea
•
•
•
•
Use approximate gradient.
Extremely fast in linear time.
Converges to the same solution.
Requires a few more iterations.
25
Core computational primitive
Weighted summation of erfc functions
26
Notion of approximation
27
Example
28
1. Beauliu’s series expansion
Derive bounds for this
to choose
the number of terms
Retain only the first few terms
contributing to the desired
accuracy.
29
2. Error bounds
30
3. Use truncated series
31
3. Regrouping
Once A and B are precomputed
Can be computed in O(pM)
Does not depend on y.
Can be computed in O(pN)
Reduced from O(MN) to O(p(M+N))
32
3. Other tricks
• Rapid saturation of the erfc function.
• Space subdivision
• Choosing the parameters to achieve
the error bound
• See the technical report
33
Numerical experiments
34
Precision vs Speedup
35
Plan of the talk
• Ranking formulation
– Training data – Pairwise preference relations
– Loss function – WMW statistic
– Function class – linear ranking functions
• Algorithm
– Maximize a lower bound on WMW
– Use conjugate-gradient
– Quadratic complexity
• Fast algorithm
– Use fast approximate gradient
– Fast summation of erfc functions
• Results
36
Datasets
•
•
•
•
12 public benchmark datasets
Five-fold cross-validation experiments
CG tolerance 1e-3
Accuracy for the gradient computation 1e-6
37
Direct
vs
Fast -WMW statistic
WMW is similar for both the exact and the fast approximate version.
Dataset
Direct
Fast
1
0.536
0.534
2
0.917
0.917
3
0.623
0.623
4
*
0.979
38
Direct
vs
Fast – Time taken
Dataset
Direct
Fast
1
1736 secs.
2 secs.
2
6731 secs.
19 secs.
3
2557 secs.
4 secs.
4
*
47 secs.
39
Effect of gradient approximation
40
Comparison with other methods
• RankNet - Neural network
• RankSVM - SVM
• RankBoost - Boosting
41
Comparison with other methods
• WMW is almost similar for all the methods.
• Proposed method faster than all the other
methods.
• Next best time is shown by RankBoost.
• Only proposed method can handle large
datasets.
42
Sample result
Dataset 8
N=950 d=10 S=5
Time taken
(secs)
WMW
RankNCG direct
333
0.984
RankNCG fast
3
0.984
RankNet linear
1264
0.951
RankNet two layer
2464
0.765
RankSVM linear
34
0.984
RankSVM quadratic
1332
0.996
RankBoost
6
0.958
43
Sample result
Dataset 11
N=4177 d=9 S=3
Time taken
(secs)
WMW
RankNCG direct
1736
0.536
RankNCG fast
2
0.534
63
0.535
RankNet linear
RankNet two layer
RankSVM linear
RankSVM quadratic
RankBoost
44
Application to collaborative filtering
• Predict movie ratings for a user based on the
ratings provided by other users.
• MovieLens dataset (www.grouplens.org)
• 1 million ratings (1-5)
• 3592 movies
• 6040 users
• Feature vector for each movie – rating provided
by d other users
45
Collaborative filtering results
46
Collaborative filtering results
47
Plan/Conclusion of the talk
• Ranking formulation
– Training data – Pairwise preference relations
– Loss function – WMW statistic
– Function class – linear ranking functions
• Algorithm
– Maximize a lower bound on WMW
– Use conjugate-gradient
– Quadratic complexity
• Fast algorithm
– Use fast approximate gradient
– Fast summation of erfc functions
• Results
– Similar accuracy as other methods
– But much much faster
48
Future work
• Ranking formulation
– Training data – Pairwise preference relations
– Loss function – WMW statistic
– Function class – linear ranking functions
• Algorithm
– Maximize a lower bound on WMW
– Use conjugate-gradient
– Quadratic complexity
• Fast algorithm
– Use fast approximate gradient
– Fast summation of erfc functions
• Results
– Similar accuracy as other methods
– But much much faster
Other applications
neural network
Probit regression
Code coming soon
49
Future work
• Ranking formulation
– Training data – Pairwise preference relations
– Loss function – WMW statistic
– Function class – linear ranking functions
Nonlinear
Kernelized
Variation.
• Algorithm
– Maximize a lower bound on WMW
– Use conjugate-gradient
– Quadratic complexity
• Fast algorithm
– Use fast approximate gradient
– Fast summation of erfc functions
• Results
Other applications
neural network
Probit regression
– Similar accuracy as other methods
– But much much faster
50
Thank You ! | Questions ?
51