Transcript slides
A fast algorithm for learning large scale preference relations Vikas C. Raykar and Ramani Duraiswami University of Maryland College Park Balaji Krishnapuram Siemens medical solutions USA AISTATS 2007 1 Learning Many learning tasks can be viewed as function estimation. 2 Learning from examples Not all supervised learning procedures fit in the standard classification/regression framework. Learning algorithm Training In this talk we are mainly concerned with ranking/ordering. 3 Ranking / Ordering For some applications ordering is more important Example 1: Information retrieval Sort in the order of relevance 4 Ranking / Ordering For some applications ordering is more important Example 2: Recommender systems Sort in the order of preference 5 Ranking / Ordering For some applications ordering is more important Example 3: Medical decision making Decide over different treatment options 6 Plan of the talk • Ranking formulation • Algorithm • Fast algorithm • Results 7 Preference relations Given a we can order/rank a set of instances. Goal - Learn a preference relation Training data – Set of pairwise preferences 8 Ranking function Provides a numerical score Not unique Goal - Learn a preference relation New Goal - Learn a ranking function Why not use classifier/ordinal regressor as the ranking function? 9 Why is ranking different? Pairwise preference Relations Learning algorithm Training Pairwise disagreements 10 Training data..more formally From these two we can get a set of pairwise preference realtions 11 Loss function.. Minimize fraction of pairwise disagreements Maximize fraction of pairwise agreements Total # of pairwise agreements Total # of pairwise preference relations Generalized Wilcoxon-Mann-Whitney (WMW) statistic 12 Consider a two class problem + + + + + + + - - - - - - 13 Function class..Linear ranking function Different algorithms use different function class • RankNet – neural network • RankSVM – RKHS • RankBoost – boosted decision stumps 14 Plan of the talk • Ranking formulation – Training data – Pairwise preference relations – Ideal Loss function – WMW statistic – Function class – linear ranking functions • Algorithm • Fast algorithm • Results 15 The Likelihood Choose w to maximize Discrete optimization problem Log-likelihood Assumption : Every pair is drawn independently Sigmoid [Burges et.al.] 16 The MAP estimator 17 Another interpretation What we want to maximize What we actually maximize O-1 indicator function Log-sigmoid Log-sigmoid is a lower bound for the indicator function 18 Lower bounding the WMW Log-likelihood <= WMW 19 Gradient based learning • • • • • Use nonlinear conjugate-gradient algorithm. Requires only gradient evaluations. No function evaluations. No second derivatives. Gradient is given by 20 RankNet Pairwise preference relations Learning algorithm Backpropagation neural net Training Cross entropy 21 RankSVM Pairwise preference relations Learning algorithm SVM RKHS Training Pairwise disagreements 22 RankBoost Pairwise preference relations Learning algorithm Boosting Decision stumps Training Pairwise disagreements 23 Plan of the talk • Ranking formulation – Training data – Pairwise preference relations – Loss function – WMW statistic – Function class – linear ranking functions • Algorithm – Maximize a lower bound on WMW – Use conjugate-gradient – Quadratic complexity • Fast algorithm • Results 24 Key idea • • • • Use approximate gradient. Extremely fast in linear time. Converges to the same solution. Requires a few more iterations. 25 Core computational primitive Weighted summation of erfc functions 26 Notion of approximation 27 Example 28 1. Beauliu’s series expansion Derive bounds for this to choose the number of terms Retain only the first few terms contributing to the desired accuracy. 29 2. Error bounds 30 3. Use truncated series 31 3. Regrouping Once A and B are precomputed Can be computed in O(pM) Does not depend on y. Can be computed in O(pN) Reduced from O(MN) to O(p(M+N)) 32 3. Other tricks • Rapid saturation of the erfc function. • Space subdivision • Choosing the parameters to achieve the error bound • See the technical report 33 Numerical experiments 34 Precision vs Speedup 35 Plan of the talk • Ranking formulation – Training data – Pairwise preference relations – Loss function – WMW statistic – Function class – linear ranking functions • Algorithm – Maximize a lower bound on WMW – Use conjugate-gradient – Quadratic complexity • Fast algorithm – Use fast approximate gradient – Fast summation of erfc functions • Results 36 Datasets • • • • 12 public benchmark datasets Five-fold cross-validation experiments CG tolerance 1e-3 Accuracy for the gradient computation 1e-6 37 Direct vs Fast -WMW statistic WMW is similar for both the exact and the fast approximate version. Dataset Direct Fast 1 0.536 0.534 2 0.917 0.917 3 0.623 0.623 4 * 0.979 38 Direct vs Fast – Time taken Dataset Direct Fast 1 1736 secs. 2 secs. 2 6731 secs. 19 secs. 3 2557 secs. 4 secs. 4 * 47 secs. 39 Effect of gradient approximation 40 Comparison with other methods • RankNet - Neural network • RankSVM - SVM • RankBoost - Boosting 41 Comparison with other methods • WMW is almost similar for all the methods. • Proposed method faster than all the other methods. • Next best time is shown by RankBoost. • Only proposed method can handle large datasets. 42 Sample result Dataset 8 N=950 d=10 S=5 Time taken (secs) WMW RankNCG direct 333 0.984 RankNCG fast 3 0.984 RankNet linear 1264 0.951 RankNet two layer 2464 0.765 RankSVM linear 34 0.984 RankSVM quadratic 1332 0.996 RankBoost 6 0.958 43 Sample result Dataset 11 N=4177 d=9 S=3 Time taken (secs) WMW RankNCG direct 1736 0.536 RankNCG fast 2 0.534 63 0.535 RankNet linear RankNet two layer RankSVM linear RankSVM quadratic RankBoost 44 Application to collaborative filtering • Predict movie ratings for a user based on the ratings provided by other users. • MovieLens dataset (www.grouplens.org) • 1 million ratings (1-5) • 3592 movies • 6040 users • Feature vector for each movie – rating provided by d other users 45 Collaborative filtering results 46 Collaborative filtering results 47 Plan/Conclusion of the talk • Ranking formulation – Training data – Pairwise preference relations – Loss function – WMW statistic – Function class – linear ranking functions • Algorithm – Maximize a lower bound on WMW – Use conjugate-gradient – Quadratic complexity • Fast algorithm – Use fast approximate gradient – Fast summation of erfc functions • Results – Similar accuracy as other methods – But much much faster 48 Future work • Ranking formulation – Training data – Pairwise preference relations – Loss function – WMW statistic – Function class – linear ranking functions • Algorithm – Maximize a lower bound on WMW – Use conjugate-gradient – Quadratic complexity • Fast algorithm – Use fast approximate gradient – Fast summation of erfc functions • Results – Similar accuracy as other methods – But much much faster Other applications neural network Probit regression Code coming soon 49 Future work • Ranking formulation – Training data – Pairwise preference relations – Loss function – WMW statistic – Function class – linear ranking functions Nonlinear Kernelized Variation. • Algorithm – Maximize a lower bound on WMW – Use conjugate-gradient – Quadratic complexity • Fast algorithm – Use fast approximate gradient – Fast summation of erfc functions • Results Other applications neural network Probit regression – Similar accuracy as other methods – But much much faster 50 Thank You ! | Questions ? 51