Transcript Slide 1
Multiple Kernel Learning Marius Kloft Technische Universität Berlin Korea University Marius Kloft (TU Berlin) 1/12 Machine Learning • Aim ▫ Learning the relation of two random quantities from observations • Kernel-based learning: • Example of and ▫ Object detection in images Marius Kloft (TU Berlin) Multiple Views / Kernels Space Shape 2/12 (Lanckriet, 2004) How to combine the views? Weightings. Color Marius Kloft (TU Berlin) 3/12 Computation of Weights? • State of the art (Bach, 2008) ▫ Sparse weights Kernels / views are completely discarded ▫ But why discard information? Marius Kloft (TU Berlin) 4/12 From Vision to Reality? • State of the art: sparse method ▫ empirically ineffective (Gehler et al., Noble et al., ShaweTaylor et al., NIPS 2008) • My dissertation: new methodology ▫ established as a standard Effective in Applications More efficient and effective in practice Breakthrough at learning bounds: O(M/n) Marius Kloft (TU Berlin) Non-sparse Multiple Kernel Learning Marius Kloft (TU Berlin) 5/12 (Kloft et al., ECML 2010, JMLR 2011) New Methodology • Computation of weights? ▫ Model Kernel ▫ Mathematical program • Generalized formulation ▫ arbitrary loss ▫ arbitrary norms e.g. lp-norms: 1-norm leads to sparsity: Optimization over weights Convex problem. Marius Kloft (TU Berlin) 6/12 Theoretical Analysis • Theoretical foundations ▫ Active research topic NIPS workshop 2010 • Corollaries (Learning Bounds) ▫ Upper bound with rate best known rate: (Cortes et al., ICML 2010) ▫ We show: Theorem (Kloft & Blanchard). The local Rademacher complexity of MKL is bounded by: Generally for , improvement of two orders of magnitude (Kloft & Blanchard, NIPS 2011, JMLR 2012) Marius Kloft (TU Berlin) Proof (Sketch) 1. Relating the original class with the centered class 2. Bounding the complexity of the centered class 3. Khintchine-Kahane’s and Rosenthal’s inequalities 4. Bounding the complexity of the original class 5. Relating the bound to the truncation of the spectra of the kernels 7/12 Marius Kloft (TU Berlin) 8/12 Optimization • Algorithms (Kloft et al., JMLR 2011) 1. Newton method 2. sequential, quadratically constrained programming with level set projections 3. • Implementation ▫ In C++ (“SHOGUN Toolbox”) Matlab/Octave/Python/R support ▫ Runtime: block-coordinate descent alg. Alternate (Sketch) solve (P) w.r.t. w solve (P) w.r.t. : Until convergence analytical (proved) ~ 1-2 orders of magnitude faster Marius Kloft (TU Berlin) 11 Toy Experiment • Results • Design ▫ Choice Two 50-dimensional of p crucial Gaussians mean μ1of onpsimplex, ▫ Optimality depends on true sparsity 50% test error ▫ Zero-mean features irrelevant SVM fails in the sparse scenarios 50 training examples 1-norm best in the sparsest scenario only One linear kernel per feature p-norm MKL proves robust in all scenarios ▫ Six scenarios • Bounds Vary % of irrelevant features ▫ Can minimize w.r.t. 92% p 0% 44% bounds 64% 82% 0% 0% 44% 64% 92% 98% sparsity 98% sparsity Variance so that Bayes error constant ▫ Bounds well reflect empirical results 82% Sparsity Marius Kloft (TU Berlin) Applications • Lesson learned: ▫ Optimality of a method depends on true underlying sparsity of problem • Applications studied: non-sparsity 12 Marius Kloft (TU Berlin) 9/12 Application Domain: Computer Vision • Visual object recognition ▫ Aim: annotation of visual media (e.g., images) aeroplane ▫ Motivation: ▫ content-based image retrieval bicycle bird Marius Kloft (TU Berlin) 9/12 Application Domain: Computer Vision • Visual object recognition ▫ Aim: annotation of visual media (e.g., images) ▫ Motivation: ▫ content-based image retrieval • Multiple kernels ▫ based on Color histograms shapes (gradients) local features (SIFT words) spatial features Marius Kloft (TU Berlin) 9/12 Application Domain: Computer Vision • 32 Kernels • Datasets ▫ 1. VOC 2009 challenge ▫ 4 types: Bag of Words Global Histogram Gradients BoW-SIFT Ho(O)G Color Histograms BoW-C HoC ▫ varied over color channel combinations and spatial tilings (levels of a spatial pyramid) ▫ one-vs.-rest classifier for each visual concept 7054 train / 6925 test images 20 object categories ▫ aeroplane, bicycle,… ▫ 2. ImageCLEF2010 Challenge 8000 train / 10000 test images ▫ taken from Flickr 93 concept categories ▫ partylife, architecture, skateboard, … Marius Kloft (TU Berlin) 9/12 Application Domain: Computer Vision • Preliminary results: SVM MKL VOC2009 55.85 56.76 CLEF2010 36.45 37.02 • Challenge results ▫ Employed our approach for ImageCLEF2011 Photo Annotation challenge achieved the winning entries in 3 categories! (Binder et al., 2010,2011) ▫ using BoW-S only gives worse results → BoW-S alone is not sufficient Marius Kloft (TU Berlin) 9/12 Application Domain: Computer Vision • Why can MKL help? ▫ some images better captured by certain kernels • Experiment ▫ Disagreement of single-kernel classifiers ▫ Different images may have different kernels that capture them well ▫ BoW-S kernels induce more or less the same predictions Marius Kloft (TU Berlin) Application Domain: Genetics (Kloft et al., NIPS 2009, JMLR 2011) • Detection of ▫ transcription start sites: • Empirical analysis ▫ detection accuracy (AUC): • by means of kernels based on: ▫ sequence alignments ▫ distribution of nukleotides downstream, upstream ▫ folding properties binding energies and angles ▫ higher accuracies than sparse MKL and ARTS ARTS winner of international comparison of 19 models (Abeel et al., 2009) Marius Kloft (TU Berlin) Application Domain: Genetics (Kloft et al., NIPS 2009, JMLR 2011) • Theoretical analysis ▫ impact of lp-Norm on bound • Empirical analysis ▫ detection accuracy (AUC): ▫ confirms experimental results: stronger theoretical guarantees for proposed approach (p>1) empirical and theoretical results approximately equal for ▫ higher accuracies than sparse MKL and ARTS ARTS winner of international comparison of 19 models (Abeel et al., 2009) Marius Kloft (TU Berlin) 20 Application Domain: Pharmacology • Protein Fold Prediction ▫ Prediction of fold class of protein • Results ▫ Accuracy: Fold class related to protein’s function ▫ e.g., important for drug design ▫ Data set and kernels from Y. Ying 27 fold classes Fixed train and test sets 12 biologically inspired kernels ▫ e.g. hydrophobicity, polarity, van-der-Waals volume ▫ 1-norm MKL and SVM on par ▫ p-norm MKL performs best 6% higher accuracy than baselines Marius Kloft (TU Berlin) 21 Further Applications non-sparsity Marius Kloft (TU Berlin) 10/12 Conclusion: Non-sparse Multiple Kernel Learning Visual Object Recognition established standard: winner of ImageCLEF 2011 Challenge Computational Biology Applications Training with > 100,000 data points and > 1 000 Kernels Sharp learning bounds More accurate gene detector than winner of int. comparison 12/12 Thank you for your attention. I will be pleased to answer any additional questions. 11/12 References ▫ Abeel, Van de Peer, Saeys (2009). Toward a gold standard for promoter prediction evaluation. Bioinformatics. ▫ Bach (2008). Consistency of the Group Lasso and Multiple Kernel Learning. Journal of Machine Learning Research (JMLR). ▫ Kloft, Brefeld, Laskov, and Sonnenburg (2008). Non-sparse Multiple Kernel Learning. NIPS Workshop on Kernel Learning. ▫ Kloft, Brefeld, Sonnenburg, Laskov, Müller, and Zien (2009). Efficient and Accurate Lp-norm Multiple Kernel Learning. Advances in Neural Information Processing Systems (NIPS 2009). ▫ Kloft, Rückert, and Bartlett (2010). A Unifying View of Multiple Kernel Learning. ECML. ▫ Kloft, Blanchard (2011). The Local Rademacher Complexity of Lp-Norm Multiple Kernel Learning. Advances in Neural Information Processing Systems (NIPS 2011). ▫ Kloft, Brefeld, Sonnenburg, and Zien (2011). Lp-Norm Multiple Kernel Learning. Journal of Machine Learning Research (JMLR), 12(Mar):953-997. ▫ Kloft and Blanchard (2012). On the Convergence Rate of Lp-norm Multiple Kernel Learning. Journal of Machine Learning Research (JMLR), 13(Aug):2465-2502. ▫ Lanckriet, Cristianini, Bartlett, El Ghaoui, Jordan (2004). Learning the Kernel Matrix with Semidefinite Programming. Journal of Machine Learning Research (JMLR).