Transcript Slide 1
Dr. Georgios C. Anagnostopoulos Series of Electrical & Computer Engineering Seminars – Spring 2011 Florida Institute of Technology Dr. Georgios C. Anagnostopoulos Associate Professor Electrical & Computer Engineering Dept. Florida Institute of Technology [email protected] http://my.fit.edu/~georgio Machine Learning Laboratory @ FIT 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 2 7/17/2015 Multi-Task Learning (MTL) has recently amassed significant interest in the Machine Learning (ML) research community. Its focus is computational approaches dealing with a single model, that is responsible to simultaneously learn an ensemble of several, possibly conflicting, tasks by offering the option of gracefully trading off performance between them. MTL surfaces in scenarios, where there are constraints on available computational resources and/or when task importance is non-stationary in time. On the other hand, the significance of Kernel-based Methods (KMs) in ML hinges on the use of a kernel function that induces an implicit mapping from a given, original feature space to a so-called Reproducing Kernel Hilbert Space. Often, the ML problem posed in the latter space is easier and/or more efficiently solved. Furthermore, through appropriate choice of kernel functions, many standard ML algorithms can be easily adapted to handle a vast variety of data types. This talk's aim is twofold. First, it aims to provide a high-level exposition of KMs and MTL, in order to provide its audience with an appropriate context. Relevant key concepts will be briefly illustrated Secondly, it will discuss a very recently developed MTL framework that employs a multiple-kernel approach and is suitable for accommodating a broad spectrum of tasks. Through this framework, some appropriate tasks are able to share a common, kernel-induced feature space, which may lead to more robust and trustworthy models participating in the framework. The framework’s mathematical formulation will be exhibited along with a maximinimization algorithm to optimize it. Finally, some preliminary experimental results will also be presented to illustrate the framework’s utility. Series of Electrical & Computer Engineering Seminars (Spring 2011) 3 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 4 Major ML tasks Pattern Recognition ▪ Classification (Bayes Risk Criterion) ▪ Detection (Neyman-Pearson Criterion) ▪ Robust Recognition (Min-Max Criterion) Regression ▪ Linear & Non-linear Clustering ▪ Density Estimation And many more… 7/17/2015 Learning refers to estimating the parameters of an ML model from the available data based on a loss function. Learning Optimization. Series of Electrical & Computer Engineering Seminars (Spring 2011) 5 Mutli-Task Learning (MTL) refers to training an ensemble of models that share common parameters and each of which is dedicated to a different task. Task 1 specific model θ1, θ common y1 x Task 2 specific model θ2, θ common 7/17/2015 y2 Note: Sharing of parameters between tasks leads to task interdependence. Series of Electrical & Computer Engineering Seminars (Spring 2011) 6 Reasons for common parameters Common feature transformations ▪ Learning one task may help in learning another task. Constraints on available memory resources Ability to trade-off performances between tasks ▪ Especially useful, when task importance is non-stationary in time MTL Examples In nature: one brain for multiple tasks Minimizing both the false alarm and the miss rate in a detector Reduce the feature space dimensionality (e.g. for computational advantages or visualization purposes), while optimizing performance ▪ A rather singular case: model regularization, i.e. reducing the “effective” model complexity, while learning a task. Optimizing performance while trying to learn invariances in the outputs. And many more… 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 7 MTL is formulated as a multi-criterion optimization problem. An effort is made to learn all tasks (optimize all criteria, loss/cost/merit functions) simultaneously. Notes: ▪ We lump all parameters together in a single vector parameter θ. ▪ We assume each task is connected to a loss (cost) function fi to be minimized. Domination of solutions f2 θA strictly dominates θC θB dominates θC There is no dominance relationship between θA and θB θC θA θB f1 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 8 We are particularly interested in the solutions that generate the Pareto Front, i.e. the set of all non-dominated solutions. Searching for a Pareto-optimal solution -g1 -g2 -g1 -g2 beneficial directions (a) 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) (b) 9 Pareto Path & Pareto Front f2 θ2 θ*1 Feasible f1 & f2 values θ*2 Pareto Front θ1 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) f1 10 Non-convex (a) vs. convex (b) feasible cost function regions f2 f2 (a) f1 (b) f1 In the general, non-convex case identifying the Pareto Front can be achieved via constrained optimization, but there are several challenges: Multiple solutions for the same cost function values Dependence on algorithms starting point 7/17/2015 When the cost/loss functions are convex, then the search in much easier. Series of Electrical & Computer Engineering Seminars (Spring 2011) 11 The convex case f2 λ1f1(θ)+λ2f2(θ) λ'1f1(θ)+λ'2f2(θ) f1 Each point on the Pareto Front corresponds to a unique solution, which can be found via scalarization of the problem 7/17/2015 Minimize λ1f1 (θ)+λ2f2(θ) with arbitrary importance weights λ1, λ2≥0 Series of Electrical & Computer Engineering Seminars (Spring 2011) 12 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 13 Positive (semi-)definite kernels Property Construction of inner product kernels 7/17/2015 H is called the Reproducing Kernel Hilbert Space φ is never used explicitly There are many other ways, though Series of Electrical & Computer Engineering Seminars (Spring 2011) 14 7/17/2015 An example of kernel function: Polynomial kernel Series of Electrical & Computer Engineering Seminars (Spring 2011) 15 Why worry about kernels? From: Schoelkopf et al., A Generalized Representer Theorem, NeuroCOLT2 Technical Report Series, NC2-TR-2000-81, May 2000. 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 16 In summary: Out of an enormous class of well-behaved functions, many ML problems/tasks have solutions that admit a kernel expansion, which is based on the available training set. Such problems include ▪ Classification ▪ Detection ▪ Regression ▪ Clustering 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 17 7/17/2015 Many successful linear models have been thoroughly studied in Statistics and Machine Learning, whose parameter estimation (learning algorithm) involves the data only in the form of inner product computations. Replacing the inner products with kernel function evaluations (the so-called “kernel trick”) gives rise to non-linear models that encompass a kernel-induced feature transformation as a, potentially, highly useful pre-processing step, e.g. non-linear separable data become linearly separable (useful in classification) Series of Electrical & Computer Engineering Seminars (Spring 2011) 18 Kernels can be interpreted as similarity measures of the data points involved. Kernels have been developed for a multitude of data types Numeric Non-numeric (Categorical) ▪ E.g. Voting records, contents of documents, etc. Sequential ▪ E.g. Time series, images, etc. Mixed Type ▪ 7/17/2015 E.g. credit history, DNA sequences, graphs, trees, etc. Therefore, classic linear methods, which were relying on the numeric nature of the data, can now be adapted, virtually, to work with any data type, provided we supply a meaningful kernel for the particular data type at hand. Series of Electrical & Computer Engineering Seminars (Spring 2011) 19 Translation-invariant kernels 7/17/2015 Very popular in regression and pattern recognition Series of Electrical & Computer Engineering Seminars (Spring 2011) 20 Support Vector Machines One of the most well-known 2-class recognition models Primal Problem minimize 1 2 w C 2 i i s.t. yi w, xi b 1 i 0 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 21 The benefit of using kernels with SVMs other than the linear one (i.e. usual dot product in the original feature space): 7/17/2015 The more the dimensions, the higher the higher the probability that the 2 classes become linearly separable. Series of Electrical & Computer Engineering Seminars (Spring 2011) 22 L1-norm SVMs Finding Sparse Solutions L1 norm constrained optimal solution (sparce) θ2 L2 norm constrained optimal solution (non-sparce) c1 c2 c1 < c2 < c3 < c4 c3 c4 θ1 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 23 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 24 Multiple Kernel Learning [2] scheme finds the optimal kernel by combining several pre-defined kernel functions. A linear combination is most frequently used and the combination coefficients are optimized. Here we will assume that the task is 2-class recognition by an SVM [2] Gert R.G. Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I. Jordan. Learning the kernel matrix with semidefinite programming. JMLR, 5:27-72, 2004. 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 25 Combining several SVM tasks via MTL [3] by using a shared feature representation (via MKL). All tasks are assumed equally important. [3] Theodoros Evgeniou, Charles A. Micchelli, and Massimiliano Pontil. Learning multiple tasks with kernel methods. JMLR, 6:615-637, 2005. 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 26 A more general framework [4]: : shared space coefficient : task-specific coefficient Forcing all tasks to share a common feature space is tantamount to increasing the training samples for enhancing classification performance. [4] Lei Tang, Jianhui Chen, and Jieping Ye. On multiple kernel learning with multiple labels. In IJCAI, 2009. 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 27 7/17/2015 Comments Experimental results indicate that [4] significantly favors a Common Space approach. There may be some problems, for which sharing a common space is not the optimal choice. A multi-task learning problem may include a few complex tasks, but also some much simpler ones. In this situation, it may be difficult to find a common feature mapping such that the kernel machine performs well for all tasks in the mapped common feature space. It is reasonable to let the complex tasks be mapped to their taskspecific space, while keeping the other tasks sharing a common space. Series of Electrical & Computer Engineering Seminars (Spring 2011) 28 A novel Partially-Shared Common Space (PSCS) framework: : shared space coefficient. : task-specific coefficient. : inter-task coefficient. It is desired to be 0 for some t’s. 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 29 p-norm and q-norm constraints are introduced to control the sparsity of the coefficients. The value of q is desired to be small, which gives sparse . Two special cases: and . 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 30 How to solve this optimization problem? A practical approach Iteratively solve the outer maximization problem and the inner minimization problem. The maximization problem is T independent SVM problems, which could be addressed by any SVM solver. 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 31 How to solve the inner minimization problem? Approach taken: Attempt to separate this problem into three minimization problems. 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 32 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 33 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 34 The solutions to the three individual problems are: where 7/17/2015 . Series of Electrical & Computer Engineering Seminars (Spring 2011) 35 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 36 Iris Dataset • 3 classes • 50 patterns in each class • 2 attributes are employed from the original 4-attribute dataset • Task1 (class 1 vs. class 2) • Task2 (class 1 vs. class 3) • Task3 (class 2 vs. class 3) 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 37 • One-vs-all for multiclass classification problems • Ten kernels: Linear kernel, Polynomial kernel with degree 2, Gaussian kernels with 8 different spread parameters:{20 ,… 27} • Fix p = 2, and choose parameters C and q by cross-validation: C is searched in {1/27, 1/9, 1/3, 1, 3, 9, 27}, and q is searched from 1.1 to 2.0 with a step size of 0.1 • Our approach is compared with the Common Space (CS) approach, by setting 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 38 Our PSCS framework is applicable to other types of learning tasks that can be expressed as where has a finite maximum in the feasible set linear function in . and is a How to apply our framework in problems with such form? 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 39 Examples that our framework is compatible with are Support Vector Domain Description (SVDD), One-class SVM for outlier detection problems, and Kernel Ridge Regression (KRR) for regression problems. 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 40 7/17/2015 We proposed a novel MKL-based MTL framework called PSCS, which allows some tasks to share a common feature space, while other tasks are appropriately associated to their task-specific feature spaces, potentially yielding robust models. The process is “automatic”. We presented a practical iterative maxi-minimization algorithm to address the learning of our framework’s tasks. Experimental results demonstrated the advantages and utility of our approach. The PSCS framework can be extended to tasks other than SVM classification, which adds to its utility. Series of Electrical & Computer Engineering Seminars (Spring 2011) 41 7/17/2015 Please feel free to ask me questions… Series of Electrical & Computer Engineering Seminars (Spring 2011) 42 Thank you for attending my presentation! 7/17/2015 Series of Electrical & Computer Engineering Seminars (Spring 2011) 43