Transcript Slide 1

Dr. Georgios C. Anagnostopoulos
Series of Electrical & Computer Engineering Seminars – Spring 2011
Florida Institute of Technology
 Dr. Georgios C. Anagnostopoulos
Associate Professor
Electrical & Computer Engineering Dept.
Florida Institute of Technology
[email protected]
http://my.fit.edu/~georgio
Machine Learning Laboratory
@ FIT
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
2
7/17/2015

Multi-Task Learning (MTL) has recently amassed significant interest in the Machine
Learning (ML) research community. Its focus is computational approaches dealing
with a single model, that is responsible to simultaneously learn an ensemble of
several, possibly conflicting, tasks by offering the option of gracefully trading off
performance between them. MTL surfaces in scenarios, where there are constraints
on available computational resources and/or when task importance is non-stationary
in time.

On the other hand, the significance of Kernel-based Methods (KMs) in ML hinges on
the use of a kernel function that induces an implicit mapping from a given, original
feature space to a so-called Reproducing Kernel Hilbert Space. Often, the ML
problem posed in the latter space is easier and/or more efficiently solved.
Furthermore, through appropriate choice of kernel functions, many standard ML
algorithms can be easily adapted to handle a vast variety of data types.

This talk's aim is twofold. First, it aims to provide a high-level exposition of KMs and
MTL, in order to provide its audience with an appropriate context. Relevant key
concepts will be briefly illustrated Secondly, it will discuss a very recently developed
MTL framework that employs a multiple-kernel approach and is suitable for
accommodating a broad spectrum of tasks. Through this framework, some
appropriate tasks are able to share a common, kernel-induced feature space, which
may lead to more robust and trustworthy models participating in the framework.
The framework’s mathematical formulation will be exhibited along with a maximinimization algorithm to optimize it. Finally, some preliminary experimental results
will also be presented to illustrate the framework’s utility.
Series of Electrical & Computer Engineering Seminars (Spring 2011)
3
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
4

Major ML tasks
 Pattern Recognition
▪ Classification (Bayes Risk Criterion)
▪ Detection (Neyman-Pearson Criterion)
▪ Robust Recognition (Min-Max Criterion)
 Regression
▪ Linear & Non-linear
 Clustering
▪ Density Estimation
 And many more…

7/17/2015
Learning refers to estimating the parameters of an ML
model from the available data based on a loss function.
Learning  Optimization.
Series of Electrical & Computer Engineering Seminars (Spring 2011)
5

Mutli-Task Learning (MTL) refers to training an ensemble of
models that share common parameters and each of which is
dedicated to a different task.
Task 1 specific model
θ1, θ
common
y1
x
Task 2 specific model
θ2, θ
common

7/17/2015
y2
Note: Sharing of parameters between tasks leads to task interdependence.
Series of Electrical & Computer Engineering Seminars (Spring 2011)
6

Reasons for common parameters

Common feature transformations
▪ Learning one task may help in learning another task.


Constraints on available memory resources
Ability to trade-off performances between tasks
▪ Especially useful, when task importance is non-stationary in time

MTL Examples



In nature: one brain for multiple tasks
Minimizing both the false alarm and the miss rate in a detector
Reduce the feature space dimensionality (e.g. for computational
advantages or visualization purposes), while optimizing performance
▪ A rather singular case: model regularization, i.e. reducing the “effective”
model complexity, while learning a task.

Optimizing performance while trying to learn invariances in the
outputs.
 And many more…
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
7

MTL is formulated as a multi-criterion optimization problem. An
effort is made to learn all tasks (optimize all criteria,
loss/cost/merit functions) simultaneously.

Notes:
▪ We lump all parameters together in a single vector parameter θ.
▪ We assume each task is connected to a loss (cost) function fi to be minimized.

Domination of solutions
f2
θA strictly dominates θC
 θB dominates θC
 There is no dominance
relationship between θA and θB

θC
θA
θB
f1
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
8

We are particularly interested in the solutions that generate the
Pareto Front, i.e. the set of all non-dominated solutions.

Searching for a Pareto-optimal solution
-g1
-g2
-g1
-g2
beneficial directions
(a)
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
(b)
9

Pareto Path & Pareto Front
f2
θ2
θ*1
Feasible f1 & f2 values
θ*2
Pareto Front
θ1
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
f1
10

Non-convex (a) vs. convex (b) feasible cost function regions
f2
f2
(a)

f1
(b)
f1
In the general, non-convex case identifying the Pareto Front can
be achieved via constrained optimization, but there are several
challenges:

Multiple solutions for the same cost function values
 Dependence on algorithms starting point

7/17/2015
When the cost/loss functions are convex, then the search in much
easier.
Series of Electrical & Computer Engineering Seminars (Spring 2011)
11

The convex case
f2
λ1f1(θ)+λ2f2(θ)
λ'1f1(θ)+λ'2f2(θ)
f1

Each point on the Pareto Front corresponds to a unique solution,
which can be found via scalarization of the problem

7/17/2015
Minimize λ1f1 (θ)+λ2f2(θ) with arbitrary importance weights
λ1, λ2≥0
Series of Electrical & Computer Engineering Seminars (Spring 2011)
12
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
13

Positive (semi-)definite kernels

Property



Construction of inner product kernels

7/17/2015
H is called the Reproducing Kernel Hilbert Space
φ is never used explicitly
There are many other ways, though
Series of Electrical & Computer Engineering Seminars (Spring 2011)
14

7/17/2015
An example of kernel function: Polynomial kernel
Series of Electrical & Computer Engineering Seminars (Spring 2011)
15

Why worry about kernels?
From: Schoelkopf et al., A Generalized Representer Theorem, NeuroCOLT2 Technical Report Series,
NC2-TR-2000-81, May 2000.
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
16

In summary:
 Out of an enormous class of well-behaved functions, many ML
problems/tasks have solutions that admit a kernel expansion,
which is based on the available training set.
 Such problems include
▪ Classification
▪ Detection
▪ Regression
▪ Clustering
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
17
7/17/2015

Many successful linear models have been thoroughly studied in
Statistics and Machine Learning, whose parameter estimation
(learning algorithm) involves the data only in the form of inner
product computations.

Replacing the inner products with kernel function evaluations (the
so-called “kernel trick”) gives rise to non-linear models that
encompass a kernel-induced feature transformation as a,
potentially, highly useful pre-processing step, e.g. non-linear
separable data become linearly separable (useful in classification)
Series of Electrical & Computer Engineering Seminars (Spring 2011)
18

Kernels can be interpreted as similarity measures of the data
points involved.

Kernels have been developed for a multitude of data types

Numeric
 Non-numeric (Categorical)
▪ E.g. Voting records, contents of documents, etc.

Sequential
▪ E.g. Time series, images, etc.

Mixed Type
▪

7/17/2015
E.g. credit history, DNA sequences, graphs, trees, etc.
Therefore, classic linear methods, which were relying on the
numeric nature of the data, can now be adapted, virtually, to
work with any data type, provided we supply a meaningful kernel
for the particular data type at hand.
Series of Electrical & Computer Engineering Seminars (Spring 2011)
19

Translation-invariant kernels

7/17/2015
Very popular in regression and pattern recognition
Series of Electrical & Computer Engineering Seminars (Spring 2011)
20

Support Vector Machines

One of the most well-known 2-class recognition models
Primal Problem
minimize
1
2
w C
2

i
i
s.t. yi  w, xi  b   1  i  0
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
21

The benefit of using kernels with SVMs other than the linear one
(i.e. usual dot product in the original feature space):

7/17/2015
The more the dimensions, the higher the higher the probability that
the 2 classes become linearly separable.
Series of Electrical & Computer Engineering Seminars (Spring 2011)
22

L1-norm SVMs

Finding Sparse Solutions
L1 norm constrained
optimal solution
(sparce)
θ2
L2 norm constrained
optimal solution
(non-sparce)
c1
c2
c1 < c2 < c3 < c4
c3
c4
θ1
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
23
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
24

Multiple Kernel Learning [2] scheme finds the optimal kernel by
combining several pre-defined kernel functions.

A linear combination is most frequently used and the
combination coefficients are optimized.

Here we will assume that the task is 2-class recognition by an
SVM
[2] Gert R.G. Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I. Jordan. Learning the
kernel matrix with semidefinite programming. JMLR, 5:27-72, 2004.
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
25
Combining several SVM tasks via MTL [3] by using a shared feature
representation (via MKL). All tasks are assumed equally important.
[3] Theodoros Evgeniou, Charles A. Micchelli, and Massimiliano Pontil. Learning multiple tasks with kernel
methods. JMLR, 6:615-637, 2005.
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
26
A more general framework [4]:
: shared space coefficient
: task-specific coefficient
Forcing all tasks to share a common feature space is tantamount to
increasing the training samples for enhancing classification
performance.
[4] Lei Tang, Jianhui Chen, and Jieping Ye. On multiple kernel learning with multiple labels. In IJCAI, 2009.
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
27

7/17/2015
Comments

Experimental results indicate that [4] significantly favors a Common
Space approach.

There may be some problems, for which sharing a common space is
not the optimal choice.

A multi-task learning problem may include a few complex tasks, but
also some much simpler ones.

In this situation, it may be difficult to find a common feature mapping
such that the kernel machine performs well for all tasks in the mapped
common feature space.

It is reasonable to let the complex tasks be mapped to their taskspecific space, while keeping the other tasks sharing a common space.
Series of Electrical & Computer Engineering Seminars (Spring 2011)
28
A novel Partially-Shared Common Space (PSCS) framework:
: shared space coefficient.
: task-specific coefficient.
: inter-task coefficient. It is desired to be
0 for some t’s.
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
29
p-norm and q-norm constraints are introduced to control the
sparsity of the coefficients.
The value of q is desired to be small, which gives sparse .
Two special cases:
and
.
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
30

How to solve this optimization problem?

A practical approach

Iteratively solve the outer maximization problem and the
inner minimization problem.
 The maximization problem is T independent SVM problems, which
could be addressed by any SVM solver.
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
31

How to solve the inner minimization problem?

Approach taken:
 Attempt to separate this problem into three minimization
problems.
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
32
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
33
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
34
The solutions to the three individual problems are:
where
7/17/2015
.
Series of Electrical & Computer Engineering Seminars (Spring 2011)
35
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
36
Iris Dataset
• 3 classes
• 50 patterns in each class
• 2 attributes are employed from
the original 4-attribute dataset
• Task1 (class 1 vs. class 2)
• Task2 (class 1 vs. class 3)
• Task3 (class 2 vs. class 3)
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
37
• One-vs-all for multiclass classification
problems
• Ten kernels: Linear kernel, Polynomial kernel
with degree 2, Gaussian kernels with 8
different spread parameters:{20 ,… 27}
• Fix p = 2, and choose parameters C and q by
cross-validation: C is searched in {1/27, 1/9, 1/3,
1, 3, 9, 27}, and q is searched from 1.1 to 2.0
with a step size of 0.1
• Our approach is compared with the Common
Space (CS) approach, by setting
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
38
Our PSCS framework is applicable to other types of learning tasks
that can be expressed as
where
has a finite maximum in the feasible set
linear function in .
and is a
How to apply our framework in problems with such form?
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
39
Examples that our framework is compatible with are Support
Vector Domain Description (SVDD), One-class SVM for outlier
detection problems, and Kernel Ridge Regression (KRR) for
regression problems.
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
40
7/17/2015

We proposed a novel MKL-based MTL framework called PSCS,
which allows some tasks to share a common feature space, while
other tasks are appropriately associated to their task-specific
feature spaces, potentially yielding robust models. The process is
“automatic”.

We presented a practical iterative maxi-minimization algorithm
to address the learning of our framework’s tasks.

Experimental results demonstrated the advantages and utility of
our approach.

The PSCS framework can be extended to tasks other than SVM
classification, which adds to its utility.
Series of Electrical & Computer Engineering Seminars (Spring 2011)
41

7/17/2015
Please feel free to ask me questions…
Series of Electrical & Computer Engineering Seminars (Spring 2011)
42
Thank you for attending my presentation!
7/17/2015
Series of Electrical & Computer Engineering Seminars (Spring 2011)
43