Nonlinear Knowledge in Kernel Machines (PowerPoint)
Download
Report
Transcript Nonlinear Knowledge in Kernel Machines (PowerPoint)
Nonlinear Knowledge in Kernel
Machines
Data Mining and Mathematical Programming Workshop
Centre de Recherches Mathématiques
Université de Montréal, Québec
October 10-13, 2006
Olvi Mangasarian
UW Madison & UCSD La Jolla
Edward Wild
UW Madison
Objectives
Primary objective: Incorporate prior knowledge
over completely arbitrary sets into:
function approximation, and
classification
without transforming (kernelizing) the knowledge
Secondary objective: Achieve transparency of the
prior knowledge for practical applications
Graphical Example of
Knowledge Incorporation
h1(x) · 0
+
++ +
+
g(x) · 0
+
+
+
Similar approach for approximation
K(x0, B0)u =
h2(x) · 0
Outline
Kernels in classification and function approximation
Incorporation of prior knowledge
Previous approaches: require transformation of knowledge
New approach does not require any transformation of knowledge
• Knowledge given over completely arbitrary sets
Fundamental tool used
Theorem of the alternative for convex functions
Experimental results
Four synthetic examples and two examples related to breast cancer prognosis
Classifiers and approximations with prior knowledge more accurate than those
without prior knowledge
Theorem of the Alternative for Convex Functions:
Generalization of the Farkas Theorem
Farkas: For A2 Rk£ n, b2 Rn, exactly one of the following must hold:
I. Ax 0, b0x > 0 has solution x 2 Rn, ,or
II. A0 v =b, v 0, has solution v 2 Rk .
Let h:½ Rn! Rk, : ! R, convex set, h and convex on , and
h(x)<0 for some x2 . Then, exactly one of the following must hold:
I. h(x) 0, (x) < 0 has a solution x , or
II. v Rk, v 0: v0h(x)+(x) 0 x .
To get II of Farkas:
v 0, v0Ax- b0x 0 8 x 2 Rn , v 0, A0 v -b=0.
Classification and Function
Approximation
Given a set of m points in n-dimensional real space Rn with
corresponding labels
Labels in {+1, -1} for classification problems
Labels in R for approximation problems
Points are represented by rows of a matrix A 2 Rm£n
Corresponding labels or function values are given by a
vector y
Classification: y 2 {+1, -1}m
Approximation: y Rm
Find a function f(Ai) = yi based on the given data points Ai
f : Rn ! {+1, -1} for classification
f : Rn! R for approximation
Classification and Function
Approximation
Problem: utilizing only given data may
result in a poor classifier or approximation
Points may be noisy
Sampling may be costly
Solution: use prior knowledge to improve
the classifier or approximation
Adding Prior Knowledge
Standard approximation and classification: fit
function at given data points without knowledge
Constrained approximation: satisfy inequalities at
given points
Previous approaches (2001 FMS, 2003 FMS, and
2004 MSW ): satisfy linear inequalities over
polyhedral regions
Proposed new approach: satisfy nonlinear
inequalities over arbitrary regions without
kernelizing (transforming) knowledge
Kernel Machines
Approximate f by a nonlinear kernel function K
using parameters u 2 Rk and in R
A kernel function is a nonlinear generalization of
the scalar product
f(x) K(x0, B0)u - , x 2 Rn, K:Rn £ Rn£k ! Rk
2
Gaussian K(x0, B0)i=-||x-Bi|| , i=1,…..,k
B 2 Rk£n is a basis matrix
Usually, B = A2Rm£n = Input data matrix
In Reduced Support Vector Machines, B is a small
subset of the rows of A
B may be any matrix with n columns
Kernel Machines
Introduce slack variable s to measure error in
classification or approximation
Error s in kernel approximation of given data:
-s K(A, B0)u - e - y s, e is a vector of ones in Rm
Function approximation: f(x) K(x0, B0)u -
Error s in kernel classification of given data
K(A+, B0)u - e + s+ ¸ e, s+ ¸ 0
K(A- , B0)u - e - s- - e, s- ¸ 0
More succinctly, let: D = diag(y), the m£m matrix
with diagonal y of § 1’s, then:
D(K(A, B0)u - e) + s ¸ e, s ¸ 0
Classifier: f(x) sign(K(x0, B0)u - )
Kernel Machines
Kernel Machines
in Approximation
in Approximation
OR Classification
OR complexity
Trade off between solution
(||u||1) and data fitting (||s||1)
At solution
e0a = ||u||1
e0s = ||s||1
Incorporating Nonlinear Prior Knowledge:
Previous Approaches
Cx d w0x- h0x +
Need to “kernelize” knowledge from input space to
transformed (feature) space of kernel
Requires change of variable x = A0t, w = A0u
CA0t d u0AA0t - h0A0t +
K(C, A0)t d u0K(A, A0)t - h0A0t +
Use a linear theorem of the alternative in the t space
Lost connection with original knowledge
Achieves good numerical results, but is not readily
interpretable in the original space
Nonlinear Prior Knowledge in Function Approximation:
New Approach
Start with arbitrary nonlinear knowledge
implication
g(x) 0 K(x0, B0)u - h(x), 8x 2 ½ Rn
g, h are arbitrary functions on
k0 , h:! R 0
g:!
R
9v ¸ 0: v g(x) + K(x , B0)u - - h(x) ¸ 0 8x 2
Linear in v, u,
Theorem of the Alternative for
Convex Functions
Assume that g(x), K(x0, B0)u - , -h(x) are convex functions
of x, that is convex and 9 x 2 : g(x)<0. Then either:
I. g(x) 0, K(x0, B0)u - - h(x) < 0 has a solution x , or
II. v Rk, v 0: K(x0, B0)u - - h(x) + v0g(x) 0 x
But never both.
If we can find v 0: K(x0, B0)u - - h(x) + v0g(x) 0
x , then by above theorem
g(x) 0, K(x0, B0)u - - h(x) < 0 has no solution x or
equivalently:
g(x) 0 K(x0, B0)u - h(x), 8x 2
Proof
I II (Not needed for present application)
Follows from a fundamental theorem of Fan-GlicksburgHoffman for convex functions [1957] and the existence
of an x 2 such that g(x) <0.
I II (Requires no assumptions on g, h, K, or
u whatsoever)
Suppose not. That is, there exists x 2 , v 2 Rk,, v¸ 0:
g(x) 0, K(x0, B0)u - - h(x) < 0, (I)
v 0, v0g(x) +K(x0, B0)u - - h(x) 0 , 8 x 2 (II)
Then we have the contradiction:
0 > v0g(x) +K(x0, B0)u - - h(x) 0
·0
<0
Incorporating Prior Knowledge
Linear semi-infinite program: infinite number of
constraints
Discretize: finite linear program
g(xi) · 0 ) K(xi0, B0)u - ¸ h(xi), i = 1, …, k
Slacks allow knowledge to be satisfied inexactly
Add term to objective function to drive slacks to zero
Numerical Experience:
Approximation
Evaluate on three datasets
Two synthetic datasets
Wisconsin Prognostic Breast Cancer Database (WPBC)
194 patients £ 2 histogical features
• tumor size & number of metastasized lymph nodes
Compare approximation with prior knowledge to
approximation without prior knowledge
Prior knowledge leads to an improved accuracy
General prior knowledge used cannot be handled
exactly by previous work (MSW 2004)
No kernelization needed on knowledge set
Two-Dimensional Hyperboloid
f(x1, x2) = x1x2
Two-Dimensional Hyperboloid
Given exact values
only at 11 points along
line x1 = x2
At x1 2 {-5, …, 5}
x2
x1
Two-Dimensional Hyperboloid
Approximation without Prior Knowledge
Two-Dimensional Hyperboloid
Add prior (inexact) knowledge:
x1x2 1 f(x1, x2) x1x2
Nonlinear term x1x2 can not be handled
exactly by any previous approaches
Discretization used only 11 points along the
line x1 = -x2, x1 {-5, -4, …, 4, 5}
Two-Dimensional Hyperboloid
Approximation with Prior Knowledge
Two-Dimensional Tower
Function
Two-Dimensional Tower Function (Misleading) Data
Given 400 points on the grid [-4, 4] [-4, 4]
Values are min{g(x), 2}, where g(x) is the
exact tower function
Two-Dimensional Tower Function
Approximation without Prior Knowledge
Two-Dimensional Tower Function
Prior Knowledge
Add prior knowledge:
(x1, x2) [-4, 4] [-4, 4] f(x) = g(x)
Prior knowledge is the exact function value.
Enforced at 2500 points on the grid
[-4, 4] [-4, 4] through above implication
Principal objective of prior knowledge here
is to overcome poor given data
Two-Dimensional Tower Function
Approximation with Prior Knowledge
Breast Cancer Application:
Predicting Lymph Node Metastasis as a Function of
Tumor Size
Number of metastasized lymph nodes is an important
prognostic indicator for breast cancer recurrence
Determined by surgery in addition to the removal of the
tumor
Optional procedure especially if tumor size is small
Wisconsin Prognostic Breast Cancer (WPBC) data
Lymph node metastasis and tumor size for 194 patients
Task: predict the number of metastasized lymph nodes
given tumor size alone
Predicting Lymph Node
Metastasis
Split data into two portions
Past data: 20% used to find prior knowledge
Present data: 80% used to evaluate performance
Past data simulates prior knowledge
obtained from an expert
Prior Knowledge for Lymph Node
Metastasis as a Function of Tumor Size
Generate prior knowledge by fitting past data:
h(x) := K(x0, B0)u -
B is the matrix of the past data points
Use density estimation to decide where to enforce
knowledge
p(x) is the empirical density of the past data
Prior knowledge utilized on approximating function
f(x):
Number of metastasized lymph nodes is greater than the
predicted value on past data, with tolerance of 0.01
p(x) 0.1 f(x) ¸ h(x) - 0.01
Predicting Lymph Node
Metastasis: Results
Approximation
Prior knowledge h(x) based
on past data 20%
f(x) without knowledge based
on present data 80%
f(x) with knowledge
based on present data 80%
Error
6.12 RMSE
5.92 LOO
5.04 LOO
RMSE: root-mean-squared-error
LOO: leave-one-out error
Improvement due to knowledge: 14.9%
Incorporating Prior Knowledge in Classification
(Very Similar)
Implication for positive region
g(x) 0 K(x0, B0)u - , 8x 2 ½ Rn
K(x0, B0)u - - + v0g(x) ¸ 0, v ¸ 0, 8x 2
Similar implication for negative regions
Add discretized constraints to linear program
Incorporating Prior Knowledege:
Classification
Numerical Experience:
Classification
Evaluate on three datasets
Two synthetic datasets
Wisconsin Prognostic Breast Cancer Database
Compare classifier with prior knowledge to one
without prior knowledge
Prior knowledge leads to an improved accuracy
General prior knowledge used cannot be handled
exactly by previous work (FMS 2001, FMS 2003)
No kernelization needed on knowledge set
Checkerboard Dataset
Black & White Points in R2
Checkerboard
Prior
Knowledge
Classifier
for
16-Point
Without
Checkerboard Classifier With
Knowledge
Checkerboard
Using
16
Classifier
Center
Points
Knowledge Using 16 Center Points
100
100
Spiral Dataset 194 Points in R2
Spiral Classifier Without Knowledge
Prior
Spiral
Knowledge
Classifier Function
With Knowledge
for Spiral
Based on 100 Labeled Points
Note the many
incorrectly
classified +’s
White ) +
Gray ) •
Labels
Priorgiven
knowledge
only
atimposed
100 correctly
at 291points
classified
points in each
circled
region
points
No
misclassified
Predicting Breast Cancer Recurrence
Within 24 Months
Wisconsin Prognostic Breast Cancer (WPBC) dataset
155 patients monitored for recurrence within 24 months
30 cytological features
2 histological features: number of metastasized lymph nodes and tumor size
Predict whether or not a patient remains cancer free after 24 months
82% of patients remain disease free
86% accuracy (Bennett, 1992) best previously attained
Prior knowledge allows us to incorporate additional information to
improve accuracy
Number of Metastasized Lymph Nodes
GrayKnowledge
regions indicate
Generating WPBC Prior
areas where g(x) · 0
Simulate oncological
+ Recur
within advice
24 months
surgeon’s
about
recurrence
Knowledge imposed at
dataset points inside given
regions
• Cancer free within 24 months
Tumor Size in Centimeters
+ Recur
• Cancer free
WPBC Results
Classifier
Misclassification
Rate
Without Knowledge
18.1%
.
With Knowledge
9.0%
.
49.7 % improvement due to knowledge
35.7 % improvement over best previous predictor
Conclusion
General nonlinear prior knowledge incorporated
into kernel classification and approximation
Implemented as linear inequalities in a linear
programming problem
Knowledge appears transparently
Demonstrated effectiveness
Four synthetic examples
Two real world problems from breast cancer prognosis
Future work
Prior knowledge with more general implications
User-friendly interface for knowledge specification
Generate prior knowledge for real-world datasets
Website Links to Talk & Papers
http://www.cs.wisc.edu/~olvi
http://www.cs.wisc.edu/~wildt