Nonlinear Knowledge in Kernel Approximation
Download
Report
Transcript Nonlinear Knowledge in Kernel Approximation
Nonlinear Knowledge in Kernel
Approximation
Olvi Mangasarian
UW Madison & UCSD La Jolla
Edward Wild
UW Madison
Objectives
Primary objective: Incorporate prior knowledge
over completely arbitrary sets into function
approximation without transforming (kernelizing)
the knowledge
Secondary objective: Achieve transparency of the
prior knowledge for practical applications
Outline
Use kernels for function approximation
Incorporate prior knowledge
Previous approaches: require transformation of
knowledge
New approach does not require any transformation of
knowledge
• Knowledge given over completely arbitrary sets
Experimental results
Two synthetic examples and one real world example
related to breast cancer prognosis
Approximations with prior knowledge more accurate
than approximations without prior knowledge
Function Approximation
Given a set m of points in n-dimensional real
space Rn and corresponding function values in R
Points are represented by rows of a matrix A Rm n
Exact or approximate function values for each point are
given by a corresponding vector y Rm
Find a function f:Rn R based on the given data
f(Ai0) yi
Function Approximation
Problem: utilizing only given data may
result in a poor approximation
Points may be noisy
Sampling may be costly
Solution: use prior knowledge to improve
the approximation
Adding Prior Knowledge
Standard approach: fit function at given
data points without knowledge
Constrained approach: satisfy inequalities
at given points
2004 MSW Paper: satisfy linear
inequalities over polyhedral regions
Proposed new approach: satisfy nonlinear
inequalities over arbitrary regions
Kernel Approximation
Approximate f by a nonlinear kernel
function K
f(x) K(x0, A0) + b
m
0
0
2)
K(x , A ) = exp(
||x-A
||
i
i
i=1
Error in kernel approximation of given data
-s K(A, A0) + be - y s
e is a vector of ones in Rm
Kernel Approximation
Trade off between solution complexity
(||||1) and data fitting (||s||1)
Convert to a linear program
At solution
e0a = ||||1
e0s = ||s||1
Incorporating Nonlinear Prior Knowledge
(MSW 2004)
Bx d 0Ax+ b h0x +
Need to “kernelize” knowledge from input space to
feature space of kernel
Requires change of variable x = A0t
BA0t d 0AA0t + b h0A0t +
K(B, A0)t d 0K(A, A0)t + b h0A0t +
Motzkin’s theorem of the alternative gives an
equivalent linear system of inequalities which is
added to a linear program
Achieves good numerical results, but kernelization
is not readily interpretable in the original space
Incorporating Nonlinear Prior Knowledge:
New Approach
Start with arbitrary nonlinear knowledge implication
g(x) 0 K(x0, A0) + b h(x), 8x 2 ½ Rn
g, h are arbitrary functions
g:! Rk, h:! R
Problem: need to add this knowledge to the
optimization problem
Logically equivalent system:
g(x) 0, K(x0, A0) + b - h(x) < 0 has no solution x
Prior Knowledge as a System of
Linear Inequalities
Use a theorem of the alternative for convex functions:
Assume that g(x), K(x0, A0) + b, -h(x) are convex functions
of x, that is convex and 9 x 2 : g(x)<0. Then either:
I. g(x) 0, K(x0, A0) + b - h(x) < 0 has a solution x , or
II. v Rk, v 0: K(x0, A0) + b - h(x) + v0g(x) 0 x
But never both.
If we can find v 0: K(x0, A0) + b - h(x) + v0g(x) 0
x , then by above theorem
g(x) 0, K(x0, A0) + b - h(x) < 0 has no solution x or
equivalently:
g(x) 0 K(x0, A0) + b h(x), 8x 2
Proof
I II
Follows from OLM 1969, Corollary 4.2.2 and the existence
of an x 2 such that g(x) <0.
I II
Suppose not. That is, there exists x 2 ,v 2 Rk,, v¸ 0:
g(x) 0, K(x0, A0) + b - h(x) < 0, (I)
v 0, v0g(x) +K(x0, A0) + b - h(x) 0 , 8 x 2 (II)
Then we have the contradiction:
0 > v0g(x) +K(x0, A0) + b - h(x) 0
Requires no assumptions on g, h, K, or whatsoever
Example
g(x) = 1250 - x3 , f(x) = x4, h(x) = x2 + 5000
g(x) · 0 ) f(x) ¸ h(x)
:I
x3 1250
x4 x2 + 5000
v0g(x) + f(x) - h(x) ¸ 0
II
Incorporating Prior Knowledge
Linear semi-infinite program: infinite number of
constraints
Discretize: finite linear program
Slacks allow knowledge to be satisfied inexactly
Add term to objective function to drive slacks to zero
Numerical Experience
Evaluate on three datasets
Two synthetic datasets
Wisconsin Prognostic Breast Cancer Database
Compare approximation with prior knowledge to
one without prior knowledge
Prior knowledge leads to an improved approximation
Prior knowledge used cannot be handled exactly by
previous work
No kernelization needed on knowledge set
Two-Dimensional Hyperboloid
f(x1, x2) = x1x2
Two-Dimensional Hyperboloid
Given exact values
only at 11 points along
line x1 = x2
At x1 2 {-5, …, 5}
x2
x1
Two-Dimensional Hyperboloid
Approximation without Prior Knowledge
Two-Dimensional Hyperboloid
Add prior knowledge:
x1x2 1 f(x1, x2) x1x2
Nonlinear term x1x2 can not be handled
exactly by any previous approach
Discretization used only 11 points along the
line x1 = -x2, x1 {-5, -4, …, 4, 5}
Two-Dimensional Hyperboloid
Approximation with Prior Knowledge
Two-Dimensional Tower
Function
Two-Dimensional Tower Function Data
Given 400 points on the grid [-4, 4] [-4, 4]
Values are min{g(x), 2}, where g(x) is the
exact tower function
Two-Dimensional Tower Function
Approximation without Prior Knowledge
Two-Dimensional Tower Function
Prior Knowledge
Add prior knowledge:
(x1, x2) [-4, 4] [-4, 4] f(x) = g(x)
Prior knowledge is the exact function value.
Enforced at 2500 points on the grid
[-4, 4] [-4, 4] through above implication
Principal objective of prior knowledge is to
overcome poor given data
Two-Dimensional Tower Function
Approximation with Prior Knowledge
Predicting Lymph Node
Metastasis
Number of metastasized lymph nodes is an
important prognostic indicator for breast cancer
recurrence
Determined by surgery in addition to the removal
of the tumor
Wisconsin Prognostic Breast Cancer (WPBC) data
Lymph node metastasis for 194 patients
30 cytological features from a fine-needle aspirate
Tumor size, obtained during surgery
Task: predict the number of metastasized lymph
nodes given tumor size alone
Predicting Lymph Node
Metastasis
Split data into two portions
Past data: 20% used to find prior knowledge
Present data: 80% used to evaluate performance
Simulates acquiring prior knowledge from
an expert’s experience
Prior Knowledge for Lymph
Node Metastasis
Use kernel approximation without knowledge on
the past data
f1(x) = K(x0, A10)1 + b1
A1 is the matrix of the past data points
Use density estimation to decide where to enforce
knowledge
p(x) is the empirical density of the past data
Number of metastasized lymph nodes is greater
than the predicted value on the past data, with
tolerance of 0.01
p(x) 0.1 f(x) f1(x) - 0.01
Predicting Lymph Node
Metastasis: Results
Approximation
RMSE/LOO RMSE
f1(x) from past data
6.12
Without knowledge
5.92
With knowledge
5.04
Table shows root-mean-squared-error (RMSE) of past data
(20%) approximation f1(x) on present data (80%)
Leave-one-out (LOO) RMSE reported for approximations with and
without knowledge
Improvement due to knowledge: 14.8%
Conclusion
Added general nonlinear prior knowledge to
kernel approximation
Implemented as linear inequalities in a linear
programming problem
Knowledge incorporated transparently
Demonstrated effectiveness
Two synthetic examples
Real world problem from breast cancer prognosis
Future work
More general prior knowledge with inequalities
replaced by more general functions
Apply to classification problems
Questions
Websites linking to papers and talks:
http://www.cs.wisc.edu/~olvi/
http://www.cs.wisc.edu/~wildt/