Nonlinear Knowledge in Kernel Approximation

Download Report

Transcript Nonlinear Knowledge in Kernel Approximation

Nonlinear Knowledge in Kernel
Approximation
Olvi Mangasarian
UW Madison & UCSD La Jolla
Edward Wild
UW Madison
Objectives
Primary objective: Incorporate prior knowledge
over completely arbitrary sets into function
approximation without transforming (kernelizing)
the knowledge
Secondary objective: Achieve transparency of the
prior knowledge for practical applications
Outline
Use kernels for function approximation
Incorporate prior knowledge
 Previous approaches: require transformation of
knowledge
 New approach does not require any transformation of
knowledge
• Knowledge given over completely arbitrary sets
Experimental results
 Two synthetic examples and one real world example
related to breast cancer prognosis
 Approximations with prior knowledge more accurate
than approximations without prior knowledge
Function Approximation
Given a set m of points in n-dimensional real
space Rn and corresponding function values in R
 Points are represented by rows of a matrix A  Rm  n
 Exact or approximate function values for each point are
given by a corresponding vector y  Rm
Find a function f:Rn  R based on the given data
 f(Ai0) yi
Function Approximation
Problem: utilizing only given data may
result in a poor approximation
Points may be noisy
Sampling may be costly
Solution: use prior knowledge to improve
the approximation
Adding Prior Knowledge
Standard approach: fit function at given
data points without knowledge
Constrained approach: satisfy inequalities
at given points
 2004 MSW Paper: satisfy linear
inequalities over polyhedral regions
Proposed new approach: satisfy nonlinear
inequalities over arbitrary regions
Kernel Approximation
Approximate f by a nonlinear kernel
function K
f(x)  K(x0, A0) + b
m
0
0
2)
K(x , A ) = exp(
||x-A
||
i
i
i=1
Error in kernel approximation of given data
-s  K(A, A0) + be - y  s
e is a vector of ones in Rm
Kernel Approximation
Trade off between solution complexity
(||||1) and data fitting (||s||1)
Convert to a linear program
At solution
 e0a = ||||1
 e0s = ||s||1
Incorporating Nonlinear Prior Knowledge
(MSW 2004)
Bx  d  0Ax+ b  h0x + 
Need to “kernelize” knowledge from input space to
feature space of kernel
Requires change of variable x = A0t
BA0t  d  0AA0t + b  h0A0t + 
K(B, A0)t  d  0K(A, A0)t + b  h0A0t + 
Motzkin’s theorem of the alternative gives an
equivalent linear system of inequalities which is
added to a linear program
Achieves good numerical results, but kernelization
is not readily interpretable in the original space
Incorporating Nonlinear Prior Knowledge:
New Approach
Start with arbitrary nonlinear knowledge implication
g(x)  0  K(x0, A0) + b  h(x), 8x 2  ½ Rn
g, h are arbitrary functions
g:! Rk, h:! R
Problem: need to add this knowledge to the
optimization problem
Logically equivalent system:
g(x)  0, K(x0, A0) + b - h(x) < 0 has no solution x  
Prior Knowledge as a System of
Linear Inequalities
Use a theorem of the alternative for convex functions:
Assume that g(x), K(x0, A0) + b, -h(x) are convex functions
of x, that  is convex and 9 x 2  : g(x)<0. Then either:
 I. g(x)  0, K(x0, A0) + b - h(x) < 0 has a solution x  , or
 II. v  Rk, v  0: K(x0, A0) + b - h(x) + v0g(x)  0 x  
 But never both.
If we can find v  0: K(x0, A0) + b - h(x) + v0g(x)  0
x  , then by above theorem
 g(x)  0, K(x0, A0) + b - h(x) < 0 has no solution x   or
equivalently:
 g(x)  0  K(x0, A0) + b  h(x), 8x 2 
Proof
 I  II
 Follows from OLM 1969, Corollary 4.2.2 and the existence
of an x 2  such that g(x) <0.
 I  II
 Suppose not. That is, there exists x 2 ,v 2 Rk,, v¸ 0:
 g(x)  0, K(x0, A0) + b - h(x) < 0, (I)
 v  0, v0g(x) +K(x0, A0) + b - h(x)  0 , 8 x 2  (II)
 Then we have the contradiction:
 0 > v0g(x) +K(x0, A0) + b - h(x)  0
 Requires no assumptions on g, h, K, or  whatsoever
Example
g(x) = 1250 - x3 , f(x) = x4, h(x) = x2 + 5000
g(x) · 0 ) f(x) ¸ h(x)
:I
x3  1250
x4  x2 + 5000
v0g(x) + f(x) - h(x) ¸ 0
II
Incorporating Prior Knowledge
Linear semi-infinite program: infinite number of
constraints
Discretize: finite linear program
 Slacks allow knowledge to be satisfied inexactly
 Add term to objective function to drive slacks to zero
Numerical Experience
Evaluate on three datasets
 Two synthetic datasets
 Wisconsin Prognostic Breast Cancer Database
Compare approximation with prior knowledge to
one without prior knowledge
 Prior knowledge leads to an improved approximation
 Prior knowledge used cannot be handled exactly by
previous work
 No kernelization needed on knowledge set
Two-Dimensional Hyperboloid
f(x1, x2) = x1x2
Two-Dimensional Hyperboloid
Given exact values
only at 11 points along
line x1 = x2
At x1 2 {-5, …, 5}
x2
x1
Two-Dimensional Hyperboloid
Approximation without Prior Knowledge
Two-Dimensional Hyperboloid
Add prior knowledge:
x1x2  1  f(x1, x2)  x1x2
Nonlinear term x1x2 can not be handled
exactly by any previous approach
Discretization used only 11 points along the
line x1 = -x2, x1  {-5, -4, …, 4, 5}
Two-Dimensional Hyperboloid
Approximation with Prior Knowledge
Two-Dimensional Tower
Function
Two-Dimensional Tower Function Data
Given 400 points on the grid [-4, 4]  [-4, 4]
Values are min{g(x), 2}, where g(x) is the
exact tower function
Two-Dimensional Tower Function
Approximation without Prior Knowledge
Two-Dimensional Tower Function
Prior Knowledge
Add prior knowledge:
(x1, x2)  [-4, 4]  [-4, 4]  f(x) = g(x)
Prior knowledge is the exact function value.
Enforced at 2500 points on the grid
[-4, 4]  [-4, 4] through above implication
Principal objective of prior knowledge is to
overcome poor given data
Two-Dimensional Tower Function
Approximation with Prior Knowledge
Predicting Lymph Node
Metastasis
Number of metastasized lymph nodes is an
important prognostic indicator for breast cancer
recurrence
Determined by surgery in addition to the removal
of the tumor
Wisconsin Prognostic Breast Cancer (WPBC) data
 Lymph node metastasis for 194 patients
 30 cytological features from a fine-needle aspirate
 Tumor size, obtained during surgery
Task: predict the number of metastasized lymph
nodes given tumor size alone
Predicting Lymph Node
Metastasis
Split data into two portions
Past data: 20% used to find prior knowledge
Present data: 80% used to evaluate performance
Simulates acquiring prior knowledge from
an expert’s experience
Prior Knowledge for Lymph
Node Metastasis
Use kernel approximation without knowledge on
the past data
 f1(x) = K(x0, A10)1 + b1
 A1 is the matrix of the past data points
Use density estimation to decide where to enforce
knowledge
 p(x) is the empirical density of the past data
Number of metastasized lymph nodes is greater
than the predicted value on the past data, with
tolerance of 0.01
 p(x)  0.1  f(x)  f1(x) - 0.01
Predicting Lymph Node
Metastasis: Results
Approximation
RMSE/LOO RMSE
f1(x) from past data
6.12
Without knowledge
5.92
With knowledge
5.04
Table shows root-mean-squared-error (RMSE) of past data
(20%) approximation f1(x) on present data (80%)
 Leave-one-out (LOO) RMSE reported for approximations with and
without knowledge
Improvement due to knowledge: 14.8%
Conclusion
Added general nonlinear prior knowledge to
kernel approximation
 Implemented as linear inequalities in a linear
programming problem
 Knowledge incorporated transparently
Demonstrated effectiveness
 Two synthetic examples
 Real world problem from breast cancer prognosis
Future work
 More general prior knowledge with inequalities
replaced by more general functions
 Apply to classification problems
Questions
Websites linking to papers and talks:
http://www.cs.wisc.edu/~olvi/
http://www.cs.wisc.edu/~wildt/