#### Transcript Nonlinear Knowledge in Kernel Machines (PowerPoint)

```Nonlinear Knowledge in Kernel
Machines
Data Mining and Mathematical Programming Workshop
Centre de Recherches Mathématiques
Université de Montréal, Québec
October 10-13, 2006
Olvi Mangasarian
UW Madison & UCSD La Jolla
Edward Wild
Objectives
Primary objective: Incorporate prior knowledge
over completely arbitrary sets into:
 function approximation, and
 classification
 without transforming (kernelizing) the knowledge
Secondary objective: Achieve transparency of the
prior knowledge for practical applications
Graphical Example of
Knowledge Incorporation

h1(x) · 0




 

+
++ +
+
g(x) · 0
+
+
+
Similar approach for approximation
K(x0, B0)u = 
h2(x) · 0
Outline
 Kernels in classification and function approximation
 Incorporation of prior knowledge
 Previous approaches: require transformation of knowledge
 New approach does not require any transformation of knowledge
• Knowledge given over completely arbitrary sets
 Fundamental tool used
 Theorem of the alternative for convex functions
 Experimental results
 Four synthetic examples and two examples related to breast cancer prognosis
 Classifiers and approximations with prior knowledge more accurate than those
without prior knowledge
Theorem of the Alternative for Convex Functions:
Generalization of the Farkas Theorem
 Farkas: For A2 Rk£ n, b2 Rn, exactly one of the following must hold:
I. Ax  0, b0x > 0 has solution x 2 Rn, ,or
II. A0 v =b, v  0, has solution v 2 Rk .
 Let h:½ Rn! Rk, : ! R,  convex set, h and  convex on , and
h(x)<0 for some x2 . Then, exactly one of the following must hold:
I. h(x)  0, (x) < 0 has a solution x  , or
II. v  Rk, v  0: v0h(x)+(x)  0 x  .
 To get II of Farkas:
v  0, v0Ax- b0x  0 8 x 2 Rn , v  0, A0 v -b=0.
Classification and Function
Approximation
 Given a set of m points in n-dimensional real space Rn with
corresponding labels
 Labels in {+1, -1} for classification problems
 Labels in R for approximation problems
 Points are represented by rows of a matrix A 2 Rm£n
 Corresponding labels or function values are given by a
vector y
 Classification: y 2 {+1, -1}m
 Approximation: y  Rm
 Find a function f(Ai) = yi based on the given data points Ai
 f : Rn ! {+1, -1} for classification
 f : Rn! R for approximation
Classification and Function
Approximation
Problem: utilizing only given data may
result in a poor classifier or approximation
Points may be noisy
Sampling may be costly
Solution: use prior knowledge to improve
the classifier or approximation
Standard approximation and classification: fit
function at given data points without knowledge
Constrained approximation: satisfy inequalities at
given points
Previous approaches (2001 FMS, 2003 FMS, and
2004 MSW ): satisfy linear inequalities over
polyhedral regions
Proposed new approach: satisfy nonlinear
inequalities over arbitrary regions without
kernelizing (transforming) knowledge
Kernel Machines
Approximate f by a nonlinear kernel function K
using parameters u 2 Rk and  in R
A kernel function is a nonlinear generalization of
the scalar product
f(x)  K(x0, B0)u - , x 2 Rn, K:Rn £ Rn£k ! Rk
2
 Gaussian K(x0, B0)i=-||x-Bi|| , i=1,…..,k
B 2 Rk£n is a basis matrix
 Usually, B = A2Rm£n = Input data matrix
 In Reduced Support Vector Machines, B is a small
subset of the rows of A
 B may be any matrix with n columns
Kernel Machines
Introduce slack variable s to measure error in
classification or approximation
Error s in kernel approximation of given data:
 -s  K(A, B0)u - e - y  s, e is a vector of ones in Rm
 Function approximation: f(x)  K(x0, B0)u - 
Error s in kernel classification of given data
 K(A+, B0)u - e + s+ ¸ e, s+ ¸ 0
 K(A- , B0)u - e - s-  - e, s- ¸ 0
More succinctly, let: D = diag(y), the m£m matrix
with diagonal y of § 1’s, then:
 D(K(A, B0)u - e) + s ¸ e, s ¸ 0
 Classifier: f(x)  sign(K(x0, B0)u - )
Kernel Machines
Kernel Machines
in Approximation
in Approximation
OR Classification
OR complexity
(||u||1) and data fitting (||s||1)
At solution
 e0a = ||u||1
 e0s = ||s||1
Incorporating Nonlinear Prior Knowledge:
Previous Approaches
Cx  d  w0x-   h0x + 
Need to “kernelize” knowledge from input space to
transformed (feature) space of kernel
Requires change of variable x = A0t, w = A0u
CA0t  d  u0AA0t -   h0A0t + 
K(C, A0)t  d  u0K(A, A0)t -   h0A0t + 
Use a linear theorem of the alternative in the t space
Lost connection with original knowledge
Achieves good numerical results, but is not readily
interpretable in the original space
Nonlinear Prior Knowledge in Function Approximation:
New Approach
implication
g(x)  0  K(x0, B0)u -   h(x), 8x 2  ½ Rn
g, h are arbitrary functions on 
k0 , h:! R 0
g:!
R
9v ¸ 0: v g(x) + K(x , B0)u -  - h(x) ¸ 0 8x 2 
Linear in v, u, 
Theorem of the Alternative for
Convex Functions
Assume that g(x), K(x0, B0)u - , -h(x) are convex functions
of x, that  is convex and 9 x 2  : g(x)<0. Then either:
 I. g(x)  0, K(x0, B0)u -  - h(x) < 0 has a solution x  , or
 II. v  Rk, v  0: K(x0, B0)u -  - h(x) + v0g(x)  0 x  
 But never both.
If we can find v  0: K(x0, B0)u -  - h(x) + v0g(x)  0
x  , then by above theorem
 g(x)  0, K(x0, B0)u -  - h(x) < 0 has no solution x   or
equivalently:
 g(x)  0  K(x0, B0)u -   h(x), 8x 2 
Proof
 I  II (Not needed for present application)
 Follows from a fundamental theorem of Fan-GlicksburgHoffman for convex functions [1957] and the existence
of an x 2  such that g(x) <0.
 I  II (Requires no assumptions on g, h, K, or
u whatsoever)
 Suppose not. That is, there exists x 2 , v 2 Rk,, v¸ 0:
 g(x)  0, K(x0, B0)u -  - h(x) < 0, (I)
 v  0, v0g(x) +K(x0, B0)u -  - h(x)  0 , 8 x 2  (II)
 Then we have the contradiction:
 0 > v0g(x) +K(x0, B0)u -  - h(x)  0
·0
<0
Incorporating Prior Knowledge
Linear semi-infinite program: infinite number of
constraints
Discretize: finite linear program
 g(xi) · 0 ) K(xi0, B0)u -  ¸ h(xi), i = 1, …, k
 Slacks allow knowledge to be satisfied inexactly
 Add term to objective function to drive slacks to zero
Numerical Experience:
Approximation
Evaluate on three datasets
 Two synthetic datasets
 Wisconsin Prognostic Breast Cancer Database (WPBC)
 194 patients £ 2 histogical features
• tumor size & number of metastasized lymph nodes
Compare approximation with prior knowledge to
approximation without prior knowledge
 Prior knowledge leads to an improved accuracy
 General prior knowledge used cannot be handled
exactly by previous work (MSW 2004)
 No kernelization needed on knowledge set
Two-Dimensional Hyperboloid
f(x1, x2) = x1x2
Two-Dimensional Hyperboloid
Given exact values
only at 11 points along
line x1 = x2
At x1 2 {-5, …, 5}
x2
x1
Two-Dimensional Hyperboloid
Approximation without Prior Knowledge
Two-Dimensional Hyperboloid
x1x2  1  f(x1, x2)  x1x2
Nonlinear term x1x2 can not be handled
exactly by any previous approaches
Discretization used only 11 points along the
line x1 = -x2, x1  {-5, -4, …, 4, 5}
Two-Dimensional Hyperboloid
Approximation with Prior Knowledge
Two-Dimensional Tower
Function
Given 400 points on the grid [-4, 4]  [-4, 4]
Values are min{g(x), 2}, where g(x) is the
exact tower function
Two-Dimensional Tower Function
Approximation without Prior Knowledge
Two-Dimensional Tower Function
Prior Knowledge
(x1, x2)  [-4, 4]  [-4, 4]  f(x) = g(x)
Prior knowledge is the exact function value.
Enforced at 2500 points on the grid
[-4, 4]  [-4, 4] through above implication
Principal objective of prior knowledge here
is to overcome poor given data
Two-Dimensional Tower Function
Approximation with Prior Knowledge
Breast Cancer Application:
Predicting Lymph Node Metastasis as a Function of
Tumor Size
Number of metastasized lymph nodes is an important
prognostic indicator for breast cancer recurrence
 Determined by surgery in addition to the removal of the
tumor
Optional procedure especially if tumor size is small
Wisconsin Prognostic Breast Cancer (WPBC) data
Lymph node metastasis and tumor size for 194 patients
Task: predict the number of metastasized lymph nodes
given tumor size alone
Predicting Lymph Node
Metastasis
Split data into two portions
Past data: 20% used to find prior knowledge
Present data: 80% used to evaluate performance
Past data simulates prior knowledge
obtained from an expert
Prior Knowledge for Lymph Node
Metastasis as a Function of Tumor Size
Generate prior knowledge by fitting past data:
 h(x) := K(x0, B0)u - 
 B is the matrix of the past data points
Use density estimation to decide where to enforce
knowledge
 p(x) is the empirical density of the past data
Prior knowledge utilized on approximating function
f(x):
 Number of metastasized lymph nodes is greater than the
predicted value on past data, with tolerance of 0.01
 p(x)  0.1  f(x) ¸ h(x) - 0.01
Predicting Lymph Node
Metastasis: Results
Approximation
Prior knowledge h(x) based
on past data 20%
f(x) without knowledge based
on present data 80%
f(x) with knowledge
based on present data 80%
Error
6.12 RMSE
5.92 LOO
5.04 LOO
RMSE: root-mean-squared-error
LOO: leave-one-out error
Improvement due to knowledge: 14.9%
Incorporating Prior Knowledge in Classification
(Very Similar)
Implication for positive region
g(x)  0  K(x0, B0)u -   , 8x 2  ½ Rn
K(x0, B0)u -  -  + v0g(x) ¸ 0, v ¸ 0, 8x 2 
Similar implication for negative regions
Add discretized constraints to linear program
Incorporating Prior Knowledege:
Classification
Numerical Experience:
Classification
Evaluate on three datasets
 Two synthetic datasets
 Wisconsin Prognostic Breast Cancer Database
Compare classifier with prior knowledge to one
without prior knowledge
 Prior knowledge leads to an improved accuracy
 General prior knowledge used cannot be handled
exactly by previous work (FMS 2001, FMS 2003)
 No kernelization needed on knowledge set
Checkerboard Dataset
Black & White Points in R2
Checkerboard
Prior
Knowledge
Classifier
for
16-Point
Without
Checkerboard Classifier With
Knowledge
Checkerboard
Using
16
Classifier
Center
Points
Knowledge Using 16 Center Points
100
100
Spiral Dataset 194 Points in R2
Spiral Classifier Without Knowledge
Prior
Spiral
Knowledge
Classifier Function
With Knowledge
for Spiral
Based on 100 Labeled Points
Note the many
incorrectly
classified +’s
White ) +
Gray ) •
Labels
Priorgiven
knowledge
only
atimposed
100 correctly
at 291points
classified
points in each
circled
region
points
No
misclassified
Predicting Breast Cancer Recurrence
Within 24 Months
 Wisconsin Prognostic Breast Cancer (WPBC) dataset
 155 patients monitored for recurrence within 24 months
 30 cytological features
 2 histological features: number of metastasized lymph nodes and tumor size
 Predict whether or not a patient remains cancer free after 24 months
 82% of patients remain disease free
 86% accuracy (Bennett, 1992) best previously attained
 Prior knowledge allows us to incorporate additional information to
improve accuracy
Number of Metastasized Lymph Nodes
GrayKnowledge
regions indicate
Generating WPBC Prior
areas where g(x) · 0
Simulate oncological
+ Recur
24 months
surgeon’s
recurrence
Knowledge imposed at
dataset points inside given
regions
• Cancer free within 24 months
Tumor Size in Centimeters
+ Recur
• Cancer free
WPBC Results
Classifier
Misclassification
Rate
Without Knowledge
18.1%
.
With Knowledge
9.0%
.
49.7 % improvement due to knowledge
35.7 % improvement over best previous predictor
Conclusion
General nonlinear prior knowledge incorporated
into kernel classification and approximation
 Implemented as linear inequalities in a linear
programming problem
 Knowledge appears transparently
Demonstrated effectiveness
 Four synthetic examples
 Two real world problems from breast cancer prognosis
Future work
 Prior knowledge with more general implications
 User-friendly interface for knowledge specification
 Generate prior knowledge for real-world datasets
Website Links to Talk & Papers
http://www.cs.wisc.edu/~olvi
http://www.cs.wisc.edu/~wildt
```