Large Datasets Lead to Overly Complex Models

Download Report

Transcript Large Datasets Lead to Overly Complex Models

A Kernel Approach for Learning From Almost Orthogonal Pattern

*

CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin

* B. Scholkopf

et al

.,

Proc. 13 th ECML

, Aug 19-23, 2002, pp. 511-528.

     

Presentation Outline

Introduction     Motivation A Brief review of SVM for linearly separable patterns Kernel approach for SVM Empirical kernel map Problem: almost orthogonal patterns in feature space   An example Situations leading to almost orthogonal patterns Method to reduce large diagonals of Gram matrix   Gram matrix transformation An approximate approach based on statistics Experiments   Artificial data (String classification, Microarray data with noise, Hidden variable problem) Real data (Thrombin binding, Lymphoma classification, Protein family classification) Conclusions Comments

Introduction

Motivation

Support vector machine (SVM)

   Powerful method for classification (or regression) with high accuracy comparable to neural network Exploit of kernel function high dimensional space for pattern separation in The information of training data for SVM is stored in the Gram matrix (kernel matrix) 

The problem:

 SVM doesn’t perform well if Gram matrix has large diagonal values

A Brief Review of SVM

For linearly separable patterns:

y i

 w T x

i

 b  1

─ ─ ─ ─

y i

  1

─ ─ ─ ─ ─ + + +

y i

 1

+ + + +

depends on closest points

+ margin

  2

w

To maximize the margin: Minimize: Constraints: 2

w

y i

(

w

T

x

i

b

)  1

Kernel Approach for SVM (1/3)

 For linearly non-separable patterns    Nonlinear mapping function  (x) 

H

: mapping the patterns to new feature space H of higher dimension For example: the XOR problem SVM in the new feature space: Minimize: Constraints:

y i

w

2 [

w

T

 (

x

i

) 

b

]  1  The kernel trick:   Solving the above minimization problem requires: 1) Explicit form of  2) Inner product in high dimensional space H Simplification by wise selection of kernel functions with property: k(x i , x j ) =  (x i )   (x j )

Kernel Approach for SVM (2/3)

 Transform the problem with kernel method     Expand w in the new feature space: w =  a i  (x i ) = [  (x)]a where [  (x)]=[  (x 1 ),  (x 2 ), …,  (x m )], and a=[a 1 , a 2 , … a m ] T Gram matrix: K=[K ij ] , where K ij =  (x i )   (x j ) = k(x i , x j ) (symmetric !) The (squared) objective function: ||w|| 2 = a T [  (x)] T [  (x)]a = a T

Ka

(sufficient condition for existence of optimal solution: K is positive definite ) The constraints: y i {w T  (x i ) + b} = y i {a T [  (x)] T  (x i ) + b} = y i {a T

K

i + b}  1 , where K i is the i th column of K.

Minimize: Constraints:

a

T

Ka

y i

[

a

T

K

i

b

]  1

Kernel Approach for SVM (3/3)

 To predict new data with a trained SVM

f

(

x

test

)   

a

T

a

T

w

T

  

k

 (

x

test

(

x

1 (

x

1 , ),  )

x

test

 (

x

1 ),

b k

),..., (

x

2  , (

x x

test m

) 

T

 ),...,

k

(

x

test

(

x

m

) 

b

,

x

test

) 

T

b

Where: a and b are optimal solution based on training data, and m is the number if training data  The explicit form of k(

x

i ,

x

j ) is required for prediction of new data

Empirical Kernel Mapping

   Assumption: m (the number if instances) is a sufficient high dimension of the new feature space. i.e. the patterns will be linearly separable in m dimension space (

R

m ) Empirical kernel map:  m (x i ) = [k(x i ,x 1 ), k(x i ,x 2 ), …, k(x i ,x m )] T =

K

i The SVM in

R

m Minimize: Constraints: 2

w

y i

[

w

T

m

(

x

i

) 

b

]  1   The new Gram matrix

K

m Km=[Km ij ], where Km ij associated with =  m (x i )   m (x j ) = K i   m (x):

K

j = K i T

K

j , i.e.

Km = K T K = KK T Advantage of empirical kernel map:

K

m  Km = KK T = (U T DU) (U T DU) T = U T D 2 U is positive definite (K is symmetric, U is unitary matrix, D is diagonal)  Satisfied the sufficient condition of above minimization problem

The Problem: Almost Orthogonal Patterns in the Feature Space Result in Poor Performance

An Example of Almost Orthogonal Patterns

 The training dataset with almost orthogonal patterns 1

X

        1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 9 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 8 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0  1 0 0 0 0 9        , Y              1 1 1   1 1     

w

 ( 0 .

04 , 0 ,  0 .

11 , 0 .

11 , 0 , 0 .

12 ,  0 .

12 ,  The Gram matrix with linear kernel k(

x

i ,

x

j ) =

x

i 

x

j 82

K

        1 1 0 0 0 1 65 1 0 0 0 1 1 82 0 0 0 0 0 0 81 0 0 0 0 0 0 64 0 0 .

11 , 0 , 0 0 0 0 0       81  Large Diagonals  0 .

11 )

T

,

b

  0 .

02   

w

is the solution with standard SVM Observation: each large entry in

w

is corresponding to a column in X with only one large entry:

w

becomes a lookup table , the SVM won’t generalize well A better solution:

w

 ( 2 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 )

T

,

b

  1

Situations Leading to Almost Orthogonal Patterns

Sparsity of the patterns in the new feature space, e.g.

   x = [ 0, 0, 0, 1, 0, 0, 1, 0] T Y = [ 0, 1, 1, 0, 0, 0 , 0, 0] T

x

x

 matrix)

y

y >> x

y

(large diagonals in Gram 

Some selection of kernel functions may result in sparsity in the new feature space

  String kernel (Watkins 2000, et al) Polynomial kernel, k(x i , x j ) = (x i 

x

j ) d , with large order d  If x i 

x

i > x i 

x

j , for i  j, then  k(x i , x i ) >> k(x i , x j ), for even moderately large d, due to the exponential function.

Methods to Reduce the Large Diagonals of Gram Matrices

Gram Matrix Transformation (1/2)

 For symmetric, positive definite Gram matrix   

K (

or

K m )

,

K = U

T

DU

U is unitary matrix, D is diagonal matrix

f

(

D

)  Define f(K) = U T f(D) U, and f(D)

ii

= f(D

ii

) i.e., the function f operates on the eigenvalues  i      

f

(  1 ) of K

f

(  2 ) f(K) should preserve positive definition of the Gram matrix ...

f

( 

m

     )   A sample procedure for Gram matrix transformation    (Optional) Compute the positive definite matrix A = sqrt(K) Suppress the large diagonals of A, and obtain a symmetric A’ i.e. transform the eigenvalues of A: [  min ,  max ]  [f(  min ), f(  max )] Compute the positive definite matrix K’=(A’) 2

Gram Matrix Transformation (2/2)

 Effect of matrix transformation    The explicit form of new kernel  (x) k(

x

i ,x j )=  (x i )   (x j ) function k’ is not available k’ is required when the trained SVM Implicit transformation is used to predict the testing data  ’(x) A solution: include all test data into K before the matrix transformation K->K’ i.e. k’(

x

i ,x j ) =  ’(x i )   ’(x j ) the testing data has to be known in training time

K

f(K) K

f

(

x

i

)  

w

T

a

T

   (

x

i

(

x

1 ) ),  

b

(

x

2 ),...,  (

x

m

n

) 

T

 (

x

i

) 

b

a

T

K

i

b

i= 1, 2,…m+n, where m is the number if training data and n is the number of testing data a’ and b’ from the portion of K’ corresponding to the training data K’ = f(K)

f

(

x

i

) 

a

'

T

K

'

i

b

' If x i has been used in calculating K’, the prediction on x i can simply use K’ i

An Approximate Approach based on Statistics

   The empirical kernel map  m+n (

x

) should be used to calculate the Gram matrix Assuming the dataset size r is large 

r

(

x

)  

r

(

x

' )  [

k

(

x

,

x

1 ),

k

(

x

,

x

2 ),...

k

(

x

,

x r

)]  [

k

(

x

' ,

x

1 ),

k

(

x

' ,

x

2 ),...

k

(

x

' ,

x r

)] 

i r

  1

k

(

x

,

x i

)

k

(

x

' ,

x i

) 

X

k

(

x

,

x

" )

k

(

x

' ,

x

" )

rdP

(

x

" ) 1 

r r

(

x

)  

r

(

x

' ) 

X

k

(

x

,

x

" )

k

(

x

' ,

x

" )

dP

(

x

" ) 1

m

m

(

x

)  

m

(

x

' )  1

m

n

m

n

(

x

)  

m

n

(

x

' ) Therefore, the SVM can be simply trained with the empirical map on the training set,  m (

x

), instead of  m+n (

x

)

Experiment Results

Artificial Data (1/3)

 String classification    String kernel function ( Watkins 2000, et al ) Sub-polynomial kernel k(x,y) = [  (x)   (y)] P , 0

Artificial Data (2/3)

 Microarray data with noise (Alon et al, 1999)   62 instance (22 positive, 44 negative), 2000 features in original data 10000 noise features were added (1% to be non-zero in probability) Error rate for SVM without noise addition is: 0.18

 0.15

Artificial Data (3/3)

 Hidden variable problem   10 hidden variables (attributes), 10 additional attributes which are nonlinear functions of the 10 hidden variables Original kernel is polynomial kernel of order 4

Real Data (1/3)

 Thrombin binding problem    1909 instances, 139,351 binary features 0.68% entries are non-zero 8-fold cross validation

Real Data (2/3)

 Lymphoma classification (Alizadeh et al, 2000)    96 samples, 4026 features 10-fold cross validation Improved results observed compared with previous work (Weston, 2001)

Real Data (3/3)

 Protein family classification (Murzin et al, 1995)  Small positive set, large negative set Rate of false positive Receiver operating characteristic 1: best score 0: worst score

Conclusions

    Problem of degraded performance for SVM due to almost orthogonal patterns was identified and analyzed The common situation that sparse vectors leading to large diagonals was identified and discussed A method of Gram matrix transformation to suppress the large diagonals was proposed to improve the performance in such cases Experiment results show improved accuracy for various artificial or real datasets with suppressed large diagonals of Gram matrices

 

Comments

Strong points:   The identification of the situations leads to large diagonals in Gram matrix, and the proposed Gram matrix transformation method for suppressing the large diagonals Experiments are extensive Weak points:    The application of Gram matrix transformation may be severely restricted in forecasting or other applications in which the testing data is not know in training time The proposed Gram matrix transformation method was not tested by experiments directly, instead, transformed kernel functions were used in experiments The almost orthogonal patterns imply that multiple pattern vectors in the same direction rarely exist, therefore, the necessary condition for statistic approach for pattern distribution is not satisfied

End!