Transcript Document

Nonparametric Weighted Feature Extraction
(NWFE)
and Its Kernel-based Version (KNWFE)
Bor-Chen Kuo
Graduate School of Educational Measurement and Statistics,
National Taichung University, Taiwan, R.O.C.
[email protected]
Cheng-Hsuan Li
Institute of Electrical Control Engineering, National Chiao Tung University,
Hsinchu, Taiwan, R.O.C.
Graduate School of Educational Measurement and Statistics,
National Taichung University, Taiwan, R.O.C.
[email protected]
Outline









Hyperspectral image data and some applications
The influence of increasing dimensionality
The Hughes phenomenon
Feature selection and feature extraction
Nonparametric Weighted Feature Extraction (NWFE)
Kernel method
Kernel Nonparametric Weighted Feature Extraction
(KNWFE)
The classified results of Washington DC image
Conclusions
Hyperspectral Image Data Representation
Sample
Image Space
Spectral Space
Feature Space
Application I
資料來源:逢甲大學地理資訊系統研究中心
Application II
Application III
Application IV


都市地區之應用是在大地區之瞭解、變遷
地區之偵測及簡易之人口估算(巴黎地區
曾試行過)。
糧食局每年二次之水稻雜糧調查、大型環
境災害 。
The Power of Increasing Dimensionality
0.35
0.35
 x1
0.3
0.35
 x2
0.3
0.25
0.25
0.25
0.2
0.2
0.2
0.15
0.15
0.15
0.1
0.1
0.1
0.05
0.05
0.05
0
-5
0
5
x2
10
0
-5
5
10
0
-5
0
5
10
x3
x3
x1
0
 x3
0.3
x3
x1
x2
x2
x1
The Hughes Phenomenon (1)
The Hughes Phenomenon (2)
MEAN RECOGNITION ACCURACY
0.75
m= •
0.70
1000
0.65
500
0.60
200
50
100
0.55
20
10
5
m=2
0.50
1
2
5
10
20
50
100
200
MEASUREMENT COMPLEXITY n (Total Discrete Values)
500
1000
A System for Hyperspectral Data Classification
Hyperspectral
Data Collection
Class Conditional
Feature Extraction
Feature
Selection
Determine Quantitative
Class Descriptions
Observations
from the ground
Observations
of the ground
Direct Method
Classifier
Pre-Gathered
Spectra
Label Training
Samples
Data
Adjustment
Clustering
Prob. Map
Results Map
Calibration, Adjustment for
the atmosphere, the solar
curve, goniometric effects, etc.
Indirect Method

Feature selection:
select l
out of p measurements
x1
f1
f2
xp

Feature extraction:
map p measurements
to l measurements
x1
f1
f2
xp
Difference Between Feature Selection
and Feature Extraction
Feature Extraction v.s. Feature Selection
Advantage
cut in measurements
easy interpretation
cheap
can be nonlinear
Selection
Extraction
Disadvantage
expensive
often approximative
need all measurements
criterion sub-optimal
150
6
Selection
4
100
2
50
0
0-5
150
-2
0
5
Extraction
100
-4
50
-6
-5
0
5
0-5
0
5
Feature Extraction and Classification Process
Compute the
Scatter Matrices
Sb and Sw
Regularize the
within-calss
Scatter Matrix Sw
Training Data
Feature
Extraction
Eigenvalue
Decomposition
Testing Data
Transformed
Testing Data
Transformed
Training Data
Classifier
Classification
Result
Principle Component Analysis

Principal component analysis
(PCA, 1901) finds directions in data...
- which retain as much variation as possible
- which make projected data uncorrelated
- which minimise squared reconstruction error
5
0
-5
-5
0
5
Rk
Rl
Classification Using PCA
10
10
8
8
6
6
4
4
2
2
0
0
-2
-2
-4
-4
-6
-6
-8
-8
-10
-10
-8
-6
-4
-2
0
2
4
6
8
10
-10
-10
-8
-6
-4
-2
0
2
4
6
8
10
What is the measure of separability?




The purpose of FE is to mitigate the effect of
Hughes Phenomenon.
The method is trying to find a transformation
matrix A such that the class separability of
transformed data (Y=ATX ) is maximized in a
lower dimension space.
What is the measure of separability ?
Usually the trace of S w1Sb is used as the
separability.
Linear Discriminant Analysis Feature
Extraction (LDA or DAFE)
The feature transforma tion matrix of LDA is composed of
the eigenvecto rs of ( S wDA ) 1 SbDA
L 1
L
where S
S
L
DA
b
  Pi (mi  m0 )(mi  m0 )    Pi Pj (mi  m j )( mi  m j )T
DA
w
DA
  PE
{(
X

m
)(
X

m
)
|

}

P


PS
 i i  i wi
i
i
i
i
T
i 1
L
i 1 j i 1
L
L
T
i 1
i 1
Pi is the prior probability of class i
mi and Σi is the mean and covariance of class i
L is the number of classes
i 1
LDA (DAFE)
is the between class distance.
is the within class distance.
The weights of between and within class distances are the same.
Disadvantages: 1. only useful for normally distributed data.
2. only L-1 features can be extracted.
Class j
Class i



Mj
Mi




Difference Between Feature Selection
and Feature Extraction
Nonparamentric Weighted Feature
Extraction (NWFE)
Class i
Class j
Nonparamentric Weighted Feature
Extraction (NWFE)
Class i
Class j

Mi

Mj
Nonparamentric Weighted Feature
Extraction (NWFE)
Class i
Class j

Mi

Mj
Nonparamentric Weighted Feature
Extraction (NWFE)
xl(i )
Class i

Class j






Mi


Mj


















Nonparamentric Weighted Feature
Extraction (NWFE)

(i,j)
ls
w
xl(i )
Class i

dist(xl(i),xs( j ) )1
ni
 dist(x
q 1
,xq( j ) )1
(i)
k
Large Weight
Class j






Mi


Mj






Small Weight












Nonparamentric Weighted Feature
Extraction (NWFE)
nj
M j ( xl(i ) )   wks(i , j ) xs( j )
x
Class i
s 1
M j ( xl(i ) )
(i )
l

Class j






Mi


Mj


















Nonparamentric Weighted Feature
Extraction (NWFE)
M i ( xl(i ) )
x
Class i
M j ( xl(i ) )
(i )
l

Class j







Mi



















Nonparamentric Weighted Feature
Extraction (NWFE)
M i ( xl(i ) )
M j ( xl(i ) )
(i )
l
x
Class i


Class j

xt(i )



M i ( xt(i ) )
M j ( xt(i ) )
Nonparamentric Weighted Feature
Extraction (NWFE) dist(x ,M (x ))
λ
(i,j)
l
xl(i )  M j ( xl(i ) )
(i )
l
M i (x )
(i )
l
M j (x )
xl(i )
Class i

(i)
l
ni
 dist(x
k 1
j
xt(i )  M j ( xt(i ) )
Large Weight
Class j

xt(i )



M i ( xt(i ) )
M j ( xt(i ) )
Small Weight
1
,M j(xk(i) ))1
(i)
k


(i)
l
Nonparamentric Weighted Feature
Extraction (NWFE)
xl(i )  M i ( xl(i ) )
M i ( xl(i ) )
xl(i )  M j ( xl(i ) )
M j ( xl(i ) )
(i )
l
x
Class i


xt(i )  M j ( xt(i ) )
xt(i )



M i ( xt(i ) )
xl(i )  M i ( xt(i ) )
Class j

M j ( xt(i ) )
Nonparamentric Weighted Feature
Extraction (NWFE)
x  M i (x )
(i )
l
(i )
l
M i ( xl(i ) )
xl(i )  M j ( xl(i ) )
M j ( xl(i ) )
(i )
l
x
Class i

NWFE focus on these
vectors and put the
different weights on
them.

xt(i )  M j ( xt(i ) )
xt(i )



M i ( xt(i ) )
xl(i )  M i ( xt(i ) )
Class j

M j ( xt(i ) )
Nonparametric Weighted Feature Extraction
(NWFE; Kuo & Landgrebe, 2002, 2004)
The feature transform ation matrix of NWFE are composed of
the eignvector s of [0.5S wNW  0.5diag ( S wNW )]1 SbNW
L
S
NW
b
L
ni
  Pi 
i 1
(ki , j )
j 1 k 1
j i
ni
L
ni
(ki ,i )
i 1
k 1
ni
S wNW   Pi 
( xk(i )  M j ( xk(i ) ))( xk(i )  M j ( xk(i ) ))T
( xk(i )  M i ( xk(i ) ))( xk(i )  M i ( xk(i ) ))T
nj
M j ( xk(i ) )   wkl(i , j ) xl( j ) , n j is the number of training samples of class j
l 1
λ
(i,j)
k

dist(x k(i),M j(xk(i) ))1
ni
 dist(xl(i),M j(xl(i) ))1
l 1
(i,j)
kl
,w

dist(xk(i),xl( j ) )1
ni
(i) ( j ) 1
dist(x

k ,xl )
l 1
The Performance of NWFE


In Kuo and Landgrebe paper, they compare the
performances of NWFE, LDA, aPAC-LDR,
and NDA.
The NWFE performs better than others.
Reference: Bor-Chen Kuo and David A. Landgrebe, “Nonparametric weighted
feature extraction for classification,” IEEE Trans. Geosci. Remote Sens., vol. 42, no. 5,
pp.1096-1105, May 2004.
The Kernel Trick




Use a feature mapping  to embed the samples from
original space into a feature space H, a Hilbert space
with higher dimensionality.
In H, the patterns can be discovered as linear relations.
We can compute the inner product of samples in the
feature space directly from the original data items using
a kernel function κ (not feature mapping ).
Assume that the sample in H can be represented by a
dual form, the combination of training samples.
The Kernel Trick
  ( x1 , x2 ),  ( x1, x2 )  x12 x12  2 x1 x2 x1x2  x22 x22  ( x1 x1  x2 x2 ) 2  ( x, x )2
:  ( x, x)2
Characterization of Kernels
A function
 : X  X  R,
which is either continuous or has a finite domain, can be
decomposed
 ( x, z )   ( x),  ( z ) 
into a feature map  into a Hilbert space H applied to
both its arguments followed by the evaluation of the inner
product in H if and only if it satisfies the finitely positive
semi-definite property.
Some Widely Used Kernel Functions

Linear Kernel
 ( x, z )  x, z 

Polynomial Kernel
 ( x, z)  (  x, z  1 ) , r  Z
r


RBF (Gaussian) Kernel
 xz
 ( x, z )  exp 
 2 2

2

 ,   R  {0}


PCA & KPCA
PCA
KPCA
Kernel-based Feature Extraction and Classification Process
Compute the
Scatter Matrices
Sb and Sw in H
Regularize the
within-calss
Scatter Matrix Sw
Training Data in
Feature Space H
Training Data
Testing Data in
Feature Sapce H
Testing Data
Feature
Extraction
Eigenvalue
Decomposition
Transformed
Training Data
Transformed
Testing Data
Classifier
Use Implicit
Feature Map
Classification
Result
Kernel Nonparametric Weighted
Feature Extraction (KNWFE)
L
S
S
KNW
b
KNW
w
ni
L
  Pi 
i 1
(ki , j )
j 1 k 1
j i
ni
L
ni
(ki ,i )
i 1
k 1
nj
ni
  Pi 
( ( xk(i ) )  M j ( ( xk(i ) )))( ( xk(i ) )  M j ( ( xk(i ) ))) T
( ( xk(i ) )  M i ( ( xk(i ) )))( ( xk(i ) )  M i ( ( xk(i ) ))) T
M j ( ( xk(i ) ))   wkl(i , j ) ( xl( j ) ), n j is the number of training samples of class j
l 1
λk(i,j) 
dist ( ( xk(i)) ,M j ( ( xk(i)))) 1
ni
 dist ( ( xl(i)),M j ( ( xl(i)))) 1
l 1
(i,j)
kl
, w

dist ( ( xk(i)) , ( xl( j ) )) 1
ni
(i)
( j ) 1
dist
(

(
x
)
,

(
x

k
l ))
l 1
Problems



Problem I:
How to transform the scatter matrices of
KNWFE with the kernel matrix?
Problem II:
How to solve the singularity of the kernel
matrix?
Problem III:
How to project our samples in the projected
space?
KNWFE Algorithm
1. Let X T  [ X 1T ,..., X LT ],
where X iT  [ ( x1(i ) ), ,  ( xN(ii) )], i  1, L.
2. Comput S wKNW  X TWX
3. Compute SbKNW  X T ( B  W ) X
Problem I is solved here.
4. A  arg max tr( ( AT S wKNW A ) 1 AT SbKNW A )
A
5.
 A  arg max tr(( AT X TWXA) 1 AT X T ( B  W ) X )
A
~
Dual Form : A  X T A
6.
~
~T
~ 1 ~ T
~
A  arg max
tr(
(
A
KWK
A
)
A
K
(
B

W
)
K
A
),
~
A
where K  XX T is the kernel matrix.
KNWFE Algorithm
7.
K  PPT , the eigen  decomposit ion of K .
8.
PTWP is regulized by
Problem II is solved here.
R  0.5 ( PTWP )  0.5 diag ( PTWP)
9.
~
~ ~
~
tr( ( AT KWKA) 1 AT K ( B  W ) KA )
~
~ ~
~
 tr( ( AT PPTWPPT A) 1 AT PPT ( B  W ) PPT A )
~
10. U  PT A
11. U  arg max tr( (U T ( R' )U ) 1U T ( PT ( B  W ) P)U )
U
KNWFE Algorithm
1.
12.
~
Compute dual form: A  P 1U
  ( x1 )T 
  ( x1 , z ) 
~T
~T 
~T 

,
T
y

A

(
z
)

A
X

(
z
)

A


(
z
)

A

13.




 ( xN )T 
 ( xN , z )


where  is the kernel function w here z is an arbitrary sample.
Problem III is solved here.
Dataset




Washington DC
The dimensionality of this hyperspectral image is 191.
The number of classes is 7.
There are two kinds of training data sets. One is with 40 training
samples in every class, and the other is with 100 training samples.
Experimental Design
NWFE
KNWFE Linear Kernel (Linear K)
Polynomial Kernel Degree 1 (Poly K-1)
Feature
Extraction
Degree 2 (Poly K-2)
Degree 3 (Poly K-3)
RBF Kernel (RBF K)
Quadratic Bayes Normal Classifier (qdc)
Classifier


1NN Classifier
Parzen Classifier
Every 20-th band, which begins from the first one, is selected for the 10 bands case.
The parameter of RBF kernel is the mean of variances in every band of training samples.
The Classification Results of Real Dataset
Ni=40
Ni=100
Mean of accuracies using 1-9 features (DC Mall, Quadratic Bayes Normal Classifier)
The Classification Results of Real Dataset
Ni=40
Ni=100
Mean of accuracies using 1-9 features (DC Mall, 1NN Classifier )
The Classification Results of Real Dataset
Ni=40
Ni=100
Mean of accuracies using 1-9 features (DC Mall, Parzen Classifier )
The Classification Results of Washington DC
Mall Image
Thematic Map
NWFE
KNWFE
The Classification Results of Washington DC
Mall Image
Thematic Map
NWFE
KNWFE
The Classification Results of Washington DC
Mall Image
Thematic Map
NWFE
KNWFE
The Classification Results of Washington DC
Mall Image
Thematic Map
NWFE
KNWFE
The Classification Results of Washington DC
Mall Image
Thematic Map
NWFE
KNWFE
Experimental Results




The performances of all three classifiers with KNWFE
features are better than those of classifiers with NWFE
features.
The polynomial kernel with degree 2 outperforms other kernel
functions in 1NN and Parzen cases.
Among three classifiers, quadratic Bayes normal classifier has
the best performance. The best classification accuracy is 0.9
and obtained by qdc classifier with 9 features extracted by
KNWFE with linear kernel in the case of Ni=100.
Comparing figures from NWFE and KNWFE, one sees that
the performance of KNWFE is better than that of NWFE in
almost in all classes.
Conclusion




We proposed a new kernel-based nonparametric
weighted feature extraction.
We have analyzed and compared NWFE and KNWFE
both theoretically and experimentally.
From theoretical point of view, NWFE is a special case
of KNWFE with linear kernel and the result of a real
hyperspectral image shows that the average
classification accuracy of applying KNWFE is better
than that of applying NWFE.
We can state that, in our case study, the use of
KNWFE is more beneficial and yielding better results
than NWFE.
Thanks for Your Attentions !
and
Questions ?