Transcript Document
Nonparametric Weighted Feature Extraction (NWFE) and Its Kernel-based Version (KNWFE) Bor-Chen Kuo Graduate School of Educational Measurement and Statistics, National Taichung University, Taiwan, R.O.C. [email protected] Cheng-Hsuan Li Institute of Electrical Control Engineering, National Chiao Tung University, Hsinchu, Taiwan, R.O.C. Graduate School of Educational Measurement and Statistics, National Taichung University, Taiwan, R.O.C. [email protected] Outline Hyperspectral image data and some applications The influence of increasing dimensionality The Hughes phenomenon Feature selection and feature extraction Nonparametric Weighted Feature Extraction (NWFE) Kernel method Kernel Nonparametric Weighted Feature Extraction (KNWFE) The classified results of Washington DC image Conclusions Hyperspectral Image Data Representation Sample Image Space Spectral Space Feature Space Application I 資料來源:逢甲大學地理資訊系統研究中心 Application II Application III Application IV 都市地區之應用是在大地區之瞭解、變遷 地區之偵測及簡易之人口估算(巴黎地區 曾試行過)。 糧食局每年二次之水稻雜糧調查、大型環 境災害 。 The Power of Increasing Dimensionality 0.35 0.35 x1 0.3 0.35 x2 0.3 0.25 0.25 0.25 0.2 0.2 0.2 0.15 0.15 0.15 0.1 0.1 0.1 0.05 0.05 0.05 0 -5 0 5 x2 10 0 -5 5 10 0 -5 0 5 10 x3 x3 x1 0 x3 0.3 x3 x1 x2 x2 x1 The Hughes Phenomenon (1) The Hughes Phenomenon (2) MEAN RECOGNITION ACCURACY 0.75 m= • 0.70 1000 0.65 500 0.60 200 50 100 0.55 20 10 5 m=2 0.50 1 2 5 10 20 50 100 200 MEASUREMENT COMPLEXITY n (Total Discrete Values) 500 1000 A System for Hyperspectral Data Classification Hyperspectral Data Collection Class Conditional Feature Extraction Feature Selection Determine Quantitative Class Descriptions Observations from the ground Observations of the ground Direct Method Classifier Pre-Gathered Spectra Label Training Samples Data Adjustment Clustering Prob. Map Results Map Calibration, Adjustment for the atmosphere, the solar curve, goniometric effects, etc. Indirect Method Feature selection: select l out of p measurements x1 f1 f2 xp Feature extraction: map p measurements to l measurements x1 f1 f2 xp Difference Between Feature Selection and Feature Extraction Feature Extraction v.s. Feature Selection Advantage cut in measurements easy interpretation cheap can be nonlinear Selection Extraction Disadvantage expensive often approximative need all measurements criterion sub-optimal 150 6 Selection 4 100 2 50 0 0-5 150 -2 0 5 Extraction 100 -4 50 -6 -5 0 5 0-5 0 5 Feature Extraction and Classification Process Compute the Scatter Matrices Sb and Sw Regularize the within-calss Scatter Matrix Sw Training Data Feature Extraction Eigenvalue Decomposition Testing Data Transformed Testing Data Transformed Training Data Classifier Classification Result Principle Component Analysis Principal component analysis (PCA, 1901) finds directions in data... - which retain as much variation as possible - which make projected data uncorrelated - which minimise squared reconstruction error 5 0 -5 -5 0 5 Rk Rl Classification Using PCA 10 10 8 8 6 6 4 4 2 2 0 0 -2 -2 -4 -4 -6 -6 -8 -8 -10 -10 -8 -6 -4 -2 0 2 4 6 8 10 -10 -10 -8 -6 -4 -2 0 2 4 6 8 10 What is the measure of separability? The purpose of FE is to mitigate the effect of Hughes Phenomenon. The method is trying to find a transformation matrix A such that the class separability of transformed data (Y=ATX ) is maximized in a lower dimension space. What is the measure of separability ? Usually the trace of S w1Sb is used as the separability. Linear Discriminant Analysis Feature Extraction (LDA or DAFE) The feature transforma tion matrix of LDA is composed of the eigenvecto rs of ( S wDA ) 1 SbDA L 1 L where S S L DA b Pi (mi m0 )(mi m0 ) Pi Pj (mi m j )( mi m j )T DA w DA PE {( X m )( X m ) | } P PS i i i wi i i i i T i 1 L i 1 j i 1 L L T i 1 i 1 Pi is the prior probability of class i mi and Σi is the mean and covariance of class i L is the number of classes i 1 LDA (DAFE) is the between class distance. is the within class distance. The weights of between and within class distances are the same. Disadvantages: 1. only useful for normally distributed data. 2. only L-1 features can be extracted. Class j Class i Mj Mi Difference Between Feature Selection and Feature Extraction Nonparamentric Weighted Feature Extraction (NWFE) Class i Class j Nonparamentric Weighted Feature Extraction (NWFE) Class i Class j Mi Mj Nonparamentric Weighted Feature Extraction (NWFE) Class i Class j Mi Mj Nonparamentric Weighted Feature Extraction (NWFE) xl(i ) Class i Class j Mi Mj Nonparamentric Weighted Feature Extraction (NWFE) (i,j) ls w xl(i ) Class i dist(xl(i),xs( j ) )1 ni dist(x q 1 ,xq( j ) )1 (i) k Large Weight Class j Mi Mj Small Weight Nonparamentric Weighted Feature Extraction (NWFE) nj M j ( xl(i ) ) wks(i , j ) xs( j ) x Class i s 1 M j ( xl(i ) ) (i ) l Class j Mi Mj Nonparamentric Weighted Feature Extraction (NWFE) M i ( xl(i ) ) x Class i M j ( xl(i ) ) (i ) l Class j Mi Nonparamentric Weighted Feature Extraction (NWFE) M i ( xl(i ) ) M j ( xl(i ) ) (i ) l x Class i Class j xt(i ) M i ( xt(i ) ) M j ( xt(i ) ) Nonparamentric Weighted Feature Extraction (NWFE) dist(x ,M (x )) λ (i,j) l xl(i ) M j ( xl(i ) ) (i ) l M i (x ) (i ) l M j (x ) xl(i ) Class i (i) l ni dist(x k 1 j xt(i ) M j ( xt(i ) ) Large Weight Class j xt(i ) M i ( xt(i ) ) M j ( xt(i ) ) Small Weight 1 ,M j(xk(i) ))1 (i) k (i) l Nonparamentric Weighted Feature Extraction (NWFE) xl(i ) M i ( xl(i ) ) M i ( xl(i ) ) xl(i ) M j ( xl(i ) ) M j ( xl(i ) ) (i ) l x Class i xt(i ) M j ( xt(i ) ) xt(i ) M i ( xt(i ) ) xl(i ) M i ( xt(i ) ) Class j M j ( xt(i ) ) Nonparamentric Weighted Feature Extraction (NWFE) x M i (x ) (i ) l (i ) l M i ( xl(i ) ) xl(i ) M j ( xl(i ) ) M j ( xl(i ) ) (i ) l x Class i NWFE focus on these vectors and put the different weights on them. xt(i ) M j ( xt(i ) ) xt(i ) M i ( xt(i ) ) xl(i ) M i ( xt(i ) ) Class j M j ( xt(i ) ) Nonparametric Weighted Feature Extraction (NWFE; Kuo & Landgrebe, 2002, 2004) The feature transform ation matrix of NWFE are composed of the eignvector s of [0.5S wNW 0.5diag ( S wNW )]1 SbNW L S NW b L ni Pi i 1 (ki , j ) j 1 k 1 j i ni L ni (ki ,i ) i 1 k 1 ni S wNW Pi ( xk(i ) M j ( xk(i ) ))( xk(i ) M j ( xk(i ) ))T ( xk(i ) M i ( xk(i ) ))( xk(i ) M i ( xk(i ) ))T nj M j ( xk(i ) ) wkl(i , j ) xl( j ) , n j is the number of training samples of class j l 1 λ (i,j) k dist(x k(i),M j(xk(i) ))1 ni dist(xl(i),M j(xl(i) ))1 l 1 (i,j) kl ,w dist(xk(i),xl( j ) )1 ni (i) ( j ) 1 dist(x k ,xl ) l 1 The Performance of NWFE In Kuo and Landgrebe paper, they compare the performances of NWFE, LDA, aPAC-LDR, and NDA. The NWFE performs better than others. Reference: Bor-Chen Kuo and David A. Landgrebe, “Nonparametric weighted feature extraction for classification,” IEEE Trans. Geosci. Remote Sens., vol. 42, no. 5, pp.1096-1105, May 2004. The Kernel Trick Use a feature mapping to embed the samples from original space into a feature space H, a Hilbert space with higher dimensionality. In H, the patterns can be discovered as linear relations. We can compute the inner product of samples in the feature space directly from the original data items using a kernel function κ (not feature mapping ). Assume that the sample in H can be represented by a dual form, the combination of training samples. The Kernel Trick ( x1 , x2 ), ( x1, x2 ) x12 x12 2 x1 x2 x1x2 x22 x22 ( x1 x1 x2 x2 ) 2 ( x, x )2 : ( x, x)2 Characterization of Kernels A function : X X R, which is either continuous or has a finite domain, can be decomposed ( x, z ) ( x), ( z ) into a feature map into a Hilbert space H applied to both its arguments followed by the evaluation of the inner product in H if and only if it satisfies the finitely positive semi-definite property. Some Widely Used Kernel Functions Linear Kernel ( x, z ) x, z Polynomial Kernel ( x, z) ( x, z 1 ) , r Z r RBF (Gaussian) Kernel xz ( x, z ) exp 2 2 2 , R {0} PCA & KPCA PCA KPCA Kernel-based Feature Extraction and Classification Process Compute the Scatter Matrices Sb and Sw in H Regularize the within-calss Scatter Matrix Sw Training Data in Feature Space H Training Data Testing Data in Feature Sapce H Testing Data Feature Extraction Eigenvalue Decomposition Transformed Training Data Transformed Testing Data Classifier Use Implicit Feature Map Classification Result Kernel Nonparametric Weighted Feature Extraction (KNWFE) L S S KNW b KNW w ni L Pi i 1 (ki , j ) j 1 k 1 j i ni L ni (ki ,i ) i 1 k 1 nj ni Pi ( ( xk(i ) ) M j ( ( xk(i ) )))( ( xk(i ) ) M j ( ( xk(i ) ))) T ( ( xk(i ) ) M i ( ( xk(i ) )))( ( xk(i ) ) M i ( ( xk(i ) ))) T M j ( ( xk(i ) )) wkl(i , j ) ( xl( j ) ), n j is the number of training samples of class j l 1 λk(i,j) dist ( ( xk(i)) ,M j ( ( xk(i)))) 1 ni dist ( ( xl(i)),M j ( ( xl(i)))) 1 l 1 (i,j) kl , w dist ( ( xk(i)) , ( xl( j ) )) 1 ni (i) ( j ) 1 dist ( ( x ) , ( x k l )) l 1 Problems Problem I: How to transform the scatter matrices of KNWFE with the kernel matrix? Problem II: How to solve the singularity of the kernel matrix? Problem III: How to project our samples in the projected space? KNWFE Algorithm 1. Let X T [ X 1T ,..., X LT ], where X iT [ ( x1(i ) ), , ( xN(ii) )], i 1, L. 2. Comput S wKNW X TWX 3. Compute SbKNW X T ( B W ) X Problem I is solved here. 4. A arg max tr( ( AT S wKNW A ) 1 AT SbKNW A ) A 5. A arg max tr(( AT X TWXA) 1 AT X T ( B W ) X ) A ~ Dual Form : A X T A 6. ~ ~T ~ 1 ~ T ~ A arg max tr( ( A KWK A ) A K ( B W ) K A ), ~ A where K XX T is the kernel matrix. KNWFE Algorithm 7. K PPT , the eigen decomposit ion of K . 8. PTWP is regulized by Problem II is solved here. R 0.5 ( PTWP ) 0.5 diag ( PTWP) 9. ~ ~ ~ ~ tr( ( AT KWKA) 1 AT K ( B W ) KA ) ~ ~ ~ ~ tr( ( AT PPTWPPT A) 1 AT PPT ( B W ) PPT A ) ~ 10. U PT A 11. U arg max tr( (U T ( R' )U ) 1U T ( PT ( B W ) P)U ) U KNWFE Algorithm 1. 12. ~ Compute dual form: A P 1U ( x1 )T ( x1 , z ) ~T ~T ~T , T y A ( z ) A X ( z ) A ( z ) A 13. ( xN )T ( xN , z ) where is the kernel function w here z is an arbitrary sample. Problem III is solved here. Dataset Washington DC The dimensionality of this hyperspectral image is 191. The number of classes is 7. There are two kinds of training data sets. One is with 40 training samples in every class, and the other is with 100 training samples. Experimental Design NWFE KNWFE Linear Kernel (Linear K) Polynomial Kernel Degree 1 (Poly K-1) Feature Extraction Degree 2 (Poly K-2) Degree 3 (Poly K-3) RBF Kernel (RBF K) Quadratic Bayes Normal Classifier (qdc) Classifier 1NN Classifier Parzen Classifier Every 20-th band, which begins from the first one, is selected for the 10 bands case. The parameter of RBF kernel is the mean of variances in every band of training samples. The Classification Results of Real Dataset Ni=40 Ni=100 Mean of accuracies using 1-9 features (DC Mall, Quadratic Bayes Normal Classifier) The Classification Results of Real Dataset Ni=40 Ni=100 Mean of accuracies using 1-9 features (DC Mall, 1NN Classifier ) The Classification Results of Real Dataset Ni=40 Ni=100 Mean of accuracies using 1-9 features (DC Mall, Parzen Classifier ) The Classification Results of Washington DC Mall Image Thematic Map NWFE KNWFE The Classification Results of Washington DC Mall Image Thematic Map NWFE KNWFE The Classification Results of Washington DC Mall Image Thematic Map NWFE KNWFE The Classification Results of Washington DC Mall Image Thematic Map NWFE KNWFE The Classification Results of Washington DC Mall Image Thematic Map NWFE KNWFE Experimental Results The performances of all three classifiers with KNWFE features are better than those of classifiers with NWFE features. The polynomial kernel with degree 2 outperforms other kernel functions in 1NN and Parzen cases. Among three classifiers, quadratic Bayes normal classifier has the best performance. The best classification accuracy is 0.9 and obtained by qdc classifier with 9 features extracted by KNWFE with linear kernel in the case of Ni=100. Comparing figures from NWFE and KNWFE, one sees that the performance of KNWFE is better than that of NWFE in almost in all classes. Conclusion We proposed a new kernel-based nonparametric weighted feature extraction. We have analyzed and compared NWFE and KNWFE both theoretically and experimentally. From theoretical point of view, NWFE is a special case of KNWFE with linear kernel and the result of a real hyperspectral image shows that the average classification accuracy of applying KNWFE is better than that of applying NWFE. We can state that, in our case study, the use of KNWFE is more beneficial and yielding better results than NWFE. Thanks for Your Attentions ! and Questions ?