CS340 Data Mining: feature selection-

Download Report

Transcript CS340 Data Mining: feature selection-

AMCS/CS 340 : Data Mining

Feature Selection

Xiangliang Zhang King Abdullah University of Science and Technology

Outline

• Introduction

• Unsupervised Feature Selection   Clustering Matrix Factorization • Supervised Feature Selection   Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers • Summary Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 2

Problems due to poor variable selection

• Input dimension is too large ; the curse of dimensionality problem may happen; • Poor model may be built with additional unrelated inputs or not enough relevant inputs; • Complex models which contain too many inputs are more difficult to understand 3 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Applications

OCR (optical character recognition) HWR (handwriting recognition) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 4

Benefits of feature selection

• Facilitating data visualization • Data understanding • Reducing the measurement and storage requirements • Reducing training and utilization times • Defying the curse of dimensionality to improve prediction performance Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 5

Feature Selection/Extraction

N

Thousands to millions of low level features : select/extract the most relevant one to build better, faster, and easier to understand learning machines.

m d X

d<

Y

• Using label Y  supervised • Without label Y  unsupervised {f i } {F j } Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 6

Feature Selection vs Extraction

Selection:

• choose a best subset of size d from the m features {f i } can be a subset of {F j }, i=1,…,d, and j=1,…,m

Extraction:

• extract the m d new features by features linear or non-linear combination of all - Linear/Non-linear feature extraction: {f i } • New features may not have = f( {F j } )

d

physical interpretation/meaning

N m X Y

{f i } Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining {F j } 7

Outline

• Introduction

• Unsupervised Feature Selection

  Clustering Matrix Factorization • Supervised Feature Selection   Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers • Summary Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 8

Feature Selection by Clustering

• Group features into clusters • Replace (many) similar variables in one cluster by a (single) cluster centroid • E.g., K-means, Hierarchical clustering Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 9

Example of student project

Abdullah Khamis, AMCS/CS340 2010 Fall,

“Statistical Learning Based System for Text Classification”

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 10

Other unsupervised FS methods

Matrix Factorization

o PCA (Principal Component Analysis) use PCs with largest eigenvalues as “features” o SVD (Singular Value Decomposition) use singular vectors with largest singular values as “features” o NMF (Non-negative Matrix Factorization) •

Nonlinear Dimensionality Reduction

o o Isomap LLE (Locally Linear Embedding) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 11

Outline

• Introduction • Unsupervised Feature Selection   Clustering Matrix Factorization

• Supervised Feature Selection

  Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers • Summary 12 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Feature Ranking

• Build

better, faster, and easier to understand

learning machines • Discover the

most relevant features

w.r.t. target label, e.g., find genes that discriminate between healthy and disease patients

N d Rank of useful features.

m X

Eliminate useless features (distracters).

- Rank useful features.

- Eliminate redundant features.

13 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Example of detecting attacks in real HTTP logs A common request . A JS XSS attack . Remote file inclusion attack DoS attack .

Represent each HTTP request by a vector • in 95 dimensions , corresponding to the 95 types of ASCII code (between 33 and 127) • of character distribution computed as the frequency of each ASCII code in the path source of a HTTP request. For example, Classification of HTTP vectors in 95-dim v.s. in reduced dimension space? Which dim to choose?

Which one is better?

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

14

Individual Feature Ranking (1) by AUC

1.

Rank the features by AUC  1, most related  0.5, most unrelated

1 ROC curve 0 AUC False Positive Rate 1

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

x i -1

15

Individual Feature Ranking (2) by Mutual Information

2

. Rank the features by

Mutual information

I(i)  The higher I(i), the attribute

x i

is more related to class y

Mutual information

between each variable and the target: • • •

P

(

Y

=

y

): frequency count of class y

P

(

X

=

x i

): frequency count of attribute value

x i P

(

X

=

x i

,

Y

=

y

): frequency count of attribute value

x i

given class y Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 16

Individual Feature Ranking (3) with continuous target

3

.

Rank features by Pearson correlation coefficient • detect linear dependencies between variable and target • rank features by R(i) or R 2 (i) (

linear regression

)  1 related;  0 unrelated Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 17

Individual Feature Ranking (4) by T-test

• Null hypothesis H 0 : m + = m - (x i and Y are independent) • Relevance index  test statistic • T statistic: If H 0 is true, t  μ   μ  ~ st udent (n  σ with in  n 1   n 1   n   2 d.f.

), where n  and n  are t he numbers of samples wit h label  and σ wit hin  ( n   1 ) n σ     n (  n   2  1 ) σ  m

-

m

+ 4.

Rank by Pvalue  false positive rate The lower Pvalue, x i is more related to class y

-1

s

-

s

+

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 18

x

i

Individual Feature Ranking (5) by Fisher Score

• Fisher discrimination • Two-class case: F =

between class

variance / pooled

within class

variance  n  n  (μ  σ 2    μ  ) 2 n n   σ 2  m

-

m

+ 5.

Rank by F value The higher F, x i is more related to class y

-1

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining s

-

s

+

19

x

i

Rank features in HTTP logs

FS

1 2 3 4 5 6 All 95 features AUC-ranking (D=30) MI-ranking (D=30) R-ranking (D=30) T-test ranking (D=30) Fisher score ranking (D=30) PCA (D=30,

unsupervised

)

SVM results

AUC 0.8797

Accuracy 97.92% 0.9212

0.8849

0.9208

0.9208

0.9208

0.8623

97.60% 97.96% 97.67% 97.67% 97.67% 97.74% 1

Features

#.128:;?BLOQ[\]_aefhiklmoptuw| "#./1268;ALPQRS[\]_`aehkltwyz| "#,.2:;?LQS[\]_`aehiklmoptuwz| "#,.2:;?LQS[\]_`aehiklmoptuwz| "#,.2:;?LQS[\]_`aehiklmoptuwz| Constructed features 0.8

0.6

http://www.lri.fr/~xlzhang/KAUST/CS340_slides/FS_rank_demo.zip

0.4

0.2

0 0 all features rank by AUC rank by MI rank by correlation coefficient rank by t-test rank by Fisher Score 0.2

0.4

0.6

False positive rate 0.8

1 20

Issues of individual features ranking

Relevance vs usefulness:

 Relevance does not imply usefulness.

 Usefulness does not imply relevance • Leads to the selection of a

redundant subset

k best

features != best

k

features • A variable that is useless by itself can be useful with others 21 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Useless features become useful

• Separation is gained by using two variables instead of one or by adding variables • Ranking variables individually and independently of each other is at loss to determine which combination of variables best performance .

would give 22

Outline

• Introduction • Unsupervised Feature Selection   Clustering Matrix Factorization

• Supervised Feature Selection

  Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers • Summary 23 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Multivariate Feature Selection is complex

Kohavi-John, 1997

M features, 2 M possible feature subsets

!

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 24

Objectives of feature selection

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 25

Questions before subset feature select

ion

1. How to search the space of all possible variable subsets? 2. Do we use the prediction search ?

performance to guide the • NO 

Filter

• Yes 

Wrapper

1) how to assess the prediction performance of a learning machine to guide the search and halt it 2) which predictor to use popular predictors include decision trees, Naive Bayes, Least-square linear predictors, and SVM Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 26

Filter: Feature subset selection

All features Filter Feature subset Predictor

The feature subset is chosen by an evaluation criterion , which measures the relation of each subset of input variables, e.g., 

correlation based feature selector (CFS)

subsets that contain features that are highly correlated with the class and uncorrelated with each other how predictive of the class a set of features are mean feature-class correlation

M(subset

{

f i

,

i

 1 ...

k

}

)

k R cf i k

how much redundancy there is among the feature subset 

k

(

k

 1 )

R f i f j

average feature-feature intercorrelation Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 27

Filter: Feature subset selection (2)

All features Filter Feature subset Predictor

Search in all possible feature subsets? k=1,…,M?

M(subset

{

f i

,

i

 1 ...

k

}

)

– exhaustive enumeration – forward selection, – backward elimination, – best first, forward/backward with a stopping criterion Filter method is a pre-processing step , which is independent of the learning algorithm .

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 28

Forward Selection

Start

Sequential forward

selection (SFS),

features are sequentially added to an empty candidate set until the addition of further features does not decrease the criterion

n n-1 n-2 1

Also referred to as SFS: Sequential Forward Selection Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 29

Backward Elimination

1 n-2 n-1 n Start

Sequential backward

selection (SBS),

in which features are sequentially removed from a full candidate set until the removal of further features increase the criterion.

Also referred to as SBS: Sequential Backward Selection Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 30

Wrapper: Feature selection methods

All features Multiple Feature subsets Predictor Wrapper

 Learning model is used as a part of evaluation function and also to induce the final learning model  Subsets of features are scored according to their predictive power  Optimizing the parameters of the model by measuring some cost functions.

 Danger of over-fitting with intensive search!

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 31

RFE SVM

Recursive Feature Elimination (RFE) SVM.

Guyon-Weston, 2000. US patent 7,117,188

All features Performance degradation?

Yes, stop!

1: repeat

No, continue…

2: Find w and

b

by training a linear SVM.

3: Remove the feature with the smallest value |w i | 4: until a desired number

of

features remain.

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 32

Selecting feature subsets in HTTP logs

FS

1 2 3 4 5 6 All 95 features AUC-ranking (D=30) R-ranking (D=30) SFS

Gram-Schmidt

(D=30) RFE SVM (D=30) ….

….

SVM results

AUC Accurac y 0.8797

0.9212

0.9208

97.92% 97.60% 97.67% 0.8914

0.9174

97.85% 97.85%

Features

#.128:;?BLOQ[\]_aefhiklmoptuw| "#,.2:;?LQS[\]_`aehiklmoptuwz| #&,./25:;=?DFLQ[\_`ghklmptwxz| "#',-.0238:;

Comparsion of Filter and Wrapper:

Main goal:

rank subsets of useful features •

Search strategies

: explore the space of all possible feature combinations •

Two criteria:

predictive power (maximize) and subset size (minimize).

Predictive power assessment:

Filter methods:

criteria not involving any learning machine, e.g., a relevance index based on correlation coefficients or test statistics –

Wrapper methods:

the performance of a learning machine trained using a given feature subset • Wrapper is potentially very time consuming since they typically need to evaluate a cross-validation scheme at every iteration.

• Filter method is much faster but it do not incorporate learning.

35 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Forward Selection w. Trees

• Tree classifiers, like

CART (Breiman, 1984)

or

C4.5 (Quinlan, 1993)

f 2 All the data f 1

At each step, choose the feature that “reduces entropy” most. Work towards “node purity”.

Feature subset selection by

Choose f 1

Random Forest

Choose f 2

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 36

Outline

• Introduction • Unsupervised Feature Selection   Clustering Matrix Factorization • Supervised Feature Selection   Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers

• Summary

37 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Conclusion

Feature selection focuses on uncovering subsets of variables X1, X2, … predictive of the target Y.

 Univariate feature selection How to rank the features?

 Multivariate (subset) feature selection   Filter, Wrapper, Embedded How to search the subset of features?

How to evaluate the subsets of features?

Feature extraction How to construct new features in linear/non-linear ways?

38 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

In practice

• No method is universally better : - wide variety of types of variables, data distributions, learning machines, and objectives. • Match the method complexity to the ratio M/N: - univariate feature selection may work better than multivariate feature selection ; - non-linear classifiers are not always better.

• Feature selection is not always necessary to achieve good performance.

NIPS 2003 and WCCI 2006 challenges :

http://clopinet.com/challenges

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining

Feature selection toolbox

• •

Matlab: sequentialfs

(Sequential feature selection, shown in demo)  Forward ---- good  Backward --- be careful on definition of criteria Feature Selection Toolbox 3 software in C++.

– freely available and open-source •

Weka

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 40

Reference

An introduction to variable and feature selection

, Isabelle Guyon , André Elisseeff, JMLR 2003 •

Feature Extraction, Foundations and Applications

, Isabelle Guyon et al, Eds. Springer, 2006.

http://clopinet.com/fextract-book

• Pabitra Mitra, C. A. Murthy, and Sankar K. Pal. (2002). "Unsupervised Feature Selection Using Feature Similarity." In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3) • Prof. Marc Van Hulle , Katholieke Universiteit Leuven, http://134.58.34.50/~marc/DM_course/slides_selection.pdf

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 41