Transcript CS340 Data Mining: feature selection-
AMCS/CS 340 : Data Mining
Feature Selection
Xiangliang Zhang King Abdullah University of Science and Technology
Outline
• Introduction
• Unsupervised Feature Selection Clustering Matrix Factorization • Supervised Feature Selection Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers • Summary Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 2
Problems due to poor variable selection
• Input dimension is too large ; the curse of dimensionality problem may happen; • Poor model may be built with additional unrelated inputs or not enough relevant inputs; • Complex models which contain too many inputs are more difficult to understand 3 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Applications
OCR (optical character recognition) HWR (handwriting recognition) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 4
Benefits of feature selection
• Facilitating data visualization • Data understanding • Reducing the measurement and storage requirements • Reducing training and utilization times • Defying the curse of dimensionality to improve prediction performance Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 5
Feature Selection/Extraction
N
Thousands to millions of low level features : select/extract the most relevant one to build better, faster, and easier to understand learning machines.
m d X
d<
Y
• Using label Y supervised • Without label Y unsupervised {f i } {F j } Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 6
Feature Selection vs Extraction
Selection:
• choose a best subset of size d from the m features {f i } can be a subset of {F j }, i=1,…,d, and j=1,…,m
Extraction:
• extract the m d new features by features linear or non-linear combination of all - Linear/Non-linear feature extraction: {f i } • New features may not have = f( {F j } )
d
physical interpretation/meaning
N m X Y
{f i } Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining {F j } 7
Outline
• Introduction
• Unsupervised Feature Selection
Clustering Matrix Factorization • Supervised Feature Selection Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers • Summary Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 8
Feature Selection by Clustering
• Group features into clusters • Replace (many) similar variables in one cluster by a (single) cluster centroid • E.g., K-means, Hierarchical clustering Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 9
Example of student project
Abdullah Khamis, AMCS/CS340 2010 Fall,
“Statistical Learning Based System for Text Classification”
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 10
Other unsupervised FS methods
•
Matrix Factorization
o PCA (Principal Component Analysis) use PCs with largest eigenvalues as “features” o SVD (Singular Value Decomposition) use singular vectors with largest singular values as “features” o NMF (Non-negative Matrix Factorization) •
Nonlinear Dimensionality Reduction
o o Isomap LLE (Locally Linear Embedding) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 11
Outline
• Introduction • Unsupervised Feature Selection Clustering Matrix Factorization
• Supervised Feature Selection
Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers • Summary 12 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Feature Ranking
• Build
better, faster, and easier to understand
learning machines • Discover the
most relevant features
w.r.t. target label, e.g., find genes that discriminate between healthy and disease patients
N d Rank of useful features.
m X
Eliminate useless features (distracters).
- Rank useful features.
- Eliminate redundant features.
13 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Example of detecting attacks in real HTTP logs A common request . A JS XSS attack . Remote file inclusion attack DoS attack .
Represent each HTTP request by a vector • in 95 dimensions , corresponding to the 95 types of ASCII code (between 33 and 127) • of character distribution computed as the frequency of each ASCII code in the path source of a HTTP request. For example, Classification of HTTP vectors in 95-dim v.s. in reduced dimension space? Which dim to choose?
Which one is better?
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
14
Individual Feature Ranking (1) by AUC
1.
Rank the features by AUC 1, most related 0.5, most unrelated
1 ROC curve 0 AUC False Positive Rate 1
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
x i -1
15
Individual Feature Ranking (2) by Mutual Information
2
. Rank the features by
Mutual information
I(i) The higher I(i), the attribute
x i
is more related to class y
Mutual information
between each variable and the target: • • •
P
(
Y
=
y
): frequency count of class y
P
(
X
=
x i
): frequency count of attribute value
x i P
(
X
=
x i
,
Y
=
y
): frequency count of attribute value
x i
given class y Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 16
Individual Feature Ranking (3) with continuous target
3
.
Rank features by Pearson correlation coefficient • detect linear dependencies between variable and target • rank features by R(i) or R 2 (i) (
linear regression
) 1 related; 0 unrelated Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 17
Individual Feature Ranking (4) by T-test
• Null hypothesis H 0 : m + = m - (x i and Y are independent) • Relevance index test statistic • T statistic: If H 0 is true, t μ μ ~ st udent (n σ with in n 1 n 1 n 2 d.f.
), where n and n are t he numbers of samples wit h label and σ wit hin ( n 1 ) n σ n ( n 2 1 ) σ m
-
m
+ 4.
Rank by Pvalue false positive rate The lower Pvalue, x i is more related to class y
-1
s
-
s
+
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 18
x
i
Individual Feature Ranking (5) by Fisher Score
• Fisher discrimination • Two-class case: F =
between class
variance / pooled
within class
variance n n (μ σ 2 μ ) 2 n n σ 2 m
-
m
+ 5.
Rank by F value The higher F, x i is more related to class y
-1
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining s
-
s
+
19
x
i
Rank features in HTTP logs
FS
1 2 3 4 5 6 All 95 features AUC-ranking (D=30) MI-ranking (D=30) R-ranking (D=30) T-test ranking (D=30) Fisher score ranking (D=30) PCA (D=30,
unsupervised
)
SVM results
AUC 0.8797
Accuracy 97.92% 0.9212
0.8849
0.9208
0.9208
0.9208
0.8623
97.60% 97.96% 97.67% 97.67% 97.67% 97.74% 1
Features
#.128:;?BLOQ[\]_aefhiklmoptuw| "#./1268;ALPQRS[\]_`aehkltwyz| "#,.2:;?LQS[\]_`aehiklmoptuwz| "#,.2:;?LQS[\]_`aehiklmoptuwz| "#,.2:;?LQS[\]_`aehiklmoptuwz| Constructed features 0.8
0.6
http://www.lri.fr/~xlzhang/KAUST/CS340_slides/FS_rank_demo.zip
0.4
0.2
0 0 all features rank by AUC rank by MI rank by correlation coefficient rank by t-test rank by Fisher Score 0.2
0.4
0.6
False positive rate 0.8
1 20
Issues of individual features ranking
•
Relevance vs usefulness:
Relevance does not imply usefulness.
Usefulness does not imply relevance • Leads to the selection of a
redundant subset
k best
features != best
k
features • A variable that is useless by itself can be useful with others 21 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Useless features become useful
• Separation is gained by using two variables instead of one or by adding variables • Ranking variables individually and independently of each other is at loss to determine which combination of variables best performance .
would give 22
Outline
• Introduction • Unsupervised Feature Selection Clustering Matrix Factorization
• Supervised Feature Selection
Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers • Summary 23 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Multivariate Feature Selection is complex
Kohavi-John, 1997
M features, 2 M possible feature subsets
!
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 24
Objectives of feature selection
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 25
Questions before subset feature select
ion
1. How to search the space of all possible variable subsets? 2. Do we use the prediction search ?
performance to guide the • NO
Filter
• Yes
Wrapper
1) how to assess the prediction performance of a learning machine to guide the search and halt it 2) which predictor to use popular predictors include decision trees, Naive Bayes, Least-square linear predictors, and SVM Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 26
Filter: Feature subset selection
All features Filter Feature subset Predictor
The feature subset is chosen by an evaluation criterion , which measures the relation of each subset of input variables, e.g.,
correlation based feature selector (CFS)
subsets that contain features that are highly correlated with the class and uncorrelated with each other how predictive of the class a set of features are mean feature-class correlation
M(subset
{
f i
,
i
1 ...
k
}
)
k R cf i k
how much redundancy there is among the feature subset
k
(
k
1 )
R f i f j
average feature-feature intercorrelation Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 27
Filter: Feature subset selection (2)
All features Filter Feature subset Predictor
Search in all possible feature subsets? k=1,…,M?
M(subset
{
f i
,
i
1 ...
k
}
)
– exhaustive enumeration – forward selection, – backward elimination, – best first, forward/backward with a stopping criterion Filter method is a pre-processing step , which is independent of the learning algorithm .
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 28
Forward Selection
Start
Sequential forward
selection (SFS),
features are sequentially added to an empty candidate set until the addition of further features does not decrease the criterion
n n-1 n-2 1
Also referred to as SFS: Sequential Forward Selection Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 29
Backward Elimination
1 n-2 n-1 n Start
Sequential backward
selection (SBS),
in which features are sequentially removed from a full candidate set until the removal of further features increase the criterion.
Also referred to as SBS: Sequential Backward Selection Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 30
Wrapper: Feature selection methods
All features Multiple Feature subsets Predictor Wrapper
Learning model is used as a part of evaluation function and also to induce the final learning model Subsets of features are scored according to their predictive power Optimizing the parameters of the model by measuring some cost functions.
Danger of over-fitting with intensive search!
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 31
RFE SVM
Recursive Feature Elimination (RFE) SVM.
Guyon-Weston, 2000. US patent 7,117,188
All features Performance degradation?
Yes, stop!
1: repeat
No, continue…
2: Find w and
b
by training a linear SVM.
3: Remove the feature with the smallest value |w i | 4: until a desired number
of
features remain.
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 32
Selecting feature subsets in HTTP logs
FS
1 2 3 4 5 6 All 95 features AUC-ranking (D=30) R-ranking (D=30) SFS
Gram-Schmidt
(D=30) RFE SVM (D=30) ….
….
SVM results
AUC Accurac y 0.8797
0.9212
0.9208
97.92% 97.60% 97.67% 0.8914
0.9174
97.85% 97.85%
Features
#.128:;?BLOQ[\]_aefhiklmoptuw| "#,.2:;?LQS[\]_`aehiklmoptuwz| #&,./25:;=?DFLQ[\_`ghklmptwxz| "#',-.0238:;
Comparsion of Filter and Wrapper:
•
Main goal:
rank subsets of useful features •
Search strategies
: explore the space of all possible feature combinations •
Two criteria:
predictive power (maximize) and subset size (minimize).
•
Predictive power assessment:
–
Filter methods:
criteria not involving any learning machine, e.g., a relevance index based on correlation coefficients or test statistics –
Wrapper methods:
the performance of a learning machine trained using a given feature subset • Wrapper is potentially very time consuming since they typically need to evaluate a cross-validation scheme at every iteration.
• Filter method is much faster but it do not incorporate learning.
35 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Forward Selection w. Trees
• Tree classifiers, like
CART (Breiman, 1984)
or
C4.5 (Quinlan, 1993)
f 2 All the data f 1
At each step, choose the feature that “reduces entropy” most. Work towards “node purity”.
Feature subset selection by
Choose f 1
Random Forest
Choose f 2
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 36
Outline
• Introduction • Unsupervised Feature Selection Clustering Matrix Factorization • Supervised Feature Selection Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers
• Summary
37 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Conclusion
Feature selection focuses on uncovering subsets of variables X1, X2, … predictive of the target Y.
Univariate feature selection How to rank the features?
Multivariate (subset) feature selection Filter, Wrapper, Embedded How to search the subset of features?
How to evaluate the subsets of features?
Feature extraction How to construct new features in linear/non-linear ways?
38 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
In practice
• No method is universally better : - wide variety of types of variables, data distributions, learning machines, and objectives. • Match the method complexity to the ratio M/N: - univariate feature selection may work better than multivariate feature selection ; - non-linear classifiers are not always better.
• Feature selection is not always necessary to achieve good performance.
NIPS 2003 and WCCI 2006 challenges :
http://clopinet.com/challenges
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Feature selection toolbox
• •
Matlab: sequentialfs
(Sequential feature selection, shown in demo) Forward ---- good Backward --- be careful on definition of criteria Feature Selection Toolbox 3 software in C++.
– freely available and open-source •
Weka
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 40
Reference
•
An introduction to variable and feature selection
, Isabelle Guyon , André Elisseeff, JMLR 2003 •
Feature Extraction, Foundations and Applications
, Isabelle Guyon et al, Eds. Springer, 2006.
http://clopinet.com/fextract-book
• Pabitra Mitra, C. A. Murthy, and Sankar K. Pal. (2002). "Unsupervised Feature Selection Using Feature Similarity." In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3) • Prof. Marc Van Hulle , Katholieke Universiteit Leuven, http://134.58.34.50/~marc/DM_course/slides_selection.pdf
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 41