Transcript CS340 Data Mining: feature selection-
AMCS/CS 340 : Data Mining
Feature Selection
Xiangliang Zhang King Abdullah University of Science and Technology
Outline
• Introduction
• Unsupervised Feature Selection Clustering Matrix Factorization • Supervised Feature Selection Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers • Summary Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 2
Problems due to poor variable selection
• Input dimension is too large ; the curse of dimensionality problem may happen; • Poor model may be built with additional unrelated inputs or not enough relevant inputs; • Complex models which contain too many inputs are more difficult to understand 3 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Applications
OCR (optical character recognition) HWR (handwriting recognition) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 4
Benefits of feature selection
• Facilitating data visualization • Data understanding • Reducing the measurement and storage requirements • Reducing training and utilization times • Defying the curse of dimensionality to improve prediction performance Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 5
Feature Selection/Extraction
N
Thousands to millions of low level features : select/extract the most relevant one to build better, faster, and easier to understand learning machines.
m d X
d< Y • Using label Y supervised • Without label Y unsupervised {f i } {F j } Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 6 Selection: • choose a best subset of size d from the m features {f i } can be a subset of {F j }, i=1,…,d, and j=1,…,m Extraction: • extract the m d new features by features linear or non-linear combination of all - Linear/Non-linear feature extraction: {f i } • New features may not have = f( {F j } ) d physical interpretation/meaning N m X Y {f i } Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining {F j } 7 • Introduction Clustering Matrix Factorization • Supervised Feature Selection Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers • Summary Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 8 • Group features into clusters • Replace (many) similar variables in one cluster by a (single) cluster centroid • E.g., K-means, Hierarchical clustering Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 9 Abdullah Khamis, AMCS/CS340 2010 Fall, “Statistical Learning Based System for Text Classification” Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 10 • Matrix Factorization o PCA (Principal Component Analysis) use PCs with largest eigenvalues as “features” o SVD (Singular Value Decomposition) use singular vectors with largest singular values as “features” o NMF (Non-negative Matrix Factorization) • Nonlinear Dimensionality Reduction o o Isomap LLE (Locally Linear Embedding) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 11 • Introduction • Unsupervised Feature Selection Clustering Matrix Factorization Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers • Summary 12 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining • Build better, faster, and easier to understand learning machines • Discover the most relevant features w.r.t. target label, e.g., find genes that discriminate between healthy and disease patients N d Rank of useful features. m X Eliminate useless features (distracters). - Rank useful features. - Eliminate redundant features. 13 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Example of detecting attacks in real HTTP logs A common request . A JS XSS attack . Remote file inclusion attack DoS attack . Represent each HTTP request by a vector • in 95 dimensions , corresponding to the 95 types of ASCII code (between 33 and 127) • of character distribution computed as the frequency of each ASCII code in the path source of a HTTP request. For example, Classification of HTTP vectors in 95-dim v.s. in reduced dimension space? Which dim to choose? Which one is better? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 14 1. Rank the features by AUC 1, most related 0.5, most unrelated 1 ROC curve 0 AUC False Positive Rate 1 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining x i -1 15 2 . Rank the features by Mutual information I(i) The higher I(i), the attribute x i is more related to class y Mutual information between each variable and the target: • • • P ( Y = y ): frequency count of class y P ( X = x i ): frequency count of attribute value x i P ( X = x i , Y = y ): frequency count of attribute value x i given class y Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 16 3 . Rank features by Pearson correlation coefficient • detect linear dependencies between variable and target • rank features by R(i) or R 2 (i) ( linear regression ) 1 related; 0 unrelated Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 17 • Null hypothesis H 0 : m + = m - (x i and Y are independent) • Relevance index test statistic • T statistic: If H 0 is true, t μ μ ~ st udent (n σ with in n 1 n 1 n 2 d.f. ), where n and n are t he numbers of samples wit h label and σ wit hin ( n 1 ) n σ n ( n 2 1 ) σ m - m + 4. Rank by Pvalue false positive rate The lower Pvalue, x i is more related to class y -1 s - s + Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 18 x i • Fisher discrimination • Two-class case: F = between class variance / pooled within class variance n n (μ σ 2 μ ) 2 n n σ 2 m - m + 5. Rank by F value The higher F, x i is more related to class y -1 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining s - s + 19 x i FS 1 2 3 4 5 6 All 95 features AUC-ranking (D=30) MI-ranking (D=30) R-ranking (D=30) T-test ranking (D=30) Fisher score ranking (D=30) PCA (D=30, unsupervised ) SVM results AUC 0.8797 Accuracy 97.92% 0.9212 0.8849 0.9208 0.9208 0.9208 0.8623 97.60% 97.96% 97.67% 97.67% 97.67% 97.74% 1 Features #.128:;?BLOQ[\]_aefhiklmoptuw| "#./1268;ALPQRS[\]_`aehkltwyz| "#,.2:;?LQS[\]_`aehiklmoptuwz| "#,.2:;?LQS[\]_`aehiklmoptuwz| "#,.2:;?LQS[\]_`aehiklmoptuwz| Constructed features 0.8 0.6 http://www.lri.fr/~xlzhang/KAUST/CS340_slides/FS_rank_demo.zip 0.4 0.2 0 0 all features rank by AUC rank by MI rank by correlation coefficient rank by t-test rank by Fisher Score 0.2 0.4 0.6 False positive rate 0.8 1 20 • Relevance vs usefulness: Relevance does not imply usefulness. Usefulness does not imply relevance • Leads to the selection of a redundant subset k best features != best k features • A variable that is useless by itself can be useful with others 21 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining • Separation is gained by using two variables instead of one or by adding variables • Ranking variables individually and independently of each other is at loss to determine which combination of variables best performance . would give 22 • Introduction • Unsupervised Feature Selection Clustering Matrix Factorization Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers • Summary 23 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Kohavi-John, 1997 M features, 2 M possible feature subsets ! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 24 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 25 ion 1. How to search the space of all possible variable subsets? 2. Do we use the prediction search ? performance to guide the • NO Filter • Yes Wrapper 1) how to assess the prediction performance of a learning machine to guide the search and halt it 2) which predictor to use popular predictors include decision trees, Naive Bayes, Least-square linear predictors, and SVM Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 26 All features Filter Feature subset Predictor The feature subset is chosen by an evaluation criterion , which measures the relation of each subset of input variables, e.g., correlation based feature selector (CFS) subsets that contain features that are highly correlated with the class and uncorrelated with each other how predictive of the class a set of features are mean feature-class correlation M(subset { f i , i 1 ... k } ) k R cf i k how much redundancy there is among the feature subset k ( k 1 ) R f i f j average feature-feature intercorrelation Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 27 All features Filter Feature subset Predictor Search in all possible feature subsets? k=1,…,M? M(subset { f i , i 1 ... k } ) – exhaustive enumeration – forward selection, – backward elimination, – best first, forward/backward with a stopping criterion Filter method is a pre-processing step , which is independent of the learning algorithm . Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 28 Start Sequential forward selection (SFS), features are sequentially added to an empty candidate set until the addition of further features does not decrease the criterion n n-1 n-2 1 Also referred to as SFS: Sequential Forward Selection Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 29 1 n-2 n-1 n Start Sequential backward selection (SBS), in which features are sequentially removed from a full candidate set until the removal of further features increase the criterion. Also referred to as SBS: Sequential Backward Selection Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 30 All features Multiple Feature subsets Predictor Wrapper Learning model is used as a part of evaluation function and also to induce the final learning model Subsets of features are scored according to their predictive power Optimizing the parameters of the model by measuring some cost functions. Danger of over-fitting with intensive search! Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 31 Recursive Feature Elimination (RFE) SVM. Guyon-Weston, 2000. US patent 7,117,188 All features Performance degradation? Yes, stop! 1: repeat No, continue… 2: Find w and b by training a linear SVM. 3: Remove the feature with the smallest value |w i | 4: until a desired number of features remain. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 32 FS 1 2 3 4 5 6 All 95 features AUC-ranking (D=30) R-ranking (D=30) SFS Gram-Schmidt (D=30) RFE SVM (D=30) …. …. SVM results AUC Accurac y 0.8797 0.9212 0.9208 97.92% 97.60% 97.67% 0.8914 0.9174 97.85% 97.85% Features #.128:;?BLOQ[\]_aefhiklmoptuw| "#,.2:;?LQS[\]_`aehiklmoptuwz| #&,./25:;=?DFLQ[\_`ghklmptwxz| "#',-.0238:;
• Main goal: rank subsets of useful features • Search strategies : explore the space of all possible feature combinations • Two criteria: predictive power (maximize) and subset size (minimize). • Predictive power assessment: – Filter methods: criteria not involving any learning machine, e.g., a relevance index based on correlation coefficients or test statistics – Wrapper methods: the performance of a learning machine trained using a given feature subset • Wrapper is potentially very time consuming since they typically need to evaluate a cross-validation scheme at every iteration. • Filter method is much faster but it do not incorporate learning. 35 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining • Tree classifiers, like CART (Breiman, 1984) or C4.5 (Quinlan, 1993) f 2 All the data f 1 At each step, choose the feature that “reduces entropy” most. Work towards “node purity”. Feature subset selection by Choose f 1 Random Forest Choose f 2 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 36 • Introduction • Unsupervised Feature Selection Clustering Matrix Factorization • Supervised Feature Selection Individual Feature Ranking (Single Variable Classifier) Feature subset selection o Filters o Wrappers 37 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Feature selection focuses on uncovering subsets of variables X1, X2, … predictive of the target Y. Univariate feature selection How to rank the features? Multivariate (subset) feature selection Filter, Wrapper, Embedded How to search the subset of features? How to evaluate the subsets of features? Feature extraction How to construct new features in linear/non-linear ways? 38 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining • No method is universally better : - wide variety of types of variables, data distributions, learning machines, and objectives. • Match the method complexity to the ratio M/N: - univariate feature selection may work better than multivariate feature selection ; - non-linear classifiers are not always better. • Feature selection is not always necessary to achieve good performance. NIPS 2003 and WCCI 2006 challenges : http://clopinet.com/challenges Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining • • Matlab: sequentialfs (Sequential feature selection, shown in demo) Forward ---- good Backward --- be careful on definition of criteria Feature Selection Toolbox 3 software in C++. – freely available and open-source • Weka Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 40 • An introduction to variable and feature selection , Isabelle Guyon , André Elisseeff, JMLR 2003 • Feature Extraction, Foundations and Applications , Isabelle Guyon et al, Eds. Springer, 2006. http://clopinet.com/fextract-book • Pabitra Mitra, C. A. Murthy, and Sankar K. Pal. (2002). "Unsupervised Feature Selection Using Feature Similarity." In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3) • Prof. Marc Van Hulle , Katholieke Universiteit Leuven, http://134.58.34.50/~marc/DM_course/slides_selection.pdf Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 41Feature Selection vs Extraction
Outline
• Unsupervised Feature Selection
Feature Selection by Clustering
Example of student project
Other unsupervised FS methods
Outline
• Supervised Feature Selection
Feature Ranking
Individual Feature Ranking (1) by AUC
Individual Feature Ranking (2) by Mutual Information
Individual Feature Ranking (3) with continuous target
Individual Feature Ranking (4) by T-test
Individual Feature Ranking (5) by Fisher Score
Rank features in HTTP logs
Issues of individual features ranking
Useless features become useful
Outline
• Supervised Feature Selection
Multivariate Feature Selection is complex
Objectives of feature selection
Questions before subset feature select
Filter: Feature subset selection
Filter: Feature subset selection (2)
Forward Selection
Backward Elimination
Wrapper: Feature selection methods
RFE SVM
Selecting feature subsets in HTTP logs
Comparsion of Filter and Wrapper:
Forward Selection w. Trees
Outline
• Summary
Conclusion
In practice
Feature selection toolbox
Reference