Variable and Feature Selection in Machine Learning (Review

Download Report

Transcript Variable and Feature Selection in Machine Learning (Review

Variable - / Feature Selection
in Machine Learning
(Review)
Martin Bachler
[email protected]
MLA - VO
15.11.2005
Based on: Isabelle Guyon, André Elisseeff. An Introduction to variable and feature selection. JMLR, 3 (2003) 1157-1182
Overview
• Introduction/Motivation
WHY ?
• Basic definitions, Terminology
WHAT ?
• Variable Ranking methods
• Feature subset selection
HOW ?
2/54
Problem: Where to focus attention ?
• A universal problem of intelligent (learning)
agents is where to focus their attention.
• What aspects of the problem at hand are
important/necessary to solve it?
• Discriminate between the relevant and
irrelevant parts of experience.
3/54
What is Feature selection ?
• Feature selection:
Problem of selecting some subset of a
learning algorithm’s input variables upon
which it should focus attention, while ignoring
the rest
(DIMENSIONALITY REDUCTION)
• Humans/animals do that constantly!
4/54
Motivational example from Biology
[1]
Monkeys performing classification task
?
N. Sigala & N. Logothetis, 2002: Visual categorization shapes feature selectivity in the primate temporal cortex.
[1] Nathasha Sigala, Nikos Logothetis: Visual categorization shapes feature selectivity in the primate visual cortex. Nature Vol. 415(2002)
5/54
Motivational example from Biology
Monkeys performing classification task
Diagnostic features:
- Eye separation
- Eye height
Non-Diagnostic features:
- Mouth height
- Nose length
6/54
Motivational example from Biology
Monkeys performing classification task
Results:
– activity of a population of 150 neurons in the
anterior inferior temporal cortex was measured
– 44 neurons responded significantly differently to
at least one feature
– After Training: 72% (32/44) were selective to one
or both of the diagnostic features (and not for the
non-diagnostic features)
7/54
Motivational example from Biology
Monkeys performing classification task
Results:
(population
of neurons)
8/54
Motivational example from Biology
Monkeys performing classification task
Results:
(single neurons)
„The data from the present
study indicate that neuronal
selectivity was shaped by the
most relevant subset of
features during the
categorization training.“
9/54
Feature Selection in ML ?
Why even think about Feature Selection in ML?
- The information about the target class is inherent in the
variables!
- Naive theoretical view:
More features
=> More information
=> More discrimination power.
- In practice:
many reasons why this is not the case!
- Also:
Optimization is (usually) good, so why not try to optimize the
input-coding ?
10/54
Feature Selection in ML ? YES!
- Many explored domains have hundreds to tens of
thousands of variables/features with many irrelevant and
redundant ones!
- In domains with many features the underlying probability
distribution can be very complex and very hard to
estimate (e.g. dependencies between variables) !
- Irrelevant and redundant features can „confuse“ learners!
- Limited training data!
- Limited computational resources!
- Curse of dimensionality!
11/54
Curse of dimensionality
1
positiveexamples
examples
positive
negativeexamples
examples
negative
0.9
1
0.8
0.9
0.8
0.7
0.7
0.6
0.6
x2x3
0.5
0.5
0.4
0.3
0.4
0.2
0.1
0.3
0
1
0.20.9
0.8
0.1
0.7
0.6
0.5
0.4
0
0
0.1
0.2
x2
0.3
0.2
0.3
0.1
0.4
0
0.2
0.1 0.6
0 0.5
x1
0.3
0.4
0.7
x1
0.5
0.6
0.8
0.7
0.8
0.9
0.9
1
1
1
12/54
Curse of dimensionality
• The required number of samples (to achieve the
same accuracy) grows exponentionally with the
number of variables!
• In practice: number of training examples is fixed!
=> the classifier’s performance usually will degrade
for a large number of features!
In many cases the information
that is lost by discarding variables
is made up for by a more
accurate mapping/sampling in the
lower-dimensional space !
13/54
Example for ML-Problem
Gene selection from microarray data
– Variables:
gene expression coefficients corresponding to the
amount of mRNA in a patient‘s sample (e.g. tissue
biopsy)
– Task: Seperate healthy patients from cancer patients
– Usually there are only about 100 examples (patients)
available for training and testing (!!!)
– Number of variables in the raw data: 6.000 – 60.000
– Does this work ? ([8])
[8] C. Ambroise, G.J. McLachlan: Selection bias in gene extraction on the basis of microarray gene-expresseion data. PNAS Vol. 99 6562-6566(2002)
14/54
Example for ML-Problem
Text-Categorization
- Documents are represented by a vector of
dimension the size of the vocabulary containing
word frequency counts
- Vocabulary ~ 15.000 words (i.e. each document
is represented by a 15.000-dimensional vector)
- Typical tasks:
- Automatic sorting of documents into web-directories
- Detection of spam-email
15/54
Motivation
• Especially when dealing with a large number
of variables there is a need for
dimensionality reduction!
• Feature Selection can significantly improve a
learning algorithm’s performance!
16/54
Overview
• Introduction/Motivation
• Basic definitions, Terminology
• Variable Ranking methods
• Feature subset selection
17/54
Problem setting
• Classification/Regression (Supervised Learning):
Given empirical data (training data)
L

{
(
x
,)
y
,
.
.
.
,
(
x
y
)
,
.
.
.
,
(
x
,
y
)
}

X

Y
1
1
i
,
i
m
m
a learner has to find a hypothesis h
H:X
Y
that is used to assign a label y to unseen x.
- Classification: y is an integer (e.g. Y = {-1,1})
- Regression: y is real-valued (e.g. Y = [-1,1])
18/54
Features/Variables
-
Take a closer look at the data:
X  f1  ...  fi  ...  f n
-
i.e. each instance x has n
attributes, variables, features,
dimensions
 x1 
 

x 
  
 
 xn 
19/54
Feature Selection - Definition
• Given a set of features F{ f1 ,..., fi ,..., f n}
the Feature Selection problem is
to find a subset F '  F that “maximizes the learners
ability to classify patterns”.
Formally F’ should maximize some scoring function
 :   (where  is the space of all possible feature
subsets of F), i.e.
F
'
a
r
g
m
a
x

G




G


20/54
Feature Extraction-Definition
• Given a set of features F{ f1 ,..., fi ,..., f n}
the Feature Extraction(“Construction”) problem is
is to map F to some feature set F '' that maximizes the
learner’s ability to classify patterns.
(again F
''
a
r
g
m
a
x

G


) *
G


• This general definition subsumes feature selection
(i.e. a feature selection algorithm also performs a
mapping but can only map to subsets of the input
variables)
*

here is the set of all possible feature sets
21/51
Feature Selection / - Extraction
• Feature Selection:
F
F‘
i  1,..., n ; j  1,..., m
{ f1,..., fi ,..., fn } 

{
f
,...,
f
,...,
f
}
i1
ij
im
f . selection
j
ia  ib  a  b; a, b  1,..., m
• Feature Extraction/Creation
F
F‘
{ f1,..., fi ,..., fn} 
{g1( f1,..., fn ),..., g j ( f1,..., fn ),..., gm ( f1,..., fn )}
f . extraction
Feature Selection – Optimality ?
• In theory the goal is to find an optimal feature-subset
(one that maximizes the scoring function)
• In real world applications this is usually not possible
– For most problems it is computationally intractable to search
the whole space of possible feature subsets
– One usually has to settle for approximations of the optimal
subset
– Most of the research in this area is devoted to finding efficient
search-heuristics
23/54
Optimal feature subset
– Often: Definition of optimal feature subset in terms of classifier’s
performance
– The best one can hope for theoretically is the Bayes error rate
– Given a learner I and training data L with features
F= {f1,….fi,…,fn} an optimal feature subset Fopt is a subset of F
such that the accuracy of the learner’s hypothesis h is maximal
(i.e. its performance is equal to an optimal Bayes classifier)*.
• Fopt (under this definition) depends on I
• Fopt need not be unique
• Finding Fopt is usually computationally intractable
* for this definition a possible scoring function is 1 – true_error(h)
24/54
Relevance of features
• Relevance of a variable/feature:
– There are several definitions of relevance in literature
– Relevance of 1 variable, Relevance of a variable given
other variables, Relevance given a certain learning
algorithm,..
– Most definitions are problematic, because there are
problems where all features would be declared to be
irrelevant
– The authors of [2] define two degrees of relevance: weak
and strong relevance.
– A feature is relevant iff it is weakly or strongly relevant and
”irrelevant”(redundant) otherwise.
[1] R. Kohavi and G. John Wrappers for features selection. Artificial Intelligence, 97(1-2):273-324, December 1997
[2] Ron Kohavi, George H. John: Wrappers for Feature Subset Selection. AIJ special issue on relevance (1996)
25/54
Relevance of featurs
• Strong Relevance of a variable/feature:
Let Si = {f1, …, fi-1, fi+1, …fn} be the set of all features
except fi. Denote by si a value-assignment to all
features in Si.
A feature fi is strongly relevant, iff there exists some
xi, y and si for which p(fi = xi, Si = si) > 0 such that
p(Y = y | fi = xi; Si = si) ≠ p(Y = y | Si = si)
This means that removal of fi alone will always
result in a performance deterioration of an optimal
Bayes classifier.
26/54
Si = {f1, …, fi-1, fi+1, …fn}
Relevance of features
• Weak Relevance of a variable/feature:
A feature fi is weakly relevant, iff it is not strongly
relevant, and there exists a subset of features Si‘
of Si for which there exists some xi, y and si‘ with
p(fi = xi, Si‘ = si‘) > 0 such that
p(Y = y | fi = xi; Si‘ = si‘) ≠ p(Y = y | Si’ = si’)
This means that there exists a subset of features Si‘,
such that the performance of an optimal Bayes
classifier on Si ‘ is worse than on Si ' fi .
27/54
Relevance of features
• Relevance  Optimality of Feature-Set
– Classifiers induced from training data are likely to be
suboptimal (no access to the real distribution of the
data)
– Relevance does not imply that the feature is in the
optimal feature subset
– Even “irrelevant” features can improve a classifier‘s
performance
– Defining relevance in terms of a given classifier (and
therefore a hypothesis space) would be better.
28/54
Overview
• Introduction/Motivation
• Basic definitions, Terminology
• Variable Ranking methods
• Feature subset selection
29/54
Variable Ranking
• Given a set of features F
Variable Ranking is the process of ordering the features
by the value of some scoring function S : F  (which
usually measures feature-relevance)
• Resulting set:
a permutation of F: F '{ fi1 ,..., fi j ,... fin } with
S ( fi j )  S ( fi j1 );
j  1,..., n 1;
The score S(fi) is computed from the training data,
measuring some criteria of feature fi.
• By convention a high score is indicative for a valuable
(relevant) feature.
30/54
Variable Ranking – Feature Selection
• A simple method for feature selection using variable
ranking is to select the k highest ranked features
according to S.
• This is usually not optimal
• but often preferable to other, more complicated
methods
• computationally efficient(!): only calculation and sorting
of n scores
31/54
Ranking Criteria – Correlation
Correlation Criteria:
• Pearson correlation coefficient
cov( fi , y)
R ( fi , y ) 
var( fi ) var( y)
• Estimate for m samples:
 f
m
R( fi , y) 
k 1
 f
m
k 1
k ,i
k,i
 fi
 fi
 y
k
y
  y
2
m
k 1
k

y

2
The higher the correlation between the feature and the target, the higher the score!
32/54
Ranking Criteria – Correlation
33/54
Ranking Criteria – Correlation
Correlation Criteria:
• R
(XY

1
,1

i, ) 
• mostly R(xi,y)² or |R(xi,y)| is used
• measure for the goodness of linear fit of xi and y.
(can only detect linear dependencies between
variable and target.)
• what if y = XOR(x1,x2) ?
• often used for microarray data analysis
34/54
Ranking Criteria – Correlation
Questions:
• Can variables with small score be automatically
discarded ?
• Can a useless variable (i.e. one with a small
score) be useful together with others ?
• Can two variables that are useless by
themselves can be useful together?)
35/54
Ranking Criteria – Correlation
• Can variables with small
score be discarded without
further consideration?
NO!
• Even variables with small
score can improve class
seperability!
• Here this depends on the
correlation between x1 and
x2 .
(Here the class conditional
distributions have a high
covariance in the direction
orthogonal to the line
between the two class
centers)
36/54
Ranking Criteria – Correlation
• Example with high
correlation between x1 and
x2 .
(Here the class conditional
distributions have a high
covariance in the direction of
the two class centers)
• No gain in seperation ability
by using two variables
instead of just one!
37/54
Ranking Criteria – Correlation
• Can a useless variable
be useful together with
others ?
YES!
38/54
Ranking Criteria – Correlation
• correlation between variables and target are not
enough to assess relevance!
• correlation / covariance between pairs of variables has
to be considered too!
(potentially difficult)
• diversity of features
39/54
Ranking Criteria – Inf. Theory
Information Theoretic Criteria
• Most approaches use (empirical estimates of)
mutual information between features and the
target:
I ( xi , y)  

xi
y
p( xi , y)
p  xi , y  log
dxdy
p( xi ) p( y)
• Case of discrete variables:
P( X  xi , Y  y )
I ( xi , y)   x  y P( X  xi , Y  y)log
i
P (X  x i )P (Y  y )
(probabilities are estimated from frequency counts)
40/54
Ranking Criteria – Inf. Theory
• Mutual information can also detect non-linear
dependencies among variables!
• But harder to estimate than correlation!
• It is a measure for “how much information (in terms of
entropy) two random variables share”
41/54
Variable Ranking - example
• Dataset from [3]
• 100.000 features
• 800 tr.- examples
• 1 selected feature:
performance of
linear classifier:
ber = 0.062 !!
mean squared error
• used classifier:
simple perceptron
ber=0.5*(error_rate_on_neg + error_rate_on_pos)
Number of selected features
[3] Isabelle Guyon and Steve Gunn. Nips feature selection challenge. http://www.nipsfsc.ecs.soton.ac.uk/, 2003.
42/54
Variable Ranking - SVC
Single Variable Classifiers
• Idea: Select variables according to their individual predictive
power
• criterion: Performance of a classifier built with 1 variable
• e.g. the value of the variable itself
(set treshold on the value of the variable)
• predictive power is usually measured in terms of error rate (or
criteria using fpr, fnr)
• also: combination of SVCs using ensemble methods (boosting,…)
43/54
Overview
• Introduction/Motivation
• Basic definitions, Terminology
• Variable Ranking methods
• Feature subset selection
44/54
Feature Subset Selection
• Goal:
- Find the optimal feature subset.
(or at least a “good one.”)
• Classification of methods:
– Filters
– Wrappers
– Embedded Methods
45/54
Feature Subset Selection
• You need:
– a measure for assessing the goodness of a
feature subset (scoring function)
– a strategy to search the space of possible feature
subsets
• Finding a minimal optimal feature set for an
arbitrary target concept is NP-hard
=> Good heuristics are needed!
[9] E. Amaldi, V. Kann: The approximability of minimizing nonzero variables and unsatisfied relations in linear systems. (1997)
46/54
Feature Subset Selection
Filter Methods
• Select subsets of variables as a pre-processing
step,
independently of the used classifier!!
•
Note that Variable Ranking-FS is a filter method
47/54
Feature Subset Selection
Filter Methods
•
usually fast
•
provide generic selection of features, not tuned by
given learner (universal)
•
this is also often criticised (feature set not optimized
for used classifier)
•
sometimes used as a preprocessing step for other
methods
48/54
Feature Subset Selection
Wrapper Methods
• Learner is considered a black-box
•
Interface of the black-box is used to score
subsets of variables according to the predictive
power of the learner when using the subsets.
•
•
Results vary for different learners
One needs to define:
– how to search the space of all possible variable
subsets ?
– how to assess the prediction performance of a
learner ?
49/54
Feature Subset Selection
Wrapper Methods
50/54
Feature Subset Selection
Wrapper Methods
•
The problem of finding the optimal subset is NP-hard!
•
A wide range of heuristic search strategies can be
used.
Two different classes:
–
–
•
•
•
Forward selection
(start with empty feature set and add features at each step)
Backward elimination
(start with full feature set and discard features at each step)
predictive power is usually measured on a validation
set or by cross-validation
By using the learner as a black box wrappers are
universal and simple!
Criticism: a large amount of computation is required.
51/54
Feature Subset Selection
Embedded Methods
• Specific to a given learning machine!
•
Performs variable selection (implicitly) in the
process of training
•
E.g. WINNOW-algorithm
(linear unit with multiplicative updates)
52/54
Important points 1/2
•
Feature selection can significantly increase the
performance of a learning algorithm (both
accuracy and computation time) – but it is not
easy!
•
One can work on problems with very highdimensional feature-spaces
•
Relevance <-> Optimality
•
Correlation and Mutual information between
single variables and the target are often used
53/54
as Ranking-Criteria of variables.
Important points 2/2
•
One can not automatically discard variables with
small scores – they may still be useful together
with other variables.
•
Filters – Wrappers - Embedded Methods
•
How to search the space of all feature subsets ?
•
How to asses performance of a learner that uses
a particular feature subset ?
54/54
THANK YOU!
Sources
1.
2.
3.
4.
5.
6.
7.
8.
9.
Nathasha Sigala, Nikos Logothetis: Visual categorization shapes feature
selectivity in the primate visual cortex. Nature Vol. 415(2002)
Ron Kohavi, George H. John: Wrappers for Feature Subset Selection. AIJ
special issue on relevance (1996)
Isabelle Guyon and Steve Gunn. Nips feature selection challenge.
http://www.nipsfsc.ecs.soton.ac.uk/, 2003.
Isabelle Guyon, Andre Elisseeff: An Introduction to Variable and Feature
Selection. Journal of Machine Learning Research 3 (2003) 1157-1182
Nathasha Sigala, Nikos Logothetis: Visual categorization shapes feature
selectivity in the primate visual cortex. Nature Vol. 415(2002)
Daphne Koller, Mehran Sahami: Toward Optimal Feature Selection. 13.
ICML (1996) p. 248-292
Nick Littlestone: Learning Quickly When Irrelevant Attributes Abound: A New
Linear-treshold Algorithm. Machine Learning 2, p. 285-318 (1987)
C. Ambroise, G.J. McLachlan: Selection bias in gene extraction on the basis
of microarray gene-expresseion data. PNAS Vol. 99 6562-6566(2002)
E. Amaldi, V. Kann: The approximability of minimizing nonzero variables and
unsatisfied relations in linear systems. (1997)