talk proteomics

Download Report

Transcript talk proteomics

C
E
N
T
R
E
F
O
R
I
N
T
E
G
R
A
T
I
V
E
Lecture 5
B
I
O
I
N
F
O
R
M
A
T
I
C
S
V
U
Feature Selection
(Elena Marchiori’s slides adapted)
Bioinformatics Data Analysis and
Tools
[email protected]
What is feature selection?
• Reducing the feature space by removing
some of the (non-relevant) features.
• Also known as:
– variable selection
– feature reduction
– attribute selection
– variable subset selection
Why select features?
• It is cheaper to measure less variables.
• The resulting classifier is simpler and
potentially faster.
• Prediction accuracy may improve by
discarding irrelevant variables.
• Identifying relevant variables gives more
insight into the nature of the corresponding
classification problem (biomarker
detection).
• Alleviate the “curse of dimensionality”.
Why select features?
Top 100
feature selection
Selection based on variance
No feature
selection
-1
+1
Correlation plot
Data: Leukemia, 3 class
The curse of dimensionality
• Term introduced by Richard Bellman1.
• Problems caused by the exponential increase
in volume associated with adding extra
dimensions to a (mathematical) space.
• So: the ‘problem space’ increases with the
number of variables/features.
1Bellman,
R.E. 1957. Dynamic Programming. Princeton University Press, Princeton, NJ
The curse of dimensionality
• A high dimensional feature space leads to
problems in for example:
– Machine learning: danger of overfitting with
too many variables.
– Optimization: finding the global optimum is
(virtually) infeasible in a high-dimensional
space.
– Microarray analysis: the number of features
(genes) is much larger than the number of
objects (samples). So a huge amount of
observations is needed to obtain a good
estimate of the function of a gene.
Approaches
• Wrapper:
– Feature selection takes into account the
contribution to the performance of a given type of
classifier.
• Filter:
– Feature selection is based on an evaluation
criterion for quantifying how well feature (subsets)
discriminate the two classes.
• Embedded:
– Feature selection is part of the training procedure
of a classifier (e.g. decision trees).
Embedded methods
• Attempt to jointly or simultaneously train
both a classifier and a feature subset.
• Often optimize an objective function that
jointly rewards accuracy of classification
and penalizes use of more features.
• Intuitively appealing.
Example: tree-building algorithms
Adapted from J. Fridlyand
Approaches to Feature Selection
Filter Approach
Input
Features
Feature
Selection by
Distance Metric
Score
Train
Model
Model
Wrapper Approach
Input
Features
Feature
Selection
Search
Feature Set
Train
Model
Model
Importance of
features given
by the model
Adapted from Shin and Jasso
Filter methods
R
p
S
Feature selection
R
Classifier design
S << p
• Features are scored independently and the top S
are used by the classifier.
• Score: correlation, mutual information, t-statistic,
F-statistic, p-value, tree importance statistic, etc.
Easy to interpret. Can provide some insight into the disease
markers.
Adapted from J. Fridlyand
Problems with filter method
• Redundancy in selected features: features are
considered independently and not measured on
the basis of whether they contribute new
information.
• Interactions among features generally can not
be explicitly incorporated (some filter methods
are smarter than others).
• Classifier has no say in what features should be
used: some scores may be more appropriates in
conjuction with some classifiers than others.
Adapted from J. Fridlyand
Wrapper methods
R
p
S
Feature selection
R
Classifier design
S << p
• Iterative approach: many feature subsets are scored
based on classification performance and best is used.
Adapted from J. Fridlyand
Problems with wrapper methods
• Computationally expensive: for each
feature subset to be considered, a
classifier must be built and evaluated.
• No exhaustive search is possible (2
subsets to consider) : generally greedy
algorithms only.
• Easy to overfit.
Adapted from J. Fridlyand
Example: Microarray Analysis
“Labeled” cases
(38 bone marrow samples: 27 AML, 11 ALL
Each contains 7129 gene expression values)
Train model
(using Neural Networks, Support Vector
Machines, Bayesian nets, etc.)
34 New
unlabeled bone
marrow samples
Model
key
genes
AML/ALL
Microarray Data Challenges to
Machine Learning Algorithms:
• Few samples for analysis (38 labeled).
• Extremely high-dimensional data (7129
gene expression values per sample).
• Noisy data.
• Complex underlying mechanisms, not fully
understood.
Some genes are more useful than
others for building classification models
Example: genes 36569_at and 36495_at are useful
Some genes are more useful than
others for building classification models
Example: genes 36569_at and 36495_at are useful
AML
ALL
Some genes are more useful than
others for building classification models
Example: genes 37176_at and 36563_at not useful
Importance of feature (gene) selection
• Majority of genes are not directly related to
leukemia.
• Having a large number of features enhances
the model’s flexibility, but makes it prone to
overfitting.
• Noise and the small number of training
samples makes this even more likely.
• Some types of models, like kNN do not
scale well with many features.
How do we choose the most
relevant of the 7219 genes?
1.
2.
3.
Distance metrics to capture class separation.
Rank genes according to distance metric score.
Choose the top n ranked genes.
HIGH score
LOW score
Distance metrics
• Tamayo’s Relative Class Separation:
• t-test:
x1  x2
s1  s2
x1  x2
2
2
s1 s2

n1 n2
1 ( x2  x1 ) 2 1
s1  s2
 log
2
2
4 s1  s2
2
2 s1  s2
2
• Bhattacharyya distance:
xi mean vector of class i
si standard deviation of class1
2
SVM-RFE: wrapper
• Recursive Feature Elimination:
– Train linear SVM  linear decision function.
– Use absolute value of variable weights to rank
variables.
– Remove half variables with lower rank.
– Repeat above steps (train, rank, remove) on data
restricted to variables not removed.
• Output: subset of variables.
SVM-RFE
•
Linear binary classifier decision function
N
f ( x1 ,...,xN )   wi xi  b
i 1
| wi |  scoreof variable xi
•
Recursive Feature Elimination (SVM-RFE)
- At each iteration:
1) eliminate threshold% of variables with lower score
2) recompute scores of remaining variables
SVM-RFE
I. Guyon et al.,
Machine Learning,
46,389-422, 2002
RELIEF
• Idea: relevant features make (1) nearest
examples of same class closer and (2)
nearest examples of opposite classes
more far apart.
1. weights of all features = zero
2. For each example in training set:
–
–
RELIEF
I. Kira K, Rendell L,
10th Int. Conf. on AI,
129-134, 1992
find nearest example from same (hit) and opposite class (miss)
update weight of each feature by adding abs(example - miss) abs(example - hit)
RELIEF Algorithm
RELIEF assigns weights to variables based on how well they separate
samples from their nearest neighbors (nnb) from the same and from
the opposite class.
RELIEF
%input: X (two classes)
%output: W (weights assigned to variables)
nr_var = total number of variables;
weights = zero vector of size nr_var;
for all x in X do
hit(x) = nnb of x from same class;
miss(x) = nnb of x from opposite class;
weights += abs(x-miss(x)) - abs(x-hit(x));
end;
nr_ex = number of examples of X;
return W = weights/nr_ex;
Note: Variables have to be normalized (e.g., divide each variable by its (max – min) values)
RELIEF: example
Gene expressions for two types of leukemia:
- 3 patiënts with AML (Acute Myeloid Leukemia)
- 3 patiënts with ALL (Acute Lymphoblastic Leukemia)
GENE_ID AML1
gene1
3
gene2
5
gene3
8
gene4
2
gene5
20
class 1
class 2
AML2 AML3 ALL1 ALL2 ALL3
2
6
4
4
1
3
4
15
13
14
2
3
5
9
7
3
5
4
2
5
24
23
7
8
7
What are the weights of genes 1-5, assigned by RELIEF?
RELIEF: normalization
First, apply (max-min) normalization:
- identify the max and min value of each feature (gene)
- Divide all values of each feature with the corresponding
(max-min)
GENE_ID AML1
gene1
3
gene2
5
gene3
8
gene4
2
gene5
20
class 1
class 2
AML2 AML3 ALL1 ALL2 ALL3 min max
1
6
2
6
4
4
1
3
4
15
13
14
3
15
2
3
5
9
7
2
9
3
5
4
2
5
2
5
24
23
7
8
7
7
24
normalization: 3 / (6-1) = 0.6
RELIEF: distance matrix
Data after normalization:
GENE_ID
gene1
gene2
gene3
gene4
gene5
AML1
0.600
0.417
1.143
0.667
1.176
class 1
AML2 AML3
0.400 1.200
0.250 0.333
0.286 0.429
1.000 1.667
1.412 1.353
ALL1
0.800
1.250
0.714
1.333
0.412
class 2
ALL2 ALL3
0.800 0.200
1.083 1.167
1.286 1.000
0.667 1.667
0.471 0.412
Then, calculate the distance matrix:
AML1
AML2
AML3
ALL1
ALL2
ALL3
0.545
0.920
1.800
1.109
1.266
AML1
0.232
1.338
1.888
1.058
AML2
Distance measure =
1 - Pearson Correlation
1.034
1.877 0.778
1.026 0.214 0.758
AML3 ALL1 ALL2 ALL3
RELIEF: 1st iteration
RELIEF, Iteration 1: AML1
GENE_ID
gene1
gene2
gene3
gene4
gene5
AML1
0.600
0.417
1.143
0.667
1.176
class 1
AML2 AML3
0.400 1.200
0.250 0.333
0.286 0.429
1.000 1.667
1.412 1.353
hit
AML1
AML2
AML3
ALL1
ALL2
ALL3
0.545
0.920
1.800
1.109
1.266
AML1
0.232
1.338
1.888
1.058
AML2
1.034
1.877 0.778
1.026 0.214 0.758
AML3 ALL1 ALL2 ALL3
ALL1
0.800
1.250
0.714
1.333
0.412
class 2
ALL2 ALL3 weights:
0.000
0.800 0.200
0.500
1.083 1.167
1.286 1.000
-0.714
0.667 1.667
-0.033
0.471 0.412
0.471
miss
Update weights:
weightgene1 += abs(0.600-0.800) - abs(0.600-0.400)
weightgene2 += abs(0.417-1.083) - abs(0.417-0.250)
..
.
RELIEF: 2nd iteration
RELIEF, Iteration 2: AML2
GENE_ID
gene1
gene2
gene3
gene4
gene5
AML1
0.600
0.417
1.143
0.667
1.176
class 1
AML2 AML3
0.400 1.200
0.250 0.333
0.286 0.429
1.000 1.667
1.412 1.353
ALL1
0.800
1.250
0.714
1.333
0.412
class 2
ALL2 ALL3 weights:
-0.600
0.800 0.200
1.333
1.083 1.167
1.286 1.000
-0.143
0.667 1.667
-0.333
0.471 0.412
1.412
hit
AML1
AML2
AML3
ALL1
ALL2
ALL3
0.545
0.920
1.800
1.109
1.266
AML1
0.232
1.338
1.888
1.058
AML2
1.034
1.877 0.778
1.026 0.214 0.758
AML3 ALL1 ALL2 ALL3
miss
Update weights:
weightgene1 += abs(0.400-0.200) - abs(0.400-1.200)
weightgene2 += abs(0.250-1.167) - abs(0.250-0.333)
..
.
RELIEF: results (after 6
th
iteration)
Weights after last iteration
GENE_ID
gene1
gene2
gene3
gene4
gene5
AML1
0.600
0.417
1.143
0.667
1.176
class 1
AML2 AML3
0.400 1.200
0.250 0.333
0.286 0.429
1.000 1.667
1.412 1.353
Last step is to sort
the features by their
weights, and select
the features with the
highest ranks:
ALL1
0.800
1.250
0.714
1.333
0.412
class 2
ALL2 ALL3 weights:
-0.600
0.800 0.200
4.250
1.083 1.167
1.286 1.000
0.429
0.667 1.667
-2.333
0.471 0.412
4.824
features weights
gene5
4.824
gene2
4.250
gene3
0.429
gene1
-0.600
gene4
-2.333
sorting
RELIEF
• Advantages:
– Fast.
– Easy to implement.
• Disadvantages:
– Does not filter out redundant features, so features
with very similar values could be selected.
– Not robust to outliers.
– Classic RELIEF can only handle data sets with two
classes.
Extension of RELIEF: RELIEF-F
• Extension for multi-class problems.
• Instead of finding one near miss, the algorithm finds
one near miss for each
RELIEF-F
different class and
%input: X (two or more classes C)
%output: W (weights assigned to variables)
averages their contribution
nr_var = total number of variables;
of updating the weights.
weights = zero vector of size nr_var;
for all x in X do
hit(x) = nnb of x from same class;
sum_miss = 0;
for all c in C do
miss(x, c) = nnb of x from class c;
sum_miss += abs(x-miss(x, c)) / nr_examples(c);
end;
weights += sum_miss - abs(x-hit(x));
end;
nr_ex = number of examples of X;
return W = weights/nr_ex;