Document 7587554
Download
Report
Transcript Document 7587554
Sampling Approaches to
Learning from Imbalanced
Datasets
Naoki Abe
IBM T.J. Watson Research Center
5/24/2016
Based on joint work with
Bianca Zadrozny
University of California, San Diego
Hiroshi Mamitsuka
Kyoto University
John Langfod, Edwin Pednault, Chid Apte et al
IBM T.J. Watson Research Center
1
Outline
Introduction
Industrial applications and learning from
imbalanced datasets
Review: Past approaches to learning from
imbalanced datasets
Sampling Approaches to Learning from
Imbalanced Dataset
5/24/2016
Selective sampling based on query learning
Sampling for cost-sensitive learning
Discussion
2
Industrial Applications and the
Issue of Imbalanced Dataset
Industrial Applications
Hardware Fault Detection (e.g. Apte, Weiss, Grout 93)
Insurance Risk Modeling (e.g. Pednault, Rosen, Apte ’00)
Response is typically rare but can be profitable
Churn Analysis (e.g. Mamitsuka and Abe ’00)
5/24/2016
Disease is typically rare but can be deadly
Targeted Marketing (e.g. Zadrozny, Elkan ’01)
Intrusion is rare but can be very costly
Airline No-show Prediction (e.g. Lawrence, Hong, et al ’03)
Fraud is rare but very costly
Intrusion Detection (e.g. Chan et al)
Claims are rare but very costly
Fraud Detection (e.g. Fawcett and Provost ‘97)
Faults are rare but very costly
Churn is typically rare but quite costly
The bottom line – all involve imbalanced data set
3
Review: Past Approaches
Algorithm Specific Approach
Modifying specific learning algorithms to handle imbalanced
data set
Using model class appropriate for modeling imbalanced data
Cost-sensitive Learning Approach
Imbalanced dataset is a problem because rare class tends to
be more costly …
Learning from imbalanced dataset is an instance of costsensitive learning ?
Query/Active Learning Approach
5/24/2016
Imbalanced dataset is a problem because it provides little
information about decision boundary
Learning from imbalanced dataset is an instance of
query/active learning ?
4
Algorithmic Specific Approach: A Case
Study from Underwriting Profitability
Analysis (UPA)
Partnership between IBM and Farmers Insurance Group
UPA predicts the expected claim amount paid per unit
time (i.e. pure premium) as a function of risk factors.
Project led to creation of IBM ProbE data mining engine:
General framework for tree-based modeling
Allows arbitrary model class at leaves
Allows modification of splitting/pruning rule
Y
Y
5/24/2016
5
X
X
Model Class for Imbalanced Dataset
Claim amounts
modeled by lognormal
distribution
Claim frequency
modeled by poisson
distributions
Node splitting allowed
only if standard error is
small:
1
n
5/24/2016
1+
0
2000
4000
6000
8000 10000
Size of Claim
10
100
1000
10000
100000
Size of Claim
2
2
This approach makes sense if a complete model is desired
6
Outline
Introduction
Sampling Approach to Learning from
Imbalanced Dataset
Industrial applications and learning from imbalanced
datasets
Review: Past approaches to learning from
imbalanced datasets
Selective sampling based on query learning
Sampling for cost-sensitive learning
Discussion
5/24/2016
7
Selective Sampling based on
Query Learning
Active/Query Learning (e.g. Angluin 88)
Learner gets to choose examples on which to request the labels
Uncertainty Sampling (e.g. Lewis and Gale 94)
Learner queries examples for which its prediction so far is uncertain
to maximize information gain
A prime example is Query by committee (Seung et al 92), which
queries examples on which models obtained so far disagree on
Successfully applied to getting labeled data for text classification
Selective sampling by query learning (e.g. Freund et al 93)
5/24/2016
Given large number of (possibly unlabeled) data, uses only small
subset of labeled data for learning
Successfully applied to mining very large data set (e.g. Mamitsuka
and Abe ’00), even when labeled data are abound
8
Why query learning for imbalanced
data set ?
Use query learning to get more data for rare classes
Use selective sampling by query learning to get
data near the decision boundary
Case 1
Case 2
Case 3
5/24/2016
+
- =0
+ Label
+ --
Label = 1
Label = 1
+ Label = 1
+
+
+ ++ + -
Label = 0
Label = 0
9
Selective Sampling by Query
Learning: Overview
Calculate the “uncertainty” of
examples based on the
results of the previous
iterations.
In the next iteration, select
the examples which are most
uncertain.
model
A commonly used measure
of uncertainty for
sample
1
classification is the margin.
5/24/2016
1
0
0
0
1
1
1
0
0
0
0
3
0
0
0
0
0
5
1
1
1
1
0
3
margin
data
uncertainty
measure
model
sample
I
sample
I+1
10
QbagS: Query by Bagging
(Mamitsuka and Abe, 00)
QbagS (Learner A, Sample Set T, sample size s,
count t)
(1) For i=1 to t do
(a) T’ = minimum margin sub-sample of
size s from T
(b) Let hi = A(T’ ) t
(2) Output h(x) = sign( hi ( x ) )
i 1
It belongs to “sequential multi-subset learning with model-guided
instance selection methods” (Provost and Kolluri ’99)
5/24/2016
11
What about “boosting” and other related approaches ?
Ivotes: Importance Sampling
(Breiman, 00)
Ivotes (Learner A, Sample Set T, sample size s,
count t)
(1) For i=1 to t do
(a) T’ = importance sample of size s from T
(accepted with probability 1 if current
hypothesis predicts wrongly, and with
probability e/(1-e) otherwise*)
(b) Let hi = A(T’ ) t
hi ( x ) )
(2) Output h(x) = sign(
i 1
* e = error rate of current hypothesis
5/24/2016
12
Empirical Comparison between
QbagS and Ivotes
Medium sized data sets (Generator from Agrawal, 93)
Large sized real world data set (Churn from NEC)
5/24/2016
13
Empirical Comparison between
QbagS and Ivotes (II)
Medium sized data sets (Generator from Agrawal, 93)
Large sized real world data set (Churn from NEC)
5/24/2016
This is an imbalanced data set (class 1 = churn is roughly 10%)
Cost of retention (C0) is smaller than cost of churn (C1)
Measured performance using Precision and Recall
14
The Precision-Recall Measure
Precision-Recall often
used as evaluation metric
for learning algorithm
Precision =
P(correct | pred = 1)
Recall =
P(correct | true = 1)
Provides measure for a
whole range of relative
cost of false positives and
false negatives
5/24/2016
15
Precision-Recall and Cost
Minimization
Let
Then
F0, F1 = Frequency of class 0 (1)
C0 ,C1 = Cost when true class is 0 (1)
P0 , R = Precision when true class 0 (1)
Expected cost = F1(1-R)C1+ F0(1-P0)C0
= K1+ F1(C1 - C0)R + C0 P
Assuming slope of PR-curve = -1, cost is decreased by
Increasing R if F1 C1 > F0C0
Increasing P if F1 C1 < F0C0
Query Learning, or more in general ensemble learning,
provides a solution for a whole range of cost-landscape,
by virtue of its ranking w.r.t. confidence of prediction
5/24/2016
16
Outline
Introduction
Industrial applications and learning from
imbalanced datasets
Review: Past approaches to learning from
imbalanced datasets
Sampling Approach to Learning from
Imbalanced Dataset
5/24/2016
Selective sampling based on query learning
Sampling for cost-sensitive learning
Cost-sensitive query learning ?
17
Cost-sensitive Learning
Traditionally assumed a cost matrix of the form:
Predict = 0
Predict = 1
True = 0
True = 1
C(0,0)
C(1,0)
C(0,1)
C(1,1)
Zadrozny and Elkan ’01 introduced cost that depends on
particular example x
Predict = 0
Predict = 1
5/24/2016
True = 0
True = 1
C(0,0,x)
C(1,0,x)
C(0,1,x)
C(1,1,x)
18
Cost-sensitive learning by cost
proportionate weighted sampling
(Zadrozny, Langford, Abe ‘03)
Presents reduction of cost-sensitive learning to
classification
With theoretical performance guarantee
Uses cost-proportionate rejection sampling
Proposes Costing (cost-sensitive ensemble learning)
Empirical evaluation using benchmark data sets from
targeted marketing domain
5/24/2016
Costing achieves excellent predictive performance (w.r.t. cost
minimization)
Costing is computationally efficient
19
Translation Theorem
Assume examples (x,y,c) are drawn i.i.d. from some
distribution D over X x Y x R
where c = C(1-y,y,x) – C(y,y,x), i.e. the opportunity cost for
misclassifying x
Let
then
c
ˆ
D( x , y, c ) D( x, y, c ) where Z E x , y ,c ~ D [c ]
Z
1
E x , y ,c ~ D [c I (h( x ) y )] E x , y ,c ~ Dˆ [ I (h( x ) y )]
Z
h minimizing expected classification error rate for D̂ will
minimize expected cost with respect to D .
5/24/2016
20
Cost-proportionate sampling
Two methods for weighted sampling
Sampling with replacement (for some
chosen sample size) from T with
probabilities
c
p( x , y , c )
( x , y ,c )S
c
Rejection sampling from T with the same
probabilities,i.e.
With
probability p(x,y,c), accept the example
Otherwise reject the example
Continue sampling from T
5/24/2016
21
Sample complexity of costproportionate rejection sampling
Define m(1 / ,1 / ) to be the worst-case sample
complexity for achieving approximately optimal cost with
high probability
Then define morig(1 / ,1 / ) to be the sample complexity of
using original sample
And define mrej(1 / ,1 / ) to be the sample complexity of
using cost-proportionate rejection sampling.
Then the following holds:
morig(1 / ,1 / ) (mrej (1 / ,1 / )) D̂
Cost-proportionate rejection sampling distills cost-sensitive
information in the original sample to a much smaller one 22
5/24/2016
Costing – Cost-based bagging
Costing (Learner A, Sample Set T, count t)
(1) For i=1 to t do
(a) T’ = cost-proportionate rejection
sample from T
(b) Let hi = A(T’ )
(2) Output h(x) = sign(
t
h ( x) )
i 1
5/24/2016
i
23
Costing results: KDD-98
Each set has ~600
examples, of which
~55% are positive.
Costing with C4.5
achieves state-of-theart profit.
Similar, though less
impressive, behavior
observed for SVM
and Naïve Bayes
5/24/2016
Test Set Net Profit
C4.5
24
Experimental results: Summary
Experiments using 2
targeted marketing
datasets
Generally, resampling
performs poorly and
costing performs well
Extremely poor
performance of
resampling with C4.5 is
thought to be caused by
overfitting due to
duplicate examples
hindering the complexity
control mechanism of
C4.5
5/24/2016
KDD-98:
Method
Costing
(200)
Resampli
ng (100k)
NB
$13163
$12026
Boosted NB
$14714
$13135
C4.5
$15016
$2259
SVMLight
$13152
$12808
DMEF-2:
Method
Costing
(200)
Resampli
ng(100k)
NB
$37629
$34506
Boosted NB
$37891
$31889
C4.5
$37500
$3149
SVMLight
$35290
$33674
25
Outline
Introduction
Sampling Approach to Learning from
Imbalanced Dataset
Industrial applications and learning from
imbalanced datasets
Review: Past approaches to learning from
imbalanced datasets
Selective sampling based on query learning
Sampling for cost-sensitive learning
Discussion
5/24/2016
26
Learning from Imbalanced Dataset
as Cost-sensitive Learning
Cost-proportionate weighted sampling solves
cost-sensitive learning, and hence learning from
imbalanced dataset
Cost proportionate rejection sampling and
Resampling with replacement correspond to
under-sampling and over-sampling
Under-sampling and over-sampling are the
special case in which F1 C1 = F0C0 and where
P=R is optimal (assuming slope of PR-curve = -1)
“Rejection sampling > Resampling” is consistent
with and generalizes “Undersampling >
Oversampling”
5/24/2016
27
Imbalanced Dataset, Costsensitivity, Query Learning
Generalization of learning from imbalanced dataset as costsensitive learning lead to better understanding and more
general solution
Generalization of learning from imbalanced dataset as
Query learning offers an alternative solution, which is valid
for a whole range of cost landscape
Sampling approach (derived from cost sensitive and query
learning) addresses the issue of imbalanced dataset, while
it also provides solution with improved computational
efficiency !
Cost-sensitive Learning
Query Learning
Cost-proportionate sampling
5/24/2016
Learning from
Imbalanced Dataset
Uncertainty sampling
28
References
“Handling imbalanced datasets in insurance risk
modeling”, E. Pednault, B. Rosen, C. Apte,
Learning from Imbalanced Datasets: Papers from
AAAI Workshop, The AAAI Press, 2000. (Also
available as IBM Research Report RC-21731.)
“Efficient mining from large databases by query
learning”, H. Mamitsuka and N. Abe, Prof. of the
Sixteenth Int’l Conf. on Machine Learning
(ICML’00).
“Cost-sensitive learning by cost-proportionate
example weighting,” B. Zadrozny, J. Langford, N.
Abe, Proc. Of the Third IEEE Int’l Conf. on Data
Mining (ICDM’03), to appear.
5/24/2016
29