Sample Selection Bias – Covariate Shift: Problems, Solutions, and Applications Wei Fan, IBM T.J.Watson Research Masashi Sugiyama, Tokyo Institute of Technology Updated PPT is available: http//www.weifan.info/tutorial.htm.

Download Report

Transcript Sample Selection Bias – Covariate Shift: Problems, Solutions, and Applications Wei Fan, IBM T.J.Watson Research Masashi Sugiyama, Tokyo Institute of Technology Updated PPT is available: http//www.weifan.info/tutorial.htm.

Sample Selection Bias –
Covariate Shift: Problems,
Solutions, and Applications
Wei Fan, IBM T.J.Watson Research
Masashi Sugiyama, Tokyo Institute of
Technology
Updated PPT is available:
http//www.weifan.info/tutorial.htm
Overview of Sample Selection
Bias Problem
A Toy Example
Two classes:
red and green
red: f2>f1
green: f2<=f1
Unbiased and Biased Samples
Not so-biased sampling
Biased sampling
Effect on Learning
Unbiased
Unbiased
Unbiased
96.9%
97.1%
96.405%
Biased
Biased
Biased
95.9%
92.7%
92.1%
• Some
techniques
are more
sensitive
to
bias than others.
• One important question:
– How to reduce the effect of sample selection
bias?
Ubiquitous
•
•
•
•
•
•
•
•
•
•
Loan Approval
Drug screening
Weather forecasting
Ad Campaign
Fraud Detection
User Profiling
Biomedical Informatics
Intrusion Detection
1. Normally, banks only have data of their own customers
Insurance
2. “Late payment, default” models are computed using their own data
3. New customers may not completely follow the same distribution.
etc
Face Recognition
• Sample selection bias:
– Training samples are taken inside research lab,
where there are a few women.
– Test samples: in real-world, men-women ratio is
almost 50-50.
The Yale Face Database B
Brain-Computer Interface (BCI)
• Control computers by EEG signals:
– Input: EEG signals
– Output: Left or Right
Figure provided by Fraunhofer FIRST, Berlin, Germany
Training
• Imagine left/right-hand movement
following the letter on the screen
Movie provided by Fraunhofer FIRST, Berlin, Germany
Testing: Playing Games
• “Brain-Pong”
Movie provided by Fraunhofer FIRST, Berlin, Germany
Non-Stationarity in EEG Features
• Different mental conditions (attention,
sleepiness etc.) between training and test
phases may change the EEG signals.
Bandpower differences between
training and test phases
Features extracted from brain activity
during training and test phases
Figures provided by Fraunhofer FIRST, Berlin, Germany
Robot Control
by Reinforcement Learning
• Let the robot learn how to autonomously
move without explicit supervision.
Khepera Robot
Rewards
Robot moves autonomously
= goes forward without hitting wall
• Give robot rewards:
– Go forward: Positive reward
– Hit wall: Negative reward
• Goal: Learn the control policy that
maximizes future rewards
Example
• After learning:
Policy Iteration and Covariate Shift
• Policy iteration:
Evaluate
control policy
Improve
control policy
• Updating the policy correspond to
changing the input distributions!
Different Types of Sample
Selection Bias
Bias as Distribution
• Think of “sampling an example (x,y) into the training
data” as an event denoted by random variable s
– s=1: example (x,y) is sampled into the training data
– s=0: example (x,y) is not sampled.
• Think of bias as a conditional probability of “s=1”
dependent on x and y
• P(s=1|x,y) : the probability for (x,y) to be sampled into
the training data, conditional on the example’s feature
vector x and class label y.
Categorization
(Zadrozy’04, Fan et al’05, Fan and Davidson’07)
– No Sample Selection
Bias
• P(s=1|x,y) = P(s=1)
– Feature Bias/Covariate
Shift
• P(s=1|x,y) = P(s=1|x)
– Class Bias
• P(s=1|x,y) = P(s=1|y)
– Complete Bias
• No more reduction
Bias for a Training Set
• How P(s=1|x,y) is computed
• Practically, for a given training set D
– P(s=1|x,y) = 1: if (x,y) is sampled into D
– P(s=1|x,y) = 0: otherwise
• Alternatively, consider D of the size can be
sampled “exhaustively” from the universe
of examples.
Realistic Datasets are biased?
• Most datasets are biased.
• Unlikely to sample each and every feature
vector.
• For most problems, it is at least feature
bias.
– P(s=1|x,y) = P(s=1|x)
Effect on Learning
• Learning algorithms estimate the “true
conditional probability”
– True probability P(y|x), such as P(fraud|x)?
– Estimated probabilty P(y|x,M): M is the model
built.
• Conditional probability in the biased data.
– P(y|x,s=1)
• Key Issue:
– P(y|x,s=1) = P(y|x) ?
Bias Resolutions
Heckman’s Two-Step Approach
• Estimate one’s donation amount if one does donate.
• Accurate estimate cannot be obtained by a regression using only
data from donors.
• First Step: Probit model to estimate probability to donate:
• Second Step: regression model to estimate donation:
• Expected error
• Gaussian assumption
Covariate Shift or Feature Bias
• However, no chance for generalization
if training and test samples have
nothing in common.
• Covariate shift:
– Input distribution changes
– Functional relation remains unchanged
Example of Covariate Shift
(Weak) extrapolation:
Predict output values outside training region
Training samples
Test samples
Covariate Shift Adaptation
• To illustrate the effect of covariate shift,
let’s focus on linear extrapolation
Training samples
Test samples
True function
Learned function
Generalization Error
= Bias + Variance
: expectation over noise
Model Specification
• Model is said to be correctly specified if
• In practice, our model may not be correct.
• Therefore, we need a theory for
misspecified models!
Ordinary Least-Squares (OLS)
• If model is correct:
– OLS minimizes bias
asymptotically
• If model is misspecified:
– OLS does not minimize
bias even asymptotically.
We want to reduce bias!
Law of Large Numbers
• Sample average converges to the
population mean:
• We want to estimate the expectation
over test input points only using
training input points
.
Key Trick:
Importance-Weighted Average
• Importance: Ratio of test and training input
densities
• Importance-weighted average:
(cf. importance sampling)
Importance-Weighted LS
(Shimodaira, JSPI2000)
:Assumed strictly positive
• Even for misspedified
models, IWLS minimizes
bias asymptotically.
• We need to estimate
importance in practice.
Use of Unlabeled Samples:
Importance Estimation
• Assumption: We have training inputs
and test inputs
.
• Naïve approach: Estimate
and
separately, and take the ratio of the
density estimates
• This does not work well since density
estimation is hard in high dimensions.
Vapnik’s Principle
When solving a problem,
more difficult problems shouldn’t be solved.
(e.g., support vector machines)
Knowing densities
Knowing ratio
• Directly estimating the ratio is easier
than estimating the densities!
Modeling Importance Function
• Use a linear importance model:
• Test density is approximated by
• Idea: Learn
approximates
so that
.
well
Kullback-Leibler Divergence
•
(constant)
(relevant)
Learning Importance Function
• Thus
(objective function)
• Since
is density,
(constraint)
KLIEP (Kullback-Leibler
Importance Estimation Procedure)
(Sugiyama et al., NIPS2007)
• Convexity: unique global solution is available
• Sparse solution: prediction is fast!
Examples
Experiments: Setup
• Input distributions: standard Gaussian with
– Training: mean (0,0,…,0)
– Test:
mean (1,0,…,0)
• Kernel density estimation (KDE):
– Separately estimate training and test input densities.
– Gaussian kernel width is chosen by likelihood
cross-validation.
• KLIEP
– Gaussian kernel width is chosen by likelihood
cross-validation
Normalized MSE
Experimental Results
KDE
KLIEP
dim
• KDE: Error increases as dim grows
• KLIEP: Error remains small for large dim
Ensemble Methods (Fan and
Davidson’07)
Averaging of estimated class probabilities weighted by posterior
Integration Over
Class
Model Space Probability
Posterior
weighting
Removes model uncertainty by averaging
How to Use Them
• Estimate “joint probability” P(x,y) instead of just
conditional probability, i.e.,
– P(x,y) = P(y|x)P(x)
– Makes no difference use 1 model, but
Multiple models
Examples of How This Works
• P1(+|x) = 0.8 and P2(+|x) = 0.4
• P1(-|x) = 0.2 and P2(-|x) = 0.6
• model averaging,
– P(+|x) = (0.8 + 0.4) / 2 = 0.6
– P(-|x) = (0.2 + 0.6)/2 = 0.4
– Prediction will be –
• But if there are two P(x) models, with probability
0.05 and 0.4
• Then
– P(+,x) = 0.05 * 0.8 + 0.4 * 0.4 = 0.2
– P(-,x) = 0.05 * 0.2 + 0.4 * 0.6 = 0.25
• Recall with model averaging:
– P(+|x) = 0.6 and P(-|x)=0.4
– Prediction is +
• But, now the prediction will be – instead of +
• Key Idea:
– Unlabeled examples can be used as “weights” to reweight the models.
Structure Discovery (Ren et al’08)
Structural
Discovery
Original Dataset
Corrected Dataset
Structural
Re-balancing
Active Learning
• Quality of learned functions depends on
training input location
.
Good input location
Poor input location
Target
Learned
• Goal: optimize training input location
Challenges
• Generalization error is unknown and needs
to be estimated.
• In experiment design, we do not have
training output values
yet.
• Thus we cannot use, e.g., cross-validation
which requires
.
• Only training input positions
can be
used in generalization error estimation!
Agnostic Setup
• The model is not correct in practice.
• Then OLS is not consistent.
• Standard “experiment design” method
does not work!
(Fedorov 1972; Cohn et al., JAIR1996)
Bias Reduction by
Importance-Weighted LS (IWLS)
(Wiens JSPI2001; Kanamori & Shimodaira JSPI2003; Sugiyama JMLR2006)
• The use of IWLS mitigates the problem
of in consistency under agnostic setup.
Importance
• Importance is known in active learning
setup since
is designed by us!
Model Selection and Testing
Model Selection
• Choice of models is crucial:
Polynomial of order 1
Polynomial of order 2
Polynomial of order 3
• We want to determine the model so that
generalization error is minimized:
Generalization Error Estimation
• Generalization error is not accessible since
the target function
is unknown.
• Instead, we use a generalization error estimate.
Model complexity
Model complexity
Cross-Validation
•
•
•
•
Divide training samples into groups.
Train a learning machine with
groups.
Validate the trained machine using the rest.
Repeat this for all combinations and output
the mean validation error.
Group 1
Group 2
Training
…
Group k-1 Group k
Validation
• CV is almost unbiased without covariate shift.
• But, it is heavily biased under covariate shift!
Importance-Weighted CV (IWCV)
(Zadrozny ICML2004; Sugiyama et al., JMLR2007)
• When testing the classifier in CV process,
we also importance-weight the test error.
Set 1
Set 2
Training
…
Set k-1
Set k
Testing
• IWCV gives almost unbiased estimates of
generalization error even under covariate shift
Example of IWCV
• IWCV gives better estimates of
generalization error.
• Model selection by IWCV
outperforms CV!
Reserve Testing (Fan and
Davidson’06)
Test
A
MA
B
MB
Labeled
test data
DA
Train
MAA
A
MAB
Train
DB
B
MBA
MBB
Estimate the performance of MA and MB based on the order of
MAA, MAB, MBA and MBB
Rule
• If “A’s labeled test data” can construct “more
accurate models” for both algorithm A and B
evaluated on labeled training data, then A is
expected to be more accurate.
– If MAA > MAB and MBA > MBB then choose A
• Similarly,
– If MAA < MAB and MBA < MBB then choose B
• Otherwise, undecided.
Why CV won’t work?
Examples
Ozone Day Prediction (Zhang et
al’06)
– Daily summary maps of two datasets from
Texas Commission on Environmental
Quality (TCEQ)
Challenges as a Data Mining Problem
1. Rather skewed and relatively sparse
distribution
–
–
–
2500+ examples over 7 years (1998-2004)
72 continuous features with missing values
Large instance space
• If binary and uncorrelated, 272 is an
astronomical number
–
2% and 5% true positive ozone days for 1hour and 8-hour peak respectively
3. A large number of irrelevant features
–
Only about 10 out of 72 features verified to be
relevant,
No information on the relevancy of the other 62
features
For stochastic problem, given irrelevant features Xir ,
where X=(Xr, Xir),
P(Y|X) = P(Y|Xr) only if the data is exhaustive.
–
–
–
May introduce overfitting problem, and change
the probability distribution represented in the
data.
•
•
P(Y = “ozone day”| Xr, Xir)
P(Y = “normal day”|Xr, Xir)
1
0
4. “Feature sample selection bias”.
–
Given 7 years of data and 72 continuous
features, hard to find many days in the
training data that is very similar to a day in
the future
– Given these,
2 closely-related
1
1
2
2
+
+
challenges
+
+
1. How to 3train an accurate
model
3
+
+
2. How to effectively use a model to predict the
future
with Distribution
a different and
yet Distribution
unknown
Testing
Training
distribution
Reliable probability
estimation under irrelevant
features
– Recall that due to irrelevant features:
•
•
–
–
P(Y = “ozone day”| Xr, Xir)
P(Y = “normal day”|Xr, Xir)
1
0
Construct multiple models
Average their predictions
• P(“ozone”|xr): true probability
• P(“ozone”|Xr, Xir, θ): estimated probability by model θ
•
–
•
Difference between “true” and “estimated”.
MSEAverage
–
•
MSEsinglemodel:
Difference between “true” and “average of many models”
Formally show that MSEAverage ≤ MSESingleModel
Estimated
1
probability
+
values
1 fold
3
TrainingSet
Algorithm
Precision
– A CV based procedure
for decision threshold
selection
1
0.0 0.2 0.4 0.6 0.8 1.0
• Prediction with feature sample selection bias
Ma
Mb
VE
2
+
2
+
-
3
+
0.0 0.2 0.4 0.6 0.8 1.0
+
- Recall
+
Estimated
“Probabilityprobability
P(y=“ozoneday”|x,θ)
Lable Distribution
Testing
Training Distribution
PrecRec
TrueLabel”
values7/1/98
0.1316
Normal
plot
file
2 fold
…..
7/3/98
0.5944
7/2/98
0.6245
Estimated
probability
values
10 fold
………
Ozone
Ozone
P(y=“ozoneday”|x,θ)
Lable
7/1/98
0.1316
Normal
7/2/98
0.6245
Ozone
7/3/98
0.5944
Ozone
………
Decision
threshold
VE
Addressing Data Mining
Challenges
• Prediction with feature sample selection
bias
– Future prediction based on decision
threshold selected
Classification
Whole
Training
Set
if P(Y = “ozonedays”|X,θ ) ≥ VE
θ
Predict “ozonedays”
on future
days
Results
KDD/Netflix CUP’07 Task1 (Liu and
Kou,07)
Task 1
Task 1: Who rated what in 2006
Given a list of 100,000 pairs of users and movies,
predict for each pair the probability that the user
rated the movie in 2006
Result: They are the close runner-up, No 3 out of
39 teams
Challenges:
•Huge amount of data how to sample the data so that any
learning algorithms can be applied is critical
•Complex affecting factors: decrease of interest in old
movies, growing tendency of watching (reviewing) more
movies by Netflix users
NETFLIX data generation
process
NO User
or Movie
Arrival
User Arrival
17K movies
Movie Arrival
Task 1
Task 2
Training Data
1998
Time
2005
4
5
?
3
2
?
2006
Qualifier
Dataset
3M
Task 1: Effective Sampling
Strategies
• Sampling the movie-user pairs for “existing” users and “existing”
movies from 2004, 2005 as training set and 4Q 2005 as developing
set
– The probability of picking a movie was proportional to the number
of ratings that movie received; the same strategy for users
Movies
……
Movie5 .0011
……
Samples
History
Movie3 .001
……
Users
Movie4 .0007
……
Movie5 User 7
……
Movie3 User 7
……
Movie4 .User 8
……
User7 .0007
……
User6 .00012
……
User8 .00003
……
….
1488844,3,2005-09-06
822109,5,2005-05-13
885013,4,2005-10-19
30878,4,2005-12-26
823519,3,2004-05-03
…
•
Learning Algorithm:
– Single classifiers: logistic regression, Ridge regression, decision tree, support vector
machines
– Naïve Ensemble: combining sub-classifiers built on different types of features with preset weights
– Ensemble classifiers: combining sub-classifiers with weights learned from the
development set
Brain-Computer Interface (BCI)
• Control computers by brain signals:
– Input: EEG signals
– Output: Left or Right
BCI Results
No
adaptation
With
adaptation
KL
1
9.3 %
10.0 %
0.76
2
8.8 %
8.8 %
1.11
3
4.3 %
4.3 %
0.69
1
40.0 %
40.0 %
0.97
2
39.3 %
38.7 %
1.05
3
25.5 %
25.5 %
0.43
1
36.9 %
34.4 %
2.63
2
21.3 %
19.3 %
2.88
3
22.5 %
17.5 %
1.25
1
21.3 %
21.3 %
9.23
2
2.4 %
2.4 %
5.58
3
6.4 %
6.4 %
1.83
1
21.3 %
21.3 %
0.79
2
15.3 %
14.0 %
2.01
Subject Trial
1
2
3
4
5
KL divergence from training
to test input distributions
• When KL is large,
covariate shift
adaptation tends to
improve accuracy.
• When KL is small,
no difference.
Robot Control by
Reinforcement Learning
• Swing-up inverted pendulum:
– Swing-up the pole by
controlling the car.
– Reward:
Results
Covariate shift adaptation
Existing method (b)
Existing method (a)
Demo: Proposed Method
Wafer Alignment in
Semiconductor Exposure Apparatus
• Recent silicon wafers have layer structure.
• Circuit patterns are exposed multiple times.
• Exact alignment of wafers is very important.
Markers on Wafer
• Wafer alignment process:
– Measure marker location printed on wafers.
– Shift and rotate the wafer to minimize the gap.
• For speeding up, reducing the number of
markers to measure is very important.
Active learning problem!
Non-linear Alignment Model
• When gap is only shift and rotation,
linear model is exact:
• However, non-linear factors exist, e.g.,
– Warp
– Biased characteristic of measurement apparatus
– Different temperature conditions
• Exactly modeling non-linear factors is very
difficult in practice!
Agnostic setup!
Experimental Results
(Sugiyama & Nakajima ECML-PKDD2008)
Mean squared error of wafer position estimation
“Outer”
IWLS-based
OLS-based
Passive
heuristic
2.27(1.08)






2.37(1.15)
2.36(1.15) 2.32(1.11)
20 markers (out of 38) are chosen by experiment design methods.
Gaps of all markers are predicted.
Repeated for 220 different wafers.
Mean (standard deviation) of the gap prediction error
Red: Significantly better by 5% Wilcoxon test
Blue: Worse than the baseline passive method
• IWLS-based active learning works very well!
Conclusions
Book on Dataset Shift
• Quiñonero-Candela, Sugiyama,
Schwaighofer & Lawrence (Eds.),
Dataset Shift in Machine Learning,
MIT Press, Cambridge, 2008.