Dual Strategy Active Learning
Download
Report
Transcript Dual Strategy Active Learning
DUAL STRATEGY
ACTIVE LEARNING
presenter: Pinar Donmez1
Joint work with Jaime G. Carbonell1 & Paul N. Bennett2
1 Language Technologies Institute, Carnegie Mellon University
2 Microsoft Research
Active Learning (Pool-based)
Data Source
learn a new model
unlabeled data
Learning Mechanism
User
output
label request
Expert
labeled data
Two different trends on Active
Learning
Uncertainty Sampling:
selects the example with the lowest certainty
i.e. closest to the boundary, maximum entropy,...
Density-based Sampling:
considers the underlying data distribution
selects representatives of large clusters
aims to cover the input space quickly
i.e. representative sampling, active learning using
pre-clustering, etc.
Goal of this Work
Find an active learning method that works
well everywhere
Some work best when very few instances
sampled (i.e. density-based sampling)
Some work best after substantial sampling
(i.e. uncertainty sampling)
Combine the best of both worlds for superior
performance
Main Features of DUAL
DUAL
is dynamic rather than static
is context-sensitive
builds upon the work titled “Active Learning with PreClustering”, (Nguyen & Smeulders, 2004)
proposes a mixture model of density and uncertainty
DUAL’s primary focus is to
outperform static strategies over a large operating
range
improve learning for the later iterations rather than
concentrating on the initial data labeling
Related Work
DUAL
AL with PreClustering
Representative
Sampling
COMB
Clustering
Yes
Yes
Yes
No
Uncertainty
+ Density
Yes
Yes
Yes
No
Dynamic
Yes
No
No
Yes
Active Learning with Pre-Clustering
We call it Density Weighed Uncertainty Sampling
(DWUS in short). Why?
^
[1]
s argmax E [(y i y i ) | x i ]p (x i )
2
selection criterion
i I U
uncertainty score
density score
assumes a hidden clustering structure of the data
calculates the posterior P(y | x) as
P (y | x ) k 1 P (y , k | x )
K
[2]
K
1
P (y | k , x )P (k | x )
x and y are conditionally independent given k since
points in one cluster assumed to share the same
label
P (y | x ) k 1 P (y , k | x ) k 1 P (y | k )P (k | x )
K
[3]
k
K
Outline of DWUS
Cluster the data using K-medoid algorithm to find
the cluster centroids ck
2. Estimate P(k|x) by a standard EM procedure
3. Model P(y|k) as a logistic regression classifier
1.
P (y | k )
1
1 exp(y (c k .a b ))
Estimate P(y|x) using P (y | x ) P (y , k | x ) P (y | k )P (k | x )
Select an unlabeled instance using Eq. 1
Update the parameters of the logistic regression
model (hence update P(y|k) )
7. Repeat steps 3-5 until stopping criterion
4.
5.
6.
K
K
k 1
k 1
Notes on DWUS
Posterior class distribution:
P (y | x ) k 1 P (y , k | x ) k 1 P (y | k )P (k | x )
K
P (y | k )
P(y | k) is calculated via
K
1
1 exp(y (c k .a b ))
P(k|x) is estimated using an EM procedure after the clustering
p (x ) k 1 p (x | k )P (k )
K
p(x | k) is a multivariate Gaussian with the same σ for all clusters
2
p (x | k ) (2 )
d / 2
d
|| x c k ||
exp{
}
2 2
The logistic regression model to estimate parameters
L
ln P (y i | x i ; a , b )
i I
l
Motivation for DUAL
Strength of DWUS:
favors higher density samples close to the decision boundary
fast decrease in error
But!
DWUS establishes diminishing returns! Why?
• Early iterations -> many points are highly uncertain
• Later iterations -> points with high uncertainty no longer in dense regions
• DWUS wastes time picking instances with no direct effect on the error
How does DUAL do better?
Runs DWUS until it estimates a cross-over
(DWUS )
x t
Monitor the change in expected error at each iteration to detect
when it is stuck in local minima
^
(DWUS )
1
nt
^
E [(y i y i )2 | x i ] 0
DUAL uses a mixture model after the cross-over ( saturation )
point
^
x s argmax * E [(y i y i )2 | x i ] (1 ) * p (x i )
*
i I U
Our goal should be to minimize the expected future error
If we knew the future error of Uncertainty Sampling (US) to
be zero, then we’d force 1
But in practice, we do not know it
More on DUAL
After cross-over, US does better => uncertainty score
should be given more weight
should reflect how well US performs
can be calculated by the expected error of
^
^
US on the unlabeled data* => (US )
Finally, we have the following selection criterion for DUAL:
^
^
^
x s argmax(1 (US )) * E [(y i y i ) | x i ] (US ) * p (x i )
*
2
i I U
*
US is allowed to choose data only from among the already
^
sampled instances, and (US ) is calculated on the
remaining unlabeled set
A simple Illustration I
1
1
1
1
1
2
1
1
1
1
2
2
2
2
2
1
2
1
2
2
2
2
2
1
1
2
2
2
1
1
1
2
2 2
2
1
1
2
1
2
2
2
2
2
2
2
2
2
2
2
2
1
1
A simple Illustration II
1
1
1
1
2
1
1
1
1
1
2
2
2
2
2
1
2
1
2
2
2
2
2
1
1
2
2
2
1
1
1
2
2 2
2
1
1
2
1
2
2
2
2
2
2
2
2
2
2
2
2
1
1
A simple Illustration III
1
1
1
1
1
2
1
1
1
1
2
2
2
2
2
1
1
2
2
2
2
1
2
2
2
2
2
1
1
1
2
2 2
2
1
1
1
2
1
2
2
2
2
2
2
2
2
2
2
2
2
1
1
A simple Illustration IV
1
1
1
1
1
2
1
1
1
1
1
2
2
2
2
2
1
2
2
2
2
1
2
2
2
2
2
1
1
1
2
2 2
2
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
1
1
Experiments
initial training set size : 0.4% of the entire data ( n+ = n- )
The results are averaged over 4 runs, each run takes 100
iterations
DUAL outperforms
DWUS with p<0.0001 significance* after 40th iteration
Representative Sampling (p<0.0001) on all
COMB (p<0.0001) on 4 datasets, and p<0.05 on Image and Mvs-N
US (p<0.001) on 5 datasets
DS (p<0.0001) on 5 datasets
* All significance results are based on a 2-sided paired t-test on the classification error
Results: DUAL vs DWUS
Results: DUAL vs US
Results: DUAL vs DS
Results: DUAL vs COMB
Results: DUAL vs Representative S.
Failure Analysis
Current estimate of the
cross-over point is not
accurate on V-vs-Y dataset
=> simulate a better error
estimator
Currently, DUAL only
considers the performance
of US. But, on Splice DS is
better => modify selection
criterion:
^
^
^
x s argmax (DS ) * E [(y i y i ) | x i ] (1 (DS )) * p (x i )
*
i I U
2
Conclusion
DUAL robustly combines density and uncertainty (can
be generalized to other active sampling methods
which exhibit differential performance)
DUAL leads to more effective performance than
individual strategies
DUAL shows the error of one method can be
estimated using the data labeled by the other
DUAL can be applied to multi-class problems where
the error is estimated either globally or at the class or
the instance level
Future Work
Generalize DUAL to estimate which method
is currently dominant or use a relative
success weight
Apply DUAL to more than two strategies to
maximize the diversity of an ensemble
Investigate better techniques to estimate the
future classification error
THANK YOU!
The error expectation for a given point:
^
^
^
E [(y i y i )2 | x i ] (y i 1)2 P (y i 1 | x i ) (y i )2 P (y i 0 | x i )
Data density is estimated as a mixture of K Gaussians:
p (x )
K
p (x
|
k 1
k )P (k )
EM procedure to estimate P(K):
|| x i c k ||2
P (k | x i ) P (k ) exp{
}
2 2
1 n
P (k ) P (k | x i )
n
Likelihood:
L (a , b )
2
K
i 1
|| a ||2 ln{ P (k | x i )P (y i | k ; a , b )}
i I l
k 1