Dual Strategy Active Learning

Download Report

Transcript Dual Strategy Active Learning

DUAL STRATEGY
ACTIVE LEARNING
presenter: Pinar Donmez1
Joint work with Jaime G. Carbonell1 & Paul N. Bennett2
1 Language Technologies Institute, Carnegie Mellon University
2 Microsoft Research
Active Learning (Pool-based)
Data Source
learn a new model
unlabeled data
Learning Mechanism
User
output
label request
Expert
labeled data
Two different trends on Active
Learning
 Uncertainty Sampling:

selects the example with the lowest certainty

i.e. closest to the boundary, maximum entropy,...
 Density-based Sampling:



considers the underlying data distribution
selects representatives of large clusters
aims to cover the input space quickly

i.e. representative sampling, active learning using
pre-clustering, etc.
Goal of this Work
 Find an active learning method that works
well everywhere


Some work best when very few instances
sampled (i.e. density-based sampling)
Some work best after substantial sampling
(i.e. uncertainty sampling)
 Combine the best of both worlds for superior
performance
Main Features of DUAL
 DUAL
 is dynamic rather than static
 is context-sensitive
 builds upon the work titled “Active Learning with PreClustering”, (Nguyen & Smeulders, 2004)
 proposes a mixture model of density and uncertainty
 DUAL’s primary focus is to
 outperform static strategies over a large operating
range
 improve learning for the later iterations rather than
concentrating on the initial data labeling
Related Work
DUAL
AL with PreClustering
Representative
Sampling
COMB
Clustering
Yes
Yes
Yes
No
Uncertainty
+ Density
Yes
Yes
Yes
No
Dynamic
Yes
No
No
Yes
Active Learning with Pre-Clustering
 We call it Density Weighed Uncertainty Sampling
(DWUS in short). Why?
^
[1]
s  argmax E [(y i  y i ) | x i ]p (x i )
2
selection criterion
i I U
uncertainty score
density score
 assumes a hidden clustering structure of the data
 calculates the posterior P(y | x) as
P (y | x )  k 1 P (y , k | x ) 
K
[2]

K
1
P (y | k , x )P (k | x )
x and y are conditionally independent given k since
points in one cluster assumed to share the same
label
P (y | x )  k 1 P (y , k | x )  k 1 P (y | k )P (k | x )
K
[3]
k
K
Outline of DWUS
Cluster the data using K-medoid algorithm to find
the cluster centroids ck
2. Estimate P(k|x) by a standard EM procedure
3. Model P(y|k) as a logistic regression classifier
1.
P (y | k ) 
1
1  exp(y (c k .a  b ))
Estimate P(y|x) using P (y | x )   P (y , k | x )   P (y | k )P (k | x )
Select an unlabeled instance using Eq. 1
Update the parameters of the logistic regression
model (hence update P(y|k) )
7. Repeat steps 3-5 until stopping criterion
4.
5.
6.
K
K
k 1
k 1
Notes on DWUS
 Posterior class distribution:
P (y | x )  k 1 P (y , k | x )  k 1 P (y | k )P (k | x )
K
P (y | k ) 
 P(y | k) is calculated via
K
1
1  exp(y (c k .a  b ))
 P(k|x) is estimated using an EM procedure after the clustering
p (x )  k 1 p (x | k )P (k )
K
 p(x | k) is a multivariate Gaussian with the same σ for all clusters
2
p (x | k )  (2 )
d / 2

d
 || x  c k ||
exp{
}
2 2
 The logistic regression model to estimate parameters
L
ln P (y i | x i ; a , b )

i I

l
Motivation for DUAL
 Strength of DWUS:



favors higher density samples close to the decision boundary
fast decrease in error
But!
DWUS establishes diminishing returns! Why?
• Early iterations -> many points are highly uncertain
• Later iterations -> points with high uncertainty no longer in dense regions
• DWUS wastes time picking instances with no direct effect on the error
How does DUAL do better?
 Runs DWUS until it estimates a cross-over
 (DWUS )

x t
 Monitor the change in expected error at each iteration to detect
when it is stuck in local minima
^
 (DWUS ) 
1
nt
^
 E [(y i  y i )2 | x i ]  0
 DUAL uses a mixture model after the cross-over ( saturation )
point
^
x s  argmax  * E [(y i  y i )2 | x i ]  (1   ) * p (x i )
*
i I U
 Our goal should be to minimize the expected future error


If we knew the future error of Uncertainty Sampling (US) to
be zero, then we’d force   1
But in practice, we do not know it
More on DUAL
 After cross-over, US does better => uncertainty score

should be given more weight

should reflect how well US performs
 can be calculated by the expected error of

^
^
US on the unlabeled data* =>    (US )
 Finally, we have the following selection criterion for DUAL:
^
^
^
x s  argmax(1   (US )) * E [(y i  y i ) | x i ]   (US ) * p (x i )
*
2
i I U
*
US is allowed to choose data only from among the already
^
sampled instances, and  (US ) is calculated on the
remaining unlabeled set
A simple Illustration I
1
1
1
1
1
2
1
1
1
1
2
2
2
2
2
1
2
1
2
2
2
2
2
1
1
2
2
2
1
1
1
2
2 2
2
1
1
2
1
2
2
2
2
2
2
2
2
2
2
2
2
1
1
A simple Illustration II
1
1
1
1
2
1
1
1
1
1
2
2
2
2
2
1
2
1
2
2
2
2
2
1
1
2
2
2
1
1
1
2
2 2
2
1
1
2
1
2
2
2
2
2
2
2
2
2
2
2
2
1
1
A simple Illustration III
1
1
1
1
1
2
1
1
1
1
2
2
2
2
2
1
1
2
2
2
2
1
2
2
2
2
2
1
1
1
2
2 2
2
1
1
1
2
1
2
2
2
2
2
2
2
2
2
2
2
2
1
1
A simple Illustration IV
1
1
1
1
1
2
1
1
1
1
1
2
2
2
2
2
1
2
2
2
2
1
2
2
2
2
2
1
1
1
2
2 2
2
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
1
1
Experiments
 initial training set size : 0.4% of the entire data ( n+ = n- )
 The results are averaged over 4 runs, each run takes 100
iterations
 DUAL outperforms
 DWUS with p<0.0001 significance* after 40th iteration
 Representative Sampling (p<0.0001) on all
 COMB (p<0.0001) on 4 datasets, and p<0.05 on Image and Mvs-N
 US (p<0.001) on 5 datasets
 DS (p<0.0001) on 5 datasets
* All significance results are based on a 2-sided paired t-test on the classification error
Results: DUAL vs DWUS
Results: DUAL vs US
Results: DUAL vs DS
Results: DUAL vs COMB
Results: DUAL vs Representative S.
Failure Analysis

Current estimate of the
cross-over point is not
accurate on V-vs-Y dataset
=> simulate a better error
estimator

Currently, DUAL only
considers the performance
of US. But, on Splice DS is
better => modify selection
criterion:
^
^
^
x s  argmax  (DS ) * E [(y i  y i ) | x i ]  (1   (DS )) * p (x i )
*
i I U
2
Conclusion
 DUAL robustly combines density and uncertainty (can
be generalized to other active sampling methods
which exhibit differential performance)
 DUAL leads to more effective performance than
individual strategies
 DUAL shows the error of one method can be
estimated using the data labeled by the other
 DUAL can be applied to multi-class problems where
the error is estimated either globally or at the class or
the instance level
Future Work
 Generalize DUAL to estimate which method
is currently dominant or use a relative
success weight
 Apply DUAL to more than two strategies to
maximize the diversity of an ensemble
 Investigate better techniques to estimate the
future classification error
THANK YOU!
 The error expectation for a given point:
^
^
^
E [(y i  y i )2 | x i ]  (y i  1)2 P (y i  1 | x i )  (y i )2 P (y i  0 | x i )
 Data density is estimated as a mixture of K Gaussians:
p (x ) 
K
 p (x
|
k 1
k )P (k )
 EM procedure to estimate P(K):
 || x i  c k ||2
P (k | x i )  P (k ) exp{
}
2 2
1 n
P (k )   P (k | x i )
n
 Likelihood:
L (a , b ) 

2
K
i 1
|| a ||2   ln{ P (k | x i )P (y i | k ; a , b )}
i I l
k 1