Transcript Slides

Active Learning
Lecture 26th
Maria Florina Balcan
Maria-Florina Balcan
Active Learning
Data Source
Learning
Algorithm
Expert / Oracle
Unlabeled
examples
Request for the Label of an Example
A Label for that Example
Request for the Label of an Example
A Label for that Example
...
Algorithm outputs a classifier
• The learner can choose specific examples to be labeled.
• He works harder, to use fewer labeled examples.
What Makes a Good Algorithm?
• Guaranteed to output a relatively good classifier
for most learning problems.
• Doesn’t make too many label requests.
• Choose the label requests carefully, to get
informative labels.
Maria-Florina Balcan
Can It Really Do Better Than Passive?
• YES! (sometimes)
• We often need far fewer labels for active
learning than for passive.
• This is predicted by theory and has been
observed in practice.
Maria-Florina Balcan
Can adaptive querying help?
[CAL92, Dasgupta04]
• Threshold fns on the real line: hw(x) = 1(x ¸ w), C = {hw: w 2 R}
-
+
w
Active Algorithm
•
Sample with 1/ unlabeled examples; do binary search.
- -
+
• Binary search – need just O(log 1/) labels.
Passive supervised: (1/) labels to find an -accurate threshold.
Active: only O(log 1/) labels. Exponential improvement.
Other interesting results as well.
Active Learning might not help [Dasgupta04]
In general, number of queries needed depends on C and also on D.
h3
R1}:
C = {linear separators in
active learning reduces sample
complexity substantially.
h2
C = {linear separators in R2}:
there are some target hyp. for
which no improvement can be
achieved!
h1
- no matter how benign the input distr.
h0
In this case: learning to accuracy  requires 1/ labels…
Maria-Florina Balcan
Examples where Active Learning helps
In general, number of queries needed depends on C and also on D.
• C = {linear separators in R1}: active learning reduces sample
complexity substantially no matter what is the input
distribution.
• C - homogeneous linear separators in Rd, D - uniform
distribution over unit sphere:
• need only d log 1/ labels to find a hypothesis with
error rate < .
• Dasgupta, Kalai, Monteleoni, COLT 2005
• Freund et al., ’97.
• Balcan-Broder-Zhang, COLT 07
Maria-Florina Balcan
Region of uncertainty [CAL92]
• Current version space: part of C consistent with labels so far.
• “Region of uncertainty” = part of data space about which
there is still some uncertainty (i.e. disagreement within version
space)
• Example: data lies on circle in R2 and hypotheses are
homogeneous linear separators.
current version space
+
+
region of uncertainty
in data space
Maria-Florina Balcan
Region of uncertainty [CAL92]
current version space
region of
uncertainy
Algorithm:
Pick a few points at random from the current
region of uncertainty and query their labels.
Maria-Florina Balcan
Region of uncertainty [CAL92]
• Current version space: part of C consistent with labels so far.
• “Region of uncertainty” = part of data space about which
there is still some uncertainty (i.e. disagreement within version
space)
current version space
+
+
region of uncertainty
in data space
Maria-Florina Balcan
Region of uncertainty [CAL92]
• Current version space: part of C consistent with labels so far.
• “Region of uncertainty” = part of data space about which
there is still some uncertainty (i.e. disagreement within version
space)
new version space
+
+
New region of
uncertainty in data
space
Maria-Florina Balcan
Region of uncertainty [CAL92], Guarantees
Algorithm: Pick a few points at random from the current region
of uncertainty and query their labels.
[Balcan, Beygelzimer, Langford, ICML’06]
Analyze a version of this alg. which is robust to noise.
• C- linear separators on the line, low noise, exponential
improvement.
• C - homogeneous linear separators in Rd, D -uniform
distribution over unit sphere.
• low noise, need only d2 log 1/ labels to find a
hypothesis with error rate < .
• realizable case, d3/2 log 1/ labels.
•supervised -- d/ labels.
Maria-Florina Balcan
Margin Based Active-Learning Algorithm
[Balcan-Broder-Zhang, COLT 07]
wk+1
Use O(d) examples to find w1 of error 1/8.
wk
iterate k=2, … , log(1/)
• rejection sample mk samples x from D
satisfying |wk-1T ¢ x| · k ;
γk
• label them;
• find wk 2 B(wk-1, 1/2k ) consistent with all these
examples.
end iterate
w*
Maria-Florina Balcan
Margin Based Active-Learning, Realizable Case
Theorem
If
PX is uniform over Sd.
and
then after
iterations ws has error · .
Fact 1
(u,v)
u
v
Fact 2
v

Maria-Florina Balcan
Margin Based Active-Learning, Realizable Case
Theorem
PX is uniform over Sd.
If
and
then after
iterations ws has error · .
Fact 1
(u,v)
u
v
Fact 3
If
uv
and
v

Maria-Florina Balcan
BBZ’07, Proof Idea
iterate k=2, … , log(1/)
Rejection sample mk samples x from D
satisfying |wk-1T ¢ x| · k ;
ask for labels and find wk 2 B(wk-1, 1/2k )
consistent with all these examples.
end iterate
Assume wk has error · . We are done if 9 k s.t. wk+1 has error · /2 and
only need O(d log( 1/)) labels in round k.
wk+1
wk
w*
γk
Maria-Florina Balcan
BBZ’07, Proof Idea
iterate k=2, … , log(1/)
Rejection sample mk samples x from D
satisfying |wk-1T ¢ x| · k ;
ask for labels and find wk 2 B(wk-1, 1/2k )
consistent with all these examples.
end iterate
Assume wk has error · . We are done if 9 k s.t. wk+1 has error · /2 and
only need O(d log( 1/)) labels in round k.
wk+1
wk
w*
γk
Maria-Florina Balcan
BBZ’07, Proof Idea
iterate k=2, … , log(1/)
Rejection sample mk samples x from D
satisfying |wk-1T ¢ x| · k ;
ask for labels and find wk 2 B(wk-1, 1/2k )
consistent with all these examples.
end iterate
Assume wk has error · . We are done if 9 k s.t. wk+1 has error · /2 and
only need O(d log( 1/)) labels in round k.
Key Point
Under the uniform distr. assumption for
we have
wk+1
· /4
wk
w*
γk
Maria-Florina Balcan
BBZ’07, Proof Idea
Key Point
Under the uniform distr. assumption for
wk+1
we have
wk
· /4
w*
γk
Key Point
So, it’s enough to ensure that
We can do so by only using O(d log( 1/)) labels in round k.
Maria-Florina Balcan