Smooth Boosting By Using An Information

Download Report

Transcript Smooth Boosting By Using An Information

Smooth Boosting
By Using An Information-Based
Criterion
Kohei Hatano
Kyushu University, JAPAN
Organization of this talk
1.
2.
3.
4.
5.
Introduction
Preliminaries
Our booster
Experiments
Summary
Boosting
• Methodology to combine prediction rules into a
more accurate one .
E.g. learning rule to classify web pages
on “Drew Barrymore”
Set of pred. rules = words
y
Barrymore?
accuracy 51%!
No
Yes
Labeled training data (web pages)
n
John Barrymore
Barrymore
combination of predictionJaid
rules
(say, majority vote)
(her grandpa) John Drew Barrymore
(her mother)
(her father)
Barrymore?
Drew?
Lionel Barrymore
y granduncle)
(her
YES
+
n
NO
y
YES
“The Barrymore family”
of Hollywood
+
nDiana
NO
Charlie’s engels?
Barrymore
y
(her aunt)
YES
n
NO
accuracy 80%
Boosting by filtering
[Schapire 90], [Freund 95],
Boosting scheme that uses random sampling from data
(Huge) data
sample randomly
boosting algorithm
accept
reject
Advantage 1: can determine sample size adaptively
Advantage 2: smaller space complexity (for sample)
batch learning: O(1/)
boosting by filtering : polylog(1/) (: desired error)
Some known results
Boosting algorithms by filtering
– Schapire’s first boosting alg. [Schapire 90] ,Boost-by-Majority [Freund
95], MadaBoost [Domingo&Watanabe 00], AdaFlat [Gavinsky 03].
– Criterion for choosing prediction rules: accuracy
Are there any better criteria?
A candidate: information-based criterion
– Real AdaBoost [Schapire&Singer 99], InfoBoost [Aslam 00] (a simple
version of Real AdaBoost)
– Criterion for choosing prediction rules: mutual information
– sometimes faster than those using accuracy-based criterion
Experimental: [Schapire&Singer 99], Theoretical: [Hatano&Warmuth 03],
[Hatano&Watanabe 04]
– However, no boosting algorithm by filtering known
Our work
Boosting by filtering
lower space complexity
our work
Information-based criterion
faster convergence
efficient boosting by filtering
using an information-based criterion
1.
2.
3.
4.
5.
Introduction
Preliminaries
Our booster
Experiments
Summary
Illustration of general boosting
(x1,+1)
(x3,-1)
(x4,-1)
(x5,+1)
Distribution D1
0.2
(x2,+1)
Pred. of h1
+1
+1
-1
+1
-1
Train. data
0.2
lower
0.2
0.2
0.2
higher
1. Choose a pred. rule h1 maximizing some criterion w. r. t. D1.
: correct
: wrong
0.25
2. Assign a coefficient to h1
based on its quality.
h1
3. Update the distribution.
+1
-1
Illustration of general boosting(2)
(x1,+1)
(x3,-1)
(x4,-1)
(x5,+1)
Distribution D2
0.16
(x2,+1)
Pred. of h2
-1
-1
-1
-1
+1
Train. data
0.16
higher
0.21
0.21
0.26
lower
: correct
: wrong
1. Choose a pred. rule h2 maximizing some criterion w. r. t. D2.
0.28
2. Assign a coefficient to h1
based on its weighted error.
h2
3. Update the distribution.
+1
-1
Repeat these procedure for T times
Illustration of general boosting(3)
Final pred. rule = weighted majority vote of chosen pred. rules.
instance x
0.28
0.25
+
h1
+1
-1
0.05
+
h2
+1
-1
h3
+1
H (x )  0.25  0.28 0.05  0.02
predict +1, if H(x) >0
predict -1, otherwise
-1
Example: AdaBoost
[Freund&Schapire 97]
Criterion for choosing pred. rules
m
ht  arg max yi h (xi )Dt (xi )
(edge)
h W
i 1 



edgeDt (h)
Coefficient
t  21 ln 11tr ,
Update
where  t  i yi ht (xi )Dt (xi );
-yiHt(xi)
Ht 1  Ht  t ht ;
Dt 1 (xi ) 
exp(yi Ht 1 (xi ))
m
 exp(yi Ht 1 (xi ))
i 1
;
correct
wrong
Difficult examples (possibly noisy)
may have too much weights
Smooth boosting
• Keeping the distribution “smooth”
poly  D1
supxDt(x)/D1(x)
is poly-bounded
Dt (distribution costructed by the booster)
D1 (original distribution, e.g. uniform)
• makes boosting algorithms
– noise-tolerant
• (statistical query model) MadaBoost [Domingo&Watanabe00]
• (malicious noise model ) SmoothBoost [Servedio01]
• (agnostic boosting model) AdaFlat [Gavinsky 03] ,
– sampling from Dt can be simulated efficiently
via sampling from D1 (e.g., by rejection sampling).
 applicable in the boosting by filtering framework
Example: MadaBoost
[Domingo & Watanabe 00]
Criterion for choosing pred. rules
l(-yiHt(xi))
m
ht  arg max yi h (xi )Dt (xi )
(edge)
h W
i 1 



edgeDt (h)
Coefficient
t  21 ln 11tr ,
where  t  i yi ht (xi )Dt (xi );
Update
Ht 1  Ht  t ht ;
Dt 1 (xi ) 
( yi Ht 1 (xi ))
m
( yi Ht

i
1
1
(xi ))
;
-yiHt(xi)
Dt is 1/bounded
( : error of Ht)
Examples of other smooth boosters
LogitBoost [Freidman, et al 00]
logistic function
AdaFlat
[Gavinsky 03]
stepwise linear function
1.
2.
3.
4.
5.
Introduction
Preliminaries
Our booster
Experiments
Summary
Our new booster
Criterion for choosing pred. rules
(pseudo gain)
l(-yiHt(xi))
ht  arg max t (h ) ;
h W

pseudo gain
Coefficient
 t [1] / 2, if z  0
t (z )  
; where  t [1] 
 t [1] / 2, if z  0
 y h (xi )Dt (xi )
;
Dt (xi )

i ht xi
i t
i :ht ( xi )  1
:
(
)  1
Update
Ht 1 (x )  Ht (x )  t (ht (x ))ht (x );
Dt 1 (xi ) 
( yi Ht 1 (xi ))
m
( yi Ht

i
1
1
(xi ))
;
-yiHt(xi)
Still, Dt is 1/bounded
( : error of Ht)
Pseudo gain
t (h )  pt t [1]  (1  pt )t [1]
2
where pt 
2
Dt (xi )  PrD {ht (x )  1};  t [1] 

ih x
: (
i
)  1
yi h (xi )Dt (xi )

ih x
: (
i
)  1
t
Dt (xi )
;
i :h ( xi )  1
m
Relation to edge edge    yi h (xi )D (xi );
i 1
  pt [1]  (1  p )t [1];
  pt [1]2  (1  p )t [1]2 ;
Property: 2    
(by convexity of of the square function)
Interpretation of pseudo gain
max  t (ht )  min 1 -  t (ht )
ht
ht
 minh(conditional entropy of labels given ht)
 maxh(mutual information between h and labels)
but, ・・・
the entropy function is NOT defined with
Shannon’s entropy, but defined with Gini index
Information-based criteria
EGini( p )  4 p (1  p )
EShannon ( p )  p log p  (1  p )log(1  p )
EKM ( p )  2 p (1  p ) ;
[Kearns & Mansour 98]
Our booster chooses a pred. rule maximizing the mutual information
defined by Gini Index (GiniBoost)
Cf. Real AdaBoost and InfoBoost choose a pred. rule that maximizes the
mutual information defined with KM entropy.
Good news: Gini index can be estimated via sampling efficiently!
Convergence
of train. error (GiniBoost)
Thm.
Suppose that (train. error of Ht)>  for t=1,…,T. Then
train.errHT   1 

T
t (ht ).

4t
1
Coro.
Further, if  t (ht)¸ , train.err(HT) ·  in T= O(1/) steps.
Comparison on convergence speed
booster
MadaBoost
[Domingo&Watanabe 00]
SmoothBoost
#of iterations to get
a final rule with error  
O(1/ 2)
○boost by filtering
○adaptive
(don’ need to know )
×needs technical assumptions
○boost by filtering
× not adaptive
[Servedio 01]
O(1/
AdaFlat
O(1/22 )
○boost by filtering
○adaptive
O(1/)
○boost by filtering
○adaptive
[Gavinsky 03]
GiniBoost
(our result)
AdaBoost
[Schapire& Freund 97]
2)
comments
1/  1/2)
O(log(1/)
/2)
○adaptive
×boost by filtering
: minimum pseudo gain : minimum edge
Boosting- by- filtering version of
GiniBoost (outline)
• Multiplicative bounds for pseudo gain
(and more practical bounds using the
central limit approximation).
• Adaptive pred. rule selector.
• Boosting alg. in the PAC learning sense.
1.
2.
3.
4.
5.
Introduction
Preliminaries
Our booster
Experiments
Summary
Experiments
•
•
•
•
•
•
Topic classification of Reuters news (Reuters-21578)
Binary classification for each 5 topics (Results are averaged).
10,000 examples.
30,000 words used as base pred. rules.
Run algorithms until they sample 1,000,000 examples in total.
10-fold CV.
Test error over Reuters
Note: GiniBoost2 doubles coefficients  t[+1],  t[-1] used in GiniBoost
Execution time
test error(%) time (sec.)
5.6
1349
AdaBoost
(w/o sampling,run in 100 step)
MadaBoost
6.7
493
GiniBoost
5.8
408
GiniBoost2
5.5
359
faster by about 4 times!
(Cf. similar result w/o sampling RealAdaBoost [Schapire & Singer 99] )
1.
2.
3.
4.
5.
Introduction
Preliminaries
Our booster
Experiments
Summary
Summary/Open problem
Summary
GiniBoost:
• uses pseudo gain (Gini index) to choose base
prediction rules
• shows faster convergence in the filtering scheme.
Open problem
•Theoretical analysis on noise-tolerance
Comparison on sample size
# of sampling
# of accepted
examples
time (sec.)
AdaBoost
N/A
N/A
1349
MadaBoost
1,032,219
157,320
493
GiniBoost1
1,039,943
156,856
408
GiniBoost2
1,027,874
140,916
359
(w/o sampling, run in
100 steps)
Observation: smaller accepted examples→ faster selection of pred. rules