Slides (PPT)

Download Report

Transcript Slides (PPT)

Structured Prediction Cascades
Ben Taskar
David Weiss, Ben Sapp, Alex Toshev
University of Pennsylvania
Supervised Learning
Learn
from
• Regression:
• Binary Classification:
• Multiclass Classification:
• Structured Prediction:
Handwriting Recognition
x
y
`structured’
Machine Translation
x
‘Ce n'est pas
un autre
problème de
classification.’
y
‘This is not
another
classification
problem.’
Pose Estimation
x
© Arthur Gretton
y
Structured Models
scoring function
Complexity of inference depends on
space of feasible outputs “part” structure
parts = cliques, productions
Supervised Structured Prediction
Model:
Data
Learning
Prediction
Discriminative
estimation of θ
Intractable/impractical
for complex models
Intractable/impractical
for complex models
3rd order OCR model = 1 mil states * length
Berkeley English grammar = 4 mil productions * length^3
Tree-based pose model = 100 mil states * # joints
Approximation vs. Computation
• Usual trade-off: approximation vs. estimation
– Complex models need more data
– Error from over-fitting
• New trade-off: approximation vs. computation
– Complex models need more time/memory
– Error from inexact inference
• This talk: enable complex models via cascades
An Inspiration: Viola Jones Face Detector
Scanning window at every
location, scale and orientation
Classifier Cascade
•
•
•
•
•
•
C1
Non-face
C2
Non-face
C3
Non-face
Cn
Non-face
Face
Most patches are non-face
Filter out easy cases fast!!
Simple features first
Low precision, high recall
Learned layer-by-layer
Next layer more complex
Related Work
• Global Thresholding and Multiple-Pass Parsing. J Goodman, 97
• A maximum-entropy-inspired parser. E. Charniak, 00.
• Coarse-to-fine n-best parsing and MaxEnt discriminative
reranking, E Charniak & M Johnson, 05
• TAG, dynamic pro- gramming, and the perceptron for efficient,
feature-rich parsing, X. Carreras, M. Collins, and T. Koo, 08
• Coarse-to-Fine Natural Language Processing, S. Petrov, 09
• Coarse-to-fine face detection. Fleuret, F., Geman, D, 01
• Robust real-time object detection. Viola, P., Jones, M, 02
• Progressive search space reduction for human pose estimation,
Ferrari, V., Marin-Jimenez, M., Zisserman, A, 08
What’s a Structured Cascade?
• What to filter?
– Clique assignments
• What are the layers?
F1
– Higher order models
– Higher resol. models
F2
??????
??????
• How to learn filters?
F3
??????
Fn
?????? ‘structured’
– Novel convex loss
– Simple online algorithm
– Generalization bounds
Trade-off in Learning Cascades
Filter 1
• Accuracy: Minimize the
number of errors incurred
by each level
• Efficiency: Maximize the
number of filtered
assignments at each level
Filter 2
Filter D
Predict
Max-marginals (Sequences)
Score of an output:
a
a
a
a
b
b
b
b
c
c
c
c
d
d
d
d
Max marginal:
Compute
max (*bc*)
Filtering with Max-marginals
• Set threshold
• Filter clique assignment
if
a
a
a
a
b
b
b
b
c
c
c
c
d
d
d
d
Filtering with Max-marginals
• Set threshold
• Filter clique assignment
if
a
a
a
a
b
b
b
b
c
c
c
c
d
d
d
d
Remove edge
bc
Why Max-marginals?
• Valid path guarantee
– Filtering leaves at least one valid global assignment
• Faster inference
– No exponentiation, O(k) vs. O(k log k) in some
cases
• Convex estimation
– Simple stochastic subgradient algorithm
• Generalization bounds for error/efficiency
– Guarantees on expected trade-off
Choosing a Threshold
• Threshold must be specific to input x
– Max-marginal scores are not normalized
• Keep top-K max-marginal assignments?
– Hard(er) to handle and analyze
• Convex alternative: max-mean-max function:
max score
mean max marginal
m = #edges
Choosing a Threshold (cont)
Mean max marginal
Score of truth
Max score
Max marginal scores
Range of possible thresholds
α = efficiency level
Example OCR Cascade
ba
b
h
kc
144 269 -137
d
e
f
g
80
52
-42
49
Mean ≈ 0
Max = 368
h
i
j
k
360 -27 -774 368
α = 0.55
l
-24
Threshold ≈ 204
…
Example OCR Cascade
b
h
k
a
e
n
u
a
b
d
g
a
c
e
g
n
o
a
e
m n
u
w
r
u
Example OCR Cascade
b ▪bh
k
▪h
▪k
a bae
n ha
u
a aab
d ab
g
a aac
e
a aae
m ae
n u
r
g n
ac
ka
be
he
…
ad
ag
ea
…
oae u
ag
an
…
wam
an
au
…
Example OCR Cascade
Mean ≈ 1412
α = 0.44
▪b
▪h
▪k
1440
1558
1575
ba
ha
ka
be
he
1440
1558
1575
1413
1553
aa
ab
ad
ag
ea
1400
1480
1575
1397
1393
aa
ac
ae
ag
an
1257
1285
1302
1294
1356
aa
ae
am
an
au
1336
1306
1390
1346
1306
Max = 1575
Threshold ≈ 1502
…
…
…
…
Example OCR Cascade
Mean ≈ 1412
α = 0.44
▪b
▪h
▪k
1440
1558
1575
ba
ha
ka
be
he
1440
1558
1575
1413
1553
aa
ab
ad
ag
ea
1400
1480
1575
1397
1393
aa
ac
ae
ag
an
1257
1285
1302
1294
1356
aa
ae
am
an
au
1336
1306
1390
1346
1306
Max = 1575
Threshold ≈ 1502
…
…
…
…
Example OCR Cascade
Mean ≈ 1412
▪h
▪k
▪h
▪k
ha
ha
ka
he
ka
ad
ed
nd
ad
ke
Threshold ≈ 1502
he
kn
…
…
…
do
ow
Max = 1575
α = 0.44
om
…
Example OCR Cascade
▪h▪▪h
▪k ▪▪k
ha
▪ha
ka ▪ka he
▪heke
kn
▪ke
…
ad
had
ed kad nd
hed
ked
…
do
ado
edo
ndo
ow
dow
om dom
Example: Pictorial Structure Cascade
H
URA
T
LRA
k ≈ 320,000
k2 ≈ 100mil
ULA
LLA
Upper
arm
Lower
arm
10x10x12
10x10x24
40x40x24
80x80x24
110×122×24
Quantifying Loss
• Filter loss
 If score(truth) > threshold, all true states are safe
• Efficiency loss
 Proportion of unfiltered clique assignments
Learning One Cascade Level
• Fix α, solve convex problem for θ
No filter mistakes
Margin w/ slack
Minimize filter mistakes at efficiency level α
An Online Algorithm
• Stochastic sub-gradient update:
Features of truth
Convex combination: Features of best guess
+ Average features of max marginal “witnesses”
Generalization Bounds
• W.h.p. (1-δ), filtering and efficiency loss
observed on training set generalize to new
data
Expected loss on true distribution
Generalization Bounds
• W.h.p. (1-δ), filtering and efficiency loss
observed on training set generalize to new
data
Empirical -ramp upper bound
-ramp(
)
1
-
0
Generalization Bounds
• W.h.p. (1-δ), filtering and efficiency loss
observed on training set generalize to new
data
n number of examples
m number of clique assignments
number of cliques
B
Generalization Bounds
• W.h.p. (1-δ), filtering and efficiency loss
observed on training set generalize to new
data
Similar bound holds for Le and all ® 2 [0,1]
OCR Experiments
80
70
% Error
60
50
73.35
1st order
2nd order
3rd order
50.56
4rd order
40
30
20
26.17
22.5
14.31 12.05
10
15.54
7.75
0
Character Error
Word Error
Dataset: http://www.cis.upenn.edu/~taskar/ocr
Efficiency Experiments
• POS Tagging (WSJ + CONLL datasets)
– English, Bulgarian, Portuguese
• Compare 2nd order model filters
– Structured perceptron (max-sum)
– SCP w/ ® 2 [0, 0.2,0.4,0.6,0.8] (max-sum)
– CRF log-likelihood (sum-product marginals)
• Tightly controlled
– Use random 40% of training set
– Use development set to fit regularization parameter ¸
and α for all methods
POS Tagging Results
English
Error Cap (%) on Development Set
Max-sum (SCP)
Max-sum (0-1)
Sum-product (CRF)
POS Tagging Results
Portuguese
Error Cap (%) on Development Set
Max-sum (SCP)
Max-sum (0-1)
Sum-product (CRF)
POS Tagging Results
Bulgarian
Error Cap (%) on Development Set
Max-sum (SCP)
Max-sum (0-1)
Sum-product (CRF)
English POS Cascade (2nd order)
DT
NN
The cascade
VRB
ADJ
is
efficient.
Full
SCP
CRF
Taglist
Accuracy (%)
96.83
96.82 96.84
---
Filter Loss (%)
0
0.12
0.024
0.118
Test Time (ms)
173.28
1.56
4.16
10.6
Avg. Num States
1935.7
3.93
11.845
95.39
Pose Estimation (Buffy Dataset)
torso
head
upper
arms
lower
arms
total
Ferrari et al 08
--
--
--
--
61.4%
Ferrari et al. 09
--
--
--
--
74.5%
Andriluka et al. 09
98.3%
95.7%
86.8%
51.7%
78.8%
Eichner et al. 09
98.72%
97.87%
92.8%
59.79%
80.1%
CPS (ours)
100.00%
99.57%
96.81%
62.34%
86.31%
method
“Ferrari score:” percent of parts correct within radius
Pose Estimation
Cascade Efficiency vs. Accuracy
Features: Shape/Segmentation Match
Features: Contour Continuation
Conclusions
• Novel loss significantly improves efficiency
• Principled learning of accurate cascades
• Deep cascades focus structured inference and allow rich models
• Open questions:
– How to learn cascade structure
– Dealing with intractable models
– Other applications: factored dynamic models, grammars
Thanks!
Structured Prediction Cascades, Weiss & Taskar, AISTATS10
Cascaded Models for Articulated Pose Estimation, Sapp, Toshev & Taskar, ECCV10