y - Microsoft Research

Download Report

Transcript y - Microsoft Research

Self-Paced Learning for
Semantic Segmentation
M. Pawan Kumar
Self-Paced Learning for
Latent Structural SVM
M. Pawan Kumar
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Benjamin Packer
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Daphne Koller
Aim
To learn accurate parameters for latent structural SVM
Input x
Output y  Y
Hidden Variable
hH
“Deer”
Y = {“Bison”, “Deer”, ”Elephant”, “Giraffe”, “Llama”, “Rhino” }
Aim
To learn accurate parameters for latent structural SVM
Feature (x,y,h)
(HOG, BoW)
Parameters w
(y*,h*) = maxyY,hH wT(x,y,h)
Motivation
Real
Numbers
Math is for
losers !!
Imaginary
Numbers
eiπ+1 = 0
FAILURE … BAD LOCAL MINIMUM
Motivation
Real
Numbers
Euler was
a Genius!!
Imaginary
Numbers
eiπ+1 = 0
SUCCESS … GOOD LOCAL MINIMUM
Motivation
Start with “easy” examples, then consider “hard” ones
Simultaneously estimate easiness and parameters
Easiness is property of data sets, not single instances
Easy vs. Hard
Expensive
Easy for human
 Easy for machine
Outline
• Latent Structural SVM
• Concave-Convex Procedure
• Self-Paced Learning
• Experiments
Latent Structural SVM
Felzenszwalb et al, 2008, Yu and Joachims, 2009
Training samples xi
Ground-truth label yi
Loss Function
(yi, yi(w), hi(w))
Latent Structural SVM
(yi(w),hi(w)) = maxyY,hH wT(x,y,h)
min ||w||2 + C∑i(yi, yi(w), hi(w))
Non-convex Objective
Minimize an upper bound
Latent Structural SVM
(yi(w),hi(w)) = maxyY,hH wT(x,y,h)
min ||w||2 + C∑i i
maxh wT(xi,yi,hi) - wT(xi,y,h)
i
≥ (yi, y, h) - i
Still non-convex
Difference of convex
CCCP Algorithm - converges to a local minimum
Outline
• Latent Structural SVM
• Concave-Convex Procedure
• Self-Paced Learning
• Experiments
Concave-Convex Procedure
Start with an initial estimate w0
Update hi = maxhH wtT(xi,yi,h)
Update wt+1 by solving a convex problem
min ||w||2 + C∑i i
wT(xi,yi,hi) - wT(xi,y,h)
≥ (yi, y, h) - i
Concave-Convex Procedure
Looks at all samples simultaneously
“Hard” samples will cause confusion
Start with “easy” samples, then consider “hard” ones
Outline
• Latent Structural SVM
• Concave-Convex Procedure
• Self-Paced Learning
• Experiments
Self-Paced Learning
REMINDER
Simultaneously estimate easiness and parameters
Easiness is property of data sets, not single instances
Self-Paced Learning
Start with an initial estimate w0
Update hi = maxhH wtT(xi,yi,h)
Update wt+1 by solving a convex problem
min ||w||2 + C∑i i
wT(xi,yi,hi) - wT(xi,y,h)
≥ (yi, y, h) - i
Self-Paced Learning
min ||w||2 + C∑i i
wT(xi,yi,hi) - wT(xi,y,h)
≥ (yi, y, h) - i
Self-Paced Learning
vi  {0,1}
min ||w||2 + C∑i vii
wT(xi,yi,hi) - wT(xi,y,h)
≥ (yi, y, h) - i
Trivial Solution
Self-Paced Learning
vi  {0,1}
min ||w||2 + C∑i vii - ∑ivi/K
wT(xi,yi,hi) - wT(xi,y,h)
≥ (yi, y, h) - i
Large K
Medium K
Small K
Self-Paced Learning
Alternating
Convex Search
vi  [0,1]
Biconvex
Problem
min ||w||2 + C∑i vii - ∑ivi/K
wT(xi,yi,hi) - wT(xi,y,h)
≥ (yi, y, h) - i
Large K
Medium K
Small K
Self-Paced Learning
Start with an initial estimate w0
T(x ,y ,h)
h
=
max
w
Update
i
hH t
i i
Update wt+1 by solving a convex problem
min ||w||2 + C∑i vii - ∑i vi/K
wT(xi,yi,hi) - wT(xi,y,h)
≥ (yi, y, h) - i
Decrease K  K/
Outline
• Latent Structural SVM
• Concave-Convex Procedure
• Self-Paced Learning
• Experiments
Object Detection
Input x - Image
Output y  Y
Latent h - Box
 - 0/1 Loss
Y = {“Bison”, “Deer”, ”Elephant”, “Giraffe”, “Llama”, “Rhino” }
Feature (x,y,h) - HOG
Object Detection
Mammals Dataset
271 images, 6 classes
90/10 train/test split
4 folds
Object Detection
CCCP
Self-Paced
Object Detection
CCCP
Self-Paced
Object Detection
CCCP
Self-Paced
Object Detection
CCCP
Self-Paced
Object Detection
Objective value
Test error
5
25
4.8
20
4.6
15
4.4
10
4.2
5
4
0
Fold1
Fold2
Fold3
Fold4
CCCP
CCCP
SPL
SPL
Fold1
Fold2
Fold3
Fold4
Handwritten Digit Recognition
Input x - Image
Output y  Y
Latent h - Rotation
 - 0/1 Loss
Y = {0, 1, … , 9}
MNIST Dataset
Feature (x,y,h) - PCA + Projection
Handwritten Digit Recognition
SPL
C
C
C
- Significant Difference
Handwritten Digit Recognition
SPL
C
C
C
- Significant Difference
Handwritten Digit Recognition
SPL
C
C
C
- Significant Difference
Handwritten Digit Recognition
SPL
C
C
C
- Significant Difference
Motif Finding
Input x - DNA Sequence
Output y  Y
Y = {0, 1}
Latent h - Motif Location
 - 0/1 Loss
Feature (x,y,h) - Ng and Cardie, ACL 2002
Motif Finding
UniProbe Dataset
40,000 sequences
50/50 train/test split
5 folds
Motif Finding
Average Hamming Distance of Inferred Motifs
SPL
SPL
SPL
SPL
Motif Finding
160
140
120
100
80
60
40
20
0
CCCP
Curr
SPL
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Objective Value
Motif Finding
50
40
30
CCCP
SPL
Curr
20
10
0
Fold Fold Fold Fold Fold
1
2
3
4
5
Test Error
Noun Phrase Coreference
Input x - Nouns
Output y - Clustering
Latent h - Spanning Forest over Nouns
Feature (x,y,h) - Yu and Joachims, ICML 2009
Noun Phrase Coreference
MUC6 Dataset
50/50 train/test split
60 documents
1 predefined fold
Noun Phrase Coreference
MITRE
Loss
Pairwise
Loss
- Significant Improvement
- Significant Decrement
Noun Phrase Coreference
SPL
MITRE
Loss
SPL
Pairwise
Loss
Noun Phrase Coreference
SPL
MITRE
Loss
SPL
Pairwise
Loss
Summary
• Automatic Self-Paced Learning
• Concave-Biconvex Procedure
• Generalization to other Latent models
– Expectation-Maximization
– E-step remains the same
– M-step includes indicator variables vi
Kumar, Packer and Koller, NIPS 2010