Combining Randomization and Discrimination for Fine-grained Image Categorization (CVPR 2011) Bangpeng Yao*, Aditya Khosla*, and Li Fei-Fei Computer Science Department, Stanford University (* - indicates.

Combining Randomization and Discrimination for Fine-grained Image Categorization (CVPR 2011) Bangpeng Yao, Aditya Khosla, and Li Fei-Fei Computer Science Department, Stanford University (* - indicates.

Transcript Combining Randomization and Discrimination for Fine-grained Image Categorization (CVPR 2011) Bangpeng Yao, Aditya Khosla, and Li Fei-Fei Computer Science Department, Stanford University (* - indicates.

Combining Randomization and
Discrimination for Fine-grained Image
Categorization (CVPR 2011)
Bangpeng Yao*, Aditya Khosla*, and Li Fei-Fei
Computer Science Department, Stanford University
(* - indicates equal contribution)
{bangpeng,aditya86,feifeili}@cs.stanford.edu
1
Hmmm, I wonder if
these
thismushrooms
snake is
poisonous…
are edible!
Species:
Amanita Muscaria
Species:
Red-capped Scaber
Fine-Grained Oracle Fine-Grained Oracle Fine-Grained Oracle
2
From Base-level to Fine-grained Categorization
Mammals
Cats
Dogs
Cows
Base-level
Categorization
Pugs
Boxer
Mastiff
Fine-grained
Categorization
[Rosch, 1979]
3
Subordinate Categorization
• Children learn to identify
base-level categories before
subordinate categories
Stilettos!
Flats!
Pumps!
Mary Janes! Peep toes!
Sling backs!
Platforms!
Shoes!
[Johnson & Eilers, 1998]
4
Computer vision has followed a
similar trajectory!
Viola-Jones, 2001
Caltech-4, 1999
Spatial pyramid
matching, 2006
Caltech - 101,
2004
Caltech - 256,
2007
Deformable Parts-based
Model, 2008
ImageNet, 2009-2011
5
Recently Emerging Fine-Grained Datasets
Caltech-UCSD
Birds 200
CUB-200, 2010
Caltech-4, 1999
Action Classification
(PASCAL VOC, PPMI, …)
Action Classification, 2010
Caltech - 101,
2004
Caltech - 256,
2007
ImageNet Winter
Release, 2011
ImageNet, 2009-2011
6
Fine-Grained Categorization:
Potential Applications
We are
looking for a
red Camry
Robots handling
tools/cooking
Fine-grained
Assisting humans in
fine-discrimination tracking/surveillance
7
Outline
• Fine-grained Image Categorization
• Our Method
• Dataset & Experiments
• ImageNet Datasets
• Conclusion
8
Outline
• Fine-grained Image Categorization
• Our Method
• Dataset & Experiments
• ImageNet Datasets
• Conclusion
9
Fine-grained Categorization is Challenging
Reference image:
California gull
California Gull
California Gull
California Gull
pose
Different address
Different
Can existingDifferent
approaches
adequately
these
Viewpoint
background
challenges?
Herring Gull
Herring Gull
Herring Gull
Different class,
very similar appearance
10
Existing methods on this Problem
California Gull
Herring Gull
California Gull
High Weights
Low weights
11
[Grauman & Darrell, 2005, Lazebnik et al, 2006]
Existing methods on this Problem
California Gull
Herring Gull
California Gull
12
[Grauman & Darrell, 2005, Lazebnik et al, 2006]
Existing methods on this Problem
California Gull
Herring Gull
California Gull
13
[Grauman & Darrell, 2005, Lazebnik et al, 2006]
Our Intuition: What do humans do?
California Gull
Herring Gull
California Gull
Our goal: Find image regions
14
Our Intuition: What do humans do?
California Gull
Herring Gull
California Gull
Our goal: Find image regions that contain
discriminative information!
15
Our Intuition: What do humans do?
California Gull
Herring Gull
California Gull
Regions in SPM are insufficient to explore this space
How to explore this feature space?
Dense feature space
Randomization &
Discrimination
Our goal: Find image regions that contain
discriminative information!
16
Outline
• Fine-grained Image Categorization
• Our Method
• Dataset & Experiments
• ImageNet Datasets
• Conclusion
17
Potential Image Regions
...
Region
Height
...
...
Region Width
Normalized Image
Size of image region
Center of image region
18
Potential Image Regions
...
Region
Height
...
...
Region Width
Normalized Image
Size of image region
Center of image region
19
Potential Image Regions
...
Region
Height
...
...
Region Width
Normalized Image
Size of image region
Center of image region
20
Potential Image Regions
...
Region
Height
...
...
Region Width
Normalized Image
Image size: N*N
Image regions: O(N6)
Size of image region
Center of image region
How can we identify the most discriminative
regions in an efficient manner?
21
Potential Image Regions
...
Region
Height
...
...
Region Width
Normalized Image
Size of image region
Center of image region
We can apply randomization to sample a
subset of the image patches
Random Forests
22
Potential Image Regions
...
Region
Height
...
...
Region Width
This class
Other classes
Normalized Image
Size of image region
Center of image region
Random Forests
23
Potential Image Regions
...
Region
Height
...
...
Region Width
This class
Other classes
Normalized Image
Size of image region
Center of image region
Random Forests With Strong Classifiers!
24
Random
Forests Image
with Strong
Classifiers
Potential
Regions
...
Region
Height
...
...
Region Width
This class
Other classes
Normalized Image
Size of image region
Center of image region
Random Forests With Strong Classifiers!
25
Random Forests with Strong Classifiers
• Generalization error of a random forest:
: correlation between decision trees
: strength of the decision trees
• Dense feature space
decreases
• Strong classifiers
increases
Better generalization
26
Random Forests with Strong Classifiers
…
1
…
…
0
4
2
1
1
5
3
…
0
Train a
binary
SVM
1
27
Random Forests with Strong Classifiers
…
1
…
…
0
4
2
1
1
5
3
…
0
Train a
binary
SVM
1
• Select best sample using information gain criterion
28
Random Forests with Strong Classifiers
…
1
…
…
0
4
2
1
1
5
3
…
0
Train a
binary
SVM
1
29
Random Forests with Strong Classifiers
…
1
…
…
0
4
2
1
1
5
3
…
0
Train a
binary
SVM
1
• Select best sample using information gain criterion
30
Random Forests with Strong Classifiers
…
…
…
…
• We stop growing the trees if:
o The maximum depth is reached
o There is only one class at the node
o The entropy of the training data at
the node is low
31
Classification With Random Forests
…
…
…
Class Label
…
Number of trees
32
Outline
• Fine-grained Image Categorization
• Our Method
• Dataset & Experiments
• ImageNet Datasets
• Conclusion
33
Experiments: Datasets
Caltech-UCSD Birds 200
Action Classification
• PASCAL Action Dataset
• People-playing
musical-instruments (PPMI)
Evaluation metric:
Average Per-class Accuracy
Evaluation metric:
Mean Average Precision (mAP)
34
Caltech-UCSD Birds 200
• 200 bird species from North America
Train Set:
(15 images/class)
Test Set:
(~15 images/class)
Bounding box
annotations are provided
for each image
35
Caltech-UCSD Birds 200
• 200 bird species from North America
Feature extraction:
Train Set:
(15 images/class)
Test Set:
(~15 images/class)
•
ColorSIFT descriptors
•
Dense SIFT sampling at multiple
scales (8, 12, 16, 24, 30)
•
Locality-constrained Linear Coding
(LLC) Features
•
Dictionary size: 256
•
Use only the image region in
provided bounding box
Bounding box
annotations are provided
for each image
[LLC: Wang et al, 2010]
[ColorSIFT: van De Sande et al, 2010]
36
Caltech-UCSD Birds 200
• 200 bird species from North America
0,2
Train Set:
(15 images/class)
19%
Accuracy
0,19
19.2%
18.0%
0,18
0,17
16.1%
0,16
Test Set:
(~15 images/class)
0,15
SPM LLC MKL Ours
Feature
extraction:
[LLC: Wang et al, 2010]
[ColorSIFT: van De Sande et al, 2010]
•
Densely sampled ColorSIFT descriptors
•
Locality-constrained Linear Coding (LLC)
•
Dictionary size: 256
•
Only bounding box regions
Uses many features
including gray/color
SIFT, geometric blur,
color histograms, etc
37
Caltech-UCSD Birds 200
• 200 bird species from North America
Accuracy
0,2
19.2%
0,19
18.0%
0,18
0,17
16.1%
0,16
0,15
SPM
Feature
extraction:
[LLC: Wang et al, 2010]
[ColorSIFT: van De Sande et al, 2010]
•
Densely sampled ColorSIFT descriptors
•
Locality-constrained Linear Coding (LLC)
•
Dictionary size: 256
•
Only bounding box regions
LLC
Ours
38
PASCAL Action Dataset
Playing
Instrument
Reading
Riding
Bike
(51)
(76)
(53)
(66)
(71)
(94)
(55)
(59)
(83)
(29)
(51)
(33)
(39)
(47)
(48)
(20)
(30)
(36)
Phoning
Riding
Using
Horse Running Taking Photo Computer Walking
Train Set:
#images:
Test Set:
#images:
Running
Bounding boxes and
corresponding action
provided for each image
Running
Background Image
Foreground Images
40
PASCAL Action Dataset
Phoning
Playing
Instrument
Reading
Riding
Bike
Train Set:
#images:
(51)
(76)
(53)
(66)
Test Set:
#images:
(29)
(51)
(33)
(39)
Feature extraction:
•
Grayscale SIFT descriptors
•
Dense SIFT sampling at multiple scales
(8, 12, 16, 24, 30)
•
Locality-constrained Linear Coding (LLC)
Features
•
Dictionary size: 1024
•
Concatenation of 2-level SPM of
background features + foreground
features (similar to Delaitre et al, 2010)
Running
Bounding boxes and
corresponding action
provided for each image
Running
[LLC: Wang et al, 2010]
[SIFT: Lowe, 2004]
Background Image
Foreground Images
41
PASCAL Action Dataset
mAP
CVC-BASE
UCLEAR-DOSP
Our Method
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
Phoning
Playing
Instrument
Reading
Riding Bike
Riding
Horse
Running
Taking
Photo
Using
Computer
Walking
Best performance in 7 of 9 classes!
42
PASCAL Action Dataset
UCLEAR-DOSP
Our Method
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
Phoning
Playing
Instrument
Reading
Riding Bike
Riding
Horse
Running
Taking
Photo
Using
Computer
Walking
64.6%
0,65
60.8% 60.4%
mAP
mAP
CVC-BASE
62.2%
CVC-BASE
61.1%
CVC-SEL
0,6
SURREY-KDA
54.3%
0,55
UCLEAR-DOSP
UMCO-KSVM
0,5
Overall
Our Method
43
PASCAL Action Dataset
mAP
CVC-BASE
UCLEAR-DOSP
Our Method
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
Phoning
Playing
Instrument
Reading
Riding Bike
Riding
Horse
Running
Taking
Photo
Using
Computer
Walking
44
PASCAL Action Dataset
mAP
CVC-BASE
UCLEAR-DOSP
Our Method
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
Phoning
Playing
Instrument
Reading
Riding Bike
Riding
Horse
Running
Taking
Photo
Using
Computer
Walking
45
PASCAL Action Dataset
mAP
CVC-BASE
UCLEAR-DOSP
Our Method
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
Phoning
Playing
Instrument
Reading
Riding Bike
Riding
Horse
Running
Taking
Photo
Using
Computer
Walking
46
PASCAL Action Dataset
mAP
CVC-BASE
UCLEAR-DOSP
Our Method
0,9
0,8
0,7
0,6
0,5
0,4
0,3
0,2
Phoning
Playing
Instrument
Reading
Playing
Instrument
Riding Bike
Riding
Horse
Reading
Running
Taking
Photo
Using
Computer
Walking
Riding Bike
47
People-Playing-Musical-Instruments (PPMI) Dataset
PPMI+
# Image:
(172)
(191)
(177)
(179)
(200)
(185)
(164)
(148)
(133)
(149)
(188)
(167)
(198)
(191)
(177)
(179)
(200)
(198)
(169)
(148)
(133)
(149)
(188)
(169)
PPMI# Image:
PPMI+
# Image:
PPMI# Image:
48
People-Playing-Musical-Instruments (PPMI) Dataset
PPMI+
# Image:
(172)
(191)
(177)
(179)
(200)
(185)
(164)
(148)
(133)
(149)
(188)
(167)
PPMI# Image:
Normalized image
Original image
(200 images each interaction)
49
People-Playing-Musical-Instruments (PPMI) Dataset
PPMI+
# Image:
(172)
(191)
(177)
(179)
(200)
(185)
(164)
(148)
(133)
(149)
(188)
(167)
PPMI# Image:
Feature extraction:
•
Grayscale SIFT descriptors
•
Dense SIFT sampling at multiple scales (8, 12, 16, 24, 30)
•
Locality-constrained Linear Coding (LLC) Features
•
Dictionary size: 256
[LLC: Wang et al, 2010]
[SIFT: Lowe, 2004]
50
Recognition tasks on
People-Playing-Musical-Instruments (PPMI) Dataset
Binary Classification
Playing vs. Not playing
Playing
Violin
vs.
Not playing
Violin
Different Instruments +
Interactions Classification
12 instruments with 2 interactions
(PPMI+ & PPMI-) per instrument
24-way classification
Playing
Guitar
Playing
Cello
vs.
Not playing
Cello
Not playing
Cello
Not playing
Harp
vs.
Playing
Clarinet
vs.
Not playing
Clarinet
Playing
Clarinet
Playing
Cello
51
PPMI Binary Classification
Twelve 2-class classification problem; PPMI+ vs. PPMI- for each instrument
BoW
Grouplet
SPM
LLC
Our Method
1
0,95
mAP
0,9
0,85
0,8
0,75
0,7
Bassoon
Erhu
Flute
French
Horn
Guitar
Saxophone
Violin
Trumpet
Cello
Clarinet
Harp
Recorder
Best performance in 9 of 12 classes!
[SPM: Lazebnik et al, 2006]
[Grouplet: Yao & Fei-Fei,
52 2010]
[LLC: Wang et al, 2010]
PPMI Binary Classification
Twelve 2-class classification problem; PPMI+ vs. PPMI- for each instrument
BoW
Grouplet
SPM
LLC
Our Method
1
0,95
0,85
0,8
0,75
0,7
Bassoon
Erhu
French
Horn
Guitar
Saxophone
Violin
Trumpet
Cello
Clarinet
Harp
Recorder
92.1%
85.1%
0,87
0,81
Flute
Overall Result
88.2% 89.2%
0,93
mAP
mAP
0,9
78.0%
0,75
BoW Grouplet
SPM
LLC
Ours
[SPM: Lazebnik et al, 2006]
[Grouplet: Yao & Fei-Fei,
53 2010]
[LLC: Wang et al, 2010]
PPMI Binary Classification
Twelve 2-class classification problem; PPMI+ vs. PPMI- for each instrument
Flute
Guitar
Violin
Playing
Instrument
Not Playing
Instrument
54
PPMI Binary Classification
Twelve 2-class classification problem; PPMI+ vs. PPMI- for each instrument
Flute
Guitar
Violin
Playing
Instrument
Not Playing
Instrument
55
PPMI 24-way Classification
24-class classification problem; PPMI+ and PPMI- for all instruments
0,5
47.0%
mAP
0,45
41.8%
0,4
36.7%
39.1%
0,35
0,3
0,25
22.7%
0,2
BoW
[SPM: Lazebnik et al, 2006]
[Grouplet: Yao & Fei-Fei, 2010]
[LLC: Wang et al, 2010]
Grouplet
SPM
LLC
Ours
56
Coarse-to-fine Learning
Depth
Tree 1 … Tree N
…
1
2
…
…
…
…
…
…
…
…
3
4
5
The human visual system is believed to
analyze raw input in order from low to
high spatial frequencies or from large
global shapes to smaller local ones!
57
Outline
• Fine-grained Image Categorization
• Our Method
• Dataset & Experiments
• ImageNet Datasets
• Conclusion
58
ImageNet Datasets
• ImageNet is a rich resource for fine-grained image
categorization datasets.
No. of
Classes
No. of
Images
Images per
class
Training
Images/class
Visibility
varies?
Bounding
boxes?
CUB-200
200
6033
30
15
Yes
Yes
PPMI
24
4800
200
100
No
Yes
PASCAL Action
9
1221
135
~60
Yes
Yes
Dogs
120
~22000
~180
100
Yes
Yes
Agarics
200
~20000
~100
50
Yes
Yes
Vehicles
50
~20000
~400
200
Yes
Yes
Dataset
59
ImageNet Dogs Dataset
Reference image:
Chihuahua
Chihuahua
Different pose
Whippet
Chihuahua
Chihuahua
Partial
Different
Occlusion Appearance
Toy Terrier
West
Highland
Terrier
Chihuahua
Different Age
Toy Terrier
Different classes,
very similar appearance
60
Much better performance with more
training images!
26.1%
24.1%
21.2%
17.7%
13.8%
61
Performance improves significantly for
almost all classes!
Per-class Accuracy
62
Ongoing Work
• Exploit the inherently parallel nature of random
forests
GPUs
• Incorporate multiple features
• Use strong classifiers with analytical solutions (e.g.
LDA)
• Allow variable image region location in each image
63
Outline
• Fine-grained Image Categorization
• Our Method
• Dataset & Experiments
• ImageNet Datasets
• Conclusion
64
Conclusion
• Exploring dense image features can benefit
fine-grained image categorization
• Combining randomization and discrimination
is an effective way to explore the dense image
representation
• A large number of examples per class can be
extremely beneficial to capture the subtle
variations in fine-grained categories
65
Thanks to
Hao Su, Olga Russakovsky, anonymous
reviewers,
And You!
66
Control Experiments
67

Combining Randomization and Discrimination for Fine-grained Image Categorization (CVPR 2011) Bangpeng Yao*, Aditya Khosla*, and Li Fei-Fei Computer Science Department, Stanford University (* - indicates.

Transcript Combining Randomization and Discrimination for Fine-grained Image Categorization (CVPR 2011) Bangpeng Yao*, Aditya Khosla*, and Li Fei-Fei Computer Science Department, Stanford University (* - indicates.

Directory

Combining Randomization and Discrimination for Fine-grained Image Categorization (CVPR 2011) Bangpeng Yao, Aditya Khosla, and Li Fei-Fei Computer Science Department, Stanford University (* - indicates.

Transcript Combining Randomization and Discrimination for Fine-grained Image Categorization (CVPR 2011) Bangpeng Yao, Aditya Khosla, and Li Fei-Fei Computer Science Department, Stanford University (* - indicates.