Multi-layer Orthogonal Codebook for Image Classification

Transcript Multi-layer Orthogonal Codebook for Image Classification

Multi-layer Orthogonal Codebook
for Image Classification
Presented by Xia Li
Outline
• Introduction
– Motivation
– Related work
• Multi-layer orthogonal codebook
• Experiments
• Conclusion
Image Classification
Sampling:
local feature
extraction
Sparse, at interest points
visual codebook
construction
Dense, uniformly
• For object categorization, dense sampling offers better
coverage. [Nowak, Jurie & Triggs, ECCV 2006]
Descriptor:
vector
quantization
spatial pooling
linear/nolinear
classifier
• Use orientation histograms within sub-patches to build
4*4*8=128 dim SIFT descriptor vector. [David Lowe, 1999, 2004]
Image credits: F-F. Li, E. Nowak, J. Sivic
Image Classification
• Visual codebook construction
local feature
extraction
visual codebook
construction
– Supervised vs. Unsupervised clustering
– k-means (typical choice), agglomerative
clustering, mean-shift,…
• Vector Quantization via clustering
– Let cluster centers be the
prototype “visual words”
vector
quantization
spatial pooling
linear/nolinear
classifier
Descriptor space
Image credits: K. Grauman, B. Leibe
– Assign the closest cluster center
to each new image patch
descriptor.
Image Classification
Bags of visual words
local feature
extraction
visual codebook
construction
vector
quantization
spatial pooling
linear/nolinear
classifier
Image credit: Fei-Fei Li
• Represent entire image based on its
distribution (histogram) of word occurrences.
• Analogous to bag of words representation used
for documents classification/retrieval.
Image Classification
local feature
extraction
visual codebook
construction
vector
quantization
spatial pooling
linear/nolinear
classifier
Image credit: S. Lazebnik
[S. Lazebnik, C. Schmid, and J. Ponce, CVPR 2006]
Image Classification
local feature
extraction
visual codebook
construction
Histogram intersection kernel:
vector
quantization
spatial pooling
linear/nolinear
classifier
Image credit: S. Lazebnik
Linear kernel:
Image Classification
local feature
extraction
visual codebook
construction
vector
quantization
spatial pooling
linear/nolinear
classifier
Image credit: S. Lazebnik
[S. Lazebnik, C. Schmid, and J. Ponce, CVPR 2006]
Motivation
• Codebook quality
local feature
extraction
visual codebook
construction
vector
quantization
spatial pooling
linear/nolinear
classifier
– Feature type
– Codebook creation
• Algorithm e.g. K-Means
• Distance metric e.g. L2
• Number of words
– Quantization process
• Hard quantization: only one word is
assigned for each descriptor
• Soft quantization: multi-words may be
assigned for each descriptor
Motivation
• Quantization error
local feature
extraction
visual codebook
construction
vector
quantization
spatial pooling
linear/nolinear
classifier
– The Euclidean squared distance between a
descriptor vector and its mapped visual word
Hard quantization leads to large error
Effects of descriptor hard
quantization – Severe drop in
descriptor discriminative power.
A scatter plot of descriptor
discriminative power before and
after quantization. The display is
in logarithmic scale in both axes.
O. Boiman, E. Shechtman, M. Irani, CVPR 2008
Motivation
• Codebook size is an important factor for
local feature
extraction
visual codebook
construction
vector
quantization
spatial pooling
linear/nolinear
classifier
applications that need efficiency
– Simply enlarging codebook size can reduce
overall quantization error
– but cannot guarantee every descriptor got
reduced error
codebook size
codebook 128 vs. codebook 256
codebook 128 vs. codebook 512
percent of descriptors
72.06%
84.18%
The right column is the percentage of descriptors whose
quantization error is reduced when codebook size grows
Motivation
• Good codebook for classification
local feature
extraction
visual codebook
construction
vector
quantization
spatial pooling
linear/nolinear
classifier
• Small individual quantization error ->
discriminative
• Compact in size
– Contradict in some extent
• Overemphasizing on discriminative ability may
increase the size of dictionary and weaken its
generalization ability
• Over-compressing to a dictionary will more or
less lose the information and its discriminative
power
– Find a balance!
[X. Lian, Z. Li, C. Wang, B. lu, and L. Zhang, CVPR 2010]
Related Work
• No quantization
local feature
extraction
visual codebook
construction
vector
quantization
spatial pooling
linear/nolinear
classifier
– NBNN [6]
• Supervised codebook
– Probabilistic models [5]
• Unsupervised codebook
– Kernel codebook [2]
– Sparse coding [3]
– Locality-constrained linear coding [4]
Multi-layer Orthogonal Codebook
(MOC)
• Use standard K-Means to keep efficiency or
any other clustering algorithm can be adopted
• Build codebook from residues to reduce
quantization errors explicitly
MOC Creation
• First layer codebook
– K-Means
N is the number of descriptors
randomly sampled to build the
codebook, di is one of the descriptors.
• Residue:
MOC Creation
• Orthogonal residue:
• Second layer codebook
– K-Means
Third layer …
Vector Quantization
• How to use MOC?
– Kernel fusion: use them separately
• Compute the kernels based on each layer codebook
separately
• Let the final kernel to be the combination of multiple
kernels
– Soft weighting: adjust weight for words from
different layers individually for each descriptor
• Select the nearest word on each layer codebook for a
descriptor
• Use the selected words from all layers to reconstruct
that descriptor and minimize reconstruction error
Hard Quantization and Kernel Fusion
(HQKF)
• Hard quantization on each layer
– average pooling: descriptors in the m-th sub-region,
totally M sub-regions on an image, histogram for m-th subregion is
• Histogram intersection kernel
……
• Linear combine kernel values from each codebook
Soft Weighting (SW)
• Weighting words for each descriptor
• Max pooling
K is codebook size
• Linear kernel
Soft Weighting (SW-NN)
• To further consider the relationships between words from
multi-layers
• Select 2 or more nearest words on each layer codebook, and
then weighting them to reconstruct the descriptor
• Each descriptor is more accurately represented by multiple
words on each layer
• The correlation between similar descriptors by sharing words
is captured
d1
d2
[J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, CVPR 2010]
Experiment
• Single feature type: SIFT
– 16*16 pixel patches densely sampled over a grid
with spacing of 6 pixels
• Spatial pyramid layer:
– 21=16+4+1 sub-regions at three resolution level
• Clustering method on each layer: K-Means
Datasets
• Caltech-101
– 101 categories, 31-800 images per category
• 15 Scenes
– 15 scenes, 4485 images
Quantization Error
• Quantization error is reduced more effectively
by MOC compared with simply enlarging
codebook size
• Experiment is done on Caltech101
codebook size
codebook 128 vs. codebook 256
codebook 128 vs. codebook 512
percent of descriptors
72.06%
84.18%
codebook 128 vs. codebook 128+128
91.22%
codebook 256 vs. codebook 128+128
87.04%
codebook 512 vs. codebook 128+128
63.80%
The right column is the percentage of descriptors whose
quantization error is reduced when codebook changes
Codebook Size
• Classification accuracy comparisons with
single layer codebook
0.74
0.72
0.7
0.68
single layer
0.66
2-layerHQKF
0.64
2-layerSW
0.62
0.6
0.58
64
128
256
512
1024
Comparison with single codebook (Caltech101). 2-layer
codebook has the same size on each layer which is also the
same size as the single layer codebook.
Comparisons with existing methods
• Classification accuracy comparisons with existing methods
Caltech101
# of training
SPM [1]
KC [2]
ScSPM [3]
LLC [4]
HQKF
SW
SW+2NN
15
56.40 (200)
67.00.45 (1024)
65.43 (2048)
60.660.7
(3-layer 512)
64.480.5
(3-layer 512)
65.900.5
(2-layer 1024)
30
64.60 (200)
64.14±1.18
73.20.54 (1024)
*73.44 (2048)
69.280.8
(3-layer 512)
71.601.1
(3-layer 512)
72.970.8
(2-layer 1024)
15 Scenes
100
81.400.5 (1024)
76.670.39
80.280.93 (1024)
83.210.6
(3-layer 1024)
82.270.6
(3-layer 1024)
-
Listed methods all used single type descriptor
*only LLC used HoG instead of SIFT, we repeated their method
with the type of descriptors we use, result is 71.63±1.2
Conclusion
• Compared with existing methods, the proposed
approach has the following merits:
– 1) No complex algorithm and easy to implement.
– 2) No time-consuming learning or clustering stage.
Able to be applied on large scale computer vision
systems.
– 3) Even more efficient than traditional K-Means
clustering.
– 4) Explicit residue minimization to explore
discriminative power of descriptors.
– 5) The basic idea can be combined with many state-ofthe-art methods.
References
• [1] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features:
Spatial pyramid matching for recognizing natural scene categories,”
CVPR, pp. 2169 – 2178, 2006.
• [2] J. Gemert, J. Geusebroek, C. Veenman, and A. Smeulders,
“Kernel codebooks for scene categorization,” ECCV, pp. 696-709,
2008.
• [3] J. Yang, K. Yu, Y. Gong, and T. Huang, “Linear spatial pyramid
matching using sparse coding for image classification,” CVPR, pp.
1794-1801, 2009.
• [4] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong, “Localityconstrained linear coding for image classification,” CVPR, pp. 33603367, 2010.
• [5] X. Lian, Z. Li, C. Wang, B. Lu, and L. Zhang, “Probabilistic models
for supervised dictionary learning,” CVPR, pp. 2305-2312, 2010.
• [6] O. Boiman, I. Rehovot, E. Shechtman, and M. Irani, “In defense
of nearest-neighbor based image classification,” CVPR, pp. 1-8,
2008.
• Thank you!
Codebook Size
• Different size combination on 2-layer MOC
0.71
0.7
0.69
0.68
0.67
64
0.66
128
0.65
256
0.64
0.63
0.62
0.61
64
128
256
Caltech101:
The X-axis is the size of the 1st layer codebook
Different colors represent the size of the 2nd layer codebook