Object Detection and ClassificationUsing

Transcript Object Detection and ClassificationUsing

Stanford CS223B Computer Vision, Winter 2006
Lecture 14:
Object Detection and
Classification Using Machine
Learning
Gary Bradski, Intel, Stanford
CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado
“Who will be strong and stand with me?
Beyond the barricade,
Is there a world you long to see?”
-- Enjolras, Do you hear the people sing?
Le Miserables
Fast, accurate and
general object
recognition …
This guy is wearing a haircut
called a “Mullet”
Find the Mullets…
Rapid Learning
and Generalization
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Approaches to Recognition
Non-Geo
Eigen Objects/Turk
Shape models
Constellation/Perona
Patches/Ulman
relations
Geometric
Histograms/Schiele
HMAX/Poggio
Local
MRF/Freeman, Murphy
features
Global
We’ll see a few of these …
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Global
Eigenfaces



Find a new coordinate system that best captures the scatter of the data.
Eigen vectors point in the direction of scatter, ordered of the magnitude
of the eigen values.
We can typically prune the number of eigen vectors to a few dozen.
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Global
Eigenfaces, the algorithm
Assumptions: Square images with W=H=N
M is the number of images in the database
P is the number of persons in the database

The database
 a1 


a
 2 

 


a 2 
 N 
 b1 
 
 b2 
 

 
b 2 
 N 
 c1 
 
 c2 
 

 
c 2 
 N 
 d1 


d
 2 

 


d 2 
 N 
 e1 
 
 e2 
 

 
e 2 
 N 







 g1 


 g2 

 


g 2 
 N 
 h1 


 h2 

 


h 2 
 N 
f1 

f2 
 

f N 2 
[slide credit: Alexander Roth]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Global
Eigenfaces, the algorithm

We compute the average face
 a1

 1  a2
m
M 

a 2
 N

 b1
 b2

bN 2


h1 

h2 
, with M  8



   hN 2 
Then subtract it from the training faces
 a1  m1 


 a2  m2 

am  

 


a 2  m 2 
N 
 N
 b1  m1 


  b2  m2 
bm  

 


b 2  m 2 
N 
 N
 c1  m1 


  c2  m2 
cm  

 


c 2  m 2 
N 
 N
 d1  m1 



 d 2  m2 
dm  

 


d 2  m 2 
N 
 N
 e1  m1 


e

m
  2
2 
em  

 


e 2  m 2 
N 
 N


 
fm  



f1  m1 

f 2  m2 

 

f N 2  mN 2 
 g1  m1 


g

m

 2
2 
gm  

 


g 2 m 2 
N 
 N
 h1  m1 


  h2  m2 
hm  

 


h 2  m 2 
N 
 N
[slide credit: Alexander Roth]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Global
Eigenfaces, the algorithm

Now we build the matrix which is N2 by M
       
A  am bm cm d m em f m gm hm

The covariance matrix which is N2 by N2
T
C  AA

Find eigenvalues of the covariance matrix C  A AT


– The matrix is very large
– The computational effort is very big

We are interested in at most M eigenvalues
– We can reduce the dimension of the matrix
[slide credit: Alexander Roth]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Global
Eigenvalue Theorem

Define
C  AAT dimension N2 by N2
L  AT A dimension M by M (e.g., 8 by 8)



Let v be an eigenvector of L : Lv  v
Then Av is eigenvector of C : C ( Av)   ( Av)
This vast
Proof:
C ( Av)  AAT ( Av)
 A( A Av)
T
 A( Lv)
 Av
dimensionality
reduction is what
makes the whole
thing work.
  ( Av)
[slide credit: Alexander Roth]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Global
Eigenfaces, the algorithm

Compute another matrix which is M by M:
L  AT A

Find the M eigenvalues and eigenvectors
– Eigenvectors of C and L are equivalent

Build matrix V from the eigenvectors of L

Eigenvectors of C are linear combination of image space
with the eigenvectors of L
U  AV

Eigenvectors represent the variation in the faces
[slide credit: Alexander Roth]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Global
Eigenfaces, the algorithm

Compute for each face its projection onto
the face space

1  U (am )
T 
5  U (em )
T


T 
2  U (bm ) 3  U (cm )

T 
T
6  U ( fm ) 7  U ( gm )
T

2  U (dm )

T
8  U (hm )
T
Compute the between-class threshold
1
  max{ i   j } for i, j  1....M
2
[slide credit: Alexander Roth]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Global
Example
Example set
Eigenfaces
Normalized Eigenfaces
Photobook, MIT
[Note: sharper]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Eigenfaces, the algorithm in use

Global
To recognize a face, subtract the average face from it
 r1 
 
 r2 
 

 
r 2 
 N 
 r1  m1 


  r2  m2 
rm  

 


r 2  m 2 
N 
 N

Compute its projection onto the face space
T 
  U (rm )

Compute the distance in the face space between the face
and all known faces
Beyond uses in
2
2
 i   
i  1...M
r1 i  mfor
recognition, Eigen
1 



r

m
2
2 
Distinguish between
rm  


– If    it’s not a face


M
)
– If    and  i   , (ir 21,...,
it’s a new face
 N  mN 2 
– If    and min{ i }  
Sebastian Thrun & Gary Bradski
it’s a known face
Stanford University
“backgrounds” can
be very effective for
background
subtraction.
[slide credit: Alexander Roth]
CS223B Computer Vision
Global
Eigenfaces, the algorithm

Problems with eigenfaces – spurious “scatter”
–
–
–
–
Different illumination
Different head pose
Different alignment
Different facial expression
Fisherfaces may beat …






Developed in 1997 by P.Belhumeur et al.
Based on Fisher’s LDA
Faster than eigenfaces, in some cases
Has lower error rates
Works well even if different illumination
Works well even if different facial express.
[slide credit: Alexander Roth]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Global/local feature mix


Global-noGeo
Global works OK, still used, but local now seems to
outperform.
Recent mix of local and global:
– Use global features to bias local features with no internal
geometric dependencies: Murphy, Torralba & Freeman (03)
[image credit: Kevin Murphy]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Global-noGeo
Use local features to find objects
* convolution
Filter bank
Object
bounding box
 normalizedcorrelation
Gaussian within
bounding box
Image
patch
Training
x positive
O negative
[image credit: Kevin Murphy]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Global feature:
Global-noGeo
Back to neural nets: Propagate Mixture Density Networks*
Feature used: Steerable pyramid
transformation using 4 orientations and 2
scales; Image divided into 4x4 grid,
average energy computed in each channel
yields 128 features. PCA down to 80.
Iteration
Uses “boosted
random fields” to
learn graph structure
* C. M. Bishop. Mixture density networks. Technical Report NCRG 4288, Neural Computing
Research Group, Department of Computer Science, Aston University, 1994
Sebastian Thrun & Gary Bradski
Stanford University
Final output
[slide credit: Kevin Murphy]
CS223B Computer Vision
Example of context focus

Global-noGeo
The algorithm knows where to focus for objects
[image credit: Kevin Murphy]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Global-noGeo
Results

Performance is boosted by knowing context
[image credit: Kevin Murphy]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Local-noGeo
Completely Local: Color Histograms

Swain and Ballard ’91 took the normalized r,g,b color histogram of
objects:

and noted the tolerance to 3D rotation, partial occlusions etc:
[image credit: Swain & Ballard]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Local-noGeo
Color Histogram Matching

Objects were recognized based on their histogram intersection:
Yielding excellent results over 30 objects:

The problem is, color varies markedly with lighting …
[image credit: Swain & Ballard]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Local-noGeo
Local Feature Histogram Matching

Scheile and Crowley used derivative type features instead:

And a probabilistic matching rule:
• For multiple objects:
[image credit: Scheile & Crowley]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Local-noGeo
Local Feature Histogram Results

Again with impressive performance results, much more tolerant to lighting:
30 of 100f objects

Problem is: Histograms suffer exponential blow up with number of features
[image credit: Scheile & Crowley]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Local Features

Local features, for example:
–
–
–
–
–
Lowe’s SIFT
Malik’s Shape Context
Poggio’s HMAX
von der Malsburg’s Gabor Jets
Yokono’s Gaussian Derivative Jets

Adding patches thereof seems to work great, but
they are of high dimensionality.

Idea: Encode in Hierarchy:
– Overview some techniques...
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Convolutional Neural Networks
Yann LeCun
Local-Hierarchy
Broke all the HIPs code
(Human Interaction Proofs)
from Yahoo, MSN, E-Bay …
[image credit: LeCun]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Fragment Based Hierarchy
Shimon Ullman

Local-Hierarchy
Top down and bottom up hierarchy
http://www.wisdom.weizmann.ac.il/~vision/research.html
See also Perona’s group work on hierarchical feature
models of objects http://www.vision.caltech.edu/html-files/publications.html
[image credit: Ullman et al]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Constellation Model
Perona’s
Bayesian Decision based
The shape model. The mean location is indicated by the cross, with
the ellipse showing the uncertainty in location. The number by each
part is the probability of that part being present.
Recognition Result:
[image credit: Perona et al]
The appearance model
closest to the mean of the
appearance density of
each part
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
From: Rob Fergus http://www.robots.ox.ac.uk/%7Efergus/

See also Perona’s group work on hierarchical feature
models of objects http://www.vision.caltech.edu/html-files/publications.html
Feature detector results:
Local-Hierarchy
Local-Hierarchy
Joijic and Frey

Scene description as hierarchy of sprites
[image credit: Joijic et al]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Local-Hierarchy
Jeff Hawkins, Dileep George

Modular hierarchical spatial temporal memory
Hierarchy
Module
Results
Templates
Good Classifications
Bad Classifications
In (D)
Out (E)
[image credit: George, Hawkins]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Peter Bock’s ALISA
An explicit Cognitive Model
Local-Hierarchy
Histogram based
[image credit: Bock et al]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
ALISA Labeling 2 Scenes
Local-Hierarchy
[image credit: Bock et al]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
HMAX from the “Standard Model”
Maximilian Riesenhuber and Tomaso Poggio Local-Hierarchy
In object recognition hierarchy
Basic building blocks
Modulated by attention
Pick this up momentarily, first, a little on trees and boosting …
Sebastian Thrun & Gary Bradski
Stanford University
[image
credit: Vision
Riesenhuber
CS223B
Computer
et al]
Machine Learning – Many Techniques
Libraries from Intel
Key:
• Optimized
• Implemented
• Not implemented
Unsupervised
Supervised
focus
• Physical Models
• Boosted decision trees
• MART
• Influence diagrams
• SVM
• HMM
• Multi-Layer Perceptron
• BayesNets: Classification
• CART
• Logistic Regression
• Decision trees
• K-NN
• Radial Basis
• Naïve Bayes
• Kalman Filter
• ARTMAP
• Assoc. Net.
• Random Forests.
• Diagnostic Bayesnet
• Bayesnet structure learning
• Adaptive Filters
• Histogram density est.
• Kernel density est.
• K-means
• Tree distributions
• Gaussian Fitting
• Dependency Nets
• ART
• Spectral clustering
• Agglomerative clustering
• PCA
• Kohonen Map • BayesNets: Parameter fitting
• Inference
Modeless
Model based
Statistical Learning
Library:
MLL
Sebastian Thrun & Gary Bradski
Stanford University
Bayesian Networks
Library:
PNL
CS223B Computer Vision
Machine Learning
Learn a model/function
f
INPUT
OUTPUT
That maps input to output
underfit
just right
f
y
overfit
X
Find a function that
describes given data
and predicts unknown data
Example Uses of Prediction:
- Insurance risk prediction
- Parameters that impact yields
- Gene classification by function
- Topics of a document
...
Specific example: prediction, using a decision tree => => =>
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Binary Recursive Decision Trees
Leo Breiman’s “CART”*
At Each Level:

Find the variable (predictor) and its threshold.
–
–

That splits the data into 2 groups
With maximal purity within each group
All variables/predictors are considered at every level.
underfit
Data of different types, each
containing a vector of “predictors”
Data set
f
y
overfit
maximal purity splits
X
Perfect purity, but…
Sebastian Thrun & Gary Bradski
Stanford University
*Classification
Regression Tree
CS223B
ComputerAnd
Vision
Binary Recursive Decision Trees
Leo Breiman’s “CART”*


At Each Level:
Find the variable (predictor) and its threshold.
–
–

That splits the data into 2 groups
With maximal purity within each group
All variables/predictors are considered at every level.
just right
Data set
f
y
overfit
x
Prune to avoid over fitting using
complexity
cost measure
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Consider a Face Detector via Decision Stumps
Consider a tree “Stump” – just one split.
It selects the single most discriminative feature …
For each rectangle combination region:
 Find the threshold
– That splits the data into 2 groups (face, non-face)
– With maximal purity within each group
Face and non-face data
that he features can be tried on
Data set
maximal purity splits: Thresh = N
Bar detector works
well for “nose” a face
detecting stump.
Sebastian Thrun & Gary Bradski
Stanford University
See Appendix for
Viola, Jones’s
feature generator:
Intregral Images
It doesn’t
detect
cars.
CS223B Computer Vision
We use “Boosting” to Select a “Forest of Stumps”
Each stump is a selected feature plus a split threshold
Gentle Boost:
 Given example images (x1,y1) , … , (xn,yn) where yi = 0, 1 for negative and positive
examples respectively.
 Initialize weights w1,i = 1/(2m), 1/(2l) for training example i, where m and l are the
number of negatives and positives respectively.
For t = 1 … T
1) Normalize weights so that wt is a distribution
2) For each feature j train a classifier hj and evaluate its error j with respect to wt.
3) Chose the classifier hj with lowest error.
4) Update weights according to:
1 i


wt 1,i wt ,i
t
where ei = 0 is xi is classified correctly, 1 otherwise, and

t


t
1
t
 The final strong classifier is:

1
h( x )  
0
1 T

2 t 1 t ,
otherwise
t 1 t ht ( x) 
T
Sebastian Thrun & Gary Bradski
Stanford University
where

t
 log(
1

)
t
CS223B Computer Vision
For efficient calculation, form a Detection Cascade
A boosted cascade is assembled such that at each
node, non-object regions stop further processing.
If the detection of each node is high (~99.9%), at cost of a high false positive rate
(say 50% of everything detected as “object), and if the nodes are independent,
then theoveralldetectionand false positiveratesare
n
n
i 1
i 1
d   detecti and f  falsePosi . If so, thenfor a 20 node cascade
7
we get : d  0.98 and f  9.6e .
Sebastian Thrun & Gary Bradski
Rapid Object Detection using a Boosted Cascade of Simple
Features - Viola, Jones (2001)
Stanford University
CS223B Computer Vision
Improvements to Cascade

J. Wu, J. M. Rehg, and M. D. Mullin just do one Boosting
round, then select from the feature pool as needed:
Viola, Jones
Wu, Rehg, Mullin

Kobi Levi and Yair Weiss just used better features (gradient
histograms) to cut training needs by an order of magnitude.

Let’s focus on better features and descriptors …
[image credit: Wu et al]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
The Standard Model of Visual Cortex
Biologically Motivated Features

Thomas Serre, Lior Wolf and Tomaso Poggio used the model
of the human visual cortex developed in Riesenhuber’s lab:
Classifier
(SVM, Boosting, …)
C2 Layer:
Max S2 Response
.8 .4 .9 .2 .6
S2 Layer:
Radial Basis fit to it’s patch
template over the whole image
Inter layer:
Dictionary of Patches of C1
First 5 chosen
features from
Boosting
C1 layer:
Local Spatial Max
S1 layer:
Gabor at 4 orientations
[image credit: Serre et al]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
The Standard Model of Visual Cortex
Biologically Motivated Features

Results in state of the art/top performance:
Seems to handily beat SIFT features:
[image credit: Serre et al]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Yokonos’ Generalization to
The Standard Model of Visual Cortex

Used Gaussian Derivates: 3 orders X 3 scales X 4
orientations = 36 base features:

Similar to Standard Model’s Gabor base filters.
[image credit: Yokono et al]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Yokonos’ Generalization to
The Standard Model of Visual Cortex

Created a local spatial jet, oriented to the gradient at the
largest scale at the center pixel:

Since Gabor has ringing spatial extent ~ still approximately
similar to standard model.
[image credit: Yokono et al]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Yokonos’ Generalization to
The Standard Model of Visual Cortex

Full system:
~S1, C1:
Features memorized from
positive samples at Harris
corner interest points.
~S2:
Dictionary of learned features is
measured (normalized cross
correlation) against all interest
points in the image.
~C2:
The maximum normalized cross
correlation scores are arranged
in a feature vector
Classifier:
Again: SVM, Boosting, …
[image credit: Yokono et al]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Yokonos’ Generalization to
The Standard Model of Visual Cortex
Excellent Results:
CBCL Database

ROC curve
for 1200
Stumps:
SVM with 1
to 5 training
images
beats other
techniques:
[image credit: Yokono et al]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Yokonos’ Generalization to
The Standard Model of Visual Cortex

Excellent Results:
Some features chosen:
AIBO Dog in articulated poses:
ROC Curve:
[image credit: Yokono et al]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Brash Claim

In the high 90% performance under lighting, articulation,
scale and 3D rotation.
– The classifier inside humans is unlikely to be much more accurate.

We are not that far from raw human level performance.
– By 2015 I predict.

Base classifier is embedded in larger system that makes it
more reliable:
–
–
–
–
–
Attention
Color constancy features
Context
Temporal filtering
Sensor fusion
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Back to Kevin Murphy: Context:
Missing
[slide credit: Kevin Murphy]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Missing
Context
We know there is a keyboard present in this scene even if we cannot see it clearly.
We know there is no keyboard present in this scene
… even if there is one indeed.
[slide credit: Kevin Murphy]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Missing
Attention

Change blindness
Farm
Sebastian Thrun & Gary Bradski
Truck
Stanford University
CS223B Computer Vision
Call for a Program:
Generalize Standard Model Even Further
Research Framework
 Detect:

Descriptors:
Local
– SIFT
– Steerable
– Gabor

Image Level Scoring:
Global
– DOG
– Harris Corner

– Histogram
– Max Correlation
– Max Probability

Classifier:
– SVM
– Boosting
– K-NN …
Dictionary:
– All descriptors
– Subset
– Clustered
Sebastian Thrun & Gary Bradski

Embedding:
–
–
–
–
Stanford University
Attention, active vision
Context: Scene, 3D inference
Sensor fusion/association
Motion
CS223B Computer Vision
Call for a Program:
Generalize Standard Model Even Further

Ashutosh Saxena, Chung and Ng learned depth using
local features in an MRF (similar to Kevin Murphy).

Ashutosh also has a robot picking up novel objects from
local features. Together with active vision, active
manipulation, context – Now is a good time for vision
systems!
Apply to “Stanley II” and to STAIR [image credit: Saxena et al]
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Summary: Mix local with global
Generalize Standard Model Even Further
Research Framework
 Detect:

Descriptors:
Local
– SIFT
– Steerable
– Gabor

Image Level Scoring:
Global
– DOG
– Harris Corner

– Histogram
– Max Correlation
– Max Probability

Classifier:
– SVM
– Boosting
– K-NN …
Dictionary:
– All descriptors
– Subset
– Clustered
Sebastian Thrun & Gary Bradski

Embedding:
–
–
–
–
Stanford University
Attention, active vision
Context: Scene, 3D inference
Sensor fusion/association
Motion
CS223B Computer Vision
Bibliography for this lecture
Papers for this lecture:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
R. Fergus, P. Perona and A.Zisserman, “Object Class Recognition by Unsupervised Scale-Invariant Learning”,
CVPR 03.
M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive Neuroscience, Vol. 3, No. 1, 1991.
Serre, T., L. Wolf and T. Poggio. Object Recognition with Features Inspired by Visual Cortex. In: Proceedings of
2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), IEEE Computer
Society Press, San Diego, June 2005.
Jerry Jun Yokono & Tomaso Poggio, “Boosting a Biologically Inspired Local Descriptor for Geometry-free Face
and Full Multi-view 3D Object Recognition”, AI Memo 3005-023 CBCL Memo 254, July 2005
J. Wu, J. M. Rehg, and M. D. Mullin, “Learning a Rare Event Detection Cascade by Direct Feature Selection”
Proc. Advances in Neural Information Processing Systems 16 (NIPS*2003), MIT Press, 2004
J. Wu, M. D. Mullin, and J. M. Rehg, “Linear Asymmetric Classifier for Face Detection”, International Conference
on Machine Learning (ICML 05), pages 993-1000, Bonn, Germany, August 2005
Kobi Levi and Yair Weiss, “Learning Object Detection from a Small Number of Examples: The Importance of
Good Features” International Conference on Computer Vision and Pattern Recognition (CVPR) 2004.
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. CVPR, pages
511–518, 2001.
B. Schiele and JL Crowley. Probabilistic object recognition using multidimensional receptive field histograms.
submitted to ICPR'96
M. J. Swain and D. H. Ballard, "Color Indexing," International Journal of Computer Vision, vol. 7, pp. 11-32, 1991.
Antonio Torralba, Kevin Murphy and William Freeman , “Contextual Models for Object Detection using Boosted
Random Fields ”, NIPS 2004.
Kevin Murphy, Antonio Torralba, Daniel Eaton, William Freeman, “Object detection and localization using local
and global features”, Sicily workshop on object recognition, 2005
M. Riesenhuber and T. Poggio. How visual cortex recognizes objects: The tale of the standard model. The Visual
Neurosciences, 2:1640–1653, 2003.
A. Saxena, S.H. Chung, A.Y. Ng, “Learning depth from Single Monocular Images”, NIPS 2005
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Feature set generators
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Intregral Images -- a Feature Set Generator
3 rectangular features types:
• two-rectangle feature type
(horizontal/vertical)
• three-rectangle feature type
• four-rectangle feature type
Using a 24x24 pixel base detection window, with all the possible
combination of horizontal and vertical location and scale of these feature
types the full set of features has 49,396 features.
The motivation behind using rectangular features, as opposed to more
expressive steerable filters is due to their extreme computational efficiency.
Paul Viola and Michael Jones www.cs.ucsd.edu/classes/fa01/cse291/ViolaJones.ppt
ICCV 2001 Workshop on Statistical and Computation Theories of Vision
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Define an “Integral image”
Sum
Def: The integral image at location (x,y), is the sum of
the pixel values above and to the left of (x,y),
inclusive.
Using the following two recurrences, where i(x,y) is
the pixel value of original image at the given location
and s(x,y) is the cumulative column sum, we can
calculate the integral image representation of the
image in a single pass.
x
(0,0)
s(x,y) = s(x,y-1) + i(x,y)
ii(x,y) = ii(x-1,y) + s(x,y)
(x,y)
y
Paul Viola and Michael Jones www.cs.ucsd.edu/classes/fa01/cse291/ViolaJones.ppt
ICCV 2001 Workshop on Statistical and Computation Theories of Vision
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Allows rapid evaluation of rectangular features
Using the integral image representation
one can compute the value of any
rectangular sum in constant time.
For example the integral sum inside
rectangle D we can compute as:
ii(4) + ii(1) – ii(2) – ii(3)
As a result: two-, three-, and four-rectangular features can be
computed with 6, 8 and 9 array references respectively.
Paul Viola and Michael Jones www.cs.ucsd.edu/classes/fa01/cse291/ViolaJones.ppt
ICCV 2001 Workshop on Statistical and Computation Theories of Vision
Sebastian Thrun & Gary Bradski
Stanford University
CS223B Computer Vision
Intregal Image Example
Image
0
8
6
1
0
8
-
-
1
5
9
0
1
14
-
-
0
7
5
0
1
-
-
-
2
8
9
2
4
-
-
-
Intregal Image
0
8
14
-
0
8
14
15
1
14
29
-
1
14
29
30
1
21
41
-
1
21
41
42
4
32
61
-
4
32
61
64
Sebastian Thrun & Gary Bradski
Stanford University
Can calculate
in one pass.
CS223B Computer Vision
Intregal Image Example
Image
Intregal Image
0
8
6
1
0
8
14
15
1
5
9
0
1
14
29
30
0
7
5
0
1
21
41
42
2
8
9
2
4
32
61
64
Find sum
5+9+7+5+8+9=43
Sebastian Thrun & Gary Bradski
61+0-(14+4)=43
Stanford University
CS223B Computer Vision