Object Recognition using Local Invariant Features

Download Report

Transcript Object Recognition using Local Invariant Features

July 5th 2006
Object Recognition using
Local Invariant Features
Claudio Scordino
[email protected]
Object Recognition

Widely used in the industry for





Current commercial systems




Inspection
Registration
Manipulation
Robot localization and mapping
Correlation-based template matching
Computationally infeasible when object rotation,
scale, illumination and 3D pose vary
Even more infeasible with partial occlusion
Alternative: Local Image Features
Local Image Features



Unaffected by

Nearby clutter

Partial occlusion
Invariant to

Illumination

3D projective transforms

Common object variations
...but, at the same time, sufficiently
distinctive to identify specific objects among
many alternatives!
Related work

Line segments, edges and regions grouping


Peaks detection in local image variations




Detection not good enough for reliable recognition
Example: Harris corner detector
Drawback: image examined at only a single scale
Different key locations as the image scale changes
Eigenspace matching, color and receptive
field histograms


Successful on isolated objects
Unextendable to cluttered and partially occluded
images
SIFT Method

Scale Invariant Feature Transform (SIFT)

Staged filtering approach


Identifies stable points (image “keys”)
Computation time less than 2 secs
SIFT Method (2)

Local features:

Invariant to image translation, scaling, rotation

Partially invariant to illumination changes and 3D
projection (up to 20° of rotation)

Minimally affected by noise

Similar properties with neurons in Inferior
Temporal cortex used for object recognition in
primate vision
First stage

Input: original image (512 x 512 pixel)

Goal: key localization and image description

Output: SIFT keys

Feature vector describing the local image region
sampled relative to its scale-space coordinate frame
First stage (2)

Description:




Represents blurred image gradient locations in
multiple orientations planes and at multiple scales
Approach based on a model of cells in the celebral
cortex of mammalian vision
Less than 1 sec of computation time
Build a pyramid of images


Images are difference-of-Gaussian (DOG) functions
Resampling between each level
Key localization
Algorithm:


1.
Expand original image by a factor of 2 using bilinear
interpolation
For each pyramid level:
Smooth input image through a convolution with the
1D Gaussian function (horizontal direction):
x2
 2
1
g ( x) 
e 2
 2
with
  2 obtaining Image A
Key localization (2)
Smooth Image A through a further convolution with
th 1D Gaussian function (vertical direction) obtaining
Image B
2.
The DOG image of this level is B-A
3.
Resample Image B using bilinear interpolation with
pixel spacing 1.5 in each direction and use the result
as Input Image of the new pyramid level
4.

Each new sample is a constant linear combination of 4
adjacent pixels
Key localization (3)

Find maxima and minima of the DOG images:
1st level
2nd level
Key orientation
Extract image gradients and orientation at each
pyramid level. For each pixel Aij compute
1.
Image Gradient Magnitude
Image Gradient Orientation
M ij  ( Aij  Ai 1, j ) 2  ( Aij  Ai , j 1 ) 2
Rij  arctan2( Aij  Ai 1, j , Ai, j 1  Aij )
Mij thresholded at a value of 0.1 times the maximum
possible gradient value
2.

Provides robustness to illumination
Key orientation (2)
Create an orientation histogram using a circular
Gaussian-weighted window with σ=3 times the
current smoothing scale
3.



The weights are multiplied by Mij
The histogram is smoothed prior to peak selection
The orientation is determined by the peak in the
histogram
Experimental results
Original image
78%
Keys on image after rotation (15°),
scaling (90%), horizontal streching
(110%), change of brightness (-10%)
and contrast (90%), and
addition of pixel noise
Experimental results (2)
Image transformation
Location and scale
match
Orientation match
Decrease constrast by 1.2
89.0 %
86.6 %
Decrease intensity by 0.2
88.5 %
85.9 %
Rotate by 20°
85.4 %
81.0 %
Scale by 0.7
85.1 %
80.3 %
Stretch by 1.2
83.5 %
76.1 %
Stretch by 1.5
77.7 %
65.0 %
Add 10% pixel noise
90.3 %
88.4 %
All previous
78.6 %
71.8 %
20 different images, around 15,000 keys
Image description
Approach suggested by the response
properties of complex neurons in the visual
cortex


A feature position is allowed to vary over a small
region, while orientation and spatial frequency are
maintained
Image descripted through 8 orientation
planes


Keys inserted according to their orientations
Second stage
Goal: identify candidate object matches



The best candidate match is the nearest
neighbour (i.e., minimum Euclidean distance
between decriptor vectors)
The exact solution for high dimensional vectors is
known to have high complexity
Second stage (2)
Algorithm: approximate Best-Bin-First (BBF)
search method (Beis and Lowe)




Modification of the k-d tree algorithm
Identifies the nearest neighbours with high
probability and small computation
The keys generated at the larger scale are given
twice the weight of those at the smaller scale

Improves recognition by giving more weight to the leastnoisy scale
Third stage

Description: final verification

Algorithm: low-residual least-squares fit



Solution of a linear system: x = [ATA]-1ATb
When at least 3 keys agree with low residual,
there is strong evidence for the presence of the
object
Since there are dozens of keys in the image, this
works also with partial occlusion
Perspective projection
Partial occlusion
Computation time: 1.5 secs on Sun Sparc 10
(0.9 secs first stage)
Connections to human vision


Performance of human vision is obviously far
superior than current computer vision...
The brain uses a highly computationalintensive parallel process instead of a staged
filtering approach
Connections to human vision
However... the results are much the same
Recent research in neuroscience showed that
the neurons of Inferior Temporal cortex





Recognize shape features
The complexity of the features is roughly the
same as for SIFT
They also recognize color and texture properties
in addition to shape
Further research:



3D structure of objects
Additional feature types for color and texture
Augmented Reality (AR)
Registration of virtual objects into a live
video sequence

Current AR systems:



Rely on markers strategically placed in the
environment
Need manual camera calibration
Related work
Harris corner detector and Kanade-LucasTomasi (KLT) tracker


Parallelogram-shaped and elliptical image
regions tracking


Requires planar structures in viewed scene
Pre-built user-supplied CAD object models




Not enough feature invariance
Not always available
Limited to objects that can be easily modelled
Off-line batch processing of the entire video
AR using SIFT
Flexible automated AR
Not needed:









Camera pre-calibration
Prior knowledge of scene geometry
Manual initialization of the tracker
Placement of special markers
Special tools or equipment (just a camera)
Short time and small effort to setup
Robust 6 degrees of freedom
AR using SIFT (2)
Need only a set of reference images taken by
a handheld uncalibrated camera from
arbitrary viewpoints



Acquired from unknown spatially separated
viewpoints by a handheld camera
At least two images


5 to 20 images separated by at most 45°
Used to build a 3D model of the viewed scene
AR using SIFT (3)
First (off-line) stage:

1.
2.
3.
4.
5.
Extract SIFT features from reference images
Establish multi-view correspondences
Build a metric model of the real world
Compute calibration parameters and camera
poses
The user places the virtual object



The placement is achieved by anchoring object
projection in the first image
Then, a second projection is adjusted in the second
image
Finally, the user fine-tunes position, orientation and
size
AR using SIFT (4)
Second (on-line) stage:

1.
2.
3.
4.
Features are detected in the current frame
Features are matched to those of the model using
the BBF algorithm
The matches are used to compute the current
pose of the camera
Solution is stabilized by using the values
computed for the previous frame
AR using SIFT: prototype
Software



C programming language
OpenGL and GLUT libraries
Hardware:




IBM ThinkPad
Pentium 4-M processor (1.8 GHz)
Logitech QuickCam Pro 4000
camera
4 FPS
Operation
Computation time
Feature extraction
150 msec
Feature matching
40 msec
Camera pose computation
25 msec
AR using SIFT: drawbacks
The tracker is very slow




4 FPS (Frame Per Second)
Too slow for real-time operations (25 FPS)
The main bottleneck is feature extraction
Unable to handle occlusion of inserted virtual
content by real objects


A full model of the observed scene is required
AR using SIFT: examples
Videos:



mug
tabletop
Conclusions
Object recognition using SIFT



Reliable recognition
Several characteristics in common with human
vision
Augmented reality using SIFT




Very flexible
Not possible in real-time due to the high
computation times
In future possible using faster processors
References



David G. Lowe, "Object recognition from local scaleinvariant features" International Conference on Computer
Vision, Corfu, Greece (September 1999), pp. 1150-1157
Stephen Se, David G. Lowe and Jim Little, "Vision-based
mobile robot localization and mapping using scaleinvariant features" Proceedings of IEEE International
Conference on Robotics and Automation, Seoul, Korea (May
2001), pp. 2051-58
Iryna Gordon and David G. Lowe, "Scene modelling,
recognition and tracking with invariant image features"
International Symposium on Mixed and Augmented Reality
(ISMAR), Arlington, VA (Nov. 2004), pp. 110-119
For any question...
David Lowe
Computer Science Department
2366 Main Mall
University of British Columbia
Vancouver, B.C., V6T 1Z4, Canada
E-mail: [email protected]