Transcript Slide 1

Bags Of Features
A presentation by Shahar Libay
Image Processing Seminar
IDC Herzeliyya 2008
Introduction



The field of Computer-Vision, which is the natural
extension of image processing, deals with understanding
the contents of an image.
An image contains lots of highly variable and usually
noisy data which makes it difficult to understand.
The solution: reduce the amount of data keeping only the
information that provides good indication towards the
contents.
Feature-Based Identification




Features are intelligently
selected components of an
image.
The two most important
characteristics of a features
are stability and
distinction.
Stability – a feature that
identifies an object should
appear in all (or most)
images of that object.
Distinction – a feature that
identifies an object should
not appear in images not
containing that object
Image copyright © :
L. Fei-Fei (Princeton),
R. Fergus (MIT),
A. Torralba (MIT)
Distinctive Image Features from
Scale-Invariant Keypoints
David G. Lowe
Computer Science Department
University of British Columbia
January 5, 2004
Introduction to the Algorithm



Extracts features nicknamed SIFTs (Scale Invariant Feature
Transforms).
SIFTs where first introduced in an article by Lowe from 1999.
That article was highly cited since then and a patent was
issued for the algorithm.
The produced features are:






Scale invariant.
Rotation invariant.
Stable.
Distinctive.
Resistant to affine distortion, 3D camera rotation, noise and
illumination changes.
Plentiful (we want as many stable features as possible for object
detection\recognition).
Stage 1 – Building the Scale-Space


We begin by searching dominant keypoints at different scales.
A scale space of an image is a function L representing a
convolution between a Gaussian G and the original image I:
L  x, y,   G  x, y,   I  x, y 
G  x, y,  


1
2
2
e
x2  y 2
2 2
By using the Difference-of-Gaussians function D (between
Gaussians of different variance) and scanning them for local
maxima we get a (big) list of keypoints to explore:
D  x, y,    G  x, y, k   G  x, y,     I  x, y   L  x, y, k   L  x, y, 

Note that the rest of the algorithm will only scan the found
keypoints and not every pixel at every scale!
Stage 1 – Building the Scale-Space – cont’d




For every possible scale we build an
octave of s images convolved with
Gaussians of increasing variance (the L
functions from the previous slide with
sigmas increasing by a factor of
k=2^(1/s)).
Every two adjacent Gaussians are
subtracted to produce the D
functions.
After building each octave we take the
image with twice the original sigma (2
images from the top of the stack), subsample it by taking every second pixel
and use that as the basis of a new
octave.
The initial image of the first octave is
not the original image but rather a
scaled version (*2) to increase the
number of found keypoints.
Stage 2 – Finding the Local Extrema


Each pixel in each
Difference-of-Gaussians
image is compared to all the
pixels adjacent to it in the
current scale and in the
scale above and below it. If it
is larger or smaller than all
of those, then it is a
suspected keypoint.
It was found empirically that
using 3 scales per octave and
setting the smoothing sigma
to 1.6 produces optimal
results in terms of accuracy
vs. computation time.
Stage 3 – Removing Unstable Keypoints



Keypoints with low contrast tend to be sensitive to noise
and need to be discarded (since they are too similar to
their neighborhood).
Using a method suggested by Brown and Lowe at 2002
(which we will not get into) we can use the Taylor
expansion of the function D(x,y,sigma) to find the
extremum point, t, and calculate the function value at the
extremum: D(t)=D(x)+0.5*(d(Dtrans)/d(x))*t. x and t are
3D vectors (x,y,sigma).
Any keypoint whose |D(t)|<0.3 is rejected (D’s derivatives
are taken at the sample point using pixel values).
Stage 4 – Removing Edge Responses



The DOG functions have strong responses around edges
(since they filter out low frequencies) even if the location
is poorly determined.
These responses can be categorized as having a strong
principle curvature in the edge’s direction but a weak one
in the perpendicular direction.
We’ll look at the 2x2 Hessian (2nd derivative) matrix
H=[Dxx Dxy; Dxy Dyy] (calculated using pixel values at
the keypoint location and scale). It’s eigen values are
proportional to the principle curvatures of D. We want
the ratio between the largest (a) and smallest (b) eigen
values of H to be smaller than a given threshold.
Stage 4 – Removing Edge Responses cont’d

There is no need to compute eigen values! We can use
the trace and determinant of H since we are only
interested in the ratio r (a=rb):




Tr(H)=Dxx+Dyy=a+b
Det(H)=Dxx*Dyy-(Dxy)^2=ab
Tr(H)^2/Det(H)=(a+b)^2/ab=(1+r)^2/r
Choosing (empirically) r=10, we require that
Tr(H)^2/Det(H)<11^2/10, or the keypoint is discarded.
Stage 5 – Orientation Assignment


So far, our filtered keypoints have a location and scale (derived from
the scale in which they are found). To achieve rotation invariance, we
must also assign a direction to each point.
Each blurred image (L) has it’s gradient orientation (theta) and
magnitude(m) pre-computed:




m(x,y)=sqrt((L(x+1,y)-L(x-1,y))^2+(L(x,y+1)-L(x,y-1))^2)
theta(x,y)=arctan((L(x,y+1)-L(x,y-1))/(L(x+1,y)-L(x-1,y)))
For each keypoint, a surrounding window is selected, and we build a
histogram of directions for every window. The histogram is divided
into bins of 10 degrees and we add to each bin the corresponding
samples weighted by their magnitude and distance from the keypoint
(a Gaussian with a variance 1.5 times the scale of the keypoint).
The keypoint’s orientation is the highest column of the histogram.
For every peak of size 80% or more from the maximum, the
keypoint is duplicated with that direction.
Found Keypoints – An Example
Stage 6 – Computing the Descriptor





Each of the remaining keypoints is
assigned a descriptor that categorized
the behaviors of it’s area (the size of
which depends on the keypoint scale).
The pre-computed gradient (and it’s
location) is rotated relative to the
keypoint’s direction.
Each keypoint is assigned 4x4
histograms of 8 bins each (the
illustration shows 2x2) which are built
by weighting the magnitudes and a
Gaussian with variance ½ of the
sampled area’s size.
The 128 elements in the feature
vector are normalized to resist
illumination changes.
To better resist 3D camera
movement, each value below 0.2 is
thresholded, and the feature vector is
re-normalized.
Object Detection





First, a database of features is built based on a training set.
The input image is processed for features as described above.
The nearest neighbor (Euclidian distance) of each feature vector is located
in the database. The distance to it is compared with the distance to the
second-closest neighbor (the closest neighbor of a different object), and
features with ratio greater than 0.8 are rejected (the feature is the same
distance from two different objects). Several algorithms exists for
approximating NNs fast (K-d trees etc.).
We need to find a group (cluster) of at least 3 features belonging to the
same object to indicate it’s presence in the image (and pose\orientation
etc.). This can be done by Hough Transform on the keypoint parameters
(location, scale and orientation).
Lastly, the found features undergo geometric verification by calculating the
affine transformation of the object relative to the database and checking
the feature’s location. A more general solution (projective transformations)
may require the use of 7 features and solving the fundamental matrix.
Examples
Object Class Recognition by Unsupervised
Scale-Invariant Learning
(1) Dept. of Eng. Science
University of Oxford
R. Fergus (1), P. Perona (2) and A. Zisserman
(2)Dept. of Electrical Eng.
California Institute of Technology
2003
Motivation



Class recognition is an important field in Computer Vision
which allows us to do many application such as face
detection.
We want to model a class by a set of features (invariant
to scale) and the relations between them.
We also want the learning phase to be unsupervised, that
is we want the computer to find out on it’s own which
features represent which class.
The model


As with Bayesian classifiers seen in earlier lectures, we want to save the
following properties of a feature:
 A – Appearance; the actual contents of the feature. Achieved by “cutting”
the window in which the feature appears from the image, rescaling it to a
standard size (11x11) and running PCA to reduce the number of
dimensions in the feature (for calculations sake).
 X – Shape; the coordinates of the center of the ellipse representing the
feature.
 S – Scale; the size of the above ellipse.
The P features that represent a single object are “glued” together using the
hypothesis h – a P dimensional vector that gives each feature a number
from 1 to N (the “index” of the feature in the image, or 0 if it was not
found (e.g. is occluded). We’ll mark d as the occlusion vector (1=feature
was found; 0=otherwise), f=sum(d) the number of foreground features, and
n=N-f the number of background features (in the current hypothesis)
The model – cont’d

Again like with the Bayesian classifier we need to solve the
Bayesian decision R, where theta is the object’s parameters
(unknown) and theta(bg) are the background parameters:

The likelihood can also be broken into manageable pieces:

Note that |H|=O(N^P) (but typically N<=30, P<=7)
Appearance

We assume that every appearance is a vector in some
appearance space, and that each part p has some Gaussian
density in that space whose distribution we can build using the
parameters {c,V} (for the foreground and background) as the
mean and covariance matrix (since this is a high dimensional
space). We also assume V is diagonal (the appearances are
independent). Therefore:

If the part is not shown in the hypothesis (dp=d(p)=0) we get
a ratio (probability) of 1. Otherwise we measure the chance
the feature is foreground compared to it being background.
Shape

We treat the shape (==location) as the joint Gaussian
density of all visible features, in a scale invariant space
(using the scale parameter). Again this Gaussian comes
with a mu and sigma parameters, this time the sigma is
not diagonal (the location of features is not independent
relative to one another!). All non visible features are
considered background, and the rest are uniformly
distributed over the area alpha of the image:
Scale

Again, a Gaussian (Normal) distribution of independent
variables with parameters {t,U}. The background
(invisible) features are assumed to have a uniform
distribution over range r:
Hypothesis

The hypothesis model is basically based on Poisson
distribution (the number of events, the parts, in a
continuous Poisson stream, the image area). It also
incorporates a counting term (for bookkeeping sake) and
a probability table, P(d|theta), whose size is 2^P and is a
parameter to be evaluated (the probability of each
occlusion pattern):
Components Summary


The majority of
hypotheses will score low.
However, we hope that
some of them will score
high and help us learn!
Feature Extraction





Done using an algorithm by Kadir and Brady.
Finds circular features that are prominent in both location
and scale.
Each image point has a histogram P(I) built of all
intensities in radius s (scale).
The histogram is checked for entropy H(s) (the local
maximas are possible scales).
The results are filtered by H(dP/ds) returning the N most
salient regions.
Learning & Recognition

We need to estimate the parameters of the model:

The idea is to find a set of parameters Tml that maximize the
likelihood probability: Tml=argmax{P(X,S,A|theta)}.
We use the EM estimator shown in earlier lectures. This
estimator starts with a random set of parameters and
iteratively computes statistics on them (E-step) and updates
them (M-step) until they converge on a (possibly local)
maximum.
There are P(N^P) hypotheses, which need to be scanned by
the EM at each stage. This causes slow learning times of 24-36
hours (for N<=30 and P<=7) even when using efficient search
algorithms for the hypotheses.


Learning & Recognition – cont’d


The reliability and consistency of the model are so high that it
did not need validation sets.
Recognition is done simply by extracting features fom the
image, feeding them to the trained model, and checking to see
if the R is above a given threshold.
Video Google: A Text Retrieval Approach to
Object Matching in Videos
Josef Sivic and Andrew Zisserman
Robotics Research Group, Department of Engineering Science
University of Oxford, United Kingdom
2003
Introduction & Motivation



Use methods of text cataloging & retrieval on videos.
The user selects an object in a video and the algorithm
retrieves all frames containing this object.
The object requested was not previously signaled out by
human in the video (automatic cataloging)!
The Original Model: Text Retrieval System






Firstly, the document is parsed into words.
The words are stemmed removing grammar-based variations
(“walking”\”walked””walk”).
Overly common words (“a”, “the”) are thrown away for not
being discriminative enough.
Each word is assigned a unique identifier and the document is
assigned a vector of word frequencies – each word has a
weight based on it’s number of occurrences, and additional
algorithm specific weight such as the number-of-links based
rank Google applies.
An index (“inverted file”) is built for all document vectors
specifying for each word the documents in which it appears
and the location within the document.
A text is located by building it’s frequencies vector and finding
the vector closest to it in terms of angle.
Finding Features





Two types of elliptical features are located in each frame: Shape
Adapted (SA) and Maximally Stable (MS).
We will not go into implementation details.
However, SA features are calculated by a scanning a Laplacian
Pyramid, which makes them appear in edges (high frequencies),
and MS features are built using watershed segmentation, which
make them appear on blobs of high contrast relative to the
background.
Each feature is saved as a 128 components SIFT descriptor.
The features are tracked (using simple velocity estimation) and
if they appear in less than 3 frames they are discarded (as
noise).
Building a Vocabulary






It is not enough to find features – we must also quantize them into “words”
for text retrieval.
Since all features are tracked, we compute the mean feature vector for
each feature.
The 10% tracks with the highest diagonal covariance matrix are discarded.
Roughly 1000 features per frame remain.
The reduce computations, 48 shots are selected. The k-means clustering
algorithm is run on the SIFT averages (several times selecting the best
results) in these frames. Every cluster represents a word.
Instead of using Euclidean distance we use Mahalanobis distance:
d(x1,x2)=sqrt((x1-x2)’*inv(S)*(x1-x2)). S is the covariance matrix
computed from all the means in the selected shots (under the
assumption that all tracks have the same S) – this reduces the weight of
noisy SIFT components.
It is also worth to notice that SA and MS features are computed separately
since they appear at different areas of the image (like having two
dictionaries for different words of the same language).
Building the Frequency Vectors


As mentioned before, most text search engines weight the
words using some custom function.
The method used is known as tf-idf or “term frequencyinverse document frequency” and is computed as follows: each
document (frame) is represented by a kD vector (k is the
number of words) of weights Wi s.t. Wi=(Nid/Nd)*log(N/Ni).





Nid is the number of occurrences of word i in document d.
Nd is the word count of document d.
N is the number of documents in the database.
Ni is the number of occurrences of word i in the database.
The “tf” increases the weight of words common to a
document while “idf” decreases the weight of words common
to the database.
Matching Words (Features)



When matching features, the
rank of each word is
determined by the normalized
scalar product (angle cosine) of
that word and the input word.
To improve speed and
robustness, the 5% most
common words and 10% least
common words are eliminated.
Spatial consistency checks
improve results: when a match
is found in a frame, each
additional match to one of it’s
15 nearest neighbors in the
frame raises the frame’s rank.
Examples
Learning Object Categories from
Google’s Image Search
R. Fergus (1), L. Fei Fei (2), P. Perona (2) and A. Zisserman
(1) Dept. of Eng. Science
(2)Dept. of Electrical Eng.
University of Oxford
California Institute of Technology
2005
Problem Description




When training classifiers someone (human) must collect a
viable training database.
Hand collected database is easy to work with but is also
limited.
There are databases such as Google Image that contain
huge amount of categorizes photos for instant learning.
Drawback: the images in Google might also contain other
classes than the one we are looking for, contain
something different than out intention, or even if the
picture is good we still don’t know if it’s in the right
orientation, occlusion free etc.
pLSA (Probabilistic Latent Semantic
Analysis)




As seen earlier, our model has D “documents” that were
quantized into a vector of W “words” that can be put in a
WxD matrix where n(w,d) is the number of occurrences of
word w in document d.
We also add Z topics s.t every topic contains several words.
The formal model: P(w,d)=sum(Z,P(w|z)P(z|d)P(d)). This breaks
down the WxD matrix to a WxZ and a ZxD matrices.
EM (seen in an earlier lecture) is used to learn P(w|z) and
P(z|d) and thus maximize the log-likelihood of the model over
the data:


L=Prod(D,Prod(W,P(w,d)^n(w,d)))
When recognizing we lock P(w|z) and use EM to estimate
P(z|d) from the query images.
ABS-pLSA (Absolute Position pLSA)




By changing the model slightly we can add the feature’s
location.
We need to quantize the location into X bins and join
them with W: P(w,d)=sum(Z,P(w,x|z)P(z|d)P(d)).
The solution remains as before using EM (the density
matrix is now (WxX)xZ).
Problematic approach: X is an absolute position and is not
scale or rotation invariant.
TSI-pLSA (Translation & Scale Invariant
pLSA)



To overcome ABS-pLSA’s shortcomings we add a 4D
vector c that specifies the centroid location of an object
and it’s x\y scales (a bounding box).
We also redefine the location x relative to c: we divide an
area (window) around c into Xfg bins, and add an extra
bin for background.
To avoid modeling P(w,x|c,z) (which requires betweenframe correlation since c is given in absolute image
coordinates) we model P(w,x|z) using c:


P(w,x|z)=sum(C,P(w,x,c|z))=sum(C,P(w,x|c,z)P(c))
We can now replace P(w,x|z) in the ABS-pLSA model
with the above equation.
TSI-pLSA – cont’d



The new model forms a problem: we need to iterate all
possible values of c to learn P(w,x|z) but c has too many
dimensions for this to be effective!
The solution: for each topic z suggest a small set of c’s
calculated for this topic.
Intuition: types of objects have a specific set of scales and
locations in the image – an elephant would not likely be
shown at 3 pixels height in the top left corner of the
image.
Visual Comparison of different pSLAs
Selection of Candidate c’s




Using a mixture of k
Gaussians we try to fit them
to the features belonging to
the topics (weighted by
P(w|z)).
We try different k’s to
separate different objects of
the same topic.
The mean of the Gaussians
is the centroid, and the
variance is the scale.
This produces a relatively
small amount of c’s!
Implementation Details


We need a large number of features for the algorithm to
work so we use several feature extractors, including one
that simply detect edges and uniformly select features on
them.
The features are saved as SIFTs with only 72 dimensions
(rather than 128) and larger histogram bins since we need
categorization and not recognition.