Document 7393928

Download Report

Transcript Document 7393928

Fisher kernels for image representation &
generative classification models
Jakob Verbeek
December 11, 2009
Plan for this course

Introduction to machine learning

Clustering techniques


Gaussian mixture density continued

4)
Introduction, generative methods, semi-supervised
Fisher kernels
Classification techniques 2

6)
Parameter estimation with EM
Classification techniques 1


5)
k-means, Gaussian mixture density
Discriminative methods, kernels
Decomposition of images

Topic models, …
Classification
•
Training data consists of “inputs”, denoted x, and corresponding output
“class labels”, denoted as y.
•
Goal is to correctly predict for a test data input the corresponding class
label.
•
Learn a “classifier” f(x) from the input data that outputs the class label or a
probability over the class labels.
•
Example:
– Input: image
– Output: category label, eg “cat” vs. “no cat”
•
Classification can be binary (two classes), or over a larger number of
classes (multi-class).
– In binary classification we often refer to one class as “positive”, and the other as
“negative”
•
Binary classifier creates a boundaries in the input space between areas
assigned to each class
Example of classification
Given: training images and their categories
What are the categories
of these test images?
Discriminative vs generative methods
•
Generative probabilistic methods
– Model the density of inputs x from each class p(x|y)
– Estimate class prior probability p(y)
– Use Bayes’ rule to infer distribution over class given input
p ( y | x) 
•
p( x | y ) p( y )
p ( x)
p( x)   p( y ) p( x | y )
y
Discriminative (probabilistic) methods
– Directly estimate class probability given input: p(y|x)
– Some methods do not have probabilistic interpretation,
• eg. they fit a function f(x), and assign to class 1 if f(x)>0,
and to class 2 if f(x)<0
Generative classification methods
•
Generative probabilistic methods
– Model the density of inputs x from each class p(x|y)
– Estimate class prior probability p(y)
– Use Bayes’ rule to infer distribution over class given input
p ( y | x) 
•
p( x | y ) p( y )
p ( x)
p ( x)   p ( y  c) p ( x | y  c)
c
Modeling class-conditional densities over the inputs x
– Selection of model class:
• Parametric models: such as Gaussian (for continuous), Bernoulli (for binary), …
• Semi-parametric models: mixtures of Gaussian, Bernoulli, …
• Non-parametric models: Histograms over one-dimensional, or multi-dimensional data,
nearest-neighbor method, kernel density estimator
•
Given class conditional model, classification is trivial: just apply Bayes’ rule
•
Adding new classes can be done by adding a new class conditional model
– Existing class conditional models stay as they are
Histogram methods
•
Suppose we
– have N data points
– use a histogram with C cells
•
How to set the density level in each cell ?
– Maximum (log)-likelihood estimator.
– Proportional to nr of points n in cell
– Inversely proportional to volume V of cell
pc  nc /(NVc )

•
Problems with histogram method:
– # cells scales exponentially with the dimension of the data
– Discontinuous density estimate
– How to choose cell size?
The ‘curse of dimensionality’
•
Number of bins increases exponentially with the dimensionality of the data.
– Fine division of each dimension: many empty bins
– Rough division of each dimension: poor density model
•
Probability distribution of D discrete variables takes at least 2D values
– At least 2 values for each variable
•
The number of cells may be reduced assuming independency between the
components of x: the naïve Bayes model
– Model is “naïve” since it assumes that all variables are independent…
– Unrealistic for high dimensional data, where variables tend to be dependent
• Poor density estimator
• Classification performance can still be good using derived p(y|x)
Example of generative classification
•
•
Hand-written digit classification
–
Input: binary 28x28 scanned digit images, collect in 784 long vector
–
Desired output: class label of image
Generative model
–
–
–
•
Independent Bernoulli model for each class
Probability per pixel per class
Maximum likelihood estimator is average value
per pixel per class
Classify using Bayes’ rule: p ( y | x) 
p( x | y ) p( y )
p ( x)
p( x | y  c)   dD1 p( x d | y  c)
p( x d  1 | y  c)   cd
p( x d  0 | y  c)  1   cd
p ( x)   p ( y  c) p ( x | y  c)
c
k-nearest-neighbor estimation method
•
Idea: fix number of samples in the cell, find the right cell size.
•
Probability to find a point in a sphere A centered on x with volume v is
•
Smooth density approximately constant in small region, and thus
•
Alternatively: estimate P from the fraction of training data in a sphere on x
•
Combine the above to obtain estimate
k-nearest-neighbor estimation method
•
Method in practice:
– Choose k
– For given x, compute the volume v which contain k samples.
– Estimate density with
•
Volume of a sphere with radius r in d dimensions is
•
What effect does k have?
– Data sampled from mixture
of Gaussians plotted in green
– Larger k, larger region,
smoother estimate
•
Selection of k
– Leave-one-out cross validation
– Select k that maximizes data
log-likelihood
k-nearest-neighbor classification rule
•
•
Use k-nearest neighbor density estimation to find p(x|category)
Apply Bayes rule for classification: k-nearest neighbor classification
–
Find sphere volume v to capture k data points for estimate
v(r , d )  2r d  d / 2 / (d / 2  1)
–
Use the same sphere for each class for estimates
p( x | y  c) 
–
Estimate global class priors
–
Calculate class posterior distribution
p ( y  c) 
kc
Ncv
Nc
N
p( x) 
p ( x | y  c) p ( y  c)
p ( x)
1 kc

p( x) Nv
k
 c
k
p( y  c | x) 
k
Nv
k-nearest-neighbor classification rule
•
Effect of k on classification boundary
– Larger number of neighbors
– Larger regions
– Smoother class boundaries
Kernel density estimation methods
•
Consider a simple estimator of the cumulative
distribution function:
•
Derivative gives an estimator of the density
function, but this is just a set of delta peaks.
•
Derivative is defined as
•
Consider a non-limiting value of h:
•
Each data point adds 1/(2hN) in region of size h
around it, sum of “blocks” gives estimate
Kernel density estimation methods
•
Can use other than “block” function to obtain smooth estimator.
•
Widely used kernel function is the (multivariate) Gaussian
– Contribution decreases smoothly as a function of the distance to data point.
•
Choice of smoothing parameter
– Larger size of “kernel” function gives
smoother desnity estimator
– Use the average distance between samples.
– Use cross-validation.
•
Method can be used for multivariate data
– Or in naïve bayes model
Summary generative classification methods
•
(Semi-) Parametric models (eg p(data |category) = gaussian or mixture)
– No need to store data, but possibly too strong assumptions on data density
– Can lead to poor fit on data, and poor classification result
•
Non-parametric models
– Histograms:
• Only practical in low dimensional space (<5 or so)
• High dimensional space will lead to many cells, many of which will be empty
• Naïve Bayes modeling in higher dimensional cases
– K-nearest neighbor & kernel density estimation:
• Need to store all training data
• Need to find nearest neighbors or points with non-zero kernel evaluation (costly)
histogram
k-nn
k.d.e.
Discriminative vs generative methods
•
Generative probabilistic methods
– Model the density of inputs x from each class p(x|y)
– Estimate class prior probability p(y)
– Use Bayes’ rule to infer distribution over class given input
p ( y | x) 
•
p( x | y ) p( y )
p ( x)
p( x)   p( y ) p( x | y )
y
Discriminative (probabilistic) methods (next week)
– Directly estimate class probability given input: p(y|x)
– Some methods do not have probabilistic interpretation,
• eg. they fit a function f(x), and assign to class 1 if f(x)>0,
and to class 2 if f(x)<0
•
Hybrid generative-discriminative models
– Fit density model to data
– Use properties of this model as input for classifier
– Example: Fisher-vectors for image respresentation
Clustering for visual vocabulary construction
• Clustering of local image descriptors
– using k-means or mixture of Gaussians
• Recap of the image representation pipe-line
– Extract image regions at various locations and scales Compute
descriptor for each region (eg SIFT)
– (Soft) assignment each descriptors to clusters
– Make histogram for complete image
• Summing of vector representations of each descriptor
Image regions
Cluster indexes
• Input to image classification method
0
1
.5
0
.5
0
0
0
0
0
0
0
0
0
1
.9
0
0
.1
0
…
…
…
…
…
…
…
…
1
1.2
20
20.0
0
0.2
0
0.6
10
10.0
3
2.5
2
2.5
1
1.0
Fisher Vector motivation
•
•
Feature vector quantization is computationally expensive in practice
Run-time linear in
– N: nr. of feature vectors ~ 10^3 per image
– D: nr. of dimensions ~ 10^2 (SIFT)
– K: nr. of clusters ~ 10^3 for recognition
•
•
•
•
So in total in the order of 10^8 multiplications per image to assign SIFT
descriptors to visual words
We use histogram of visual word counts
Can we do this more efficiently ?!
Reading material: “Fisher Kernels on Visual
Vocabularies for Image Categorization”
F. Perronnin and C. Dance, in CVPR'07
Xerox Research Centre Europe, Meylan
Fisher vector image representation
•
MoG / k-means stores nr of points per cell
– Need many clusters to represent distribution of descriptors in image
– But increases computational cost
•
Fischer vector adds 1st & 2nd order moments
– More precise description regions assigned to cluster
– Fewer clusters needed for same accuracy
– Representation (2D+1) times larger, at same computational cost
– Terms already calculated when computing soft-assignment
qnk
qnk
qnk ( xn  mk )
qnk ( xn  mk ) 2
2
2
1
qnk: soft-assignment of image region to
cluster (Gaussian mixture component)
5
3
2
4
2
2
4
2
4
2
4
8
2
8
2
3
5
1
1
1
2
3
2
2
3
Image representation using Fisher kernels
•
General idea of Fischer vector representation
– Fit probabilistic model to data
– Use derivative of data log-likelihood as data representation, eg.for classification
[Jaakkola & Haussler. “Exploiting generative models in discriminative classifiers”, in
Advances in Neural Information Processing Systems 11, 1999.]
•
Here, we use Mixture of Gaussians to cluster the region descriptors
N
N
K
n 1
n 1
k 1
L( )   log p( xn )   log   k N ( xn ; mk , Ck )
•
Concatenate derivatives to obtain data representation

 k
N
L( )   qnk
n 1
N

1
L( )  Ck  qnk ( xn  mk )
mk
n 1
N

1
1

L( )   qnk  Ck  ( xn  mk )( xn  mk )T 
1
Ck
2
2

n 1
Image representation using Fisher kernels
•
Extended representation of image descriptors using MoG
– Displacement of descriptor from center
– Squares of displacement from center
qnk

 k

mk

Ck1
– From 1 number per descriptor per cluster, to 1+D+D2 (D = data dimension)
•
Simplified version obtained when
– Using this representation for a linear classifier
– Diagonal covariance matrices, variance in dimensions given by vector v k
– For a single image region descriptor
qnk
qnk
qnk ( xn  mk )
qnk ( xn  mk ) 2
– Summed over all descriptors this gives us
• 1: Soft count of regions assigned to cluster
• D: Weighted average of assigned descriptors
• D: Weighted variance of descriptors in all dimensions
Fisher vector image representation
• MoG / k-means stores nr of points per cell
– Need many clusters to represent distribution of descriptors in image
• Fischer vector adds 1st & 2nd order moments
– More precise description regions assigned to cluster
– Fewer clusters needed for same accuracy
– Representation (2D+1) times larger, at same computational cost
– Terms already calculated when computing soft-assignment
– Comp. cost is O(NKD), need difference between all clusters and data
qnk
qnk
qnk ( xn  mk )
qnk ( xn  mk ) 2
2
2
1
5
3
2
4
2
2
4
2
4
2
4
8
2
8
2
3
5
1
1
1
2
3
2
2
3
Images from categorization task PASCAL VOC
•
Yearly “competition” for image classification (also object localization,
segmentation, and body-part localization)
Fisher Vector: results
•
BOV-supervised learns separate mixture model for each image class, makes
that some of the visual words are class-specific
•
•
MAP: assign image to class for which the corresponding MoG assigns maximum
likelihood to the region descriptors
Other results: based on linear classifier of the image descriptions
•
•
Similar performance, using 16x fewer Gaussians
Unsupervised/universal representation good
Plan for this course

Introduction to machine learning

Clustering techniques
•

k-means, Gaussian mixture density
Gaussian mixture density continued
•

Parameter estimation with EM
Classification techniques 1


Introduction, generative methods, semi-supervised
Reading for next week:


5)
Previous papers (!), nothing new
Available on course website http://lear.inrialpes.fr/~verbeek/teaching
Classification techniques 2
•
6)
Discriminative methods, kernels
Decomposition of images
•
Topic models, …