Texture-Layout Filters

Download Report

Transcript Texture-Layout Filters

1
Jamie Shotton
Machine Intelligence Laboratory, University of Cambridge
John Winn, Carsten Rother, Antonio Criminisi
Microsoft Research Cambridge, UK
Presenter: Kuang-Jui Hsu
Date :2011/7/04(Mon.)
• Introduction
• A Conditional Random Field Model
of object Classed
• Boosted Learning of Texture, Layout,
and Context
• Result and Comparisons
2
• Achieving automatic detection, recognition, and
segmentation of object classes in photographs
• Not only the accuracy of segmentation and
recognition, but also the efficiency of the algorithm
• At a local level, the appearance of an image patch
leads to ambiguities in its class label
• To overcome it, it is necessary to incorporate longer
range information
• To achieve, construct a discriminative model for
labeling images which exploits all three type of
information: texture appearance, layout, and context
3
• Overcome problems associated with object
recognition techniques that rely on sparse feature
• Authors’ technique based on dense feature is capable
of coping with textured and untextured and with
multiple objects which inter- or self-occlude
4
• Three contributions:
• Use a novel type of feature called the texture-layout filter
• A new discriminative model that combines texture-layout
filters with lower-level image features
• Demonstrate how to train this model efficiently on a very
large dataset by exploiting both boosting and piecewise
training methods
5
• Use a conditional random field (CRF) model to learn
the conditional distribution over the class labeling
given an image
• Incorporate texture, layout, color, location, and edge
cues
6
• Definition:
log 𝑃 𝒄 𝒙, 𝜽
texture-layout
=
color
location
𝜓𝑖 𝑐𝑖 , 𝒙; 𝜽𝜓 + 𝜋 𝑐𝑖 , 𝑥𝑖 , 𝜽𝜋 + 𝜆(𝑐𝑖 , 𝑖, 𝜽𝜆 )
𝑖
edge
+
partition function
𝜙(𝑐𝑖 , 𝑐𝑗 , g𝑖𝑗 𝒙 ; 𝜽𝜙 ) − log Z(𝜽, x)
(𝑖,𝑗)∈𝜀
c:class labels
x: an image
𝑖: the node 𝑖 in the graph
𝜽 = 𝜽𝜓 , 𝜽𝜋 , 𝜽𝜆 , 𝜽𝜙 : the parameters to learn
𝜀: the set of edges in a 4-connected grid structure
7
• Definition:𝜓𝑖 𝑐𝑖 , 𝒙; 𝜽𝜓 = log 𝑃(𝑐𝑖 |𝒙, 𝑖)
𝑃(𝑐𝑖 |𝒙, 𝑖): the normalized distribution given by a
boosted classifier
• This classifier models the texture, layout, and texture
context of the object classed by combining novel
discriminative features called texture-layout filter
8
• Represented as Gaussian Mixture Models (GMMs) in
CIELab color space where the mixture coefficients
depend on label
• Conditional probability of the color x of a pixel
𝑃 𝑥|𝑐 =
𝑃 𝑥 𝑘 𝑃(𝑘|𝑐)
𝑘
with color clusters (mixture component)
𝑃 𝑥|𝑘 = 𝒩(𝑥|𝜇𝑘 , Σ𝑘 )
𝑘: color cluster
𝜇𝑘 , Σ𝑘 : the mean and variance respectively of color
cluster 𝑘
9
𝑃 𝑥|𝑐 =
𝑃 𝑥 𝑘 𝑃(𝑘|𝑐)
𝑘
The color models use Gaussian Mixture Models where
the mixture coefficients 𝑃(𝑘|𝑐) are conditioned on the
class label c
10
• But, predict the class label c given the image color of
a pixel x and the color cluster k
• Use the simple inference method by Bayes rule
• For pixel 𝑥𝑖
𝑃(𝑘, 𝑥𝑖 ) 𝑃 𝑥𝑖 𝑘 𝑃(𝑘)
𝑃(𝑘)
𝑃 𝑘 𝑥𝑖 =
=
= 𝑃(𝑥𝑖 |𝑘)
𝑃(𝑥𝑖 )
𝑃(𝑥𝑖 )
𝑃(𝑥𝑖 )
So, 𝑃 𝑘 𝑥𝑖 ∝ 𝑃(𝑥𝑖 |𝑘)
• Definition:
𝜋 𝑐𝑖 , 𝑥𝑖 , 𝜽𝜋 = log
𝜽𝜋 𝑐𝑖 , 𝑘 𝑃(𝑘|𝑥𝑖 )
𝑘
11
• Definition:
𝜋 𝑐𝑖 , 𝑥𝑖 , 𝜽𝜋 = log
𝜽𝜋 𝑐𝑖 , 𝑘 𝑃(𝑘|𝑥𝑖 )
𝑘
• Learned parameter 𝜽𝜋 𝑐𝑖 , 𝑘 represent the
distribution 𝑃(𝑐𝑖 |𝑘)
For discriminative inference, the arrows in the graphical
model are reversed using Bayes rule
12
• Definition:
𝜆 𝑐𝑖 , 𝑖, 𝜽𝜆 = log 𝜽𝜆 (𝑐𝑖 , 𝑖)
𝑖: the normalized version of the pixel index i ,where the
normalization allows for image of different sizes
• The 𝜽𝜆 is also learned
13
• Definition:
𝜙 𝑐𝑖 , 𝑐𝑗 , g𝑖𝑗 𝒙 ; 𝜽𝜙 = −𝜽𝑇𝜙 g𝑖𝑗 𝒙 [𝑐𝑖 ≠ 𝑐𝑗 ]
• g𝑖𝑗 : the edge feature measures the difference in color
between the neighboring pixels
g𝑖𝑗 = exp(−𝛽 𝑥𝑖 − 𝑥𝑗
1
2
𝑥𝑖 , 𝑥𝑗 : three-dimensional vectors representing the colors
of the pixels 𝑖, 𝑗
𝛽= 2
𝑥𝑖 − 𝑥𝑗
2
−1
14
• Given the CRF model and its learned parameters, find
the most probable labeling, 𝒄∗
• The optimal labeling is found by applying the alphaexpansion graph cut algorithm
15
• A current configuration (set of labels) c and fixed
label 𝛼 ∈ 1, … , 𝐶 , where 𝐶is the number of classes
• Each pixel 𝑖 makes a binary decision: it can either
keep its old label or switch to label 𝛼
• A binary vector 𝒔 ∈ 0,1 𝑝 which defines the
auxiliary configuration c[𝒔]as
𝑐𝑖 ,
𝑐𝑖 𝒔 =
𝛼 ,
𝑖𝑓 𝑠𝑖 = 0
𝑖𝑓 𝑠𝑖 = 1
• Start with an initial configuration 𝒄0 , given by the
mode of the texture-layout potentials
• Compute optimal alpha-expansion moves for label 𝛼
in some order, accepting the moves only they increase
the objective function
16
• There are two methods to learn the parameters:
• Maximum a-posteriori (MAP) – poor results
• Piecewise training
• Only 𝜽𝜋 , 𝜽𝜆 , 𝜽𝜙 are learned by these methods
• 𝜽𝜓 is learned during boosted learning
17
• Maximizes the conditional likelihood of the labels
given the train data,
𝐿 𝜽 =
log 𝑃(𝒄𝑛 |𝒙𝑛 , 𝜽) + log 𝑃(𝜽)
𝑛
𝒄𝑛 ,𝒙𝑛 : the training data of input and output
log 𝑃(𝜽): prevent overfitting
• The maximization of 𝐿 𝜃 with respect to 𝜽 can be
achieved using a gradient ascent algorithm
18
• Conjugate gradient ascent did eventually converge to
a solution, evaluating the learned parameter against
validation data gave poor results with almost
improvement
• The lack of alignment between object edges and label
boundaries in the roughly labeled training set forced
the learned parameters to tend toward zero
19
• Based on the piecewise training method of “Piecewise
Training of Undirected Models” [C. Sutton et al., 2005]
• The terms are trained independently, and the recombined
• The training method minimized an upper bound on the
log partition function:
Let 𝑧 𝜽, 𝒙 = log 𝑍(𝜽, 𝒙), and index the terms in the
model by r
𝑧 𝜽, 𝒙 ≤
𝑧𝑟 (𝜽𝑟 , 𝒙)
𝑟
𝜽𝑟 : the parameters of the rth term
𝑧𝑟 (𝜽𝑟 ): the partition function for a model with the rth term
20
𝑧 𝜽, 𝒙 ≤
𝑧𝑟 (𝜽𝑟 , 𝒙)
𝑟
Proof:
Use the Jensen’s inequality:
𝑎𝑖 𝑥𝑖
𝑎𝑖 𝜑 𝑥𝑖
𝜑
≤
,
𝑎𝑗
𝑎𝑗
𝑎𝑖 𝑥𝑖
𝑎𝑖 𝜑 𝑥𝑖
𝜑
≥
,
𝑎𝑗
𝑎𝑗
𝑖𝑓 𝜑 𝑖𝑠 𝑟𝑒𝑎𝑙 𝑐𝑜𝑛𝑣𝑒𝑥
𝑖𝑓 𝜑 𝑖𝑠 𝑟𝑒𝑎𝑙 𝑐𝑜𝑛𝑐𝑎𝑣𝑒
𝑎𝑖 : the positive weights
𝑧 𝜽, 𝒙 = log 𝑍(𝜽, 𝒙) is concave
21
• Replacing 𝑧 𝜽, 𝒙 with
bound
𝑟 𝑧𝑟 (𝜽𝑟 , 𝒙)
gives a lower
• The bound can be loose, especially if the terms in the
model are correlated
• Performing piecewise parameter training leads to
over-counting during inference in the combined
model
• Because of over-counting, 𝜃𝜓𝑛𝑒𝑤 = 2𝜃𝜓𝑜𝑙𝑑
• To avoid this, weight the logarithm of each duplicate
term by a factor of 0.5, or raise the term to the power
of 0.5
22
23
• Four types of parameter have to be learned
•
•
•
•
Texture-layout potential parameters
Color potential parameters
Location potential parameters
Edge potential parameters
• But the first parameters is learned during boosted
learning, and each others are learned by the piecewise
learing
24
• The color potentials are learned at test time for each
image independently
• First, the color clusters, 𝑃 𝑥|𝑘 = 𝒩(𝑥|𝜇𝑘 , Σ𝑘 ), are
learned in an unsupervised manner using K-means
• Then, an iterative algorithm, reminiscent of EM
alternates between inferring class labeling 𝒄∗ , and updating the color potential parameters as
𝜃𝜋 𝑐𝑖 , 𝑘 =
𝑖
𝑐𝑖∗
𝑐𝑖 =
𝑃(𝑘|𝑥𝑖 ) + 𝛼𝜋
𝑖 𝑃(𝑘|𝑥𝑖 ) + 𝛼𝜋
𝓌𝜋
25
• Training these parameters by maximizing the
likelihood of the normalized model containing just
that potential and raising the result to a fixed power
𝑤𝜆 to compensate for over-counting
𝑁𝑐,𝑖 + 𝛼𝜆
𝜃𝜆 𝑐, 𝑖 =
𝑁𝑖 + 𝛼𝜆
𝜔𝜆
𝑁𝑐,𝑖 : the number of pixels of class c at normalized
location 𝑖in the training set
𝑁𝑐,𝑖 : the total number of pixels at location 𝑖
26
• The value of the two contrast-related parameters were
manually selected to minimize the error on the
validation
27
• Based on a novel set of features called texture-layout
filter
• Capable of jointly capturing texture, spatial layout,
and textural context
28
1. The training images are convolved with a 17dimensional filter-band at scale 𝜅
2. The 17D responses for all training pixels are whitened
3. An un supervised clustering is performed
4. Each pixel in each image is assigned to the nearest
cluster center, producing the texton map
» Denote the texton map as T where pixel i has value
𝑇𝑖 ∈ 1, … , 𝐾
29
• Each texture-layout filter is a pair 𝑟, 𝑡 of an image
region, 𝑟, and a texton𝑡
• 𝑟: defined in coordinates relative to the pixel 𝑖 being
classified
• For simplicity, a set ℛ of candidate rectangles are
chosen at random, such their top-left and bottom-right
corners lie within a fixed bounding box covering
about half the image area
30
Feature response:
𝑣 𝑟,𝑡 𝑖 =
1
𝑎𝑟𝑒𝑎(𝑟)
𝑗∈(𝑟+𝑖)
𝑇𝑗 = 𝑡 , i: location
31
efficiently computed
over a whole image
with integral images
[P.Viola et al.,2001]
Process:
1. Separated into K channels (one for each channel)
2. For each channel, a separated integral images is
calculated
(𝑖)
(𝑖)
(𝑖)
3. Feature response: 𝑣 𝑟,𝑡 𝑖 = 𝑇𝑟(𝑖)
−
𝑇
−
𝑇
+
𝑇
𝑟𝑏𝑙
𝑟𝑡𝑟
𝑟𝑏𝑙
𝑏𝑟
𝑇 (𝑡) : the integral image of T for texton channel t
32
• Some classes may have large within-class textural
differences, but repeatable layout of texture within a
particular object instance
• It uses the texton at pixel i being classified, 𝑇𝑖 , rather
than a particular learned texton
33
𝑃(𝒄|𝒙, 𝜽) ≈ 𝑃(𝒄𝑥 |𝒙, 𝜽𝑥 ) × 𝑃(𝒄𝑦 |𝒙, 𝜽𝑦 )
34
• Employ an adapted version of the Joint Boost
algorithm [A. Torralba et al.,2007]
• Iteratively selects discriminative texture-layout as
“weak learner”
• Combine them into a powerful classifier
𝑃 𝑐 𝒙, 𝑖 ,used by the texture-layout potentials
• Joint Boost shares each weak learner between a set of
classes C
35
• Strong classifier: 𝐻 𝑐, 𝑖 =
𝑚
𝑀
ℎ
𝑚=1 𝑖 (𝑐)
• Use the multiclass logistic transformation
𝑃(𝑐|𝒙, 𝑖) ∝ exp 𝐻(𝑐, 𝑖) [J. Friedman et al.,2000]
• Each weak learner based on feature response 𝑣 𝑟,𝑡 𝑖
𝑎 𝑣 𝑟,𝑡 𝑖 > 𝜃 + 𝑏,
𝑖𝑓 𝑐 ∈ 𝐶
ℎ𝑖 (𝑐) =
𝑘𝑐,
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
with parameter 𝑎, 𝑏, 𝑘 𝑐
𝑐∉𝐶 , 𝜃, 𝐶, 𝑟, 𝑡
36
• Each training example i (a pixel in a training image ) is
paired with a target value 𝑧𝑖𝑐 ∈ {−1, +1}, and assigned a
weight 𝑤𝑖𝑐 specifying its classification accuracy for class c
after m-1 rounds
• Round m choose a new weak learner by minimizing an
error function 𝐽𝑤𝑠𝑒 :
𝑤𝑖𝑐 (𝑧𝑖𝑐 − ℎ𝑖𝑚 (𝑐))2
𝐽𝑤𝑠𝑒 =
𝑐
𝑖
• Re-weighted:
𝑤𝑖𝑐
≔
𝑐 −𝑧𝑖𝑐 ℎ𝑖𝑚 (𝑐)
𝑤𝑖 𝑒
37
• Minimizing the error function 𝐽𝑤𝑠𝑒 requires an
expensive brute-force search over the possible weak
learner ℎ𝑖𝑚 (𝑐)
• Given the sharing set 𝑁, features (𝑟, 𝑡), and threshold
𝜃, a closed form exist for 𝑎, 𝑏, 𝑎𝑛𝑑 𝑘 𝑐 𝑐∉𝑁 :
𝑏=
𝑐∈𝑁
𝑐∈𝑁
𝑎+𝑏 =
𝑐 𝑐
𝑤
𝑖 𝑖 𝑧𝑖 𝑣(𝑖, 𝑟, 𝑡) ≤ 𝜃
𝑐
𝑤
𝑖 𝑖 𝑣(𝑖, 𝑟, 𝑡) ≤ 𝜃
𝑐∈𝑁
𝑐∈𝑁
𝑐 𝑐
𝑤
𝑖 𝑖 𝑧𝑖 𝑣(𝑖, 𝑟, 𝑡) > 𝜃
𝑐
𝑤
𝑖 𝑖 𝑣(𝑖, 𝑟, 𝑡) > 𝜃
𝑐 𝑐
𝑤
𝑖 𝑖 𝑧𝑖
𝑐
𝑘 =
𝑐
𝑤
𝑖 𝑖
by minimizing 𝐽𝑤𝑠𝑒
38
• Employ the quadratic-cost greed algorithm to speed
up the search [A. Torralba et al.,2007]
• Optimization over 𝜃 ∈ Θ can be made efficient by
careful use of histograms of weighted feature
responses:
• By treating Θ as an ordered set, histograms of
values𝑣 𝑟,𝑡 𝑖 , weighted appropriately by 𝑤𝑖𝑐 𝑧𝑖𝑐 and
𝑤𝑖𝑐 , are built over bin corresponding to the thresholds
in Θ
• These histogram are accumulated to give the
thresholded sums for the calculation of a and b for all
value of Θ at once
39
• Employ a random feature selection procedure to
speed up the minimization over features
• This algorithm examines only a randomly chosen
fraction 𝜉 ≪ 1 of the possible features
[ S. Baluja et al.]
40
41
Adding more texture-layout filters improve classification
42
• The effect of different model potentials
(a): the original input image
(b): only using the texture-layout potentials
(c): with out color modeling
(d): full CRF model
43
• Texton-dependent layout filter
44
• MSRC 21-class database result
45
• Accuracy of segmentation for the MSRC 21-class
database
46
• Comparison with He et al.
47
• TV sequences
48