PPT Slides - People

Download Report

Transcript PPT Slides - People

Learning and Inference in Vision:
from Features to Scene Understanding
Jonathan Huang, Tomasz Malisiewicz
MLD Student Research Symposium, 2009
Sky
Bridge
Sign
Trees
Car
Road
Huge datasets
PASCAL Visual Objects Challenge (VOC) dataset
~15000 annotated images,
~35,000 annotated object instances,
20 object classes with segmentations, bounding boxes
Huge datasets
LabelMe dataset
~11845 static images,
>100,000 labeled polygons
Outline
I. Recognizing single object classes (Jon)
II. Scene understanding with multiple
classes (Tomasz)
Recognition task #1: Find all markers
Recognition task #2: Find all cats
Object recognition is often hard due to:
Geometric Variability
Variation within an object class
Viewpoint/Scales/Illumination Variability
Images from Flickr
From Pixels to Visual features
Imaging
Scene
Pixels
Inference
Low level features
car
Higher level
inference
Local Visual Features
Images are high dimensional!
(640 width) *(480 height) = (307200 pixels)
Compute image statistics in a region (e.g., estimate the
distribution of image gradient orientations)
Key ideas in feature design
Be invariant to stuff you don’t care about…
while not being too invariant
Object classification
Let’s look at a simpler example first…
Cow or Horse??
Inference: What object class is this?
Learning: What does each object class look like?
Document classification analogy
???
John Terry scored on a header to
lift Chelsea to a 1-0 victory over
Manchester United and extend
the Blues’ Premier League lead
to 5 points. Chelsea had been
frustrated by Manchester United
for 76 minutes, but took
advantage of a free kick awarded
when Darren Fletcher fouled
Ashley Cole.
Brian Ching scored six minutes
into overtime and the Houston
Dynamo advanced to Major
League Soccer’s Western ...
???
In the Senate, where proposals
differ substantially from the
House-passed measure on
issues like a government-run plan
and how to pay for coverage, the
bill is stalled while budget
analysts assess its overall costs.
The slim margin in the House —
the bill passed with just two votes
to spare, and 39 Democrats
opposed it — suggests even
greater challenges in the Senate,
where the majority leader, ...
Classify each document as sports or politics
Bag-of-words models for text classification
bag
words
(Sue Ann)
“Much of the meaning behind written language is preserved even when
the ordering of the individual words is lost.” [El-Arini et al.,’09]
Document classification analogy
???
but to on Darren awarded
Fletcher advanced Ashley lift over
to 1-0 scored advantage Major for
lead 76 Chelsea Premier to Terry
League John Houston the kick
Chelsea took United points. free
minutes fouled United been
frustrated overtime Manchester
six a when League a extend
victory Ching 5 and to and
Western Manchester Brian Cole.
Dynamo Soccer’s by a minutes,
Blues’ the had header into of
scored ...
???
the margin how In on majority 39
costs. with measure slim overall
— to like opposed suggests
challenges pay even substantially
stalled government run where
the issues votes it the where bill
for spare, from bill and a Senate,
analysts coverage, in — the
Democrats greater differ two
proposals budget its House
assess while Senate, to in just
the leader and the plan passed
the is House passed The ...
Document classification analogy
???
but to on Darren awarded
Fletcher advanced Ashley lift over
to 1-0 scored advantage Major for
lead 76 Chelsea Premier to Terry
League John Houston the kick
Chelsea took United points. free
minutes fouled United been
frustrated overtime Manchester
six a when League a extend
victory Ching 5 and to and
Western Manchester Brian Cole.
Dynamo Soccer’s by a minutes,
Blues’ the had header into of
scored ...
???
the margin how In on majority 39
costs. with measure slim overall
— to like opposed suggests
challenges pay even substantially
stalled government-run where
the issues votes it the where bill
for spare, from bill and a Senate,
analysts coverage, in — the
Democrats greater differ two
proposals budget its House
assess while Senate, to in just
the leader and the plan passed
the is House-passed The ...
Visual words (discretization)
Words are discrete, visual features are typically continuous…
Discretization via clustering/vector quantization
Visual words
[Sivic et al., ‘05]
Object classification with bag of words
[Sivic et al., ‘05]
Object classification with bag of words
Performance on Caltech 101 dataset with linear SVM on bag-of-word vectors:
Faces
Airplanes
Cars
[Csurka et al., ‘04]
Object Detection problem
Detection: Locate all the faces in this image.
Classification: Is this a face, or not a face?
Face detection via a series of classifications
(a.k.a. sliding window brain damage)
False Detection
Missed Faces
Sliding window detection results
The need for… capturing spatial relationships
One Approach
Create a more descriptive (complicated) feature
gradient
magnitudes
Original Image
gradient
orientations
Estimated Image
Gradients
Subdivided
Image cells
Histogrammed
gradients in
each cell
Histograms of Oriented Gradients (HOG) features
[Dalal & Triggs, ‘06]
People Tracking with HOG features
Modeling Spatial Relationships with
Deformable Part Based Models
Spring-based models:
Parts prefer low-energy configurations
[Fischler & Elschlager ,’73], [Ramanan et al,’07],
[Felszwenwalb et al,’05,’09], [Kumar et al, ‘09]
Parts Based Model
Goal: Assign model parts to
image regions preserving
both local appearance and
spatial relationships
Vertices – Local Appearance
Edges - Spatial Relationship






Parts based models - Inference
Problem
Inference problem: What is the best scoring assignment f?
For trees can use belief
propagation
Local Appearance term
for exact solution in polytime
Inference is NP-hard for general graphs
Pairwise Spatial
Relationship term
Parts based models - Learning Problem
Linear models:
Local Appearance term
Pairwise Spatial
Relationship term
Learning linear models: Find weight vectors that best separate
positive and negative examples. E.g.,
Convex max-margin objective
s.t.
Positive examples on one side
Negative examples on the other
[Kumar et al,’09]
Root filter (8x8
resolution)
Part filter (4x4 Quadratic spatial
configuration model
resolution)
Person deformable part model
[Felszwenwalb et al,’09]
[Felszwenwalb et al,’09]
[Ramanan et al,’09]
Outline
I. Recognizing single object classes (Jon)
II. Scene understanding with multiple
classes (Tomasz)
Part II: Scene Understanding with
Multiple Classes
Goal: Predict Many Different Objects in a Single Image
Tree
Building
Car
Fence
Fire Hydrant
Sidewalk
Wait...
• What’s wrong with just learning a different
sliding window classifier for each object type
in the world?
The image as seen from a object
detector’s point of view
Relationships between objects
make recognition possible
41
Antonio Torralba. The Context Challenge.
http://web.mit.edu/torralba/www/carsAndFacesInContext.html
41
Objects as the “Parts” of a Scene
Deformable Part Model
Scene Model
Key Challenge in Scene Understanding: Modeling
relationships
between objects from different categories
43
Fixed Extent “Things” vs
Free-form “Stuff”
Tree
Building
Car
Fence
Things have a
well-defined
shape. A part of a
car is not a car.
Stuff is free-form
and mostly defined
by color/texture. A
part of a building is
still a building.
Fire Hydrant
Sidewalk
3 Types of Scene Models
Pixel-based
Window-based
Segment-based
Pixel-based Scene Understanding
Unable to reason about instances Produces Segmentation
Only limited notion of context
Works well on “stuff”
TextonBoost: Joint Appearance, Shape and Context Modeling for Multiclass Object Recognition and Segmentation. Shotton et al. ECCV 2006
Pixel-wise Conditional Random
Fields (TextonBoost)
• Inference
• y^* = argmax_y p(y|x)
• Training: Use boosting to learn unary
potential
• Future Direction: Higher-Order Cliques
50
50
TextonBoost: Joint Appearance, Shape and Context Modeling for Multiclass Object Recognition and Segmentation. Shotton et al. ECCV 2006
Window-based Scene Understanding
Object Recognition by Scene Alignment.
Russell et al. NIPS 2007
Discriminative models for multi-class object layout.
Desai et al. ICCV 2009
Often not possible to model “stuff” using windows.
Window assumption also questionable for some “things.”
Possible to model interactions between object instances.
Discriminative models for multiclass object layout
• Inference via Greedy Forward Search
• Training
52
52
Window-based results
53
53
Region-Based Scene Understanding
Use Segmentation algorithm to extract stable regions
Use CRF to label those segments
Problem: Hard to get object-segments.
Problem: Inference difficult for fully connected models.
Region-Based CRF
Spatial Relations
• Training: Bag of Words with Nearest
Neighbor classifier
• Maximum Likelihood training of pairwise
potentials
56
Object Categorization using Co-Occurrence, Location and
Appearance. Galleguillos et al. CVPR 2008.
56
Segmentation-Based Results
Input image
57
No context
w/ context
Object Categorization using Co-Occurrence, Location and
Appearance. Galleguillos et al. CVPR 2008.
57
Model Granularity vs. Object Type
Granularity
Pixels
Things (car,
Object cow, person)
Type
Stuff (road,
sky, tree)
58
Windows
Regions
:-(
:-)
:-/
:-)
:-(
:-)
Scene Understanding Recap
• Rich object-object interactions are important for
scene understanding.
• Different underlying assumptions (pixel vs. window
vs. region) are better suited for different types of
objects (“stuff” vs. “things”)
• Many of the techniques for single class object
recognition (e.g., part based models) are relevant for
scene understanding
Thanks!
Image Classification
Sliding Window based
Object Detection
Modeling Spatial Relationships
between objects
Modeling Spatial
Relationships between parts