Learning with Ambiguity in Computer Vision

Download Report

Transcript Learning with Ambiguity in Computer Vision

Training Discriminative Computer
Vision Models
with Weak Supervision
Boris Babenko
PhD Defense
University of California, San Diego
Outline
• Overview
– Supervised Learning
– Weakly Supervised Learning
• Weakly Labeled Location
– Object Localization and Recognition
– Object Detection with Parts
– Object Tracking
• Weakly Labeled Categories
– Object Detection with Sub-categories
– Object Recognition with Super-categories
• Theoretical Analysis of Multiple Instance Learning
• Conclusions & Future Work
Outline
• Overview
– Supervised Learning
– Weakly Supervised Learning
• Weakly Labeled Location
– Object Localization and Recognition
– Object Detection with Parts
– Object Tracking
• Weakly Labeled Categories
– Object Detection with Sub-categories
– Object Recognition with Super-categories
• Theoretical Analysis of Multiple Instance Learning
• Conclusions & Future Work
Computer Vision Problems
• Want to detect, recognize/classify, track
objects in images and videos
• Examples:
– Face detection for point-and-shoot cameras
– Pedestrian detection for cars
– Animal tracking for behavioral science
– Landmark/place recognition for search-by-image
Old School
• Hand tuned models per application
• Example: face detection
[Yang et al. ‘94]
New School
• Adopt methods from machine learning
• Train a generic* system by providing labeled
examples (supervised learning)
– Labeling examples is intuitive
– Adapt to new domains/applications
– Learn subtle queues that would be impossible to
model by hand
* Hand tuning/design still often required :-/
Supervised Learning
• Training data: pairs of inputs and labels
• Train classifier to predict label for novel input
TRAINING
(
(
(
(
, face)
, face)
, non-face)
, non-face)
RUN TIME
( )
Supervised Learning
• Training data:
Inputs/instances:
Labels:
• Most common case:
• Want to train a classifier:
• Typically a classifier also outputs a
confidence score, in addition to label
Discriminative vs Generative
• Generative: model the distribution of the data
• Discriminative: directly minimize classification
error, model the boundary
– E.g. SVM, AdaBoost, Perceptron
– Tends to outperform generative models
Training Discriminative Model
• Objective (minimize training error)
• Loss function, , is typically a convex upper
bound on 0/1 loss
• Regularization term can help avoid over-fitting
Weak Supervision
• Slightly overloaded term…
• Any form of learning where the training data is
missing some labels (i.e. latent variables)
Object Detection
w/ Weak Supervision
• Goal: train object detector
• Strong:
(
, face) (
, face) (
, non-face)
• Weak: only presence of object is known, not
location
+
+
-
Object Detection
w/ Weak Supervision
• Goal: train object detector
• Strong:
(
, face) (
, face) (
, non-face)
• Weak: only presence of object is known, not
location <- latent
+
+
-
Weak Supervision: Advantages
• Reduce labor cost
• Deal with inherent ambiguity & human error
• Automatically discover latent information
Training w/ Latent Variables
• Classifier now takes in input AND latent input
• To predict label:
• Objective:
Training w/ Latent Variables
• Classifier now takes in input AND latent input
• To predict label:
• Objective:
• Not convex!
Training w/ Latent Variables
• Two ways of solving
• Method 1: Alternate between finding latent
variables and training classifier
– Finding latent variables given a fixed classifier may
require domain knowledge
– E.g. EM (Dempster et al.), Latent Structural SVM
(Yu & Joachims) – based on CCCP (Yuille &
Rangarajan), Latent SVM (Felzenszwalb et al.), MISVM (Andrews et al.)
Training w/ Latent Variables
• Method 2: Replace the hard max with “soft”
approximation, and then do gradient descent
– E.g. MILBoost (Viola et al.), MIL-Logistic
Regression (Ray et al.)
Outline
• Overview
– Supervised Learning
– Weakly Supervised Learning
• Weakly Labeled Location
– Object Detection, Localization and Recognition
– Object Detection with Parts
– Object Tracking
• Weakly Labeled Categories
– Object Detection with Sub-categories
– Object Recognition with Super-categories
• Theoretical Analysis of Multiple Instance Learning
• Conclusions & Future Work
Object Detection
w/ Weak Supervision
• Goal: train object detector
• Only presence of object is known, not location
+
+
-
• Can’t “just throw these into a learning alg.” –
very difficult to design invariant features
Multiple Instance Learning (MIL)
•
•
•
•
(set of inputs, label) pairs provided
MIL lingo: set of inputs = bag of instances
Learner does not see instance labels
Bag labeled positive if at least one instance in
bag is positive
[Keeler et al. ‘90, Dietterich et al. ‘97]
Object Detection w/ MIL
+
{
…}
+
{
…}
-
{
…}
Instance: image patch
Instance Label: is face?
Bag: whole image
Bag Label: contains face?
[Andrews et al. ’02, Viola et al. ’05, Dollar et al. 08, Galleguillos et al. 08]
MIL Notation
• Training input:
Bags:
Bag Labels:
Instance Labels:
(unknown during training)
MIL
• Positive bag contains at least one positive
instance
• Goal: learning instance classifier
• Corresponding bag classifier
MIL Algorithms
• Many “standard” learning algorithms have
been adapted to the MIL scenario:
– SVM (Andrews et al. ‘02), Boosting (Viola et al.
‘05), Logistic Regression (Ray et al. ‘05)
• Some specialized algorithms also exist
– DD (Maron et al. ’98), EM-DD (Zhang et al. ‘02)
MIL Algorithms
• Objective: minimize bag error on training data
Bag label according to
, i.e.
• MILBoost (Viola et al. ‘05)
– Replace max with differentiable approximation
– Use functional gradient descent (Mason et al. ’00,
Friedman ’01)
Object Detection
• Have a learning framework (MIL), and an algorithm to train
classifier (MILBoost)
• Question: how exactly do we form a bag?
Sliding Window
{
…}
Segmentation
{
…}
Forming a bag via segmentation
• Pro: get more precise localization
• Con: segmentation algorithms often fail;
require prior knowledge (e.g. number of
segments)
• If segmentation fails, we might not see “the”
positive instance in a positive bag
• Only way to prevent this is to use ALL possible
segments… not practical
Multiple Stable Segmentations (MSS)
• Solution: Multiple Stable Segmentations
(Rabinovich et al. ‘06)
– A heuristic for picking out a few “good” segments
from the huge set of all possible segments
– End up with more segments, but higher chance of
getting the “right” segment
Multiple Instance Learning with
Stable Segmentation (MILSS)
• Localization and Recognition
• Features: BOF on SIFT
• Classifier: MILBoost one-vs-all (for multiclass)
Multiple Stable
Segmentation
{
[ Work with Carolina Galleguillos, Andrew Rabinovich & Serge Belongie – ECCV ‘08]
…}
Results: Landmarks
Results: Landmarks
More segments = better results
Our System
NCuts w/ k=6
NCuts w/ k=4
Outline
• Overview
– Supervised Learning
– Weakly Supervised Learning
• Weakly Labeled Location
– Object Localization and Recognition
– Object Detection with Parts
– Object Tracking
• Weakly Labeled Categories
– Object Detection with Sub-categories
– Object Recognition with Super-categories
• Theoretical Analysis of Multiple Instance Learning
• Conclusions & Future Work
Object Detection with Parts
• Pedestrians are non-rigid
• Difficult to design features that are invariant
– Decision boundary very complex
• Objects parts are rigid
Object Detection with Parts
• Naïve sol’n: label parts and train detectors
– Labor intensive
– Sub-optimal (e.g. “space between the legs”)
• Better sol’n:
– Use rough location of objects
– Treat part locations as latent variables
[Mohan et al. ’01, Mikolajczyk et al. ‘04]
Multiple Component Learning (MCL)
1. How to train a part detector from weakly
labeled data?
2. How to train many, diverse part detectors
3. How to combine part detectors and
incorporate spatial information
[Work with Piotr Dollar, Pietro Perona, Zhuowen Tu & Serge Belongie ECCV ‘08]
MCL: One Part Detector
• Fits perfectly into MIL
+
{
…}
+
{
…}
• Which part does it learn?
MCL: Diverse Parts
• Pedestrian images are “roughly aligned”
• Choose random sections of the images to feed
into MIL
MCL: Top 5 Learned Detectors
MCL: Combining Part Detectors
• Run part detectors, get response map
• Compute Haar features on top, plug into
Boosting
Confidence maps from each part detector
MCL: Results
• INRIA Pedestrian dataset
MCL: Results
MCL: Related Work
• P. Felzenszwalb, R. Girshick, D. McAllester, D.
Ramanan. "Object Detection with
Discriminatively Trained Part-Based
Models" IEEE PAMI. Sept 2009.
– Very similar model, uses SVM instead of Boosting,
and an explicit shape model
• L. Bourdev, S. Maji, T. Brox, J. Malik.
“Detecting people using mutually consistent
poselet activations” ECCV 2010.
Outline
• Overview
– Supervised Learning
– Weakly Supervised Learning
• Weakly Labeled Location
– Object Localization and Recognition
– Object Detection with Parts
– Object Tracking
• Weakly Labeled Categories
– Object Detection with Sub-categories
– Object Recognition with Super-categories
• Theoretical Analysis of Multiple Instance Learning
• Conclusions & Future Work
Object Tracking
• Problem: given location of object in first
frame, track object through video
• Tracking by Detection: alternate training
detector and running it on each frame
Tracking by Detection
• First frame is labeled
Tracking by Detection
• First frame is labeled
Classifier
Online classifier (i.e. Online AdaBoost)
Tracking by Detection
• Grab one positive patch, and some negative
patch, and train/update the model.
Classifier
Tracking by Detection
• Get next frame
Classifier
Tracking by Detection
• Evaluate classifier in some search window
Classifier
Classifier
Tracking by Detection
• Evaluate classifier in some search window
X
old location
Classifier
Classifier
Tracking by Detection
• Find max response
XX
old location
new location
Classifier
Classifier
Tracking by Detection
• Repeat…
Classifier
Classifier
Problems
• What if classifier is a bit off?
– Tracker starts to drift
• How to choose training examples?
[Work with Ming-Hsuan Yang, & Serge Belongie – CVPR ’09, PAMI ‘11]
How to Get Training Examples
Classifier
Classifier
MIL
Classifier
How to Get Training Examples
Classifier
Classifier
MIL
Classifier
Experiments
• Compare MILTrack to:
– OAB1 = Online AdaBoost w/ 1 pos. per frame
– OAB5 = Online AdaBoost w/ 45 pos. per frame
– SemiBoost = Online Semi-supervised Boosting
– FragTrack = Static appearance model
• All params were FIXED
• 9 videos, labeled every 5 frames by hand
(available on the web)
[Grabner ‘06, Adam ‘06, Grabner ’08]
Experiments
Experiments
Experiments
Outline
• Overview
– Supervised Learning
– Weakly Supervised Learning
• Weakly Labeled Location
– Object Localization and Recognition
– Object Detection with Parts
– Object Tracking
• Weakly Labeled Categories
– Object Detection with Sub-categories
– Object Recognition with Super-categories
• Theoretical Analysis of Multiple Instance Learning
• Conclusions & Future Work
Weakly Labeled Categories
• Discovering sub-categories
– Object Detection
Animals
Birds
Bluejay
• Discovering super-categories
– Image Categorization
Cats
Warbler
Outline
• Overview
– Supervised Learning
– Weakly Supervised Learning
• Weakly Labeled Location
– Object Localization and Recognition
– Object Detection with Parts
– Object Tracking
• Weakly Labeled Categories
– Object Detection with Sub-categories
– Object Recognition with Super-categories
• Theoretical Analysis of Multiple Instance Learning
• Conclusions & Future Work
Start with binary problem
• Examples from the “positive” category
• Difficult to design invariant features
[Images from the Sheffield Face Database]
Learning sub-categories
• Discover sub-categories
• Train detector for each sub-category
=
Multiple Pose Learning (MPL)
• Naïve sol’n: run k-means on data, and then
train classifier on each cluster
• Our sol’n: Simultaneously group data and
train classifiers; treat group membership as a
latent variable
• Similar to MIL…
[Work with Piotr Dollar, Serge Belongie & Zhuowen Tu – Faces in Real-Life Images, ECCV ‘08]
MPLBoost
• Objective for MIL:
• Instead of having many instances per bag, we
now want to train many classifiers:
MPL: Results
LFW
MNIST
• Some preliminary “toy” results
Other MPL Results
• Same algorithm was published in
T-K. Kim and R. Cipolla. MCBoost: Multiple Classifier Boosting for Perceptual Co-clustering of
Images and Visual Features, NIPS, 2008.
-- developed independently
• Recently, MPL evaluated in pedestrian detection:
“In general however, MPLBoost seemed to be the most
robust classifier with respect to challenging lighting
conditions while being computationally less expensive than
SVMs.”
C. Wojek, S. Walk, B. Schiele. Multi-Cue Onboard Pedestrian Detection, CVPR, 2009
Outline
• Overview
– Supervised Learning
– Weakly Supervised Learning
• Weakly Labeled Location
– Object Localization and Recognition
– Object Detection with Parts
– Object Tracking
• Weakly Labeled Categories
– Object Detection with Sub-categories
– Object Recognition with Super-categories
• Theoretical Analysis of Multiple Instance Learning
• Conclusions & Future Work
Learning Super-categories
• Application: classifying images into multiple
categories
• Idea: use training data to learn metric, plug
into kNN
Multiple Similarity Learning (MuSL)
• Learn a single global similarity metric
Category 1
Category 4
[ Jones et al. ‘03,
Chopra et al. ‘05,
Goldberger et al. ‘05,
Shakhnarovich et al. ’05,
Weinberger et al. ‘08
Torralba et al. ’08,
McFee et al. ‘10]
Category 3
Category 2
Monolithic
Query Image Similarity Metric Labeled Dataset
Multiple Similarity Learning (MuSL)
• Learn similarity metric for each category (1-vs-all)
Category 1
Category 3
Category 4
[ Varma et al. ‘07,
Frome et al. ‘07,
Weinberger et al. ‘08
Nilsback et al. ’08]
Category
Specific
Category 2
Monolithic
Query Image Similarity Metric Labeled Dataset
How many should we train?
• Per category:
– More powerful
– Do we really need thousands of metrics?
– Have to train for new categories
• Global/Monolithic:
– Less powerful
– Can generalize to new categories
[Ramanan & Baker, 2010]
Multiple Similarity Learning (MuSL)
• Would like to explore space between two
extremes
• Idea:
– Group categories into super-categories
– Learn a few similarity metrics, one for each supercategory
[Work with Steve Branson & Serge Belongie, ICCV ‘09]
Multiple Similarity Learning (MuSL)
• Learn a few good similarity metrics
Category 1
Category 2
Category 3
Category 4
Category
Specific
MuSL
Monolithic
Query Image Similarity Metric Labeled Dataset
Learning a Similarity Metric
• Training data:
Images
Category Labels
• Can treat problem as binary classification
(
,
), 1
(
,
), 0
• Confidence/score from classifier -> similarity
Multiple Similarity Learning (MuSL)
• Goal: train
• and recover mapping
• At runtime
where
4ategory 3 Category 2 Category 1
Category C
– To compute similarity of query image to
use
Naïve Solution
• Run pre-processing to group categories (i.e. kmeans), then train as usual
• Drawbacks:
– Hacky / not elegant
– Not optimal: pre-processing not informed by class
confusions, etc.
• How can we train & group simultaneously?
MuSL Boosting
• Objective:
where
How well
works with category
MuSL Results
• Created dataset with hierarchical structure of
categories
Accuracy
0.8
0.75
MuSL+retrain
MuSL
k-means
Rand
Monolithic
Per Cat
0.7
Merged categories from:
• Caltech 101 [Griffin et al.]
• Oxford Flowers [Nilsback et al.]
• UIUC Textures [Lazebnik et al.]
0.65
5
10
15
K (number of classifiers)
20
k-means
MuSL
Recovered Super-categories
Generalizing to New Categories
New categories only
Both new and old categories
Training more metrics overfits!
Outline
• Overview
– Supervised Learning
– Weakly Supervised Learning
• Weakly Labeled Location
– Object Localization and Recognition
– Object Detection with Parts
– Object Tracking
• Weakly Labeled Categories
– Object Detection with Sub-categories
– Object Recognition with Super-categories
• Theoretical Analysis of Multiple Instance Learning
• Conclusions & Future Work
PAC Analysis
• Probably Approximately Correct (PAC)
– Unknown but fixed distribution of data
– Given i.i.d. training examples to learner
– Learner returns classifier
– What can we say about generalization/test error
of a classifier?
[Valiant ‘84]
PAC Bound
generalization (test) error
complexity term
empirical (train) error
•
depends on number of training examples
• Sample complexity: how many examples you
need to guarantee certain error rate
PAC Analysis of MIL
• Bound bag generalization error in terms of
empirical error
• Data model (bottom up)
– Draw instances and their labels from fixed
distribution
– Create bag from instances, determine its label
(max of instance labels)
– Return bag & bag label to learner
Data Model
Bag 1: positive
(instance space)
Negative instance
Positive instance
Data Model
Bag 2: positive
Negative instance
Positive instance
Data Model
Bag 3: negative
Negative instance
Positive instance
PAC Analysis of MIL
• Blum & Kalai (1998)
– If: access to noise tolerant instance learner,
instances drawn independently
– Then: bag sample complexity linear in
• Sabato & Tishby (2009)
– If: can minimize empirical error on bags
– Then: bag sample complexity logarithmic in
• Disconnect between theory and applications
in computer vision etc
MIL Example: Face Detection (Images)
+
{
…}
+
{
…}
-
{
…}
Bag: whole image
Instance: image patch
[Andrews et al. ’02, Viola et al. ’05, Dollar et al. 08, Galleguillos et al. 08]
MIL Example: Phoneme Detection
(Audio)
Detecting ‘sh’ phoneme
+
“machine”
Bag: audio of word
Instance: audio clip
“learning”
[Mandel et al. ‘08]
MIL Example: Event Detection (Video)
+
Bag: video
Instance: few frames
[Ali et al. ‘08, Buehler et al. ’09, Stikic et al. ‘09]
Observations for these applications
• Top down process: draw entire bag from a bag
distribution, then get instances
• Instances of a bag lie on a manifold
[Work with Nakul Verma, Piotr Dollar and Serge Belongie, ICML 2011]
Manifold Bags
Negative region
Positive region
Manifold Bags
• For such problems:
– Existing analysis not appropriate because number
of instances is infinite
– Expect sample complexity to scale with manifold
parameters (curvature, dimension, volume, etc)
Manifold Bags: Formulation
• Manifold bag drawn from bag distribution
• Instance hypotheses:
• Corresponding bag hypotheses:
Typical Route: VC Dimension
• VC Dimension: characterizes the
power/complexity of the hypothesis class
• Error Bound:
[Vapnik & Chervonenkis, ‘71]
Typical Route: VC Dimension
• Error Bound:
generalization error
# of training bags
empirical error
Typical Route: VC Dimension
• Error Bound:
VC Dimension of bag hypothesis class
Relating
to
• We do have a handle on
• For finite sized bags, Sabato & Tishby:
• Question: can we assume manifold bags are
smooth and use a covering argument?
Relating
to
• We do have a handle on
• For finite sized bags, Sabato & Tishby:
• Question: can we assume manifold bags are
smooth and use a covering argument?
• Answer: No! Can show
is unbounded
even if
is very simple and bags are smooth
Issue
• Bag hypothesis class too powerful
– For positive bag, need to only classify 1 instance
as positive
– Negligible part of bag responsible for the positive
label
– Infinitely many instances -> too much flexibility for
bag hypothesis
• Would like to ensure a non-negligible portion
of positive bags is labeled positive
Solution
• Switch to real-valued hypothesis class
• Incorporate a notion of margin
• Intuition: if bag is classified with large margin,
and hypotheses are smooth, then large
portion of the bag is classified positive
Fat-shattering Dimension
•
= “Fat-shattering” dimension of realvalued hypothesis class
– Analogous to VC dimension
• Relates generalization error to empirical error
at margin
– i.e. not only does binary label have to be correct,
margin has be to
[Anthony & Bartlett ‘99]
Fat-shattering of Manifold Bags
• Error Bound:
Fat-shattering of Manifold Bags
• Error Bound:
generalization error
# of training bags
empirical error at margin
Fat-shattering of Manifold Bags
• Error Bound:
fat shattering of bag hypothesis class
Fat-shattering of Manifold Bags
• Bound
in terms of
– Use covering arguments – approximate manifold
with finite number of points
– Analogous to Sabato & Tishby’s analysis of finite
size bags
Error Bound
• With high probability:
Error Bound
• With high probability:
generalization error
complexity term
empirical error at margin
Error Bound
• With high probability:
fat shattering of instance hypothesis class
Error Bound
• With high probability:
number of training bags
Error Bound
• With high probability:
manifold dimension
Error Bound
• With high probability:
manifold volume
Error Bound
• With high probability:
term depends on smoothness parameters
(e.g. curvy manifold bags -> high complexity)
Error Bound
• With high probability:
• Obvious strategy for learner:
– Minimize empirical error & maximize margin
– This is what most MIL algorithms already do
Learning from Queried Instances
• Previous result assumes learner has access
entire manifold bag
• In practice learner will only access small
number of instances ( )
{
…}
• Not enough instances -> might not draw a pos.
instance from pos. bag
Learning from Queried Instances
• Bound
holds with failure probability increased by
if
Take-home Message
• Increasing
• Increasing
reduces complexity term
reduces failure probability
– Seems to contradict previous results (smaller bag
size is better)
– Important difference between and !
– If is too small we may only get negative
instances from a positive bag
• Increasing
does not
requires extra labels, increasing
Iterative Querying Heuristic (IQH)
• Problem: want many instances/bag, but have
computational limits
• Heuristic solution:
– Grab small number of instances/bag, run standard
MIL algorithm
– Query more instances from each bag, only keep
the ones that get high score from current classifier
• At each iteration, train with small # of
instances
Experiments
• Synthetic Data (will skip in interest of time)
• Real Data
– INRIA Heads (images)
– TIMIT Phonemes (audio)
INRIA Heads
pad=16
pad=32
[Dalal et al. ‘05]
TIMIT Phonemes
+
“machine”
“learning”
[Garofolo et al., ‘93]
Padding (volume)
INRIA Heads
TIMIT Phonemes
Number of Instances (
INRIA Heads
)
TIMIT Phonemes
Number of Iterations (heuristic)
INRIA Heads
TIMIT Phonemes
Outline
• Overview
– Supervised Learning
– Weakly Supervised Learning
• Weakly Labeled Location
– Object Localization and Recognition
– Object Detection with Parts
– Object Tracking
• Weakly Labeled Categories
– Object Detection with Sub-categories
– Object Recognition with Super-categories
• Theoretical Analysis of Multiple Instance Learning
• Conclusions & Future Work
Future Work: Vision
• Strong supervision is becoming cheaper
(thanks to Amazon Mechanical Turk, etc)
– Ambiguity issues still exist (e.g. which bird parts
should you ask to be labeled?)
• Would be nice to combine strong and weak
supervision, add active element
• Would be nice to (automatically) learn better
low/mid level features
Future Work: Learning
• Manifold properties
– Would be nice to develop algorithms that take
more advantage of the manifold structure of the
data
Thanks!
• All of my collaborators:
– Pitor Dollar, Nakul Verma, Carolina Galleguillos,
Andrew Rabinovich, Steve Branson, Zhuowen Tu,
Pietro Perona, Ming-Hsuan Yang, Kai Wang,
Catherine Wah, Peter Wellinder, Serge Belongie
• My committee:
– Serge Belongie, David Kriegman, Lawrence Saul,
Virginia de Sa & Gert Lanckriet
• Funding:
– NSF IGERT, Google