Neural Networks for Machine Learning

Download Report

Transcript Neural Networks for Machine Learning

RECOGNIZING HUMAN-OBJECT
INTERACTION IN STILL IMAGE
BY MODELING THE MUTUAL CONTEXT
OF OBJECTS AND HUMAN POSES
Date: 2013/05/27
Instructor: Prof. Wang, Sheng-Jyh
Student: Hung, Fei-Fan
Yao, B., and Fei-fei, L. IEEE Transactions on PAMI(2012)
2
Outline
• Introduction
• Intuition and goal
• Model Representation
• Model Learning
• Obtaining Atomic Poses
• Training Detectors and Classifiers
• Estimating Model Parameters
• Model Inference
• Experimental Results
• Conclusion
3
Outline
• Introduction
• Intuition and goal
• Model Representation
• Model Learning
• Obtaining Atomic Poses
• Training Detectors and Classifiers
• Estimating Model Parameters
• Model Inference
• Experimental Results
• Conclusion
4
Why using context in computer vision?
• simple image vs. human activities
Without
context:
~3-4%
with
context
without
context
With
mutual
context:
5
Challenges in Human Pose Estimation
• Human pose estimation is challenging
Difficult part
appearance
Self-occlusion
Image region
looks like a
body part
•  Object detection facilitate human pose estimation
6
Challenges in Object Detection
• Object detection is challenging
Small, low-resolution,
partially occluded
Image region similar
to detection target
• human pose estimation facilitate object detection
7
The Goal
• To build a mutual context model in Human-Object
Interaction(HOI) activities
8
Outline
• Introduction
• Intuition and goal
• Model Representation
• Model Learning
• Obtaining Atomic Poses
• Training Detectors and Classifiers
• Estimating Model Parameters
• Model Inference
• Experimental Results
• Conclusion
9
Model representation
• Modeling the mutual context of object and human poses
A:
Tennis Croquet Volleyball
forehand shot
smash
O:
Tennis Tennis Croquet Volleyball
racket
mallet
ball
Body
parts
𝑂1 , 𝑂2 , … , 𝑂𝑀 , M:num of bounding box
H:
More than one atomic pose H in A
P:
body parts, 𝑃1 , 𝑃2 , … , 𝑃𝐿
10
Model representation
activity
A
H
objects
O1
Human
pose
O2
P1
P2
• 𝝓𝟏 : co-occurrence compatibility
between A,O,H
• 𝝓𝟐 : spatial relationship between O,H
• 𝝓𝟑 ~𝝓𝟓 : modeling the image evidence with detectors
or classifiers
PL
11
𝝓1: Co-occurrence context
• co-occurrence between all A,O,H
A
• 𝜍𝑖,𝑗,𝑘 : strength of co-occurrence interaction
between ℎ𝑖 , 𝑜𝑗 , 𝑎𝑘
𝟏(∙) : indicator function
𝑁ℎ : total number of atomic poses
𝑁𝑜 : total number of objects
𝑁𝑎 : total number of activity classes
H
O1
O2
P1
P2
PL
12
𝝓2: Spatial context
𝒙𝑙𝐼 :
• Spatial relationship between all O and different H
A
• 𝜆𝑖,𝑗,𝑘 : weight of 𝑏 𝒙𝑙𝐼 , 𝑂𝑚
• 𝑏
•
•
𝒙𝑙𝐼 , 𝑂𝑚
𝑂𝑚 =𝑜𝑗
H
: a sparse binary vector
shows relative location
of 𝑂𝑚 w.r.t. 𝒙𝑙𝐼
O1
O2
P1
P2
PL
13
𝝓3: Modeling objects
• Model O in the image I using object detection score
A
• For all object O
• 𝑔 𝑂𝑚 : vector of score of detecting 𝑂𝑚
• 𝛾𝑗 : weight of 𝑔 𝑂𝑚 𝑂𝑚 =𝑜𝑗
• Between Om and Om’
• 𝑏 𝑂𝑚 , 𝑂𝑚′ : binary feature vector
• 𝛾𝑗,𝑗′: weight of 𝑜𝑗 and 𝑜𝑗′
H
O1
O2
P1
P2
PL
14
𝝓4: Modeling human pose
• Model atomic pose that H belongs to and likelihood 𝑃(𝐼|ℎ𝑖 )
A
• 𝑃 𝒙𝑙𝐼 |𝒙𝑙ℎ𝑖 : Gaussian likelihood function
• 𝑓 𝑙 (𝐼) : vector of score of detecting
body part in 𝒙𝑙𝐼
H
O1
O2
P1
P2
PL
15
𝝓5: Modeling activity
• Model HOI activity by training activity classifier
• 𝑠 𝐼 : 𝑁𝑎 -dim output of one-versus-all (OVA)
A
discriminative classifier
taking image as features
H
O1
• 𝜂𝑘 : feature weight of 𝑎𝑘
O2
P1
P2
PL
17
Model Properties
• Spatial context between O and H
• Object detection and human pose estimation facilitate each other
• Ignore the objects and body parts that are unreliable
• Flexible to extend to large scale datasets and other
activities
• Jointly model can share all objects and atomic poses
18
Outline
• Introduction
• Intuition and goal
• Model Representation
• Model Learning
• Obtaining Atomic Poses
• Training Detectors and Classifiers
• Estimating Model Parameters
• Model Inference
• Experimental Results
• Conclusion
19
Model Learning
Assign human pose
to atomic pose
Training detectors and
classifiers
Estimate parameters
by Maximum Likelihood
20
Obtaining Atomic Poses
• Using clustering to obtain atomic
Assign human pose
to atomic pose
poses
• Normalize the annotations
• 𝒙1 , 𝒙2 , … , 𝒙𝐿
Training detectors and
classifiers
Estimate parameters
by Maximum Likelihood
• Finding missing part
• Using the nearest visible neighbor
• Obtain a set of atomic poses
• Hierarchical clustering
with maximum linkage
measure : 𝐿𝑙=1 𝑤 𝑇 |𝒙𝑙𝑖 − 𝒙𝑗𝑙 |
21
Training Detectors and Classifiers
• 𝑔 𝑂𝑚 : Object detector in 𝜙3 𝑂, 𝐼
Assign human pose
to atomic pose
• 𝑓 𝑙 (𝐼) : Human body part detector in 𝜙4 𝐻, 𝐼
 deformable part model
Training detectors and
classifiers
• 𝑠 𝐼 : Overall activity classifier in 𝜙5 (𝐴, 𝐼)
 Spatial pyramid matching (SPM)
SIFT + 3 level image pyramid
Estimate parameters
by Maximum Likelihood
24
Estimating Model Parameters
Assign human pose
to atomic pose
Training detectors and
classifiers
Estimate parameters
by Maximum Likelihood
• Estimate 𝜍, 𝜆, 𝛾, 𝛼, 𝛽 by using ML
approach with zero-mean Gaussian
prior
25
Learning result
26
Outline
• Introduction
• Intuition and goal
• Model Representation
• Model Learning
• Obtaining Atomic Poses
• Training Detectors and Classifiers
• Estimating Model Parameters
• Model Inference
• Experimental Results
• Conclusion
27
Model Inference
New image
Update
human body parts
Update object
detection results
Initialize
with learned results
Update A and H
labels
28
Initialization
New image
A: SPM classification
O: object detection
H: pictorial structure model
Initialize with learned results
Initialize
Activity classification
Object detection
Human pose estimation
29
Update model inference
• Marginal distribution of human pose:
𝑝(𝐻 = ℎ𝑖 )
Update
human body parts
𝑁ℎ
𝑖=1
• Using mixture of Gaussian to refine the prior
of body part 𝒩(𝒙𝑙ℎ𝑖 )
𝑁ℎ
Update object
detection results
Update A and H
labels
𝑝(𝐻 = ℎ𝑖 ) 𝒩(𝒙𝑙ℎ𝑖 )
𝑖=1
30
Update model inference
O,H
Update
human body parts
O,A,H
Update object
detection results
• Greedy forward search method :
• Initial (𝑚, 𝑗) and no object in bounding box
• Select 𝑚∗ , 𝑗 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 (𝑚, 𝑗)
• Label 𝑚∗ box as 𝑜𝑗 ∗
Update A and H
labels
O,I
• update (𝑚, 𝑗)
• Stop when 𝑚∗ , 𝑗 ∗ <0
31
Update model inference
• Enumerate possible A and H label
Update
human body parts
Update object
detection results
• Optimize Ψ(𝐴, 𝑂, 𝐻, 𝐼)
Update A and H
labels
32
Outline
• Introduction
• Intuition and goal
• Model Representation
• Model Learning
• Obtaining Atomic Poses
• Training Detectors and Classifiers
• Estimating Model Parameters
• Model Inference
• Experimental Results
• Conclusion
33
Experimental Results (Sports Dataset)
34
Experimental Results (Sports Dataset)
35
Experimental Results (Sports Dataset)
• Activity classification
36
37
Experimental results (PPMI Dataset)
38
Experimental results (PPMI Dataset)
39
40
Outline
• Introduction
• Intuition and goal
• Model Representation
• Model Learning
• Obtaining Atomic Poses
• Training Detectors and Classifiers
• Estimating Model Parameters
• Model Inference
• Experimental Results
• Conclusion
41
Conclusion
• Mutual context can significantly improve the performance
in difficult visual recognition problems
• The joint model can share all the information
• Annotate all the human body parts and objects in training
images
42
Reference
• Yao, B., and Fei-fei, L. “Recognizing Human-Object Interactions in
Still Images by Modeling the Mutual Context of Objects and Human
Poses,” IEEE Transactions on Pattern Analysis and Machine
Intelligence (2012)
• B. Yao and L. Fei-Fei, “Modeling Mutual Context of Object and
Human Pose in Human-Object Interaction Activities,” Proc. IEEE Conf.
Computer Vision and Pattern Recognition, 2010
• B. Sapp, A. Toshev, and B. Taskar, “Cascade Models for Articulated
Pose Estimation,” Proc. European Conf. Computer Vision, 2010.
• S. Lazebnik, C. Schmid, and J. Ponce, “Beyond Bags of Features:
Spatial Pyramid Matching for Recognizing Natural Scene Categories,”
Proc. IEEE CS Conf. Computer Vision and Pattern Recognition, 2006.
• http://en.wikipedia.org/wiki/Hierarchical_clustering