Human pose recognition

Transcript Human pose recognition

1. Introduction
2. Article [1]
Real Time Motion Capture Using a Single
TOF Camera (2010)
3. Article [2]
Real Time Human Pose Recognition In
Parts Using a Single Depth Images(2011)
Fig From [2]
Why do we need this?
Robotics
Smart surveillance
virtual reality
motion analysis
Gaming - Kinect
Microsoft Xbox 360 console
“You are the controller”
Launched - 04/11/10
In the first 60 days on the market sold over 8M
units! (Guinness world record)
http://www.youtube.com/watch?v=p2qlHoxPioM
mocap using markers –
expensive
Multi View camera systems –
limited applicability.
Monocular –
simplified problems.
Time Of Flight Camera. (TOF)
Dense depth
High frame rate (100 Hz)
 Robust to:
Lighting
shadows
other problems.
2.1 previous work
2.2 What’s new?
2.3 Overview
2.4 results
2.5 limitations & future work
2.6 Evaluation
Many many many articles…
(Moeslund et al 2006–covered 350 articles…)
(2006)
(2006)
(1998)
TOF technology
 Propagating information up the kinematic chain.
 Probabilistic model using the unscented transform.
Multiple GPUs.
1. Probabilistic Model
2. Algorithm Overview:
 Model Based Hill Climbing Search
 Evidence Propagation
 Full Algorithm
15 body parts
DAG – Directed Acyclic Graph
pose X  { X i } N
t
t
i 1
DBN– Dynamic Bayesian Network
speed V
t
range scan z t
dynamic Bayesian network (DBN)
Assumptions
V t | V t 1 ~ N (V t 1 ,  )
 Use ray casting to evaluate
P ( X t  V t  X t 1 )  1
i
zk
i
i

distance from measurement.
 Goal: Find the most likely states, given previous frame MAP, i.e.:
 Xˆ t , Vˆt  arg m ax X t ,V t log P ( z t | X t , V t )  log P ( X t , V t | Xˆ t 1 , Vˆt 1 )
Fig From [1]
1. Hill climbing search (HC)
2. Evidence Propagation –EP
Grid around
i
i
P (V t | V t 1 )
  0 .0 5m
  0 .0 5m
i
Sample
Vt
evaluate likelihood
choose best point!
i
Calculate
X t from V t , Xˆ t  1
Coarse to fine Grids.
Fig From [1]
The good:
Simple
Fast
run in parallel in GPUS
The Bad:
Local optimum
Ridges, Plateau, Alleys
Can lose track when motion is fast ,or occlusions occur.
Also has 3 stages:
1. Body part detection (C. Plagemann et al 2010)
2. Probabilistic Inverse Kinematics
3. Data association and inference
Bottom up approach:
1. Locate interest points with AGEX –
Accumulative Geodesic Extrema.
2. Find orientation.
3. Classify the head, foots and hands using local shape
descriptors.
Fig From [3]
Results:
Fig From [3]
p i  { H ead , H ands , Legs }i 1 of X
5
pˆ j ( j  1, ..., N )
Assume Correspondence p i   pˆ j
Need new MAP conditioned on Xˆ t 1 , pˆ j .
Problem –
p i (V t , Xˆ t  1 , Vˆt  1 )
isn’t linear!
 Solution: Linearize with the unscented Kalman filter .
 Easy to determine P (V t | Vˆt 1 , Xˆ t 1 , pˆ j ) .
X’>Xbest?
{( p i , pˆ j )}
Experiments:
28 real depth image sequences.
Ground Truth - tracking markers.
M
 avg 

|| m i  mˆ i ||
i 1
M
, m i – real marker position
mˆ i – estimated position
 avg  0.1m
perfect tracks.
 avg  0.3 m
fault tracking.
Compared 3 algorithms: EP, HC, HC+EP .
best – HC+EP, worse – EP.
Runs close to real time.
HC: 6 frames per second.
HC+EP: 4-6 frames per second.
Fig From [1]
Extreme case – 27:
Lose track
HC
HC+EP
Fig From [1]
Limitations:
Manual Initialization.
Tracking more than one person at a time.
Using temporal data – consume more time,
reinitialization problem.
Future work:
improving the speed.
combining with color cameras
fully automatic model initialization.
Track more than 1 person.
Well Written
Self Contained
Novel combination of existing parts
New technology
Achieving goals (real time)
Missing examples on probabilistic model.
Not clear how
X0
is defined
Extensively validated:
Data set and code available
not enough visual examples in article
No comparison to different algorithms
2.1 previous work
2.2 What’s new?
2.3 Overview
2.4 results
2.5 limitations & future work
2.6 Evaluation
 Same as Article [1].
 Using no temporal information – robust and
fast (200 frames per second).
 Object recognition approach.
 per pixel classification.
 Large and highly varied
training dataset .
Fig From [2]
1. Database construction
2. Body part inference and joint proposals:
Goals:
computational efficiency and robustness
Pose estimation is often overcome lack of training data…
why???
Huge color and texture variability.
Computer simulation don’t produce the range of volitional
motions of a human subject.
Fig From [2]
Fig From [2]
1. Body part labeling
2. Depth image features
3. Randomized decision forests
4. Joint position proposals
31 body parts labeled .
The problem now can be solved by an efficient
classification algorithms.
Fig From [2]
Simple depth comparison features:(1)
d I ( x ) – depth at pixel x in image I, offset   ( u , v )
normalization - depth invariant.
computational efficiency:
no preprocessing.
Fig From [2]
How does it work?
Pixel x
Node = feature f  and a threshold 
Classify pixel x:
P (c | I , x ) 
1
T
T
 P (c | I , x )
t
t 1
Fig From [2]
Training Algorithm:
1M Images – 2000 pixels
Per image
  ( ,  )

  arg m ax  G ( )
G ( )  H ( Q ) 

s ( l , r )

| Q s ( ) |
|Q |
H ( Q s (  )) *H-antropy
Training 3 trees, depth 20, 1M images~ 1 day (1000 core cluster)
14
1M images*2000pixels*2000 f  *50  = 2  10 com putations ...
Trained tree:
Fig From [2]
Local mode finding approach based on mean shift with a
weighted Gaussian kernel.
Density estimator:
N
f c ( xˆ ) 

i 1

x  xi
w ic exp  

bc

w ic  P ( c | I , x i )  d I ( x i )
2




2
Fig From [4]
Experiments:
8800 frames of real depth images.
5000 synthetic depth images.
Also evaluate Article [1] dataset.
Measures :
1. Classification accuracy – confusion matrix.
2. joint accuracy –mean Average Precision (mAP)
results within D=0.1m –TP.
Fig From [2]
high correlation between real and synthetic.
Depth of tree – most effective
Fig From [2]
Comparing the algorithm on:
real set (red) – mAP 0.731
ground truth set (blue) – mAP 0.914
mAP 0.984 – upper body
Fig From [2]
Comparing algorithm to ideal Nearest Neighbor
matching, and realistic NN - Chamfer NN.
Fig From [2]
Comparison to Article[1]:
Run on the same dataset
Better results (even without temporal data)
Runs 10x faster.
Fig From [2]
Full rotations and multiple people
Right-left ambiguity
mAP of 0.655 ( good for our uses)
Result Video
Fig From [2]
Future work:
better synthesis pipeline
Is there efficient approach that directly
regress joint positions? (already done in future
work - Efficient offset regression of body joint
positions)
Well Written
Self Contained
Novel combination of existing parts
New technology
Achieving goals (real time)
Extensively validated:
Used in real console
Many results graphs and examples
(Another pdf of supplementary material)
Broad comparison to other algorithms
data set and code not available
[1] Real Time Motion Capture Using a Single TOF Camera (V.
Ganapathi et al. 2010)
[2] Real Time Human Pose Recognition In Parts Using a Single Depth
Images(Shotton et al. & Xbox Incubation 2011)
[3] Real time identification and localization of body parts from depth
images (C. Plagemann et al. 2010)
[4] Computer Graphics course (046746), Technion.