Human pose recognition
Download
Report
Transcript Human pose recognition
1. Introduction
2. Article [1]
Real Time Motion Capture Using a Single
TOF Camera (2010)
3. Article [2]
Real Time Human Pose Recognition In
Parts Using a Single Depth Images(2011)
Fig From [2]
Why do we need this?
Robotics
Smart surveillance
virtual reality
motion analysis
Gaming - Kinect
Microsoft Xbox 360 console
“You are the controller”
Launched - 04/11/10
In the first 60 days on the market sold over 8M
units! (Guinness world record)
http://www.youtube.com/watch?v=p2qlHoxPioM
mocap using markers –
expensive
Multi View camera systems –
limited applicability.
Monocular –
simplified problems.
Time Of Flight Camera. (TOF)
Dense depth
High frame rate (100 Hz)
Robust to:
Lighting
shadows
other problems.
2.1 previous work
2.2 What’s new?
2.3 Overview
2.4 results
2.5 limitations & future work
2.6 Evaluation
Many many many articles…
(Moeslund et al 2006–covered 350 articles…)
(2006)
(2006)
(1998)
TOF technology
Propagating information up the kinematic chain.
Probabilistic model using the unscented transform.
Multiple GPUs.
1. Probabilistic Model
2. Algorithm Overview:
Model Based Hill Climbing Search
Evidence Propagation
Full Algorithm
15 body parts
DAG – Directed Acyclic Graph
pose X { X i } N
t
t
i 1
DBN– Dynamic Bayesian Network
speed V
t
range scan z t
dynamic Bayesian network (DBN)
Assumptions
V t | V t 1 ~ N (V t 1 , )
Use ray casting to evaluate
P ( X t V t X t 1 ) 1
i
zk
i
i
distance from measurement.
Goal: Find the most likely states, given previous frame MAP, i.e.:
Xˆ t , Vˆt arg m ax X t ,V t log P ( z t | X t , V t ) log P ( X t , V t | Xˆ t 1 , Vˆt 1 )
Fig From [1]
1. Hill climbing search (HC)
2. Evidence Propagation –EP
Grid around
i
i
P (V t | V t 1 )
0 .0 5m
0 .0 5m
i
Sample
Vt
evaluate likelihood
choose best point!
i
Calculate
X t from V t , Xˆ t 1
Coarse to fine Grids.
Fig From [1]
The good:
Simple
Fast
run in parallel in GPUS
The Bad:
Local optimum
Ridges, Plateau, Alleys
Can lose track when motion is fast ,or occlusions occur.
Also has 3 stages:
1. Body part detection (C. Plagemann et al 2010)
2. Probabilistic Inverse Kinematics
3. Data association and inference
Bottom up approach:
1. Locate interest points with AGEX –
Accumulative Geodesic Extrema.
2. Find orientation.
3. Classify the head, foots and hands using local shape
descriptors.
Fig From [3]
Results:
Fig From [3]
p i { H ead , H ands , Legs }i 1 of X
5
pˆ j ( j 1, ..., N )
Assume Correspondence p i pˆ j
Need new MAP conditioned on Xˆ t 1 , pˆ j .
Problem –
p i (V t , Xˆ t 1 , Vˆt 1 )
isn’t linear!
Solution: Linearize with the unscented Kalman filter .
Easy to determine P (V t | Vˆt 1 , Xˆ t 1 , pˆ j ) .
X’>Xbest?
{( p i , pˆ j )}
Experiments:
28 real depth image sequences.
Ground Truth - tracking markers.
M
avg
|| m i mˆ i ||
i 1
M
, m i – real marker position
mˆ i – estimated position
avg 0.1m
perfect tracks.
avg 0.3 m
fault tracking.
Compared 3 algorithms: EP, HC, HC+EP .
best – HC+EP, worse – EP.
Runs close to real time.
HC: 6 frames per second.
HC+EP: 4-6 frames per second.
Fig From [1]
Extreme case – 27:
Lose track
HC
HC+EP
Fig From [1]
Limitations:
Manual Initialization.
Tracking more than one person at a time.
Using temporal data – consume more time,
reinitialization problem.
Future work:
improving the speed.
combining with color cameras
fully automatic model initialization.
Track more than 1 person.
Well Written
Self Contained
Novel combination of existing parts
New technology
Achieving goals (real time)
Missing examples on probabilistic model.
Not clear how
X0
is defined
Extensively validated:
Data set and code available
not enough visual examples in article
No comparison to different algorithms
2.1 previous work
2.2 What’s new?
2.3 Overview
2.4 results
2.5 limitations & future work
2.6 Evaluation
Same as Article [1].
Using no temporal information – robust and
fast (200 frames per second).
Object recognition approach.
per pixel classification.
Large and highly varied
training dataset .
Fig From [2]
1. Database construction
2. Body part inference and joint proposals:
Goals:
computational efficiency and robustness
Pose estimation is often overcome lack of training data…
why???
Huge color and texture variability.
Computer simulation don’t produce the range of volitional
motions of a human subject.
Fig From [2]
Fig From [2]
1. Body part labeling
2. Depth image features
3. Randomized decision forests
4. Joint position proposals
31 body parts labeled .
The problem now can be solved by an efficient
classification algorithms.
Fig From [2]
Simple depth comparison features:(1)
d I ( x ) – depth at pixel x in image I, offset ( u , v )
normalization - depth invariant.
computational efficiency:
no preprocessing.
Fig From [2]
How does it work?
Pixel x
Node = feature f and a threshold
Classify pixel x:
P (c | I , x )
1
T
T
P (c | I , x )
t
t 1
Fig From [2]
Training Algorithm:
1M Images – 2000 pixels
Per image
( , )
arg m ax G ( )
G ( ) H ( Q )
s ( l , r )
| Q s ( ) |
|Q |
H ( Q s ( )) *H-antropy
Training 3 trees, depth 20, 1M images~ 1 day (1000 core cluster)
14
1M images*2000pixels*2000 f *50 = 2 10 com putations ...
Trained tree:
Fig From [2]
Local mode finding approach based on mean shift with a
weighted Gaussian kernel.
Density estimator:
N
f c ( xˆ )
i 1
x xi
w ic exp
bc
w ic P ( c | I , x i ) d I ( x i )
2
2
Fig From [4]
Experiments:
8800 frames of real depth images.
5000 synthetic depth images.
Also evaluate Article [1] dataset.
Measures :
1. Classification accuracy – confusion matrix.
2. joint accuracy –mean Average Precision (mAP)
results within D=0.1m –TP.
Fig From [2]
high correlation between real and synthetic.
Depth of tree – most effective
Fig From [2]
Comparing the algorithm on:
real set (red) – mAP 0.731
ground truth set (blue) – mAP 0.914
mAP 0.984 – upper body
Fig From [2]
Comparing algorithm to ideal Nearest Neighbor
matching, and realistic NN - Chamfer NN.
Fig From [2]
Comparison to Article[1]:
Run on the same dataset
Better results (even without temporal data)
Runs 10x faster.
Fig From [2]
Full rotations and multiple people
Right-left ambiguity
mAP of 0.655 ( good for our uses)
Result Video
Fig From [2]
Future work:
better synthesis pipeline
Is there efficient approach that directly
regress joint positions? (already done in future
work - Efficient offset regression of body joint
positions)
Well Written
Self Contained
Novel combination of existing parts
New technology
Achieving goals (real time)
Extensively validated:
Used in real console
Many results graphs and examples
(Another pdf of supplementary material)
Broad comparison to other algorithms
data set and code not available
[1] Real Time Motion Capture Using a Single TOF Camera (V.
Ganapathi et al. 2010)
[2] Real Time Human Pose Recognition In Parts Using a Single Depth
Images(Shotton et al. & Xbox Incubation 2011)
[3] Real time identification and localization of body parts from depth
images (C. Plagemann et al. 2010)
[4] Computer Graphics course (046746), Technion.