Transcript Slides

Real-time Articulated Hand Pose Estimation using
Semi-supervised Transductive Regression Forests
Tsz-Ho
Yu
Danhang
Tang
Sponsored by
T-K
Kim
2
Motivation
Multiple cameras with invserse
kinematics
[Bissacco et al. CVPR2007]
[Yao et al. IJCV2012]
[Sigal IJCV2011]
Specialized hardware
(e.g. structured light sensor,
TOF camera)
[Shotton et al. CVPR’11]
[Baak et al. ICCV2011]
[Ye et al. CVPR2011]
[Sun et al. CVPR2012]
Learning-based (regression)
[Navaratnam et al.
BMVC2006]
[Andriluka et al. CVPR2010]
Motivation
• Discriminative approaches (RF) have
achieved great success in human
body pose estimation.
 Efficient – real-time
 Accurate – frame-basis, not rely on tracking
 Require a large dataset to cover many poses
 Train on synthetic, test on real data
 Didn’t exploit kinematic constraints
Examples:
Shotton et al. CVPR’11, Girshick et al. ICCV’11, Sun et al. CVPR’12
Challenges for Hand?
• Viewpoint changes and self occlusions
• Discrepancy between synthetic and real data is
larger than human body
• Labeling is difficult and tedious!
Our method
• Viewpoint changes and self occlusions
Hierarchical
Hybrid
Forest
• Discrepancy between synthetic and real data is
larger than human body
• Labeling is difficult and tedious!
Transductive
Learning
Semisupervised
Learning
Existing Approaches
Generative approaches
• Model-fitting
• No training is required
Oikonomidis et al.
ICCV2011
• Slow
• Needs initialisation and tracking
Motion capture
Ballan et al. ECCV 2012
De La Gorce et al.
PAMI2010
Hamer et al. ICCV2009
Discriminative approaches
• Similar solutions to human body pose estimation
• Performance on real data remains challenging
• al. ECCV2012
Xu and Cheng
Keskin et
Wang ICCV
et al. 2013
SIGGRAPH2009
Stenger et al. IVC 2007
Our method
• Viewpoint changes and self occlusions
Hierarchical
Hybrid
Forest
• Discrepancy between synthetic and real data is
larger than human body
• Labeling is difficult and tedious!
Hierarchical Hybrid Forest
Viewpoint Classification: Qa
STR forest:
Qapv = αQa + (1-α)βQP + (1-α)(1-β)QV
• Qa – View point classification quality (Information gain)
Hierarchical Hybrid Forest
Viewpoint Classification: Qa
Finger joint Classification: Qp
STR forest:
Qapv = αQa + (1-α)βQP + (1-α)(1-β)QV
• Qa – View point classification quality (Information gain)
• Qp – Joint label classification quality (Information gain)
Hierarchical Hybrid Forest
Viewpoint Classification: Qa
Finger joint Classification: Qp
Pose Regression: Qv
STR forest:
Qapv = αQa + (1-α)βQP + (1-α)(1-β)QV
• Qa – View point classification quality (Information gain)
• Qp – Joint label classification quality (Information gain)
• Qv – Compactness of voting vectors (Determinant of covariance trace)
Hierarchical Hybrid Forest
Viewpoint Classification: Qa
Finger Joint Classification: Qp
Pose Regression: Qv
STR forest:
Qapv = αQa + (1-α)βQP + (1-α)(1-β)QV
•
•
•
•
Qa – View point classification quality (Information gain)
Qp – Joint label classification quality (Information gain)
Qv – Compactness of voting vectors (Determinant of covariance trace)
(α,β) – Margin measures of view point labels and joint labels
Our method
• Viewpoint changes and self occlusions
• Discrepancy between synthetic and real data is
larger than human body
• Labeling is difficult and tedious!
Transductive
Learning
Semisupervised
Learning
Transductive learning
Source space
(Synthetic data S)
Target space
(Realistic data R)
Training data D = {Rl, Ru, S}:
labeled
unlabeled
•Synthetic data S:
»Generated from an articulated hand model. All labeled.
•Realistic data R:
»Captured from Primesense depth sensor
»A small part of R, Rl are labeled manually (unlabeled set Ru)
Transductive learning
Source space
(Synthetic data S)
Target space
(Realistic data R)
Training data D = {Rl, Ru, S}:
• Realistic data R:
» Captured from Kinect
» A small part of R, Rl are labeled manually (unlabeled set Ru)
• Synthetic data S:
» Generated from a articulated hand model, where |S| >> |R|
Transductive learning
Source space
(Synthetic data S)
Target space
(Realistic data R)
Training data D = {Rl, Ru, S}:
• Similar data-points in Rl and S are paired(if separated by split function give penalty)
Semi-supervised learning
Source space
(Synthetic data S)
Target space
(Realistic data R)
Training data D = {Rl, Ru, S}:
• Similar data-points in Rl and S are paired(if separated by split function give penalty)
• Introduce a semi-supervised term to make use of unlabeled real data when
evaluating split function
Kinematic refinement
Experiment settings
Training data:
» Synthetic data(337.5K images)
» Real data(81K images, <1.2K labeled)
Evaluation data:
• Three different testing sequences
1. Sequence A --- Single viewpoint(450 frames)
2. Sequence B --- Multiple viewpoints, with slow hand movements(1000
frames)
3. Sequence C --- Multiple viewpoints, with fast hand movements(240
frames)
19
20
Self comparison experiment
Self comparison(Sequence A):
» This graph shows the joint classification accuracy of Sequence A.
» Realistic and synthetic baselines produced similar accuracies.
» Using the transductive term is better than simply augmented real and
synthetic data.
» All terms together achieves the best results.
Multiview experiments
Multi view experiment (Sequence C):
Conclusion
A 3D hand pose estimation algorithm
• STR forest: Semi-supervised and transductive regression forest
• A data-driven refinement scheme to rectify the shortcomings of STR forest
»
»
»
»
Real-time (25Hz on Intel i7 PC without CPU/GPU optimisation)
Works better than state-of-the-arts
Makes use of unlabelled data, required less manual annotation.
More accurate in real scenario
Video demo
Thank you!
http://www.iis.ee.ic.ac.uk/icvl
25