Learning Robust Model for Keypoint Tracking

Download Report

Transcript Learning Robust Model for Keypoint Tracking

Metric Learning-Driven Multi-Task
Structured Output Optimization
for Robust Keypoint Tracking
Xi Li (李玺)
Zhejiang University
Liming Zhao, Xi Li*, Jun Xiao, Fei Wu, and Yueting Zhuang. “Metric Learning Driven Multi-Task Structured Output
Optimization for Robust Keypoint Tracking.” AAAI 2015. (Oral Presentation) http://www.zhaoliming.net/research
Review: Object Tracking
• Goal: to estimate the motion state of a target object in an input video.
• Category: different tracking forms
a) contour
•
•
•
b) region
c) keypoint
contour based tracking
• non-rigid object tracking.
region based tracking
• global statistical information.
keypoint based tracking
• local texture information.
• robust to partial occlusion, shape deformation, etc.
• flexible and have many other applications (graphics).
Li, X.; Hu, W.; Shen, C.; Zhang, Z.; Dick, A.; and Hengel, A. V. D. A survey of appearance models in visual object tracking. TIST 2013.
Review: Keypoint Based Tracking
General keypoint based tracking approach:
Detector:
Descriptor:
Matching:
Modeling:
• Harris, SIFT, SURF,
MSER, FAST
• SIFT, SURF, BRIEF,
ORB, BRISK, FREAK
• Template Matching,
Graph Matching,
Optical Flow
• Random Tree,
Boosting, SVM,
Structured Learning
What we focus on:
• Descriptor Feature Learning
• Statistical Modeling Method
• A Spatio-temporal Aware Unified Matching Framework
Preliminary
traditional approach
template
input frame
extract descriptors from keypoints
𝒅1
𝒅2
𝒅3
⋮
𝒅𝑖
⋮
𝒅𝑁1
keypoint matching
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝒅𝑖 , 𝒅𝑗 )
𝒅1
𝒅2
𝒅3
⋮
𝒅𝑗
⋮
𝒅𝑁2
Preliminary
traditional approach
• Descriptor changes when the appearance changes
(scale and rotation changes, illumination variation, etc.)
• So the traditional keypoint matching method performs not well
in some complicated scenarios.
𝒅1
𝒅2
𝒅3
⋮
𝒅𝑖
⋮
𝒅𝑁1
keypoint matching
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝒅𝑖 , 𝒅𝑗 )
𝒅1
𝒅2
𝒅3
⋮
𝒅𝑗
⋮
𝒅𝑁2
model learning
approach
Preliminary
template
input frame
learn a model
𝒘1
𝒘2
𝒘3
⋮
𝒘𝑖
⋮
𝒘𝑁1
keypoint matching
𝑠𝑐𝑜𝑟𝑒 = 𝒘𝑖 , 𝒅𝑗
𝒅1
𝒅2
𝒅3
⋮
𝒅𝑗
⋮
𝒅𝑁2
model learning
approach
Preliminary
template
𝑂=
𝑢𝑖 , 𝒒𝑖
input frame
𝑁𝑂
𝑖=1
𝐼=
𝑣𝑗 , 𝒅𝑗
𝑁𝐼
𝑗=1
All potential correspondences:
𝐶=
𝑢𝑖 , 𝑣𝑗 , 𝑠𝑖𝑗
𝑢𝑖 , 𝒒𝑖 ∈ 𝑂, 𝑣𝑗 , 𝒅𝑗 ∈ 𝐼, 𝑠𝑖𝑗 = 𝒘𝑖 , 𝒅𝑗 }
Keypoint matching is to find the correct correspondences.
Challenges of Keypoint Tracking
• Appearance changes
• Similar keypoint features
• Wrong result with wrong structure
?
Main Idea
Learn a robust
model
• Appearance changes
temporal model coherence
across frames
• Similar keypoint features
discriminative feature
construction
• Wrong result with wrong structure
?
spatial model consistency
within frames
Main Idea
Learn a robust
model
spatial model consistency
within frames
Structured Output Learning
temporal model coherence
across frames
Multi-task Model Learning
discriminative feature
construction
Metric Learning
joint learning scheme
Main Idea
Learn a robust
model
spatial model consistency
within frames
Structured Output Learning
temporal model coherence
across frames
Multi-task Model Learning
discriminative feature
construction
Metric Learning
joint learning scheme
Structured Output Learning
• Keypoint Matching:
• Consider the spatial consistency of the correspondences.
• Planar Object Tracking:
• Most correspondences satisfy the same homography transformation.
• RANSAC method (maximize the number of inliers)
Structured Output Learning
Given a transformation 𝒚, the set of inliers is:
𝐻 𝐶, 𝒚 =
𝑢𝑖 , 𝑣𝑗
𝑢𝑖 , 𝒒𝑖 ∈ 𝑂, 𝑣𝑗 , 𝒅𝑗 ∈ 𝐼, 𝒚 𝑢𝑖 − 𝑣𝑗 < 𝜏
Get the expected 𝒚 by maximizing the total score of inliers:
Structured Output Learning
F(C,𝒚1 )=0.2+0.3+0.1+0.9=1.5
F(C,𝒚3 )=0.6+0.7+0.9+0.9=3.1
F(C,𝒚2 )=0.9+0.8+0.9+0.9=3.5
Structured Output Learning
Objective Function:
where
Structured SVM:
Main Idea
We consider not only spatial structural information,
but also temporal sequential information.
The models learned from the tracklets should be
mutually correlated
spatial model consistency
within frames
Structured Output Learning
temporal model coherence
across frames
Multi-task Model Learning
discriminative feature
construction
Metric Learning
joint learning scheme
Multi-task Model Learning
frame 1,
… …
, t-K+1, … … , t-1 , t
temporal model coherence
a common model 𝒘0
𝒘𝟏
𝒘𝟐
𝒘𝟑
𝒘𝟒
𝒘𝟓
Multi-task Model Learning
• A multi-task structured model learning scheme
• encodes the cross-frame interaction information
• simultaneously optimizing a set of mutually correlated learning
subtasks
where
Multi-task Model Learning
Main Idea
Original descriptor feature (Brief) is not enough!
• it can not adapt to time-varying tracking situations.
spatial model consistency
within frames
temporal model coherence
across frames
discriminative feature
construction
Good
featureOutput
is important
for tracking!
Structured
Learning
• it can enhancethe discriminative power of the tracker.
Multi-task Model Learning
Metric Learning
joint learning scheme
Metric Learning
Discriminative feature space
• learning a mapping function
𝑓 𝒅 = 𝐌𝐓 𝐝
• the distance metric
• the loss between a doublet
(if the doublet is similar, 𝑝𝑗𝑗 ′ = 1; otherwise, 𝑝𝑗𝑗 ′ = 0.)
Weinberger, Kilian Q., John Blitzer, and Lawrence K. Saul. "Distance metric learning for large margin nearest neighbor classification.“ NIPS. 2005.
Metric Learning
We utilize ℓ2,1 -norm to learn the discriminative information and feature correlation
consistently. (feature selection)
Given all the keypoint features from the video frames
To minimize the following cost function
, we get training set:
Metric Learning
Visualization of keypoint features using PCA
A joint learning scheme
and
An alternating optimization algorithm to solve the optimization problem online.
Experimental Settings
Dataset:
• nine image sequences with ground-truth.
• four sequences are recorded by ourselves. (five are from SamHare)
• available: http://www.zhaoliming.net/research
All these sequences cover several complicated scenarios:
• background clutter
• scale and rotation
• illumination variation
• motion blurring
• partial occlusion
Experimental Settings
Implementation:
• FAST keypoint detector.
• BRIEF binary descriptor.
Criteria:
Given predicted homography 𝑦 and the ground-truth homography 𝑦 ∗ :
For each frame, 𝑆 𝑦, 𝑦 ∗ < 10 is regarded as a successfully detected frame.
(Same as Hare .etc CVPR 2012)
Experimental Results (1)
•
•
•
•
•
Boosting based approach (Grabner, Grabner, and Bischof 2007)
Structured SVM (SSVM) approach (Hare,Saffari, and Torr 2012)
A baseline static tracking approach (without model updating)
Our approach in C++ takes 0.0746 second to process one frame.
Executable binary code is available: http://www.zhaoliming.net/research
Experimental Results (1)
Comparison of three approaches in the
accumulated number of falsely detected frames
(lower is better).
Experimental Results (2)
•
•
•
•
SSVM (Structured SVM, exactly the approach in (Hare, Saffari, and Torr 2012))
SML (SSVM + metric learning)
SMT (SSVM + multi-task learning)
SMM (SSVM + ML + MT, which is exactly our approach)
Experimental Results (2)
Thanks!
Q&A
Project Website: http://www.zhaoliming.net/research
Prof. Xi Li: http://mypage.zju.edu.cn/xilics