Transcript slides
Summarization of ego-centric video Object driven Vs. Story driven
Presented By: Elad Osherov Jan 2013
Today’s talk
Motivation Related Work Object driven summarization Story driven summarization Results Future Development 2
3 What is Egocentric Video Anyway ?
http://xkcd.com/1235/
4
What is Egocentric Video Anyway ?
Motivation
Goal - Generate a visual summary of an unedited egocentric video 5 Input : Egocentric video of camera wearer’s day Output : Storyboard (or skim video) summary
6 Potential Applications of Egocentric Video Summarization Memory aid Law enforcement Mobile robot discovery
Egocentric Video Properties
Long unedited video Constant head motion – blur Moving camera – unstable background Frequent changes in people and objects Hand occlusion 7
Today’s talk
Motivation Related Work Object driven summarization Story driven summarization Results Future Development 8
Related Work
Object recognition in egocentric video [Egocentric Recognition of Handled Objects: Benchmark and Analysis X.Ren, M.Philipose -CVPR 2009] Detection and recognition of first person actions [Detecting activities of daily living in first-person camera views H.Pirsiavash, D.Ramanan CVPR 2012] Data summarization – Today !
[Rav-Acha, Y. Pritch, and S. Peleg, Making a Long Video Short: Dynamic Video Synopsis, CVPR 06] 9
Related Work
[ Rav-Acha, Y. Pritch, and S. Peleg, Making a Long Video Short: Dynamic Video Synopsis, CVPR 06 ] 10 http://www.vision.huji.ac.il/video-synopsis/
A Few Words About the Authors
Prof. Kristen Grauman University of Texas at Austin (department of CS) Prof. Zheng Lu City university of Hong Kong (department of CS) Dr. Yong Jae Lee UC Berkeley (departments of EE & CS) Prof. Joydeep Ghosh University of Texas at Austin. Director of IDEAL (Intelligent Data Exploration and Analysis Lab) Discovering Important People and Objects for Egocentric Video Summarization. Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman CVPR 2012 Story-Driven Summarization for Egocentric Video. Zheng Lu and Kristen Grauman CVPR 2013 11
Today’s talk
Motivation Related Work Object driven summarization Story driven summarization Results Future Development 12
Object Driven Video Summarization
Goal - create a storyboard summary of a person’s day that is driven by the
important people and objects
Important things - significant interaction Several problems arise Important is a subjective index !
What does significant interaction really mean ?
No priors on People and objects
13
Algorithm Overview
Train a category-independent important person/object detector Train regression model to predict region importance Train Segment the video into temporal events Test Group regions of same object Test Generate a storyboard Test 14 [Discovering Important People and Objects for Egocentric Video Summarization. Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman CVPR 2012]
Annotating Important Regions in Training Video Data collection – 10 videos each 3-5 hours long-total of 37 hrs 4 subjects Crowd source annotations using Mturk Object’s degree of importance will highly depend on what the camera wearer is doing before , while and after the object/person appears The object must be seen in the context of the camera wearer’s activity to properly gauge its importance 15 www.looxcie.com
www.mturk.com/mturk/
Annotating Important Regions in Training Video Man wearing a blue shirt in a Café Yellow notepad on a table Coffee mug that cameraman drinks Smartphone the cameraman holds For about 3-5 hours of video they get 700 object segmentations 16
Training a Regression Model
General purpose category-independent model predicts important regions in any egocentric video: 1.
Segment each frame into regions 2.
For each region, compute a set of candidate features that could describe it’s importance Egocentric, Object & Region features 3.
Train a regressor to predict region importance 17
Egocentric Features
Interaction feature – Euclidean distance of the region’s centroid to the closest detected hand Classify region as a hand according to color likelihoods and a naïve bayes classifier trained on ground-truth hand annotations
Distance to hand
19
Egocentric Features
Gaze feature – A coarse estimate of how likely the region is being focused upon Euclidean distance of the region’s centroid to the frame center 20
Distance to frame center
Egocentric Features
Frequency feature – Region matching - Color dissimilarity between the region and each region in surrounding frames Points matching - Match SIFT features between each region and frame in surrounding frames
Frequency
21
Object Features
Object-like appearance Using region ranking function that ranks each region according to Gestalt cues: 22 [J. Carreira and C. Sminchisescu. Constrained Parametric Min-Cuts for Automatic Object Segmentation. In CVPR, 2010.]
Object Features
Object-like motion Rank each region according to the difference of motion patterns in comparison to the nearby regions High scores to regions that “stand-out” of their surroundings during motion 23
Object-like motion
[Key-Segments for Video Object Segmentation Yong Jae Lee, Jaechul Kim, and Kristen Grauman ICCV 2011]
Object features
Likelihood of a person’s face Compute the maximum overlap score between the region r and any detected face q in the frame 24 Overlap with face detection
Train a regressor to predict region importance Size, centroid, bounding box centroid, bounding box, width, bounding box height – Region features Solve using least squares 25
Algorithm Overview
Train a category-independent important person/object detector Train regression model to predict region importance Train Segment the video into temporal events Test Group regions of same object Test Generate a storyboard Test 26 [Discovering Important People and Objects for Egocentric Video Summarization. Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman CVPR 2012]
Segmenting the video into temporal events Pair-wise distance matrix Events allow summary to include multiple instances of the person or object that is central in multiple contexts in the video Group frames until the smallest maximum inter-frame distance is larger than two STDs beyond the mean 27
Algorithm Overview
Train a category-independent important person/object detector Train regression model to predict region importance Train Segment the video into temporal events Test Group regions of same object Test Generate a storyboard Test 28 [Discovering Important People and Objects for Egocentric Video Summarization. Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman CVPR 2012]
Discovering an Event’s Key People and Objects Score each frame region using the regressor Group instances of the same object/person together Set a pool of high scoring clusters Remove clusters with affinity to a higher I(r) cluster For each remaining cluster select the region with the highest importance as its representative 29
Generating a Storyboard Summary
Each event can display different number of frames, depending on how many unique important things the method discovers 30
31
Results
Important region prediction accuracy
32
Results
Important region prediction accuracy
Results
Which cues matter most for predicting importance ?
33 Top 28 features with highest learned weights Low scores on Interaction and frequency pair Object-like region that is frequent
34
Results
Egocentric video summarization accuracy
Results
User studies to evaluate summaries 35 Let the camera wearer answer 2 quality questions: Important objects/people captured Overall summary quality Better results in ~69% of the summaries
Today’s talk
Motivation Related Work Object driven summarization Story driven summarization Results Future Development 36
Story Driven Video Summarization
Good summary captures the progress of the story!
Segment video temporally into subshots Select chain of k subshots that maximize both weakest link’s influence and object importance Each subshot
influential ”leads to”
objects the next through some subset of 37 [Story-Driven Summarization for Egocentric Video. Zheng Lu and Kristen Grauman CVPR 2013]
Document – Document Influence [Shahaf & Guestrin, KDD 2010] Connecting the dots between news articles. D. Shahaf and C. Guestrin. In KDD, 2010.
38
Egocentric Subshot Detection
Define 3 generic ego-activities Static In transit Head moving Train classifiers to predict these activity types Features based on Blur and Optical flow Classify using SVM classifier 39
Temporal Subshot Segmentation
Tailored to egocentric video – detects ego-activities 40 Provides an over-segmentation - A typical subshot lasts ~15 Sec
Subshot Selection Objective
Given a set series of subshots segmented from the input video, our goal is to select the
optimal K-node chain of subshots
42
Story Progress Between Subshots
A good story – a coherent chain of subshots, where each strongly influences the next one 43
Predicting Influence Between Subshots
0.01
0.2
0.1
0.2
0.003
0.1
0.1
44
Predicting Influence Between Subshots
Sink node Captures how reachable subshot j is from subshot i, via object o.
45
Subshot Selection Objective
Given a set series of subshots segmented from the input video, our goal is to select the
optimal K-node chain of subshots
46
Predicting diversity among transitions
Compute GIST and color histograms for each frame in each subshot, quantize them into 55 scene types Compute for each two adjacent subshots in the chain
K j
1 1
1
2
(s , s
j j
1
)))
47
Coherent Object Activation Patterns
Story driven Uniform sampling Prefer activating few objects at once and, coherent (smooth) entrance/exit patterns Solve with linear programing and priority queue 48
Today’s talk
Motivation Related Work Object driven summarization Story driven summarization Results Future Development 50
Results
4 videos, each 3-5 hours long, uncontrolled setting 20 videos, each 20-60 minutes, daily activities in house 51
Results
Baselines 1.
2.
3.
Uniform sampling of K subshots Shortest path – K subshots with minimal bag-of-objects distance between each other Object driven – Only for UTE set Parameters K=4...8
s
1,
i
d
0.5
Simultaneous active objects : 80-UTE 15-ADL 52
Results
Test methodology 34 human subjects, ages 18-60 12 hours of original video Each comparison done by 5 subjects Total 535 tasks, 45 hours of subject time Probably the most comprehensive egocentric summarization test ever established!
53
Results
Blind taste test Show sped up version of original video Show Story-driven summary and one of the baselines Which summary better shows the progress of the story?
Pay attention to the relationship among sub-events,redundancy, and representativeness of each sub-event 54 In 51% of the comparisons all 5 subjects voted Story-driven Only in 9% Story-driven won by only one vote
3 workers on Mturk N=42 objects as GT
Results
Discovering influential objects
Baseline frequency of objects in the video
Results show the method’s advantage. The most influential objects need not be the most frequent!
55
Results
Where does the method fail ?
Where the story is uneventful Where there are multiple interwoven threads 56
Further Development
Better use of machine learning techniques instead of simple pair-wise regression Extend subshot descriptions to detect actions Augment the summary with a location service such as GPS Improve success ratio 57
58
Automatic storyboard maps
Pros and Cons
Pros: Well written Well referenced Novel solution Large and detailed Human experiment Detailed Website Cons: 59 Very complicated material No source code publically available No real competition Computationally demanding
60