Transcript slides

Summarization of ego-centric video Object driven Vs. Story driven

Presented By: Elad Osherov Jan 2013

Today’s talk

      Motivation Related Work Object driven summarization Story driven summarization Results Future Development 2

3 What is Egocentric Video Anyway ?

http://xkcd.com/1235/

4

What is Egocentric Video Anyway ?

Motivation

 Goal - Generate a visual summary of an unedited egocentric video 5 Input : Egocentric video of camera wearer’s day Output : Storyboard (or skim video) summary

6 Potential Applications of Egocentric Video Summarization Memory aid Law enforcement Mobile robot discovery

Egocentric Video Properties

 Long unedited video  Constant head motion – blur  Moving camera – unstable background  Frequent changes in people and objects  Hand occlusion 7

Today’s talk

      Motivation Related Work Object driven summarization Story driven summarization Results Future Development 8

Related Work

 Object recognition in egocentric video [Egocentric Recognition of Handled Objects: Benchmark and Analysis X.Ren, M.Philipose -CVPR 2009]  Detection and recognition of first person actions [Detecting activities of daily living in first-person camera views H.Pirsiavash, D.Ramanan CVPR 2012]  Data summarization – Today !

[Rav-Acha, Y. Pritch, and S. Peleg, Making a Long Video Short: Dynamic Video Synopsis, CVPR 06] 9

Related Work

[ Rav-Acha, Y. Pritch, and S. Peleg, Making a Long Video Short: Dynamic Video Synopsis, CVPR 06 ] 10 http://www.vision.huji.ac.il/video-synopsis/

A Few Words About the Authors

Prof. Kristen Grauman University of Texas at Austin (department of CS) Prof. Zheng Lu City university of Hong Kong (department of CS) Dr. Yong Jae Lee UC Berkeley (departments of EE & CS) Prof. Joydeep Ghosh University of Texas at Austin. Director of IDEAL (Intelligent Data Exploration and Analysis Lab)  Discovering Important People and Objects for Egocentric Video Summarization. Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman CVPR 2012  Story-Driven Summarization for Egocentric Video. Zheng Lu and Kristen Grauman CVPR 2013 11

Today’s talk

      Motivation Related Work Object driven summarization Story driven summarization Results Future Development 12

Object Driven Video Summarization

  Goal - create a storyboard summary of a person’s day that is driven by the

important people and objects

Important things - significant interaction     Several problems arise Important is a subjective index !

What does significant interaction really mean ?

No priors on People and objects

13

Algorithm Overview

 Train a category-independent important person/object detector Train regression model to predict region importance Train Segment the video into temporal events Test Group regions of same object Test Generate a storyboard Test 14 [Discovering Important People and Objects for Egocentric Video Summarization. Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman CVPR 2012]

Annotating Important Regions in Training Video    Data collection – 10 videos each 3-5 hours long-total of 37 hrs 4 subjects   Crowd source annotations using Mturk Object’s degree of importance will highly depend on what the camera wearer is doing before , while and after the object/person appears  The object must be seen in the context of the camera wearer’s activity to properly gauge its importance 15 www.looxcie.com

www.mturk.com/mturk/

Annotating Important Regions in Training Video Man wearing a blue shirt in a Café Yellow notepad on a table Coffee mug that cameraman drinks Smartphone the cameraman holds  For about 3-5 hours of video they get 700 object segmentations 16

Training a Regression Model

 General purpose category-independent model predicts important regions in any egocentric video: 1.

Segment each frame into regions 2.

 For each region, compute a set of candidate features that could describe it’s importance Egocentric, Object & Region features 3.

Train a regressor to predict region importance 17

Egocentric Features

   Interaction feature – Euclidean distance of the region’s centroid to the closest detected hand Classify region as a hand according to color likelihoods and a naïve bayes classifier trained on ground-truth hand annotations

Distance to hand

19

Egocentric Features

   Gaze feature – A coarse estimate of how likely the region is being focused upon Euclidean distance of the region’s centroid to the frame center 20

Distance to frame center

Egocentric Features

 Frequency feature – Region matching - Color dissimilarity between the region and each region in surrounding frames Points matching - Match SIFT features between each region and frame in surrounding frames

Frequency

21

Object Features

  Object-like appearance Using region ranking function that ranks each region according to Gestalt cues: 22 [J. Carreira and C. Sminchisescu. Constrained Parametric Min-Cuts for Automatic Object Segmentation. In CVPR, 2010.]

Object Features

   Object-like motion Rank each region according to the difference of motion patterns in comparison to the nearby regions High scores to regions that “stand-out” of their surroundings during motion 23

Object-like motion

[Key-Segments for Video Object Segmentation Yong Jae Lee, Jaechul Kim, and Kristen Grauman ICCV 2011]

Object features

  Likelihood of a person’s face Compute the maximum overlap score between the region r and any detected face q in the frame 24 Overlap with face detection

Train a regressor to predict region importance  Size, centroid, bounding box centroid, bounding box, width, bounding box height – Region features  Solve using least squares 25

Algorithm Overview

 Train a category-independent important person/object detector Train regression model to predict region importance Train Segment the video into temporal events Test Group regions of same object Test Generate a storyboard Test 26 [Discovering Important People and Objects for Egocentric Video Summarization. Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman CVPR 2012]

Segmenting the video into temporal events Pair-wise distance matrix   Events allow summary to include multiple instances of the person or object that is central in multiple contexts in the video Group frames until the smallest maximum inter-frame distance is larger than two STDs beyond the mean 27

Algorithm Overview

 Train a category-independent important person/object detector Train regression model to predict region importance Train Segment the video into temporal events Test Group regions of same object Test Generate a storyboard Test 28 [Discovering Important People and Objects for Egocentric Video Summarization. Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman CVPR 2012]

Discovering an Event’s Key People and Objects      Score each frame region using the regressor Group instances of the same object/person together Set a pool of high scoring clusters Remove clusters with affinity to a higher I(r) cluster For each remaining cluster select the region with the highest importance as its representative 29

Generating a Storyboard Summary

 Each event can display different number of frames, depending on how many unique important things the method discovers 30

31

Results

Important region prediction accuracy

32

Results

Important region prediction accuracy

Results

Which cues matter most for predicting importance ?

33 Top 28 features with highest learned weights  Low scores on  Interaction and frequency pair  Object-like region that is frequent

34

Results

Egocentric video summarization accuracy

Results

User studies to evaluate summaries 35   Let the camera wearer answer 2 quality questions:  Important objects/people captured  Overall summary quality Better results in ~69% of the summaries

Today’s talk

      Motivation Related Work Object driven summarization Story driven summarization Results Future Development 36

Story Driven Video Summarization

    Good summary captures the progress of the story!

Segment video temporally into subshots Select chain of k subshots that maximize both weakest link’s influence and object importance Each subshot

influential ”leads to”

objects the next through some subset of 37 [Story-Driven Summarization for Egocentric Video. Zheng Lu and Kristen Grauman CVPR 2013]

Document – Document Influence [Shahaf & Guestrin, KDD 2010] Connecting the dots between news articles. D. Shahaf and C. Guestrin. In KDD, 2010.

38

Egocentric Subshot Detection

    Define 3 generic ego-activities Static In transit Head moving  Train classifiers to predict these activity types  Features based on Blur and Optical flow  Classify using SVM classifier 39

Temporal Subshot Segmentation

Tailored to egocentric video – detects ego-activities 40 Provides an over-segmentation - A typical subshot lasts ~15 Sec

Subshot Selection Objective

 Given a set series of subshots segmented from the input video, our goal is to select the

optimal K-node chain of subshots

42

Story Progress Between Subshots

 A good story – a coherent chain of subshots, where each strongly influences the next one 43

Predicting Influence Between Subshots

0.01

0.2

0.1

0.2

0.003

0.1

0.1

44

Predicting Influence Between Subshots

Sink node  Captures how reachable subshot j is from subshot i, via object o.

45

Subshot Selection Objective

 Given a set series of subshots segmented from the input video, our goal is to select the

optimal K-node chain of subshots

46

Predicting diversity among transitions

  Compute GIST and color histograms for each frame in each subshot, quantize them into 55 scene types Compute for each two adjacent subshots in the chain  

K j

 1  1 

1

  2

(s , s

j j

 1

)))

47

Coherent Object Activation Patterns

Story driven Uniform sampling   Prefer activating few objects at once and, coherent (smooth) entrance/exit patterns Solve with linear programing and priority queue 48

Today’s talk

      Motivation Related Work Object driven summarization Story driven summarization Results Future Development 50

Results

4 videos, each 3-5 hours long, uncontrolled setting 20 videos, each 20-60 minutes, daily activities in house 51

Results

 Baselines 1.

2.

3.

Uniform sampling of K subshots Shortest path – K subshots with minimal bag-of-objects distance between each other Object driven – Only for UTE set  Parameters    K=4...8

s

 1,  

i

d

 0.5

Simultaneous active objects : 80-UTE 15-ADL 52

Results

     Test methodology 34 human subjects, ages 18-60 12 hours of original video Each comparison done by 5 subjects Total 535 tasks, 45 hours of subject time  Probably the most comprehensive egocentric summarization test ever established!

53

Results

 Blind taste test     Show sped up version of original video Show Story-driven summary and one of the baselines Which summary better shows the progress of the story?

Pay attention to the relationship among sub-events,redundancy, and representativeness of each sub-event 54   In 51% of the comparisons all 5 subjects voted Story-driven Only in 9% Story-driven won by only one vote

3 workers on Mturk N=42 objects as GT

Results

Discovering influential objects

Baseline frequency of objects in the video

 Results show the method’s advantage. The most influential objects need not be the most frequent!

55

Results

Where does the method fail ?

 Where the story is uneventful  Where there are multiple interwoven threads 56

Further Development

 Better use of machine learning techniques instead of simple pair-wise regression  Extend subshot descriptions to detect actions  Augment the summary with a location service such as GPS  Improve success ratio 57

58

Automatic storyboard maps

Pros and Cons

 Pros:      Well written Well referenced Novel solution Large and detailed Human experiment Detailed Website  Cons: 59     Very complicated material No source code publically available No real competition Computationally demanding

60