Action Recognition

Download Report

Transcript Action Recognition

A general survey of previous works on
Sobhan Naderi Parizi
September 2009

Statistical Analysis of Dynamic Actions

On Space-Time Interest Points

Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words

What, where and who? Classifying events by scene and object recognition

Recognizing Actions at a Distance

Recognizing Human Actions: A Local SVM Approach

Retrieving Actions in Movies

Learning Realistic Human Actions from Movies

Actions in Context

Selection and Context for Action Recognition

Paper info:
 Title:
▪ Statistical Analysis of Dynamic Actions
 Authors:
▪ Lihi Zelnik-Manor
▪ Michal Irani
 TPAMI 2006
 A preliminary version appeared in CVPR 2001
▪ “Event-Based video Analysis”

Overview:
 Introduce a non-parametric distance measure
 Video matching (no action model): given a reference
video, similar sequences are found
 Dense features from multiple temporal scales (only
corresponding scales are compared)
 Temporal extent of videos in each category should be the
same! (a fast and slow dancing are different)
 New database is introduced
▪ Periodic activities (walk)
▪ Non-periodic activities (Punch, Kick, Duck, Tennis)
▪ Temporal Textures (water)
▪ www.wisdom.weizmann.ac.il/~vision/EventDetection.html

Feature description:
 Space-time gradient of each pixel
 Threshold the gradient magnitudes
 Normalization (ignoring appearance)
 Absolute value (invariant to dark/light transitions)
▪ Direction invariant
▪ ( N xl , N yl , Ntl ) 
(| S xl |, | S yl |, | Stl |)
( S xl ) 2  ( S yl ) 2  ( Stl ) 2

Comments:
 Actions are represented by 3L independent 1D
distributions (L being number of temporal scales)
 The frames are blurred first
▪ Robust to change of appearance e.g. high textured clothing
 Action recognition/localization
▪ For a test video sequence S and a reference sequence of T
frames:
▪ Each consequent sub-sequence of length T is compared to the
reference
▪ In case of multiple reference videos:
▪ Mahalanobis distance

Paper info:
 Title:
▪ On Space-Time Interest Points
 Authors:
▪ Ivan Laptev: INRIA / IRISA
 IJCV 2009


Extends Harris detector to 3D (space-time)
Local space-time points with non-constant
motion:
 Points with accelerated motion: physical forces


Independent space and time scales
Automatic scale selection

Automatic scale selection procedure:
 Detect interest points
 Move in the direction of optimal scale
 Repeat until locally optimal scale is reached
(iterative)

The procedure can not be used in real-time:
 Frames in future time are needed
 There exist estimation approaches to solve this
problem

Paper info:
 Title:
▪ Unsupervised Learning of Human Action Categories
Using Spatial-Temporal Words
 Authors:
▪ Juan Carlos Niebles: University of Illinois
▪ Hongcheng Wang: University of Illinois
▪ Li Fei-Fei: University of Illinois
 BMVC 2006


Generative graphical model (pLSA)
STIP detector is used (piotr dollár et al.)
 Laptev’s STIP detector is too sparse




Dictionary of video words is created
The method is unsupervised
Simultaneous action recognition/localization
Evaluations on:
 KTH action database
 Skating actions database (4 action classes)

Overview of the method:



K
w: video word
d: video sequence
z: latent topic (action category)
P( wi , d j )  P( wi , d j , zk )  P(d j ) P( zk | d j ) P( wi , zk )
k 1

Feature descriptor:
 Brightness gradient + PCA
 Brightness gradient found equiv. to Optical Flow for
motion capturing

Multiple action can be localized in the video:
P( z k | wi , d j ) 

Average classification accuracy:
 KTH action database: 81.5%
 Skating dataset: 80.67%
P( wi | z k ) P( z k | d j )

K
l 1
P( wi | zl ) P( zl | d j )

Paper info:
 Title:
▪ What, where and who? Classifying events by scene and
object recognition
 Authors:
▪ Li-Jia Li: University of Illinois
▪ Li Fei-Fei: Princeton University
 ICCV 2007

Goal of the paper:
 Event classification in still images
 Scene labeling
 Object labeling

Approach:
 Generative graphical model
 Assumes that objects and scenes are independent
given the event category
 Ignores spatial relationships between objects

Information channels:
 Scene context (holistic representation)
 Object appearance
 Geometrical layout (sky at infinity/vertical
structure/ground plane)

Feature extraction:
 12x12 patches obtained by grid sampling (10x10)
 For each patch:
▪ SIFT feature (used both for scene and object models)
▪ Layout label (used only for object model)

The graphical model
 E: event
 S: scene
 O: object
 X: scene feature
 A: appearance feature
 G: geometry layout

A new database is compiled:
 8 sport even categories (downloaded from web)
 Bocce, croquet, polo, rowing, snowboarding,
badminton, sailing, rock climbing

Average classification
accuracy over all 8
event classes = 74.3%

Sample results:

Paper info:
 Title:
▪ Recognizing Actions at a Distance
 Authors:
▪ Alexei A. Efros: UC Berkeley
▪ Alexander C. Berg: UC Berkeley
▪ Greg Mori: UC Berkeley
▪ Jitendra Malik: UC Berkeley
 ICCV 2003

Overall review:




Actions in medium resolution (30 pix tall)
Proposing a new motion descriptor
KNN for classification
Consistent tracking bounding
box of the actor is required
 Action recognition is done only
on the tracking bounding box
 Motion in terms of as relative
movement of body parts
 No info. about movements is given by the tracker

Motion Feature:
 For each frame, a local temporal neighborhood is
considered
 Optical flow is extracted (other alternatives:
image pixel values, temporal gradients)
 OF is noisy:
▪ half-wave rectifying + blurring
 To preserve motion info:
▪ OF vector is decomposed to its
vertical/horizontal components

Similarity measure:
 i,j: index of frame
 T: temporal extent
4
S (i, j )    aci t ( x, y)bcj t ( x, y)
tT c 1 x , yI
 I: spatial extent
 {a1i , a2i , a3i , a4i }
 A: 1st video sequence =
 B: 2nd video sequence = {b1i , b2i , b3i , b4i }

New Dataset:
 Ballet (stationary camera):
▪ 16 action classes
▪ 2 men + 2 women
▪ Easy dataset (controlled environment)
 Tennis (real action, stationary camera):
▪ 6 action classes (stand, swing, move-left, …)
▪ different days/location/camera position
▪ 2 players (man + woman)
 Football (real action, moving camera):
▪ 8 action classes (run-left 45˚, run-left, walk-left, …)
▪ Zoom in/out

Average classification accuracy:
 Ballet:
87.44% (5NN)
 Tennis: 64.33% (5NN)
 Football: 65.38% (1NN)

What can be done?

Applications:
 Do as I Do:
▪ Replace actors in videos
 Do as I Say:
▪ Develop real-world motions in computer games
 2D/3D skeleton transfer
 Figure Correction:
▪ Remove occlusion/clutter in movies

Paper info:
 Title:
▪ Recognizing Human Actions: A Local SVM Approach
 Authors:
▪ Christian Schuldt: KTH university
▪ Ivan Laptev: KTH university
 ICPR 2004

New dataset (KTH action database):
 2391 video sequences
 6 action classes (Walking, Jogging, Running,
Handclapping, Boxing, Hand-waving)
 25 persons
 Static camera
 4 scenarios:
▪
▪
▪
▪
Outdoors (s1)
Outdoors + scale variation (s2): the hardest scenario
Outdoors + cloth variation (s3)
Indoors (s4)

Features:
 Sparse (STIP detector)
 Spatio-temporal jets of order 4

Different feature representations:
 Raw jet feature descriptors
2

 Exponential
kernel on the histogram of jets
 Spatial HoG with temporal pyramid

Different classifiers:
 SVM
 NN

Experimental results:
 Local Feature (jets) + SVM performs the best
 SVM outperforms NN
 HistLF (histogram of jets) is slightly better than
HistSTG (histogram of spatio-temporal gradients)

Average classification accuracy on all
scenarios = 71.72%

Paper info:
 Title:
▪ Retrieving Actions in Movies
 Authors:
▪ Ivan Laptev: INRIA / IRISA
▪ Patrik Perez: INRIA / IRISA
 ICCV 2007
A new action database from real movies
Experiments only on Drinking action vs.
random/Smoking
 Main contributions:


 Recognizing unrestricted real actions
 Key-frame priming

Configuration of experiments:
 Action recognition (on pre-segmented seq.)
 Comparing different features
 Action detection (using key-frame priming)

Real movie action database:
 105 drinking actions
 141 smoking actions
 Different scenes/people/views


www.irisa.fr/vista/Equipe/People/Laptev/actiondetection.html
Action representation:
 R = (P, ΔP)
 P = (X, Y, T): space-time coordinates
 ΔP = (ΔX, ΔY, ΔT):
▪ ΔX: 1.6 width of head bounding box
▪ ΔY: 1.3 height of head bounding box

Learning scheme:
 Discrete AdaBoost + FLD (Fisher Linear Discriminant)
 All action cuboids are normalized
to 14x14x8 cells of 5x5x5 pixels
(needed for boosting)
 Slightly temporal-randomized
sequences is added to training
 HoG(4bins)/OF(5bins) is used
 Local features:
▪ Θ=(x,y,t, δx, δy, δt, β, Ψ)
▪ Β Є{plain, temp-2, spat-4}
▪ ΨЄ{OF5, Grad4}



HoG captures shape, OF captures motion
Informative motions: start & end of action
Key-frame:
 When hand reaches head
 Boosted-Histogram on HOG
 No motion info
around key-frame
 Integration of
motion & key-frame
should help

Experiments:
 OF/OF+HoG/STIP+NN/only key-frame
 OF/OF+HoG works best on hard test (drinking vs.
smoking)
 Extension of OF5 to OFGrad9 does not help!

Key-frame priming:
 #FPs decreases significantly (different info. channels)
 Significant overall accuracy:
▪ It’s better to model motion and appearance separately
 Speed of key-primed version: 3 seconds per frame

Possible extensions:
 Extend the experiments to more action classes
 Make it real-time

Paper info:
 Title:
▪ Learning Realistic Human Actions from Movies
 Authors:
▪ Ivan Laptev: INRIA / IRISA
▪ Marcin Marszalek: INRIA / LEAR
▪ Cordelia Schmid: INRIA / LEAR
▪ Benjamin Rozenfeld: Bar-Ilan university
 CVPR 2008

Overview:
 Automatic movie annotation:
▪ Alignment of movie scripts
▪ Text classification
 Classification of real action
 Providing a new dataset
 Beat state-of-the-art results on KTH dataset
 Extending spatial pyramid to space-time pyramid

Movie script:
 Publicly available textual description about:
▪
▪
▪
▪
Scene description
Characters
Transcribed dialogs
Actions (descriptive)
 Limitations:
▪
▪
▪
▪
No exact timing alignment
No guarantee for correspondence with real actions
Actions are expressed literally (diverse descriptions)
Actions may be missed due to lack of conversation

Automatic annotation:
 Subtitles include exact time alignment
 Timing of scripts is matched by subtitles
 Textual description of action is done by a text
classifier

New dataset:
 8 action classes (AnswerPhone, GetOutCar, SitUp, …)
 Two training sets (automatically/manually annotated)
 60% of the automatic training set is correctly
annotated

http://www.irisa.fr/vista/actions

Action classification approach:
 BoF framework (k=4000)
 Space-time pyramids
▪ 6 spatial grids: {1x1, 2x2, 3x3, 1x3, 3x1, o2x2}
▪ 4 temporal grids: {t1, t2, t3, ot2}
 STIP with multiple scales
 HoG and HoF

Feature extraction:
 A volume of (2kσ x 2kσ x 2kτ) is taken around each
STIP where σ/τ is spatial/temporal extent (k=9)
 The volume is divided to nx  ny  nt  3 3 2 grid
 HoG and HoF for each grid cell is calculated and
concatenated together
 These concatenated features are concatenated
once more according to the pattern of spatiotemporal pyramid

Different channels:





Each spatio-temporal template: one channel
Greedy search to find the best channel combination
C
Kernel function = channel 1 KernelDist channel
Chi2 distance
Observations:





HoG performs better than HoF
No temporal subdivision is preferred (temporal grid = t1)
Combination of channels improves classification in real scenario
Mean AP on KTH action database = 91.8%
Mean AP on real movies database:
▪ Trained on manually annotated dataset : 39.5%
▪ Trained on automatically annotated dataset : 22.9%
▪ Random classifier (chance) : 12.5%

Future works:






Increase robustness to annotation noise
Improve script to video alignment
Learn on larger database of automatic annotation
Experiment more low-level features
Move from BoF to detector based methods
The table shows:
▪ effect of temporal division when combining channels (HMM based methods
should work)
▪ Pattern of spatio-temporal pyramid changes so that context is best captured
when the action is scene-dependent

Paper info:
 Title:
▪ Actions in Context
 Authors:
▪ Marcin Marszalek: INRIA / LEAR
▪ Ivan Laptev: INRIA / IRISA
▪ Cordelia Schmid: INRIA / LEAR
 CVPR 2009

Contributions:
 Automatic learning of scene classes from video
 Improve action recognition using image context and
vice versa



Movie scripts is used for automatic training
For both action and scene: BoF + SVM
New large database:
 12 action classes
 69 movies involved
 10 scene classes

www.irisa.fr/vista/actions/hollywood2


For automatic annotation, scenes are
identified only from text
Features:
 SIFT (modeling scene)
on 2D-Harris
 HoG and HoF (motion)
on 3D-Harris (STIP)

Features:
 SIFT: extracted from 2D-Harris detector
▪ Captaures static appearance
▪ Used for modeling scene context
▪ Calculated for single frame (every 2 seconds)
 HoG/HoF: extracted from 3D-Harris detector
▪ HoG captures dynamic appearance
▪ HoF captures motion pattern
 One video dictionary per channel is created
 Histogram of video words is created for each channel

Classifier:
 SVM using chi2 distance
 Exponential kernel (RBF)
 Sum over multiple channels
K ( xi , x j )  exp(

channel
1
channel
Dchannel ( xi , x j ))

Evaluations:
 SIFT: better for context
 HoG/HoF: better for action
 Only context can also classify
actions fairly good!
 Combination of the 3 channels
works best

Observations:
 Context is not always good
▪ Idea: The model should control
contribution of context for each
action class individually
 Overall, the gain of accuracy
is not significant using context:
▪ Idea: other types of context should
work better

Paper info:
 Title:
▪ Selection and Context for Action Recognition
 Authors:
▪ Dong Han: University of Bonn
▪ Liefeng Bo: TTI-Chicago
▪ Cristian Sminchisescu: University of Bonn
 ICCV 2009

Main contributions:
 Contextual scene descriptors based on:
▪ Presence/absence of objects (bag-of-detectors)
▪ Structural relation between objects and their parts
 Automatic learning of multiple features
▪ Multiple Kernel Gaussian Process Classifier (MKGPC)

Experimental results on:
 KTH action dataset
 Hollywood1,2 Human Action database (INRIA)

Main message:
 Detection of a Car and a Person in its proximity increases
probability of Get-Out-Car action

Provides a framework to train a classifier based on
combination of multiple features (not necessarily relevant)
e.g. HOG+HOF+histogram intersection, …

Similar to MKL but here
 Parameters are learnt automatically i.e. (weights + hyper-
parameters)
T
km ( xi , x j ; , )   e k ( xit , x tj ;  t )
t 1
 Gaussian Process scheme is used for learning
t

Descriptors:
 Bag of Detectors
▪ Deformable part models are used (Pedro)
▪ Once object BBs are detected, 3 descriptors are built:
▪ ObjPres (4D)
▪ ObjCount (4D)
▪ ObjDist (21D): pair-wise distances of object parts for all of Person
detector (7 parts)
 HOG (4D) + HOF (5D) from STIP detector (Ivan)
▪ Spatial grids: 1x1, 2x1, 3x1, 4x1, 2x2, 3x3
▪ Temporal grids: t1, t2, t3
 3D gradient features

Experimental results:
 KTH dataset
▪ 94.1% mean AP vs. 91.8% reported by Laptev
▪ Superior to state-of-the-art in all but Running class
 HOHA1 dataset
▪ Trained on clean set only
▪ The optimal subset of features is found greedily
(addition/removal) based on test error
▪ 47.5% mean AP vs. 38.4% reported by Laptev
 HOHA2 dataset
▪ 43.12% mean AP vs. 35.1% reported by Marszalek

Best feature combination