Action Recognition
Download
Report
Transcript Action Recognition
A general survey of previous works on
Sobhan Naderi Parizi
September 2009
Statistical Analysis of Dynamic Actions
On Space-Time Interest Points
Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words
What, where and who? Classifying events by scene and object recognition
Recognizing Actions at a Distance
Recognizing Human Actions: A Local SVM Approach
Retrieving Actions in Movies
Learning Realistic Human Actions from Movies
Actions in Context
Selection and Context for Action Recognition
Paper info:
Title:
▪ Statistical Analysis of Dynamic Actions
Authors:
▪ Lihi Zelnik-Manor
▪ Michal Irani
TPAMI 2006
A preliminary version appeared in CVPR 2001
▪ “Event-Based video Analysis”
Overview:
Introduce a non-parametric distance measure
Video matching (no action model): given a reference
video, similar sequences are found
Dense features from multiple temporal scales (only
corresponding scales are compared)
Temporal extent of videos in each category should be the
same! (a fast and slow dancing are different)
New database is introduced
▪ Periodic activities (walk)
▪ Non-periodic activities (Punch, Kick, Duck, Tennis)
▪ Temporal Textures (water)
▪ www.wisdom.weizmann.ac.il/~vision/EventDetection.html
Feature description:
Space-time gradient of each pixel
Threshold the gradient magnitudes
Normalization (ignoring appearance)
Absolute value (invariant to dark/light transitions)
▪ Direction invariant
▪ ( N xl , N yl , Ntl )
(| S xl |, | S yl |, | Stl |)
( S xl ) 2 ( S yl ) 2 ( Stl ) 2
Comments:
Actions are represented by 3L independent 1D
distributions (L being number of temporal scales)
The frames are blurred first
▪ Robust to change of appearance e.g. high textured clothing
Action recognition/localization
▪ For a test video sequence S and a reference sequence of T
frames:
▪ Each consequent sub-sequence of length T is compared to the
reference
▪ In case of multiple reference videos:
▪ Mahalanobis distance
Paper info:
Title:
▪ On Space-Time Interest Points
Authors:
▪ Ivan Laptev: INRIA / IRISA
IJCV 2009
Extends Harris detector to 3D (space-time)
Local space-time points with non-constant
motion:
Points with accelerated motion: physical forces
Independent space and time scales
Automatic scale selection
Automatic scale selection procedure:
Detect interest points
Move in the direction of optimal scale
Repeat until locally optimal scale is reached
(iterative)
The procedure can not be used in real-time:
Frames in future time are needed
There exist estimation approaches to solve this
problem
Paper info:
Title:
▪ Unsupervised Learning of Human Action Categories
Using Spatial-Temporal Words
Authors:
▪ Juan Carlos Niebles: University of Illinois
▪ Hongcheng Wang: University of Illinois
▪ Li Fei-Fei: University of Illinois
BMVC 2006
Generative graphical model (pLSA)
STIP detector is used (piotr dollár et al.)
Laptev’s STIP detector is too sparse
Dictionary of video words is created
The method is unsupervised
Simultaneous action recognition/localization
Evaluations on:
KTH action database
Skating actions database (4 action classes)
Overview of the method:
K
w: video word
d: video sequence
z: latent topic (action category)
P( wi , d j ) P( wi , d j , zk ) P(d j ) P( zk | d j ) P( wi , zk )
k 1
Feature descriptor:
Brightness gradient + PCA
Brightness gradient found equiv. to Optical Flow for
motion capturing
Multiple action can be localized in the video:
P( z k | wi , d j )
Average classification accuracy:
KTH action database: 81.5%
Skating dataset: 80.67%
P( wi | z k ) P( z k | d j )
K
l 1
P( wi | zl ) P( zl | d j )
Paper info:
Title:
▪ What, where and who? Classifying events by scene and
object recognition
Authors:
▪ Li-Jia Li: University of Illinois
▪ Li Fei-Fei: Princeton University
ICCV 2007
Goal of the paper:
Event classification in still images
Scene labeling
Object labeling
Approach:
Generative graphical model
Assumes that objects and scenes are independent
given the event category
Ignores spatial relationships between objects
Information channels:
Scene context (holistic representation)
Object appearance
Geometrical layout (sky at infinity/vertical
structure/ground plane)
Feature extraction:
12x12 patches obtained by grid sampling (10x10)
For each patch:
▪ SIFT feature (used both for scene and object models)
▪ Layout label (used only for object model)
The graphical model
E: event
S: scene
O: object
X: scene feature
A: appearance feature
G: geometry layout
A new database is compiled:
8 sport even categories (downloaded from web)
Bocce, croquet, polo, rowing, snowboarding,
badminton, sailing, rock climbing
Average classification
accuracy over all 8
event classes = 74.3%
Sample results:
Paper info:
Title:
▪ Recognizing Actions at a Distance
Authors:
▪ Alexei A. Efros: UC Berkeley
▪ Alexander C. Berg: UC Berkeley
▪ Greg Mori: UC Berkeley
▪ Jitendra Malik: UC Berkeley
ICCV 2003
Overall review:
Actions in medium resolution (30 pix tall)
Proposing a new motion descriptor
KNN for classification
Consistent tracking bounding
box of the actor is required
Action recognition is done only
on the tracking bounding box
Motion in terms of as relative
movement of body parts
No info. about movements is given by the tracker
Motion Feature:
For each frame, a local temporal neighborhood is
considered
Optical flow is extracted (other alternatives:
image pixel values, temporal gradients)
OF is noisy:
▪ half-wave rectifying + blurring
To preserve motion info:
▪ OF vector is decomposed to its
vertical/horizontal components
Similarity measure:
i,j: index of frame
T: temporal extent
4
S (i, j ) aci t ( x, y)bcj t ( x, y)
tT c 1 x , yI
I: spatial extent
{a1i , a2i , a3i , a4i }
A: 1st video sequence =
B: 2nd video sequence = {b1i , b2i , b3i , b4i }
New Dataset:
Ballet (stationary camera):
▪ 16 action classes
▪ 2 men + 2 women
▪ Easy dataset (controlled environment)
Tennis (real action, stationary camera):
▪ 6 action classes (stand, swing, move-left, …)
▪ different days/location/camera position
▪ 2 players (man + woman)
Football (real action, moving camera):
▪ 8 action classes (run-left 45˚, run-left, walk-left, …)
▪ Zoom in/out
Average classification accuracy:
Ballet:
87.44% (5NN)
Tennis: 64.33% (5NN)
Football: 65.38% (1NN)
What can be done?
Applications:
Do as I Do:
▪ Replace actors in videos
Do as I Say:
▪ Develop real-world motions in computer games
2D/3D skeleton transfer
Figure Correction:
▪ Remove occlusion/clutter in movies
Paper info:
Title:
▪ Recognizing Human Actions: A Local SVM Approach
Authors:
▪ Christian Schuldt: KTH university
▪ Ivan Laptev: KTH university
ICPR 2004
New dataset (KTH action database):
2391 video sequences
6 action classes (Walking, Jogging, Running,
Handclapping, Boxing, Hand-waving)
25 persons
Static camera
4 scenarios:
▪
▪
▪
▪
Outdoors (s1)
Outdoors + scale variation (s2): the hardest scenario
Outdoors + cloth variation (s3)
Indoors (s4)
Features:
Sparse (STIP detector)
Spatio-temporal jets of order 4
Different feature representations:
Raw jet feature descriptors
2
Exponential
kernel on the histogram of jets
Spatial HoG with temporal pyramid
Different classifiers:
SVM
NN
Experimental results:
Local Feature (jets) + SVM performs the best
SVM outperforms NN
HistLF (histogram of jets) is slightly better than
HistSTG (histogram of spatio-temporal gradients)
Average classification accuracy on all
scenarios = 71.72%
Paper info:
Title:
▪ Retrieving Actions in Movies
Authors:
▪ Ivan Laptev: INRIA / IRISA
▪ Patrik Perez: INRIA / IRISA
ICCV 2007
A new action database from real movies
Experiments only on Drinking action vs.
random/Smoking
Main contributions:
Recognizing unrestricted real actions
Key-frame priming
Configuration of experiments:
Action recognition (on pre-segmented seq.)
Comparing different features
Action detection (using key-frame priming)
Real movie action database:
105 drinking actions
141 smoking actions
Different scenes/people/views
www.irisa.fr/vista/Equipe/People/Laptev/actiondetection.html
Action representation:
R = (P, ΔP)
P = (X, Y, T): space-time coordinates
ΔP = (ΔX, ΔY, ΔT):
▪ ΔX: 1.6 width of head bounding box
▪ ΔY: 1.3 height of head bounding box
Learning scheme:
Discrete AdaBoost + FLD (Fisher Linear Discriminant)
All action cuboids are normalized
to 14x14x8 cells of 5x5x5 pixels
(needed for boosting)
Slightly temporal-randomized
sequences is added to training
HoG(4bins)/OF(5bins) is used
Local features:
▪ Θ=(x,y,t, δx, δy, δt, β, Ψ)
▪ Β Є{plain, temp-2, spat-4}
▪ ΨЄ{OF5, Grad4}
HoG captures shape, OF captures motion
Informative motions: start & end of action
Key-frame:
When hand reaches head
Boosted-Histogram on HOG
No motion info
around key-frame
Integration of
motion & key-frame
should help
Experiments:
OF/OF+HoG/STIP+NN/only key-frame
OF/OF+HoG works best on hard test (drinking vs.
smoking)
Extension of OF5 to OFGrad9 does not help!
Key-frame priming:
#FPs decreases significantly (different info. channels)
Significant overall accuracy:
▪ It’s better to model motion and appearance separately
Speed of key-primed version: 3 seconds per frame
Possible extensions:
Extend the experiments to more action classes
Make it real-time
Paper info:
Title:
▪ Learning Realistic Human Actions from Movies
Authors:
▪ Ivan Laptev: INRIA / IRISA
▪ Marcin Marszalek: INRIA / LEAR
▪ Cordelia Schmid: INRIA / LEAR
▪ Benjamin Rozenfeld: Bar-Ilan university
CVPR 2008
Overview:
Automatic movie annotation:
▪ Alignment of movie scripts
▪ Text classification
Classification of real action
Providing a new dataset
Beat state-of-the-art results on KTH dataset
Extending spatial pyramid to space-time pyramid
Movie script:
Publicly available textual description about:
▪
▪
▪
▪
Scene description
Characters
Transcribed dialogs
Actions (descriptive)
Limitations:
▪
▪
▪
▪
No exact timing alignment
No guarantee for correspondence with real actions
Actions are expressed literally (diverse descriptions)
Actions may be missed due to lack of conversation
Automatic annotation:
Subtitles include exact time alignment
Timing of scripts is matched by subtitles
Textual description of action is done by a text
classifier
New dataset:
8 action classes (AnswerPhone, GetOutCar, SitUp, …)
Two training sets (automatically/manually annotated)
60% of the automatic training set is correctly
annotated
http://www.irisa.fr/vista/actions
Action classification approach:
BoF framework (k=4000)
Space-time pyramids
▪ 6 spatial grids: {1x1, 2x2, 3x3, 1x3, 3x1, o2x2}
▪ 4 temporal grids: {t1, t2, t3, ot2}
STIP with multiple scales
HoG and HoF
Feature extraction:
A volume of (2kσ x 2kσ x 2kτ) is taken around each
STIP where σ/τ is spatial/temporal extent (k=9)
The volume is divided to nx ny nt 3 3 2 grid
HoG and HoF for each grid cell is calculated and
concatenated together
These concatenated features are concatenated
once more according to the pattern of spatiotemporal pyramid
Different channels:
Each spatio-temporal template: one channel
Greedy search to find the best channel combination
C
Kernel function = channel 1 KernelDist channel
Chi2 distance
Observations:
HoG performs better than HoF
No temporal subdivision is preferred (temporal grid = t1)
Combination of channels improves classification in real scenario
Mean AP on KTH action database = 91.8%
Mean AP on real movies database:
▪ Trained on manually annotated dataset : 39.5%
▪ Trained on automatically annotated dataset : 22.9%
▪ Random classifier (chance) : 12.5%
Future works:
Increase robustness to annotation noise
Improve script to video alignment
Learn on larger database of automatic annotation
Experiment more low-level features
Move from BoF to detector based methods
The table shows:
▪ effect of temporal division when combining channels (HMM based methods
should work)
▪ Pattern of spatio-temporal pyramid changes so that context is best captured
when the action is scene-dependent
Paper info:
Title:
▪ Actions in Context
Authors:
▪ Marcin Marszalek: INRIA / LEAR
▪ Ivan Laptev: INRIA / IRISA
▪ Cordelia Schmid: INRIA / LEAR
CVPR 2009
Contributions:
Automatic learning of scene classes from video
Improve action recognition using image context and
vice versa
Movie scripts is used for automatic training
For both action and scene: BoF + SVM
New large database:
12 action classes
69 movies involved
10 scene classes
www.irisa.fr/vista/actions/hollywood2
For automatic annotation, scenes are
identified only from text
Features:
SIFT (modeling scene)
on 2D-Harris
HoG and HoF (motion)
on 3D-Harris (STIP)
Features:
SIFT: extracted from 2D-Harris detector
▪ Captaures static appearance
▪ Used for modeling scene context
▪ Calculated for single frame (every 2 seconds)
HoG/HoF: extracted from 3D-Harris detector
▪ HoG captures dynamic appearance
▪ HoF captures motion pattern
One video dictionary per channel is created
Histogram of video words is created for each channel
Classifier:
SVM using chi2 distance
Exponential kernel (RBF)
Sum over multiple channels
K ( xi , x j ) exp(
channel
1
channel
Dchannel ( xi , x j ))
Evaluations:
SIFT: better for context
HoG/HoF: better for action
Only context can also classify
actions fairly good!
Combination of the 3 channels
works best
Observations:
Context is not always good
▪ Idea: The model should control
contribution of context for each
action class individually
Overall, the gain of accuracy
is not significant using context:
▪ Idea: other types of context should
work better
Paper info:
Title:
▪ Selection and Context for Action Recognition
Authors:
▪ Dong Han: University of Bonn
▪ Liefeng Bo: TTI-Chicago
▪ Cristian Sminchisescu: University of Bonn
ICCV 2009
Main contributions:
Contextual scene descriptors based on:
▪ Presence/absence of objects (bag-of-detectors)
▪ Structural relation between objects and their parts
Automatic learning of multiple features
▪ Multiple Kernel Gaussian Process Classifier (MKGPC)
Experimental results on:
KTH action dataset
Hollywood1,2 Human Action database (INRIA)
Main message:
Detection of a Car and a Person in its proximity increases
probability of Get-Out-Car action
Provides a framework to train a classifier based on
combination of multiple features (not necessarily relevant)
e.g. HOG+HOF+histogram intersection, …
Similar to MKL but here
Parameters are learnt automatically i.e. (weights + hyper-
parameters)
T
km ( xi , x j ; , ) e k ( xit , x tj ; t )
t 1
Gaussian Process scheme is used for learning
t
Descriptors:
Bag of Detectors
▪ Deformable part models are used (Pedro)
▪ Once object BBs are detected, 3 descriptors are built:
▪ ObjPres (4D)
▪ ObjCount (4D)
▪ ObjDist (21D): pair-wise distances of object parts for all of Person
detector (7 parts)
HOG (4D) + HOF (5D) from STIP detector (Ivan)
▪ Spatial grids: 1x1, 2x1, 3x1, 4x1, 2x2, 3x3
▪ Temporal grids: t1, t2, t3
3D gradient features
Experimental results:
KTH dataset
▪ 94.1% mean AP vs. 91.8% reported by Laptev
▪ Superior to state-of-the-art in all but Running class
HOHA1 dataset
▪ Trained on clean set only
▪ The optimal subset of features is found greedily
(addition/removal) based on test error
▪ 47.5% mean AP vs. 38.4% reported by Laptev
HOHA2 dataset
▪ 43.12% mean AP vs. 35.1% reported by Marszalek
Best feature combination