THUMOS`13 Action
Download
Report
Transcript THUMOS`13 Action
Action recognition with improved
trajectories
Heng Wang and Cordelia Schmid
LEAR Team, INRIA
1
Action recognition in realistic videos
Challenges
Severe camera motion
Variation in human appearance and pose
Cluttered background and occlusion
Viewpoint and illumination changes
Current state of the art
Local space-time features + bag-of-features model
Dense trajectories performs the best on a large
variety of datasets (Wang et.al. IJCV’13)
2
Dense trajectories revisited
Three major steps:
- Dense sampling
- Feature tracking
- Trajectory-aligned descriptors
3
Dense trajectories revisited
Advantages:
- Capture the intrinsic dynamic structures in video
- MBH is robust to camera motion
Disadvantages:
- Generate irrelevant trajectories in background due to camera motion
- Motion descriptors are corrupted due to camera motion, e.g., HOF, MBH
4
Improved dense trajectories
Contributions:
- Improve dense trajectories by explicit camera motion estimation
- Detect humans to remove outlier matches for homography estimation
- Stabilize optical flow to eliminate camera motion
- Remove trajectories caused by camera motion
5
Camera motion estimation
Find the correspondences between two consecutive frames:
- Extract and match SURF features (robust to motion blur)
- Sample good-features-to-track interest points from optical flow
Combine SURF (green) and optical flow (red) results in a
more balanced distribution
Use RANSAC to estimate a homography from all feature matches
Inlier matches of the homography
6
Remove inconsistent matches due to humans
Human motion is not constrained by camera motion,
thus generate outlier matches
Apply a human detector in each frame, and track the
human bounding box forward and backward to join
them together
Remove feature matches inside the human bounding
box during homography estimation
Inlier matches and warped flow, without or with HD
7
Warp optical flow
Warp the second frame of two consecutive frames
with the homography and re-compute the optical flow
For HOF, the warped flow removes irrelevant camera
motion, thus only encodes foreground motion
For MBH, it also improves, as the motion boundaries
are enhanced
Two images overlaid
Original optical flow
Warped version
8
Remove background trajectories
Remove trajectories by thresholding the maximal magnitude
of stabilized motion vectors in the warped optical flow
Our method works well under various camera motions,
such as pan, zoom, tilt
Successful examples
Failure cases
Removed trajectories (white) and foreground ones (green)
Failure due to severe motion blur; the homography is not
correctly estimated due to unreliable feature matches
9
Demo of warp flow and remove track
Warped optical flow eliminates background camera motion
Removing trajectories makes the feature representation
more focus on human motion
10
Experimental setting
"RootSIFT" normalization for each descriptor, then PCA
to reduce its dimension by a factor of two
Use Fisher vector to encode each descriptor separately,
set the number of Gaussians to K=256
Use Power+L2 normalization for FV, and linear SVM
with one-against-rest for multi-class classification
Datasets
Hollywood2: 12 classes from 69 movies, report mAP
HMDB51: 51 classes, report accuracy on three splits
Olympic sports: 16 sport actions, report mAP
UCF50: 50 classes, report accuracy over 25 groups
11
Evaluation of the intermediate steps
Trajectory
DTF
25.4%
WarpFlow
31.0%
RmTrack
26.9%
ITF
32.4%
HOG
38.4%
38.7%
39.6%
40.2%
HOF
39.5%
48.5%
41.6%
48.9%
MBH HOF+MBH Combined
49.1%
49.8%
52.2%
50.9%
53.5%
55.6%
50.8%
51.0%
53.9%
52.1%
54.7%
57.2%
Results on HMDB51 using Fisher vector
Baseline: DTF = "dense trajectory feature"
WarpFlow = "warp the optical flow"
RmTrack = "remove background trajectory"
ITF = "improved trajectory feature:
combining WarpFlow and RmTrack
12
Evaluation of the intermediate steps
Trajectory
DTF
25.4%
WarpFlow
31.0%
RmTrack
26.9%
ITF
32.4%
HOG
38.4%
38.7%
39.6%
40.2%
HOF
39.5%
48.5%
41.6%
48.9%
MBH HOF+MBH Combined
49.1%
49.8%
52.2%
50.9%
53.5%
55.6%
50.8%
51.0%
53.9%
52.1%
54.7%
57.2%
Results on HMDB51 using Fisher vector
Both Trajectory and HOF are significantly improved;
MBH also better as motion boundaries are clearer;
HOG does not change much
HOF and MBH are complementary, as they represent
zero and first order motion information
Both RmTrack and WarpFlow helps; WarpFlow
contributes more; Combing them (ITF) works the best
13
Impact of feature encoding on improved trajectories
Datasets
Hollywood2
HMDB51
Olympic Sport
UCF50
Bag of features
DTF
ITF
58.5%
62.2%
47.2%
52.1%
75.4%
83.3%
84.8%
87.2%
Fisher vector
DTF
ITF
60.1% 64.3%
52.2% 57.2%
84.7% 91.1%
88.6% 91.2%
Compare DTF and ITF using different feature encoding
Standard bag of features: train a codebook of 4000
visual words with k-means for each descriptor type;
RBFkernel SVM for classification
We observe a similar improvement of ITF over DTF
when using BOF or FV for feature encoding
The improvement of FV over BOF varies on different
datasets, from 2% to 7%
14
Impact of human detection and state of the art
Hollywood2
Jain CVPR'13
62.5%
With HD
64.3%
Without HD
63.0%
Olympic Sports
Jain CVPR'13
83.2%
With HD
91.1%
Without HD
90.2%
HMDB51
Jain CVPR'13
52.1%
With HD
57.2%
Without HD
55.9%
UCF50
Shi CVPR'13
With HD
Without HD
83.3%
91.2%
90.5%
HD stands for human detection
Human detection always helps. For Hollywood2 and HMDB51, the
difference is more significant, as there are more humans present
Significantly outperforms the state of the art on all four datasets
Source code: http://lear.inrialpes.fr/~wang/improved_trajectories
15
THUMOS'13 Action Recognition Challenge
Dataset: three train-test splits from UCF101
We follow exactly the same framework:
improved trajectory feature + Fisher vector
We do not apply human detection as it is
computational expensive to run it on large datasets
We use spatio-temporal pyramid to embed
structure information in the final representation
16
THUMOS'13 Action Recognition Challenge
Descriptors
HOG
HOF
MBH
HOG+HOF
HOG+MBH
HOF+MBH
HOG+HOF+MBH
None
72.4%
76.0%
80.8%
82.9%
83.3%
82.2%
84.8%
T2
72.8%
76.1%
81.1%
82.7%
83.3%
82.2%
84.8%
H3
73.2%
77.3%
80.5%
82.7%
83.4%
82.0%
84.6%
Combined
74.6%
78.3%
82.1%
83.9%
84.4%
83.3%
85.9%
We do not include Trajectory descriptor, as combining it
does not improve the final performance
For single descriptor: MBH > HOF > HOG; For combining
two descriptors, MBH+HOG works the best, as they are
most complementary
17
THUMOS'13 Action Recognition Challenge
Descriptors
HOG
HOF
MBH
HOG+HOF
HOG+MBH
HOF+MBH
HOG+HOF+MBH
None
72.4%
76.0%
80.8%
82.9%
83.3%
82.2%
84.8%
T2
72.8%
76.1%
81.1%
82.7%
83.3%
82.2%
84.8%
H3
73.2%
77.3%
80.5%
82.7%
83.4%
82.0%
84.6%
Combined
74.6%
78.3%
82.1%
83.9%
84.4%
83.3%
85.9%
Spatio-temporal pyramids always helps. The improvement is
more significant on a single descriptor
Combing everything gives the best performance 85.9%,
which is the result we submitted to THUMOS
18
TRECVID’13 Multimedia Event Detection
Large scale video classification: 4500 hours, over 100,0000 videos.
ITF is the best video descriptor and very fast to compute. Our
whole pipeline (ITF+FV) is only 10 times slower than real time.
For visual channel, we combine ITF and SIFT
Descriptors
AXES
CMU
BBNVISER
Sesame
MediaMill
NII
SRIAURORA
Genie
Full system
36.6%
36.3%
32.2%
25.7%
25.3%
24.9%
24.2%
20.2%
ASR
1.0%
5.7%
8.0%
3.9%
------3.9%
4.3%
Audio
12.4%
16.1%
15.1%
5.6%
5.6%
8.8%
9.6%
10.1%
Top performance on MED ad-hoc
OCR
1.1%
3.7%
5.3%
0.2%
------4.3%
----
Visual
29.4%
28.4%
23.4%
23.2%
23.8%
19.9%
20.4%
16.9%
19