Transcript CNN - SGLab

CS688/WST665
Spatial Pyramid Pooling in Deep Convolutional
Networks for Visual Recognition
Presenter ByungIn Yoo
1
Contents
● Introduction
● Motivation
● Previous work
● Main Idea
● Details
● Experiments
● Conclusion
2
Introduction
● Web-scale image retrieval
●
Classify images or videos
●
Detect and localize object
●
Estimate semantic and geometrical attributes
● Why is this challenging?
3
●
View point
●
Illumination
●
Occlusion
●
Scale
●
Deformation
●
Clutter background
Motivation
● The current CNN require a fixed input image size
(e.g., 224 x 224 )
Content loss
Crop
Distortion 224x224
Warp
● Recognition accuracy is degraded!
4
Convolutional
Neural Network
(CNN)
Motivation
● The current CNN require a fixed input image size
(e.g., 224 x 224 )
Content loss
Crop
Spatial
Pyramid
Distortion 224x224
Pooling
Warp
● Recognition accuracy is degraded!
5
Convolutional
Neural Network
(CNN)
Previous work (1/2)
● Spatial Pyramid Matching
- very successful in traditional computer vision
6
Grauman et al, The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features, ICCV 2005.
Lazebnik et al, Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, CVPR 2006.
Previous work (2/2)
● Zeiler-Fergus Architecture (2013, 1st)
8 Layers
Still low accuracy! & Fixed Image Size
● Google LeNet (2014, 1st)
Convolution
Pooling
Softmax
Other
22 Layers Too complex model! & Fixed Image Size
7
M.D. Zeiler et al, “Visualizing and understanding convolutional neural networks”, aXiv:1311.2901, 2013.
Christian Szegedy et a, “Going Deeper with Convolutions”, arXiv:1409.4842, 2014.
Main Idea (1/2)
● Add Spatial Pyramid Pooling layer!
Previous
Nets
SPP
Net
8
Main Idea (2/2)
● Generate fixed length representation regardless of
image size/scale.
● Simple (still 8 layers) and Powerful Model!
● Variable input size/scale
● Multi-size training, Multi-scale testing, Full image view
● Multi-level pooling
● Robust to deformation
● Operated on feature map
● Pooling in regions
9
Details – Convolutional Layers and Feature Maps
● Inherently, the convolutional layers can accept
arbitrary size image.
● Feature map involve not only the strength of the
responses, but also their spatial positions.
10
Details – The Spatial Pyramid Pooling Layer
● SPP-net is a new layer with Spatial Pyramid Pooling
256 x ( 4x4 + 2x2 + 1) = 5376 Dimension vector
SoftMax
FC7
FC6
SPP
Conv5
Conv4
Conv3
Conv2
Conv1
11
256 filters
Details – Training with the Spatial Pyramid Pooling
● Single-size training
● Simply modify the configuration file of CNN frameworks
SoftMax
FC7
FC6
SPP
Conv5
Conv4
Conv3
Conv2
Conv1
12
Feature map: 13x13
Details – Training with the Spatial Pyramid Pooling
● Multiple-size training
● Multiple networks sharing all weights
● Each network for a single size. (e.g. 224x224, 180x180)
● Improve scale-invariance
resize
13
Details – Fast CNN-based Object Detection
● The features can be computed from entire image only once.
● Similar accuracy, much faster (24x~64x) than R-CNN
2000 Convolutions!
14
1 Convolution!
Experiments (1/4)
● ILSVRC image classification task
● 1000 object classes (1,431,167 images)
15
Experiments (2/4)
● ILSVRC image classification task (rank #3)
● SPP improves all CNN architectures
Top-5 test accuracy
Top-5 val. accuracy
16
Experiments (3/4)
● ILSVRC image detection task
● Fully annotated 200 object classes across 121,931 images
● Allows evaluation of generic object detection in cluttered scenes at
scale
Detected
Region
Groun
d-truth
:True
:False
17
Experiments (4/4)
● ILSVRC image detection task (rank #2)
●
18
More practical than R-CNN
Conclusion
● SPP is flexible solution for handling different scales,
sizes, and aspect ration.
● Spatial Pyramid Pooling improves accuracy.
● Multi-size training improves accuracy.
● Full-image representation improves accuracy.
● Classification: SPP improves all CNNs in the literature.
● Detection: Practical, fast and accurate than R-CNN.
19