Transcript 20120127

Real-time Signal Processing on
Embedded Systems
Advanced Cutting-edge Research
Seminar I&III
Practical Applications

Pedestrian Detection


FPGA-based system
Pedestrian Tracking

GPU-based system
Hardware Architecture for
High-Accuracy Real-Time Pedestrian
Detection with CoHOG Features
Outline



Introduction
Pedestrian detection using CoHOG features
Proposed hardware architecture




Parallel execution
Merging histogram calculation and SVM prediction
FPGA implementation
Conclusion
Outline



Introduction
Pedestrian detection using CoHOG features
Proposed hardware architecture




Parallel execution
Merging histogram calculation and SVM prediction
FPGA implementation
Conclusion
Pedestrian detection on
automotive systems

Challenges:

Various appearances of pedestrians
…Clothes’ shape and color, pose, etc.
 Template-base or simple gradient-base method
does not perform high-accuracy recognition

Viewpoint movement
…all objects in an image are moving
 Background subtraction or
frame subtraction cannot be used
A robust recognition method
suitable for pedestrians is required
Pedestrian detection algorithms

Recent trend:

Combination of gradients and histograms

Gradient: robust for illumination and color change
Histogram: robust for deformation

Histograms of oriented gradients (HOG)



Examples
Co-occurrence histograms of oriented gradients (CoHOG)*




HOG-based method
Using pairs of oriented gradients
One of today’s best algorithms for pedestrian detection
However, Real-time execution is difficult to be achieved by
software implementation
(e.g. a few seconds are required for processing on a 320x240 image)
Specialized hardware for real-time processing
* T. Watanabe, S. Ito, and K. Yokoi,
“Co-occurrence histograms of oriented gradients for pedestrian detection,” PSIVT2009
Outline



Introduction
Pedestrian detection using CoHOG features
Proposed hardware architecture




Parallel execution
Merging histogram calculation and SVM prediction
FPGA implementation
Conclusion
Outline



Introduction
Pedestrian detection using CoHOG features
Proposed hardware architecture




Parallel execution
Merging histogram calculation and SVM prediction
FPGA implementation
Conclusion
Pedestrian detection using CoHOG
Calculate
gradient
orientations
Divide
into small
regions
(BLOCKS)
Pick up
pairwise
pixels
Calculate
co-occurrence
histograms
Co-occurrence histogram
of oriented gradients
Offset 1
CoHOG feature vector
Classified by SVM
Offset 2
Gradient orientations
Repeat for various
positions of pixel pairs
(called as OFFSETS)
Variations of
offsets
(31 offsets)
Detection procedure

Sliding window approach


Feature vectors are extracted in a scan line
order.
Image size or window size is scaled to
detect pedestrians in another scale.
Outline



Introduction
Pedestrian detection using CoHOG features
Proposed hardware architecture




Parallel execution
Merging histogram calculation and SVM prediction
FPGA implementation
Conclusion
Parallel execution of
CoHOG feature calculation

Large number of co-occurrence histograms must be calculated
→ All histograms can be calculated in parallel

Offsets


31 parallel threads
Blocks


Horizontal:6 parallel threads
Vertical: 12 parallel threads
Large parallelism
Block number:
6x12=72
Offset
variations: 31
We execute
31 parallel offsets and
6 horizontal block-threads
=186 parallel threads
Processing performance is
drastically improved!
Merging histogram calculation and
SVM prediction
Matrix size: 8x8=64
Offset variations: 31

Block number:
6x12=72
Dimensions of CoHOG feature vector is very high



64×31offsets×72blocks=about 140k dimensions
Large memory is required to store the feature vector
Many multiplications must be executed during
SVM prediction f(x)=sign(w・x+b)
Our proposal:
Execute histogram calculation
and SVM prediction simultaneously
Merging histogram calculation and
SVM prediction

Straightforward approach
Scan image
+1 to a corresponding bin
i j
x  ( xi, j )
+1
+1
xi, j 
+1
j
if orientatio
SVM prediction
 (w
×wi,j
i,j×wi,j
+
ns are ( i,j )
otherwise
i, j
i, j
×wi,j×w
Inner product is calculated
for SVM prediction
 1,
 0,
image 
wx 
i
Histogram is
generated
Histogram calculation
Weighting
vector values
 x i, j )
Merging histogram calculation and
SVM prediction

Proposed method
Histogram calculation
Scan image
x  ( xi, j )
i j
+wi,j
+wi,j
xi, j 
 1,
 0,
image 
if orientatio
ns are ( i,j )
otherwise
SVM prediction
+wi,j
+
Directly accumulate
weighting vector values
wx 
i, j
 x i, j )
i, j


i, j
Large memory to store histograms and
many multipliers for SVM prediction
are unnecessary
 (w



 1,
w 

 i, j 
0,
 image 


i , j image

 wi, j ,

 0,
if orientatio ns are ( i,j )  


otherwise

if orientatio ns are ( i,j )
Circuit size can be drastically reduced!
otherwise
Proposed architecture
Gradient orientation
image generator
Input
image
Line buffers
Sobel filter
(horizontal)
Sobel filter
(vertical)
Orientation
classifier
Combined module for
histogram calculation and SVM prediction
Shift registers
Frame buffer
WxH
Controller
Weighting vector
ROMs
Subwindow
data
31
offsets
6 blocks
Accumulator
Results
Proposed architecture
Gradient orientation
image generator
Input
image
Line buffers
Sobel filter
(horizontal)
Orientation
classifier
Combined module for
histogram calculation and SVM prediction
Shift registers
Frame buffer
WxH
Sobel filter
(vertical)
Controller

Parallel execution


31 offsets×6 blocks
= 186 parallel threads
Merging histogram calculation and
SVM prediction


Weighting vector
ROMs
Subwindow
data
31
offsets
6 blocks
Accumulator
No histogram memory and multipliers
Only weighting vector ROMs and an
accumulator
Efficient hardware architecture is successfully
designed by using proposed methods
Results
Outline



Introduction
Pedestrian detection using CoHOG features
Proposed hardware architecture




Parallel execution
Merging histogram calculation and SVM prediction
FPGA implementation
Conclusion
FPGA implementation

Implementation result

Target FPGA: Xilinx Virtex-5 XC5VLS330T-2
Device name
Used
Number of Slice Registers
Number of Slice LUTs
Number of occupied Slices
Number of BlockRAM
Total Memory used (KB)
Number of DSP48Es
Available
5,980
28,495
8,580
61
2,196
2
Utilization
207,360
207,360
51,840
324
11,664
192
2%
13%
16%
18%
18%
1%
Max delay: 5.997ns (Max frequency: 167MHz)
Our system can process
139,166 sub-windows / second
Intel Core i7 3.2GHz:
about 1,100 sub-windows / second
Capable for real-time
processing on 38 fps
320x240 video sequence
More than 100 times faster!
20
Pedestrian detection system

FPGA board

Receives input images from
host PC, and returns results
of pedestrian detection




Xilinx Virtex-5 FPGA LX330T
PCI Express
PCI Express endpoint
DDR2 memory
Host PC

Transfers images captured
by a camera, and displays
detection results


CPU: Intel Core i7 3.2GHz
Camera: USB webcam
(640x480 resolution)
Detection
result
Outline



Introduction
Pedestrian detection using CoHOG features
Proposed hardware architecture




Parallel execution
Merging histogram calculation and SVM prediction
FPGA implementation
Conclusion
Conclusion

High-performance and efficient hardware
architecture for CoHOG-based pedestrian
detection is proposed



Effectively exploits parallelism in CoHOG algorithm
→ 186 parallel processing is realized
Drastically reduces circuit area (memory and
multipliers) by proposing simultaneous execution
of histogram calculation and SVM prediction
Achieves more than 100 times faster processing
by FPGA implementation than CPU
→ Capable for real-time processing on 38 fps
320x240 video sequence
Parallel Implementation of Pedestrian
Tracking Using Multiple Cues on GPGPU
Outline




Introduction
Pedestrian Tracking using Multiple Cues
Parallel Implementation on NVIDIA GPU
Conclusion
Outline




Introduction
Pedestrian Tracking using Multiple Cues
Parallel Implementation on NVIDIA GPU
Conclusion
Introduction

Pedestrian recognition


Detection
Tracking
Combination of 2 steps
Scan entire image
Input image
Detection
Track the pedestrians
over the frames
Tracking
Introduction

Pedestrian Tracking

Particle Filter

HSV color histogram (K. Okuma et.al., ECCV2004)
Succeed to track
Fail to track
Simple background
Complex background
HSV histogram within the rectangle
Introduction
Red shirt
Color information
Red car
Gray gnd.
Gray gnd.
HSV histogram
HSV histogram
Shape information
Combining both color and shape information
Introduction

The contributions of this paper


New pedestrian tracking algorithm using
both color and shape information based on
particle filters
Parallel implementation on GPGPU for realtime processing
Outline




Introduction
Pedestrian Tracking using Multiple Cues
Parallel Implementation on NVIDIA GPU
Conclusion
Particle Filter (pedestrian
tracking)
Scatter particles
Eliminate low likelihood
particles
and replicate high
Current frame (time t-1)
likelihood particles.
Particle
Re-sampling (time t)
MeasurePrediction
the pedestrian
likelihood
Measurement
Particle Filter (pedestrian
tracking)
To define pedestrian likelihood,
we use
Current frame
Shape information…HOG feature
Color information…HSV histogram
Particle
Re-sampling
Prediction
Measurement
Histograms of Oriented Gradients

Represent object shape information
Calculate gradient orientation
Aggregate gradient orientation of each block
Map the vector on the feature space
Learn beforehand by SVM
Non-pedestrian
 HOG
Discriminant
border
Pedestrian
HOG Feature space
HSV Histogram

Represent object color information



Convert an input image into a HSV
image
Calculate a HSV hist.
Calculate a Bhattacharyya dist.
HSV color space
Hue
Saturation
Value
Input image
HSV histogram
Bhattacharyya
distance
 HSV
Reference HSV hist.
HSV feature space
Pedestrian tracking using multiple
cues
Measurement
Prediction
Non-pedestrian
 HSV
 HOG
Pedestrian
Existing algorithm
cf ( HOG )  (1  c ) g ( HSV ) Reference HSV hist.
HOG feature space
Pedestrian likelihood
Weighted coefficient [0,1]
HSV feature space
Tracking results



HOG+HSV (our proposed algorithm)
HSV only (K. Okuma et.al., ECCV2004)
HOG only
Outline




Introduction
Pedestrian Tracking using Multiple Cues
Parallel Implementation on NVIDIA GPU
Conclusion
NVIDIA GPU architecture





Streaming
multiprocessors (SM)
32-bit scalar
processors (SP)
Shared memory
Read only cache
Device memory
SM
SM
SM
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
Shrd mem
Shrd mem
Shrd mem
Cache
Cache
Cache
In case of Tesla C1060,
•4GB Device memory
•30 streaming multiprocessors (total 240 SPs)
•1.3 GHz processor clock
Device memory
Implementation strategy
Current frame
Re-sampling

Prediction
Measurement
SM
SM
SM
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
Shrd mem
Shrd mem
Shrd mem
Cache
Cache
Cache
Device memory
Run measurement process on GPU.

Almost 99% computation time
Implementation strategy
Current frame
Re-sampling

Prediction
SM
SM
SM
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
Shrd mem
Shrd mem
Shrd mem
Cache
Cache
Cache
Device memory
Allocate each particle on SM

Measurement
Independent process of each particle
Implementation strategy
Current frame
Prediction
SM
SM
SM
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
SP SP
Shrd mem
Shrd mem
Shrd mem
Cache
Cache
Cache
Re-sampling
Device memory
Measurement

Exploit pixel level parallelism on SPs

Sync. among SPs is fast.
HSV likelihood calculation
Transfer the results
to the
CPU
Sum
allmemory
the histograms
Calculate HSV
histogram on SPs
Allocate each particle
per SM
line
calculation to the
Calculate the
Bhattacharyya dist.
Bhattacharyya
distance
 HSV
Reference HSV hist.
Input image
HSV histogram
HSV feature space
HOG likelihood calculation
Calculate the
distance to the
discriminant border
Transfer the results
Sum histograms
to the CPU memory
Calculate grad. andCalculate
angle HOG histogram
on SPs
on SPs per some pixels
Non-pedestrian
Allocate each particle
calculation to the SM
 HOG
Discriminant
border
Pedestrian
HOG Feature space
Processing time

GPU: NVIDIA Tesla C1060



Number of multiprocessors: 30
Total number of scalar processors: 240
Comparing Intel Core i7 965 @ 3.2 GHz
140
120
13.9 times faster
100
80
processing
time per
frame[ms]
60
40
20
113.6 fps
0
Core i7
Tesla C1060
Outline




Introduction
Pedestrian Tracking using Multiple Cues
Parallel Implementation on NVIDIA GPU
Conclusion
Conclusion


Pedestrian tracking algorithm using HSV
and HOG features is proposed
Real-time processing can be achieved
by the parallel implementation using
NVIDIA GPU
Report subject (not mandatory)

What do you think about the advance
of signal processing on embedded
systems in the future?



Please submit the report by email to
[email protected].
Please write your student ID and name.
Deadline: Feb 3rd 17:00
レポート課題(必須ではない)

組込みシステムにおける信号処理の今後
について自由に述べよ(応用でも、やりた
いことでも何でもOK)



提出先 [email protected]
IDと名前をメール本文に明記すること。
締切 2/3 17:00