Transcript [slides]

An Efficient GPGPU
Implementation of Viola-Jones
Classifier based Face Detection
Algorithm
Sharmila Shridhar, Vinay Gangadhar,
Ram Sai Manoj
ECE 759 Project Presentation
Fall 2015
University of Wisconsin - Madison
1
Executive Summary
• Viola Jones Classifier based Face Detection Implemented on GPU
• Detection done in 2 phases:
– Nearest Neighbor and Integral Image
– Scanning Window and HAAR Feature Detection
• Optimizations applied to both phases
– Shared Memory, No bank conflicts, TB divergence etc.
• Upto 5.3x SpeedUp compared to single threaded CPU performance
• GPU performs better for larger images
2
Introduction
• Face detection a hot algorithm
– Auto tagging pictures in Facebook
– Easy search in Google Photos
– Biometric based security access
– Threat activity detection
– Face mapping
• Human Faces  Similar properties (HAAR)
• Different type of algortihms
– Motion based
– Color based
• Viola Jones Classifier based algorithm
3
Motivation
• Face Detection algorithms have large amount of Data Level
Parallelism (DLP)
• Involve processing each window separately to detect face
• GPU resources can be utilized efficiently
• Performance and Energy efficient Face detection implementation
• This application can be used to
– showcase the benefits of GPGPU implementation
– showcase incremental benefits with fine tuning of the code
(optimizations)
4
Outline
• Introduction
• Motivation
• Viola Jones Background
• Nearest Neighbor and Integral Image
• HAAR based Cascade Classifier Implementation
• Evaluation and Results
• Conclusion
5
Background- Viola Jones
Algorithm (1)
• Haar Feature Selection
– Each feature consists of white and black
rectangles
– Subtract white region’s pixel sum from
black region’s pixel sum
– If (sum > threshold)
Region has Haar Feature
• We use pre-selected Haar Features as
classifiers for Face Detection
6
Background- Viola Jones
Algorithm (2)
• Integral Image Calculation
– Sum calculation for each Haar rectangle is very important
– Sped up by using Integral Image
Integral Sum at any pixel  IS x, 𝑦 =
𝑥 ′ ≤𝑥,𝑦′≤𝑦
𝑣(𝑥 ′ , 𝑦 ′ ),
where 𝑣 𝑥′, 𝑦′ is the value of the pixel (𝑥 ′ , 𝑦 ′ )
7
Background- Viola Jones Algorithm (3)
Cascade Classifier
• Adaboost Algorithm
– Output of several weak classifiers (Haar Features) combined into weighted
sum to form Strong Classifier
– Further concatenates strong classifiers to a Cascade Classifier
+
Cascade Classifier
8
Background - Image Pyramid
• We use 25x25 window size for
each feature
• But face in the image can be of
any size
• Use Image Pyramid to make
face detection scale invariant
9
Outline
• Introduction
• Motivation
• Viola Jones Background
• Nearest Neighbor and Integral Image
• HAAR based Cascade Classifier Implementation
• Evaluation and Results
• Conclusion
10
Implementation Flow
Read the source Image and Cascade classifier
parameters
Nearest Neighbor
Image downscaling
factor  1.2
Detection window  25 X 25
Image / Scaling Factor >= 25
Integral Image
Compute sum of pixels
from [0,0]  [x,y]
Integral sum >
threshold all
stages
Face detected,
store the coordinates
Compute sum of squares of
pixels from [0,0]  [x,y]
Set Image for HAAR Detection
Compute the image co-ordinates for each HAAR feature
Run Cascade Classifier
Shift the detection window
Image with detected Faces
Group rectangles
Draw rectangles around the
faces
Integral sum <
threshold for a
stage
Skip this
window for
further stages
Nearest Neighbor (NN)
• Computes the image pixels for the downscaled (DS) image
• Used scaling factor of 1.2 in implementation
• Detection window of 25 X 25 pixels
• Image downscaled until it’s equal to detection window
• Why downscale ??  Example: downscaled image with scaling factor of 2
Parallelization scope
• DS image pixels calculated by scale factor offset, width, height of source image
• Map each (or more) pixel position to a single thread
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
Nearest
Neighbor
0
2
4
6
16
18
20
22
32
34
36
38
48
50
52
54
12
Integral Image (II)
• Sum of all the pixels above & to the left of (X, Y) for each X & Y in the image structure
• Integral Image  Prefix scan along row for all rows and then along column
Parallelization scope
• Prefix scan of each row independent of other rows  RowScan (RS)
• Prefix scan along columns  ColumnScan (CS)
• Similarly compute square integral sum (sum of squares)
• Why ? Need the variance of the pixels for the haar rectangle co-ordinates
Var(X) = E(X2) - (E(X))2
0
2
4
6
16
18
20
22
32
34
36
38
48
50
52
54
RS
0
2
4
6
16
20
24
28
48
54
60
66
96
104
112
120
CS
0
2
6
12
16
36
60
88
48
102
162
228
96
200
312
422
13
NN + II Implementation
• Implementing the NN and II Sum in 4 separate kernels
• Kernel 1  Nearest Neighbor(NN) & RowScan (RS) – downscaled image ready
• Kernel 2  Matrix Transpose
• Kernel 3  RowScan
ColumnScan
• Kernel 4  Matrix Transpose
• Integral sum & square integral sum ready at the end of kernel 4
14
Kernel 1  Nearest Neighbor (NN)
& RowScan (RS)
• Combine NN & RS  eliminates 1 global memory access (storing in between kernels)
Kernel configuration: (w, h – width & height of downscaled Image)
• Threads per block = smallestpower2(w) – Constraint from RowScan algorithm
• Blocks = h
• RowScan – Inclusive prefix scan of each row in image
• Harris-Sengupta-Owen algorithm (Upsweep & downsweep)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
NN
NNNN
0
2
4
6
16
18
20
22
32
34
36
38
48
50
52
54
RS
0
2
4
6
16
20
24
28
48
54
60
66
96
104
112
120
0 16 32 48
Shared memory [2 * BLOCKSIZE + 1]
15
Kernel 2, 3 & 4  ColumnScan(CS)
• Columnscan replaced with kernel 2, 3 & 4
• Why ? Could have been done as Rowscan & columnscan
Straightforward implementation
0
2
4
6
16
18
20
22
32
34
36
38
48
50
52
54
RS
0
2
4
6
16
20
24
28
48
54
60
66
96
104
112
120
CS
0
2
6
12
16
36
60
88
48
102
162
228
96
200
312
422
16
Alternate Method for ColumnScan
Motivation for Transpose
• ColumnScan – perform prefix scan along the column
Thread 0 
Thread 1 
Thread 2 
Thread 3 
0
2
4
6
16
20
24
28
48
54
60
66
96
104
112
120
Global memory (GM) layout of above 2D matrix
T0
0
T2
T1
16
48
Row 0
96
2
20
54
Row 1
104
4
T3
24
60
Row 2
112
6
28
66
120
Row 3
• Global memory reads aren’t coalesced
• Time consuming (400 – 500 cycles each access !!) & each thread needs separate access
17
ColumnScan Breakdown
0
2
4
6
16
20
24
28
48
54
60
66
96
104
112
120
Transpose
0 2 4 6
16 20 24 28
48 54 60 66
96 104 112 120
RS
0 2 6 12
16 36 60 88
48 102 162 228
96 200 312 422
Transpose
0
2
6
12
16
36
60
88
48
102
162
228
96
200
312
422
Image after
Integral sum
Downscaled image
after RowScan
Matrix Transpose
• Implemented in a tiled fashion – coalesced reads & writes to GM
• Further optimizations continued….
0
2
4
6
16
20
24
28
48
54
60
66
96
104
112
120
Transpose
0 2 4 6
16 20 24 28
48 54 60 66
96 104 112 120
18
Optimizations in Transpose
& RS kernels
Matrix Transpose
• GM coalescing  write row wise only
0
2
4
6
16
20
24
28
48
54
60
66
48 96
54 104
96
104
112
120
Naive way
Transpose
0
16
48
96
2
20
54
104
4
24
60
112
6
28
66
120
0
2
4
6
16
20
24
28
Optimized version
Transpose 0
96
48
54 104
60 112
66 120
Shared memory[BLOCKSIZE] [BLOCKSIZE]
2
16 20
48 54
96 104
4
24
60
112
6
28
66
120
Global memory
• Reading along column  Any SM bank conflicts ??
32 banks of SM. Threads in a warp access to the same bank
• BLOCKSIZE = 16  SM[Ty]Tx] changed to SM[Tx][Ty]
• (Tx, Ty) map to (Tx * 16 + Ty) % 32 bank  (Tx, Ty) for (0, 0) & (2, 0) map to bank 0
• Eliminated by Shared memory [BLOCKSIZE] [BLOCKSIZE + 1]
19
Optimization in RowScan Only
RowScan – Row wise prefix scan
• Use extern shared memory (SM)  don’t make it a occupancy constraint by hardcoding
• Kernel execution configuration
– Shared memory size req.  2 * (width of image)
[One each for integral sum & square Integral sum]
– At downscale of 256 X 256 image size (from source of 1k X 1k pixels)
1D 256 Threads Per Block (TPB), 256 blocks
6 blocks alive – 100% occupancy (1536 threads)
– Hardcoding to max case of 1K TPB (8kB SM) decreases it to 5 blocks (84% occupancy
only)
Face detection using Cascade Classifier(CC) follows….
20
Outline
• Introduction
• Motivation
• Viola Jones Background
• Nearest Neighbor and Integral Image
• HAAR based Cascade Classifier Implementation
• Evaluation and Results
• Conclusion
21
Implementation
Scan Window Processing
Classifier Size: 25x25 (Fixed)
Some Specs of Classifiers
• Each HAAR Classifier has up to
3 rectangles
• Each stage consists of up to
211 HAAR Classifiers
Image Size: 1024x1024 (can vary)
• Our algorithm has 25 stages
with 2913 HAAR Classifiers
22
Implementation
Scan Window Processing
Classifier Size: 25x25
• For each image, we consider 25x25
moving scan window
• Next scan window by pixel++
• Each scan window is processed
independently
Image Size: 1024x1024
• For 1024 x 1024 image, there are
1000 x 1000 = 106 scan windows
23
Baseline Implementation
Scan Window Processing
• Each thread operates on one
scan window
• Each scan window is processed
through all 25 stages to detect
face
• A Bit Vector keeps track of
rejected scan windows
Scan Win (20,30)
BV[20][30] = True
Stage 1
BV[20][30] = True
Stage 2
BV[20][30] = False
Stage 25
• Bit Vector copied back to Host
Memory
24
Optimizations (1)
Scan Window Processing
Use Shared Memory
• Information of classifiers common for all threads
–
–
–
–
Indices of Haar classifiers
Weights of each rectangle in Haar classifier
Threshold for each Haar classifier
Threshold for each stage
• Bring these data to shared memory
– Share data across Thread Block
• Due to shared memory limitation, entire Scan Window Processing
split into 12 Kernels [Total 2913 Classifiers into 12 Kernels]
• Each kernel uses approximately 19 kB of shared memory
25
Optimizations (2)
Scan Window Processing
Use Pinned Host Memory
• Replace malloc with cudaMallocHost
Use Fast Math
• Variance = sqrtf(square sum) – Special FU in GPU is used
Do not use maxrregcount
• Our kernel needs around 26 registers
• If we restrict to 20, spilling occurs
• Occupancy not always a measure of performance
26
Optimizations (3)
Scan Window Processing
Use block divergence
• If a scan window fails at the end of
any kernel, reject
• Results in thread divergence
• But more importantly, it leads to block
divergence
• Thread blocks with all threads
rejected won’t be launched at all
• This block divergence is the common
case in most of the image windows
• Hence according to Amdahl’s law, this
optimization gave huge performance
benefit
Kernel 1
Kernel 2
Kernel 1
Kernel 2
27
Outline
• Introduction
• Motivation
• Viola Jones Background
• Nearest Neighbor and Integral Image
• HAAR based Cascade Classifier Implementation
• Evaluation and Results
• Conclusion
28
Evaluation
• Used GTX 480 GPU for evaluation
– 15 SMs, 1.5GB Global Memory
• Shared memory Usage
– 8.2kB  NN and II
– 19kB  HAAR Kernels
• Occupancy 
– 100% for NN and II
– 66.67% for HAAR Kernels
• 1024 x 1024 image size for evaluation  21 downsampling iterations
• Compared with CPU single threaded performance
29
Performance of NN and II Kernels (1)
NN + RowScan Kernel [Log Scale]
Execution Time (ms)
1
Shared Mem
No Bank Conflicts
Extern Shared Mem
0.1
0.01
Downsampling Iterations
30
Performance of NN and II Kernels (2)
Transpose 1 Kernel [Log Scale]
1
Execution Time (ms)
Shared Mem
No Bank Conflicts
Extern Shared Mem
0.1
0.01
Downsampling Iterations
31
Performance of NN and II Kernels (3)
RowScan Only Kernel [Log Scale]
1.4
Execution Time (ms)
1.2
1
Shared Mem
No Bank Conflicts
0.8
Extern Shared Mem
0.6
0.4
0.2
0
Downsampling Iterations
32
Performance of NN and II Kernels (4)
Transpose 2 Kernel [Log Scale]
Execution Time (ms)
1
Shared Mem
No Bank Conflicts
Extern Shared Mem
0.1
0.01
Downsampling Iterations
33
NN + II Overall Performance
4 Kernels Performance
6
5
No Bank Conflicts
Extern Shared Mem
4
3
2
1
0
ROWSCAN + NN
TRANSPOSE 1
ROWSCAN
TRANSPOSE 2
NN and II
Performance
Execution Time (ms)
Execution Time (ms)
Shared Mem
20
Overall NN and II
SpeedUp = 1.46x
15
10
5
0
GPU EXCLUSIVE
TIME
CPU
34
Performance of HAAR Kernels
SpeedUp Over Baseline GPU
100
TB divergence
No Maxregcount
10
Fast Math
Pinned Host Mem
Shared Mem
1
0
1
2
3
4
5
6
7
8
9
10
11
HAAR Kernels
35
Performance of HAAR Kernels (2)
29.7x
SpeedUp Over Baseline GPU
3
82.7x 128.8x 155.9x 161.1x
135.1x 137.8x 139.2x 144x
221.3x 212.7x
2.5
2
TB divergence
No Maxregcount
Fast Math
1.5
Pinned Host Mem
Shared Mem
1
Baseline
0.5
0
0
1
2
3
4
5
6
HAAR Kernels
7
8
9
10
11
36
Speed Up Over Iterations
9
8
7
TB divergence
No Maxregcount
SpeedUp Over CPU
6
Fast Math
Pinned HostMem
5
Shared Mem
Baseline
4
CPU
3
2
1
0
Downsampling Iterations
37
Scanning Window Speed Up
Comparison
6
5
Exclusive time
SpeedUp Over All Iterations
Inclusive time
4
CPU
3
2
1
0
BASELINE
SHARED MEM
PINNED HOST MEM
FASTMATH
NO MAXRREGCOUNT
TB DIVERGENCE
GPU Optimizations
Overall Scanning Window
SpeedUp = 5.47x
38
Overall Face Detection Speed Up
6
Face Detecttion SpeedUp
5
4
Inclusive time
Exclusive time
3
CPU
2
1
0
BASELINE
SHARED MEM
PINNED HOST MEM
FAST MATH
NO
MAXRREGCOUNT
TB DIVERGENCE
GPU Optimizations
Overall Scanning Window
SpeedUp = 5.35x
39
GPU Speed Up Over Varying
Image Sizes
6
SPEEDUP OVER CPU
5
4
5.35x
GPU
CPU
3
2
1
0
25x25
32x32
64x64
128x128
256x256
512x512 1024x1024
INCREASING IMAGE SIZE
40
GPU Face Detection Accuracy
Faces
1
2
4
8
16
32
Average Detection
Rate %
Detection Rate %
100
100
100
87.5
100
93.75
96.875
41
Lessons Learned & Future Work
• GPU provides performance and energy benefits over CPU for
parallelizable workloads
• But this comes at a cost  Need to understand the bottlenecks
• Can reap benefits with finer level optimizations
Future work
•
•
•
•
Compare GPU performance with equivalent OpenMP, MPI code
OpenCV library provides CUDA APIs for Object Detection
Compare performance of our implementation with this
Detection accuracy can be improved with more robust version
42
Conclusion
• Face Detection is a good candidate for parallelization
• Optimizations help in increasing GPU Performance
• Up to 5.3x performance improvement on GPU
• Further improvements possible with careful analysis and
hardcoding
43