GPU-Efficient Recursive Filtering and Summed-Area Tables D. Nehab1 A. Maximo1 1IMPA 2Digitok R. S. Lima2 H.

Download Report

Transcript GPU-Efficient Recursive Filtering and Summed-Area Tables D. Nehab1 A. Maximo1 1IMPA 2Digitok R. S. Lima2 H.

GPU-Efficient Recursive Filtering
and Summed-Area Tables
D. Nehab1 A. Maximo1
1IMPA
2Digitok
R. S. Lima2 H. Hoppe3
3Microsoft Research
Recursive filters
• Linear, shift-invariant filters
• But use feedback from earlier outputs
prologue
input
output
Recursive filters
• Linear, shift-invariant filters
• But use feedback from earlier outputs
prologue
input
output
• Sequential dependency chain
Applications of recursive filtering
recursive preprocessing step
input
coefficients
• B-Spline (or other) interpolation
interpolation
(from coefficients)
Applications of recursive filtering
recursive filters
input
blurred
• B-Spline (or other) interpolation
• Fast, wide, Gaussian-blur approximation
• Summed-area tables
Causality and order
• Recursive filters can be causal or anticausal
• Causal goes forward, anticausal in reverse direction
input
epilogue
output
• Filter order is simply the number r of feedbacks
Filter sequences and separability
• Often, sequences of recursive filters are needed
• Independent columns
• Causal
• Anticausal
• Independent rows
• Causal
• Anticausal
output
stages
input
Algorithm RT
column processing
row processing
• The baseline algorithm
• Process columns in parallel, then rows in parallel
• Ruijters et al. 2010 “GPU prefilter […]”
First-order filter benchmarks
• Alg. RT is the baseline implementation
• Ruijters et al. 2010 “GPU prefilter […]”
Cubic B-Spline Interpolation
(GeForce GTX 480)
7
Throughput (GiP/s)
6
RT
5
4
3
Alg.
2
1
RT
64
2
128
2
256
2
2
512
1024
Input size (pixels)
2
2048
2
4096
2
Step
Complexity
Max. # of
Threads
Used
Bandwidth
Optimization roadmap
• Modern GPUs have several hundred cores
• Latency-hiding requires many times more tasks
• Images are not large enough: must parallelize further
Alg.
RT
Step
Complexity
Max. # of
Threads
Used
Bandwidth
Increasing parallelism
• Similar to parallel prefix-sum algorithms
• Sengupta et al. 2007 “Scan primitives for GPU computing”
• Dotsenko et al. 2008 “Fast scan algorithms […]”
……✗ …
✗
✗
✗
…
…
• Compute and store incomplete prologues
• Fix incomplete prologues
• Somewhat more complicated than a recursive invocation
• Use prologues to compute and store causal results
Fixing incomplete prologues
…
…
…
✗
superposition
linearity
output
stages
input
Algorithm 2
fix
fix
fix
• Adds block parallelism
• Sung et al. 1986 “Efficient […] recursive […]”, or
• Blelloch 1990 “Prefix sums […]”
• + tricks from GPU parallel scan algorithms
fix
First-order filter benchmarks
• Alg. RT is the baseline implementation
• Ruijters et al. 2010 “GPU prefilter […]”
• Alg. 2 adds block parallelism & tricks
• Sung et al. 1986 “Efficient […] recursive […]”
• Blelloch 1990 “Prefix sums […]”
• + tricks from GPU parallel scan algorithms
Cubic B-Spline Interpolation
(GeForce GTX 480)
7
Throughput (GiP/s)
6
2
RT
5
4
Alg.
3
2
2
1
RT
64
2
128
2
256
2
2
512
1024
Input size (pixels)
2
2048
2
4096
2
Step
Complexity
Max. # of
Threads
Memory
Bandwidth
Optimization roadmap
• Modern GPUs have several hundred cores
• Latency-hiding requires many times more tasks
• Images are not large enough: must parallelize further
• FLOP/IO ratio of recursive filters is too low
• Can use even more FLOPs but must reduce IO
• To do so, we introduce overlapping
Alg.
2
RT
Step
Complexity
Max. # of
Threads
Memory
Bandwidth
Causal-anticausal overlapping
• Start anticausal processing before causal is done
• Saves reading and writing causal results!
…
…
• Compute and store incomplete prologues & epilogues
• Fix incomplete prologues & twice-incomplete epilogues
• Twice-incomplete epilogues are trickier
• Use them to compute and store anticausal results
Fixing twice-incomplete epilogues
• Repeatedly apply linearity and superposition
• Tedious derivation, simple result
corrected epilogue
corrected prologue
twice-incomplete epilogue
output
stages
input
Algorithm 4
fix both
fix both
• Adds causal-anticausal overlapping
• Eliminates reading and writing causal results
• Both in column and in row processing
• Modest increase in computation
First-order filter benchmarks
• Alg. RT is the baseline implementation
• Alg. 4 adds causal-anticausal overlapping
• Ruijters et al. 2010 “GPU prefilter […]”
• Eliminates 4hw of IO
• Modest increase in computation
• Alg. 2 adds block parallelism & tricks
• Sung et al. 1986 “Efficient […] recursive […]”
• Blelloch 1990 “Prefix sums […]”
• + tricks from GPU parallel scan algorithms
Cubic B-Spline Interpolation
(GeForce GTX 480)
7
4
2
RT
Throughput (GiP/s)
6
5
Alg.
4
4
3
2
2
1
RT
64
2
128
2
256
2
2
512
1024
Input size (pixels)
2
2048
2
4096
2
Step
Complexity
Max. # of
Threads
Memory
Bandwidth
output
stages
input
Algorithm 5
fix all!
• Adds row-column overlapping
• Eliminates reading and writing column results
• Modest increase in computation
Start from input and global borders
Load blocks into shared memory
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
Compute & store incomplete borders
All borders in global memory
Fix incomplete borders
Fix twice-incomplete borders
Fix thrice-incomplete borders
Fix four-times-incomplete borders
Done fixing all borders
Load blocks into shared memory
Finish causal columns
Finish anticausal columns
Finish causal rows
Finish anticausal rows
Store results to global memory
Done!
Row-column overlapping rules
• Fixing thrice-incomplete row-prologues
• Fixing four-times-incomplete row-epilogues
First-order filter benchmarks
• Alg. RT is the baseline implementation
• Alg. 4 adds causal-anticausal overlapping
• Ruijters et al. 2010 “GPU prefilter […]”
• Eliminates 4hw of IO
• Modest increase in computation
• Alg. 2 adds block parallelism & tricks
• Sung et al. 1986 “Efficient […] recursive […]” • Alg. 5 adds row-column overlapping
• Blelloch 1990 “Prefix sums […]”
• Eliminates additional 2hw of IO
• + tricks from GPU parallel scan algorithms
• Modest increase in computation
Cubic B-Spline Interpolation
(GeForce GTX 480)
7
5
4
2
RT
Throughput (GiP/s)
6
5
Alg.
5
4
4
3
2
2
1
RT
64
2
128
2
256
2
2
512
1024
Input size (pixels)
2
2048
2
4096
2
Step
Complexity
Max. # of
Threads
Memory
Bandwidth
Second-order filter benchmarks
• Alg. 42 uses causal-anticausal overlapping
• Alg. 52 adds row-column overlapping
• Added complexity outweighs IO reduction
• Balance will change (hardware, compiler, implementation)
Quintic B-Spline Interpolation
(GeForce GTX 480)
5
42
52
Throughput (GiP/s)
4
3
Alg.
2
42
1
52
64
2
128
2
256
2
2
512
1024
Input size (pixels)
2
2048
2
4096
2
Step
Complexity
Max. # of
Threads
Memory
Bandwidth
Gaussian blur results
• CUFFT is in frequency domain
•
• Overlapped recursive
• 3rd order approximation
•
complexity
• van Vliet et al. 1998
“Recursive Gaussian derivative filters”
• Implemented as 51 fused with 42
complexity
• DIR is direct convolution
•
complexity
• Podlozhnyuk 2007 whitepaper
“Image convolution with CUDA”
Gaussian Blur
(GeForce GTX 480)
• Recursive approximation is faster
4
Overlapped Recursive
DIR 2.5
DIR 5
DIR 10
CUFFT
Throughput (GiP/s)
3
• Even for modest size images
• Also modest standard-deviations
2
1
64
2
128
2
256
2
2
512
1024
Input size (pixels)
2
2048
2
4096
2
Summed-area table benchmarks
• First-order filter, unit coefficient, no anticausal component
• Harris et al 2008, GPU Gems 3
• Hensley 2010, Gamefest
• “Parallel prefix-scan […]”
• Multi-scan + transpose + multiscan
• Implemented with CUDPP
• “High-quality depth of field”
• Multi-wave method
• Our improvements
+ specialized row and column kernels
+ save only incomplete borders
+ fuse row and column stages
Summed-area Table
(GeForce GTX 480)
9
Overlapped SAT
Improved Hensley [2010]
Hensley [2010]
Harris et al [2008]
8
Throughput (GiP/s)
7
• Overlapped SAT
6
• Row-column overlapping
5
4
3
2
1
64
2
128
2
256
2
2
512
1024
Input size (pixels)
2
2048
2
4096
2
Future work
• Volumetric processing
• Overlapping should generalize
• Not enough shared memory (yet?)
• CPU implementation
• Blocking should increase L1 cache effectiveness
• Is doubling amount of computation worth it?
• Solving general narrow-banded linear systems
• Overlapping back- and forward- substitution
Conclusions
• Recursive filters are useful in many applications
• Cubic and quintic B-Spline interpolation
• Gaussian-blur approximation
• Summed-area table computation
• We introduced parallel algorithms for GPUs
• Overlapping reduces IO requirements
• Leads to faster algorithms
• Code is available from project page
• Most is already there, rest is on the way
Questions?
Alg. RT (0.5 GiP/s)
Alg. 2 (3 GiP/s)
baseline
+ block parallelism
Alg. 4 (5 GiP/s)
Alg. 5 (6 GiP/s)
+ causal-anticausal overlapping
+ row-column overlapping