GPU-Efficient Recursive Filtering and Summed-Area Tables D. Nehab1 A. Maximo1 1IMPA 2Digitok R. S. Lima2 H.
Download ReportTranscript GPU-Efficient Recursive Filtering and Summed-Area Tables D. Nehab1 A. Maximo1 1IMPA 2Digitok R. S. Lima2 H.
GPU-Efficient Recursive Filtering and Summed-Area Tables D. Nehab1 A. Maximo1 1IMPA 2Digitok R. S. Lima2 H. Hoppe3 3Microsoft Research Recursive filters • Linear, shift-invariant filters • But use feedback from earlier outputs prologue input output Recursive filters • Linear, shift-invariant filters • But use feedback from earlier outputs prologue input output • Sequential dependency chain Applications of recursive filtering recursive preprocessing step input coefficients • B-Spline (or other) interpolation interpolation (from coefficients) Applications of recursive filtering recursive filters input blurred • B-Spline (or other) interpolation • Fast, wide, Gaussian-blur approximation • Summed-area tables Causality and order • Recursive filters can be causal or anticausal • Causal goes forward, anticausal in reverse direction input epilogue output • Filter order is simply the number r of feedbacks Filter sequences and separability • Often, sequences of recursive filters are needed • Independent columns • Causal • Anticausal • Independent rows • Causal • Anticausal output stages input Algorithm RT column processing row processing • The baseline algorithm • Process columns in parallel, then rows in parallel • Ruijters et al. 2010 “GPU prefilter […]” First-order filter benchmarks • Alg. RT is the baseline implementation • Ruijters et al. 2010 “GPU prefilter […]” Cubic B-Spline Interpolation (GeForce GTX 480) 7 Throughput (GiP/s) 6 RT 5 4 3 Alg. 2 1 RT 64 2 128 2 256 2 2 512 1024 Input size (pixels) 2 2048 2 4096 2 Step Complexity Max. # of Threads Used Bandwidth Optimization roadmap • Modern GPUs have several hundred cores • Latency-hiding requires many times more tasks • Images are not large enough: must parallelize further Alg. RT Step Complexity Max. # of Threads Used Bandwidth Increasing parallelism • Similar to parallel prefix-sum algorithms • Sengupta et al. 2007 “Scan primitives for GPU computing” • Dotsenko et al. 2008 “Fast scan algorithms […]” ……✗ … ✗ ✗ ✗ … … • Compute and store incomplete prologues • Fix incomplete prologues • Somewhat more complicated than a recursive invocation • Use prologues to compute and store causal results Fixing incomplete prologues … … … ✗ superposition linearity output stages input Algorithm 2 fix fix fix • Adds block parallelism • Sung et al. 1986 “Efficient […] recursive […]”, or • Blelloch 1990 “Prefix sums […]” • + tricks from GPU parallel scan algorithms fix First-order filter benchmarks • Alg. RT is the baseline implementation • Ruijters et al. 2010 “GPU prefilter […]” • Alg. 2 adds block parallelism & tricks • Sung et al. 1986 “Efficient […] recursive […]” • Blelloch 1990 “Prefix sums […]” • + tricks from GPU parallel scan algorithms Cubic B-Spline Interpolation (GeForce GTX 480) 7 Throughput (GiP/s) 6 2 RT 5 4 Alg. 3 2 2 1 RT 64 2 128 2 256 2 2 512 1024 Input size (pixels) 2 2048 2 4096 2 Step Complexity Max. # of Threads Memory Bandwidth Optimization roadmap • Modern GPUs have several hundred cores • Latency-hiding requires many times more tasks • Images are not large enough: must parallelize further • FLOP/IO ratio of recursive filters is too low • Can use even more FLOPs but must reduce IO • To do so, we introduce overlapping Alg. 2 RT Step Complexity Max. # of Threads Memory Bandwidth Causal-anticausal overlapping • Start anticausal processing before causal is done • Saves reading and writing causal results! … … • Compute and store incomplete prologues & epilogues • Fix incomplete prologues & twice-incomplete epilogues • Twice-incomplete epilogues are trickier • Use them to compute and store anticausal results Fixing twice-incomplete epilogues • Repeatedly apply linearity and superposition • Tedious derivation, simple result corrected epilogue corrected prologue twice-incomplete epilogue output stages input Algorithm 4 fix both fix both • Adds causal-anticausal overlapping • Eliminates reading and writing causal results • Both in column and in row processing • Modest increase in computation First-order filter benchmarks • Alg. RT is the baseline implementation • Alg. 4 adds causal-anticausal overlapping • Ruijters et al. 2010 “GPU prefilter […]” • Eliminates 4hw of IO • Modest increase in computation • Alg. 2 adds block parallelism & tricks • Sung et al. 1986 “Efficient […] recursive […]” • Blelloch 1990 “Prefix sums […]” • + tricks from GPU parallel scan algorithms Cubic B-Spline Interpolation (GeForce GTX 480) 7 4 2 RT Throughput (GiP/s) 6 5 Alg. 4 4 3 2 2 1 RT 64 2 128 2 256 2 2 512 1024 Input size (pixels) 2 2048 2 4096 2 Step Complexity Max. # of Threads Memory Bandwidth output stages input Algorithm 5 fix all! • Adds row-column overlapping • Eliminates reading and writing column results • Modest increase in computation Start from input and global borders Load blocks into shared memory Compute & store incomplete borders Compute & store incomplete borders Compute & store incomplete borders Compute & store incomplete borders Compute & store incomplete borders Compute & store incomplete borders Compute & store incomplete borders Compute & store incomplete borders All borders in global memory Fix incomplete borders Fix twice-incomplete borders Fix thrice-incomplete borders Fix four-times-incomplete borders Done fixing all borders Load blocks into shared memory Finish causal columns Finish anticausal columns Finish causal rows Finish anticausal rows Store results to global memory Done! Row-column overlapping rules • Fixing thrice-incomplete row-prologues • Fixing four-times-incomplete row-epilogues First-order filter benchmarks • Alg. RT is the baseline implementation • Alg. 4 adds causal-anticausal overlapping • Ruijters et al. 2010 “GPU prefilter […]” • Eliminates 4hw of IO • Modest increase in computation • Alg. 2 adds block parallelism & tricks • Sung et al. 1986 “Efficient […] recursive […]” • Alg. 5 adds row-column overlapping • Blelloch 1990 “Prefix sums […]” • Eliminates additional 2hw of IO • + tricks from GPU parallel scan algorithms • Modest increase in computation Cubic B-Spline Interpolation (GeForce GTX 480) 7 5 4 2 RT Throughput (GiP/s) 6 5 Alg. 5 4 4 3 2 2 1 RT 64 2 128 2 256 2 2 512 1024 Input size (pixels) 2 2048 2 4096 2 Step Complexity Max. # of Threads Memory Bandwidth Second-order filter benchmarks • Alg. 42 uses causal-anticausal overlapping • Alg. 52 adds row-column overlapping • Added complexity outweighs IO reduction • Balance will change (hardware, compiler, implementation) Quintic B-Spline Interpolation (GeForce GTX 480) 5 42 52 Throughput (GiP/s) 4 3 Alg. 2 42 1 52 64 2 128 2 256 2 2 512 1024 Input size (pixels) 2 2048 2 4096 2 Step Complexity Max. # of Threads Memory Bandwidth Gaussian blur results • CUFFT is in frequency domain • • Overlapped recursive • 3rd order approximation • complexity • van Vliet et al. 1998 “Recursive Gaussian derivative filters” • Implemented as 51 fused with 42 complexity • DIR is direct convolution • complexity • Podlozhnyuk 2007 whitepaper “Image convolution with CUDA” Gaussian Blur (GeForce GTX 480) • Recursive approximation is faster 4 Overlapped Recursive DIR 2.5 DIR 5 DIR 10 CUFFT Throughput (GiP/s) 3 • Even for modest size images • Also modest standard-deviations 2 1 64 2 128 2 256 2 2 512 1024 Input size (pixels) 2 2048 2 4096 2 Summed-area table benchmarks • First-order filter, unit coefficient, no anticausal component • Harris et al 2008, GPU Gems 3 • Hensley 2010, Gamefest • “Parallel prefix-scan […]” • Multi-scan + transpose + multiscan • Implemented with CUDPP • “High-quality depth of field” • Multi-wave method • Our improvements + specialized row and column kernels + save only incomplete borders + fuse row and column stages Summed-area Table (GeForce GTX 480) 9 Overlapped SAT Improved Hensley [2010] Hensley [2010] Harris et al [2008] 8 Throughput (GiP/s) 7 • Overlapped SAT 6 • Row-column overlapping 5 4 3 2 1 64 2 128 2 256 2 2 512 1024 Input size (pixels) 2 2048 2 4096 2 Future work • Volumetric processing • Overlapping should generalize • Not enough shared memory (yet?) • CPU implementation • Blocking should increase L1 cache effectiveness • Is doubling amount of computation worth it? • Solving general narrow-banded linear systems • Overlapping back- and forward- substitution Conclusions • Recursive filters are useful in many applications • Cubic and quintic B-Spline interpolation • Gaussian-blur approximation • Summed-area table computation • We introduced parallel algorithms for GPUs • Overlapping reduces IO requirements • Leads to faster algorithms • Code is available from project page • Most is already there, rest is on the way Questions? Alg. RT (0.5 GiP/s) Alg. 2 (3 GiP/s) baseline + block parallelism Alg. 4 (5 GiP/s) Alg. 5 (6 GiP/s) + causal-anticausal overlapping + row-column overlapping