Fast Fourier Transforms on GPU

Download Report

Transcript Fast Fourier Transforms on GPU

Fast Fourier Transforms on
GPUs
Cory Quammen
April 11, 2007
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Overview
Applications exploiting DFT
Discrete Fourier transform
FFT algorithm basics
Papers
Performance comparison
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Applications
Signal processing
Waveform analysis
Band-pass filtering
Signal detection
Convolution
Signal filtering
Polynomial multiplication
Big integer multiplication
Interpolation
Padding with zeros in frequency domain implies
increased sampling rate in spatial (time) domain
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Applications
Solving PDEs
Finding nth derivative in spatial (time) domain
equivalent to multiplying by in in frequency
domain
Image compression
Analyze parts of an image, discarding highfrequency components
Medical image reconstruction
MRI and ultrasound
Volume rendering
Fourier slice projection theorem
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Discrete Fourier Transform
Transforms discrete, evenly-spaced
samples from spatial (or time) domain
to frequency domain
Expresses input signal as a
superposition of sinusoids
Maps complex vectors to complex
vectors
All frequencies in original data are
fully represented in transformed data
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Discrete Fourier Transform
Linear
Invertible - original data can be
fully recovered
Important - discrete Fourier
transform (DFT) is a
mathematical concept, not an
algorithm
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
DFT definition
Forward transform
N1
F(k)   f (n)W Nkn
n 0
where W N  ei2 / N , k  0,...,N 1
Inverse transform
1 N 1
kn
f
(k)

F(n)W

N

N
n 0
Difference is a change of sign and
scale factor

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Multi-dimensional DFT
N 2 1
N d 1

 
k1 n1
k2 n 2
kd n d



F(k1,k2,...,kd )  
W
W
...
W

f
(n
,n
,...,n
)
...


N
N
N
1
2
d
1
2
d




n1  0
n 2  0
nd
 
N1 1
where d is number of dimensions
Nested summations over each
dimension

Computationally,
simplifies to row-
column algorithm
Apply set of 1D DFTs in one
dimension, then subsequent sets of
1D DFTs in remaining dimensions
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Interpretation
Fourier transform often
characterized by two things
Amplitude
F(k)  a2  b2 where a is real, b is imaginary
Phase
b 
(F(k))  tan  
a 
1


The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
F(impulse)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
F(cos wave)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
F(white noise)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
F(Gaussian)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Naïve DFT computation
Implement directly
N 1
F(k)   f (n)W Nkn
n 0
Complexity O(N2)
Much
 too slow!
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Cooley-Tukey algorithm
Developed by Gauss in 1805, but
forgotten and developed
independently by Cooley and
Tukey, and others before them
Divide-and-conquer algorithm
reduces complexity to O(N log N)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Faster DFT computation
Apply divide-and-conquer
Split input into 2 subarrays
Even-index subarray
Odd-index subarray
Take DFT of subarrays
Combine results
Yields complexity O(N log N)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Danielson-Lanczos Lemma
Separate summation into even-odd parts
N 1
F(k)   f n W Nnk


n 0
N / 21
 f 2nW
n 0
N / 21

n 0
2nk
N

N / 21
 f 2n  1W
n 0
N / 21
k
o
N
n 0
f e (n)W N2nk  W
f
(2n 1)k
N
(n)W N2nk
Notice

WN2  e(i2 / N )2 ei2  /(N / 2)  WN / 2
[Moreland2003]
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Therefore
2
(i2  / N )2
i2  /(N / 2)
W

e
e
 WN / 2
Recall
N
 F(k) 

N / 21

n 0
N / 21

n 0
f e (n)W N2nk  W Nk
f e (n)W Nnk/ 2  W Nk
N / 21
f
(n)W N2nk
o
(n)W Nnk/ 2
n 0
N / 21
f
n 0

F e (k)  W Nk F o (k)/N
  e
k o

F
(k)

W
N F (k)/N

[Moreland2003]
o
k  N /2
k  N /2
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Recursive implementation
% Recursive formulation of FFT algorithm. Very inefficient because
% of data reordering and replication.
function Fx = myfft(fx)
N = length(fx);
if (N == 1)
Fx = fx;
return;
end
% Coefficient
Wn = exp(-i*2*pi/N);
even = myfft(fx(1:2:N-1));
odd = myfft(fx(2:2:N));
exponents = (0:(N/2-1))';
firstHalf = even + (Wn.^exponents.*odd);
secondHalf = even - (Wn.^exponents.*odd);
Fx = [firstHalf; secondHalf];
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Recursive implementation
caveats
Requires stack space
No stack on GPU
Array reordering at every step
Too much memory traffic
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Iterative implementation
Reorder input
Details to follow
Start from smallest partition size first
Fourier transform of length-one vector is identity
Start with partition size p = 2
Combine results
Double partition size p
Repeat
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Input reordering
000
001
010
011
100
101
110
111
0
1
2
3
4
5
6
7
0
2
4
6
1
3
5
7
0
4
2
6
1
5
3
7
000
100
010
110
001
101
011
111
Reverse the bits!
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Iterative computation
0
1
2
3
4
5
6
7
0
2
4
6
1
3
5
7
0
4
2
6
1
5
3
7
Unweighted
Weighted
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Spitzer 2003
Implemented 1D FFT
Bit reversal is a lookup operation
Indices and weights for each pass are
precomputed and stored in
“scramble” texture
Pad real inputs to complex by setting
imaginary component to 0
[Spitzer2003]
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Implementation and results
OpenGL and Cg
1D textures
Single 2048 FFT
FFTW
3.0GHz Pentium 4
NVIDIA GeForceFX
5900 Ultra
[Spitzer2003]
12µs (1.5 GFLOPS)
16µs (1.1 GFLOPS)
32µs w/read
(0.6 GFLOPS)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Performance limitations
Vertex processors not used (~5
GFLOPS of unused processing power)
Many OpenGL calls for ping-ponging
Each pass must finish before next one
starts
1D textures - not optimized in
hardware
Additional memory bandwidth for
index and weight lookups
Vector ALUs not exploited
[Spitzer2003]
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Performance improvements
Packing complex numbers into 1 or
more textures
RG - first number, BA - second number
Two textures: RGBA1 - real, RGBA2 - imaginary
Batching
Fill 2D texture with 1D inputs
Keep graphics pipeline filled
Cuts ratio of driver calls to computation
substantially
[Spitzer2003]
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Mitchell et al. 2003
Discusses FFT in context of
image processing and DirectX 9
Similar arrangement and
algorithm to [Spitzer2003]
Handles 2D FFTs
[Mitchell2003]
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
FFT over columns
// Horizontal scramble first
SetSurfaceAsTexture( surfaceA); // Input image
SetRenderTarget( surfaceB);
LoadShader( HorizontalScramble);
SetTexture( ButterflyTexture[log2(width)]);
DrawQuad();
// Horizontal butterflies
LoadShader( HorizontalButterfly);
SetTexture( ButterflyTexture[log2(width)]);
for ( i = 0; i < log2( width); i++) {
SwapSurfacesAandB();
SetShaderConstant( “pass”, i/log2(width));
DrawQuad();
}
[Mitchell2003]
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
FFT over rows
// Vertical scramble
SwapSurfacesAandB();
LoadShader( VerticalScramble);
SetTexture( ButterflyTexture[log2(height)]);
DrawQuad();
// Vertical butterflies
LoadShader( VerticalButterfly);
SetTexture( ButterflyTexture[log2(height)]);
for ( i = 0; i < log2( height); i++) {
SwapSurfacesAandB();
SetShaderConstant( “pass”, i/log2(height));
DrawQuad();
}
[Mitchell2003]
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Moreland and Angel 2003
Goal was to enable a real-time
filtering stage during image
generation
Developed an indexing scheme
to eliminate input reordering
Optimized for real input signals
[Moreland2003]
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Optimizations for real input
Most input images are real
Symmetry in Fourier transform of
real signals
F(k)  F * (N  k)
No need to add 0-valued
imaginary components
 half the memory
Uses
[Moreland2003]
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Optimizations for real input
Can “pack” two reals into a
single complex number and apply
standard complex->complex FFT
Treat one half of 1D input as real,
other half as imaginary
h(k)  f (k)  ig(k)  H(k)  F(k)  iG(k)
Must untangle results afterwards

[Moreland2003]
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Conceptual organization
2D Organization
1D Organization
N 1
0
N  1 N2  1 N2 N2  1 N  1
Real Values
Imaginary Values
...
N
2
No data reordering 
2D packing
Columns 0 and M/2 contain
reals
Other columns can be paired
up as real and imaginary
components

[Moreland2003]
1
N
2
N
2
1
...
1
0
0
M 1... M2  1 M2 M2  1... M  1
Real Values
Imaginary Values
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Program flow
Imag.
F
[Moreland2003]
I, G
R, G
I, F
Imaginary
Untangled
R, F
I, G
Real
F
R, G
Real
Tangled
Scale
Untangle
I, F
Imag.
G
Scale
R, F
Real
G
Pass
Imaginary
Tangled
FFT
Imaginary
Untangled
Untangle
Real
Untangled
Imag., Tangled
Imag.
F
Real
Untangled
Imag., Tangled
Scale
Real
F
Imag.
G
Real, Tangled
FFT
Pass
Images
Real
Tangled
Real
G
Untangle
Pass
Frequency Spectra
Imaginary
Tangled
FFT
Real, Tangled
Untangle
Scale
FFT
Pass
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Implementation
OpenGL and Cg
NVIDIA GeForce FX 5800 Ultra
Ping-ponging
[Moreland2003]
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Results
Arithmetic-bound
2.5 GFLOPS
3.4 GB/sec
memory bandwidth
1024x1024
convolution
FFTW on 1.7GHz Xeon
took 0.625 seconds
Their implementation
took 2.7 seconds (with
data transfer)
[Moreland2003]
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Sumanaweera and Liu 2005
Can operate on two complex
inputs simultaneously
Targets 2D images
Columns first, then rows
Auto-tunes to balance load
among:
Vertex processors
Rasterizer
Fragment processors
[Sumanaweera2005]
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Compute buffers
Uses one pbuffer with several draw buffers
Use separate buffers for real and imaginary
complex components
NVIDIA Quadro NV4x family has stereo
capability
Supports up to 8 buffers
GL_FRONT_LEFT, GL_BACK_LEFT, GL_FRONT_RIGHT,
GL_BACK_RIGHT
GL_AUX0, GL_AUX1, GL_AUX2, GL_AUX3
Allows parallel processing of two 2D images
2 complex images  2 components  (src + dst buffers)
[Sumanaweera2005]
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Incorporating vertex
processors and rasterizer
In previous implementations,
vertex processor is basically idle
Transforms only 4 vertices for each pass
Indices and weights stored in textures
Enhancement
Fragments can be grouped together into
contiguous chunks within which indices
and angular argument of the weight vary
linearly
[Sumanaweera2005]
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Linear interpolation
F0
F1
F2
F3
F4
F5
F6
F7
0
1
2
3
4
5
6
7
Note the linear variation in indices used to
form F0, F1, F2, F3
Can exploit interpolation hardware to
eliminate texture accesses for index lookups
Also works for angular arguments of
weights
[Sumanaweera2005]
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Caveats
In early stages, must draw many
small rectangles
Vertex processors loaded more heavily
than fragment processors
Worse performance than single-quad
approach with index and weight lookups
sincos evaluation in hardware to
obtain weights
Probably enough arithmetic to hide
memory latency
[Sumanaweera2005]
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Load balancing
Quads at step
1 2
…
log N
Switch between
approaches at
some stage decided
at runtime
Auto-tuning code
Index
lookups
[Sumanaweera2005]
Interpolated
indices
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Performance
Fragment
Only
Quadro NV40
GPU (Hz)
Vertex and
Fragment
Quadro NV40
GPU (Hz)
FFTW
2.5GHz P4
(Hz)
Gain over
CPU
(Column 3
/Column 4)
256256
430
473
355
1.33
512512
95
104
56
1.86
204832
387
426
230
1.85
204864
190
207
105
1.97
2048128
91
100
53
1.89
2048256
44
48
26
1.85
2048512
21
23
13
1.77
20481024
10
11
6
1.83
1024256
93
102
54
1.89
1024512
45
49
27
1.81
Image Size
[Sumanaweera2005]
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Govindaraju et al. 2006
Emphasis is on cache-efficiency
Uses Stockham FFT formulation
Like [Moreland2003], does not perform initial
reordering step
Simpler indexing functions
Supports 1D FFTs
Data stored in 2D arrays
Better exploitation of GPU cache designed for 2D
access patterns
Makes larger problem sizes possible
Divides work up into tiles to improve
cache reuse
[Govindaraju2006]
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Performance
Problem size - 4 million complex
numbers
On NVIDIA 7800GTX, 6464 tile size performs best
6.1 GFLOPS on NVIDIA 7900 GTX
32 GB/s memory bandwidth
4-5X faster than two 3.6GHz Xeon processors
w/Hyperthreading and two dual-core Opteron
280s running threaded FFTW
Claims to be 3X faster than libgpufft
from Stanford
[Govindaraju2006]
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Other implementations
CUFFT
CUDA’s black-box FFT implementation
Mark Harris reports 50 GFLOPS on 8800
GTX for more than 4096 elements in 1D
GPU FFT
Brook implementation
T. Jansen, B von Rymon-Lipinski, N.
Hanssen, E. Keeve, “Fourier volume
rendering on the GPU using a split-stream
FFT”, Proc. Vision, Modeling, and
Visualization, pp. 395-403, 2004.
9X faster on ATI Radeon 9800 than 2.6GHz
Pentium 4 for 10242 real image
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Other implementations
T. Schiwietz, T. Chang, P. Speier, and R.
Westermann, “MR image reconstruction
using the GPU”, Proc. SPIE 6142, pp. 12791290, 2006.
GPU-FFT on ATI X1600 XT vs. 3.0GHz Pentium 4
1.7X faster than FFTW for 10242 complex image
1.9X faster for 5122 complex image
O. Fialka and M. Cadik, “FFT and convolution
performance in image filtering on GPU”,
Proc. of Information Visualization, 2006.
NVIDIA GeForce 6600GT vs 2.6GHz Intel Celeron D
Convolution 3.4X faster for 2562 complex image
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
GPU FFT shootout
Tested on NVIDIA GeForce 8800 GTX
All tests include bus traffic to GPU
No readback
Complex inputs
GFLOPS derived from estimated FFTW
operation count
5 N log2(N) / time
N is product of dimension sizes
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
GPU FFT shootout
Library
GPUFFTW
CUDA
1D
16,384
ms (GFLOPS)
1D
8,388,608
ms (GLFOPS)
2D
10242
ms (GFLOPS)
2D
20482
ms (GLOPS)
169 (0.006)
198 (4.6)
200 (0.5)
288 (1.6)
0.701 (1.6)
◊
6.781 (15.5)
32 (14.4)
◊
◊
12.205 (8.6)
Cg error
Cg error
Cg error
Cg error
pbuffer error
pbuffer error
pbuffer error
pbuffer error
Mitchell2003
Moreland2003
Sumanaweera2005
◊ Dimension not supported
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
References
Spitzer, “Implementing a GPU-Efficient FFT”, SIGGRAPH
GPGPU Course, 2003.
K. Moreland and E. Angel, “The FFT on a GPU”, Graphics
Hardware, 2003.
J. Mitchell, M. Ansari, E. Hart, “Advanced Image Processing
with DirectX 9 Pixel Shaders”, in ShaderX2 - Shader
Programming Tips and Tricks, 2003.
T. Sumanaweera and D. Liu, “Medical Image
Reconstruction with the FFT”, GPU Gems 2, pp. 765-784,
2005.
N. Govindaraju, S. Larsen, J. Gray, and D. Manocha, “A
memory model for scientific algorithms on graphics
processors,” Proc. Supercomputing, p. 6, 2006.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL