An_FFT_code_generator_for_heterogeneous_system

Download Report

Transcript An_FFT_code_generator_for_heterogeneous_system

CLFFT: An FFT code generator
for heterogeneous systems
Krishna G Pai
Rejith George Joseph
Girish Ravunnikutty
Agenda






FFT
Intro to CLFFT
Brief intro to OpenCL
Comparisons of FFT Algorithms
Comparison with CUFFT
Future work
Discrete Fourier Transform
Takes O(n2) with a naive implementation.
 Fast Fourier Transforms (FFTs) are
O(nlogn) implementation of DFTs.

Image from Intel.com
Why FFT’s ?

Lots of Ongoing Research
◦ FFTW ( http://fftw.org/)
◦ Spiral (http://spiral.net/)
Spectral methods are one of the 13
Dwarfs of Parallel Computing
 Rich set of Algorithms each optimal for
certain ‘N’.
 And of course, Wide applicability.

Our Approach
FFTW generates code that adapts to a
particular architecture (CPU’s)
 Spiral also the same but optimizes at
compile time (Also CPU’s)
 Other research that is optimized for
GPU’s most notably Govindaraju et al.
 Use all the available computing resources
to make FFT’s really fast !

Heterogeneous Computing
Intel Core 2
Duo
Nvidia Tesla
Use both these resources simultaneously
Intro to CLFFT




Future systems are going to be
heterogeneous (multi core CPUs and
GPGPUs as co processors) in nature.
Study various FFT algorithms and implement
them on a GPGPU and multi-core CPUs.
Explore how FFT's can be scheduled across
both these computing resources and the
performance thus obtained.
OpenCL to program the GPGPU and
OpenMp to parallelize on CPU's.
FFT’s Studied ..
SlowFFT (Naive Implementation)
 Cooley-Tookey (Radix 2 , for N = 2k)
 Stockham (Radix 2 , for N = 2k)
 Sande-Tookey (Radix 2 , Decimation in
Frequency, for N = 2k)
 Bluesteins (Radix 2 , for any N)


Cooley-Tookey and SlowFFT also
parallelized with OpenMp.
Computational Parity
Intel Xeon has about 70 GFlops at peak
performance
 nVidia Tesla has about 933 Gflops
 So not much computational parity on the
hpc tesla machines
 Better parity on Laptops with GPGPU’s.
 Thus more work can be shared b/w CPU
and GPU.

Source Intel and Nvidia
Open CL
Standard for parallel programming of
heterogeneous systems involving CPU,
GPU(s), CPU + GPU, IBM cell blade etc.
 So we can have portability across various
architectures without a very great
performance penalty*

More on this when we compare matrix
multiplication…
Differences w.r.t CUDA
No stand alone compiler to produce binaries.
We compile at run time .
Command Queues for launching kernels and
Memory operations.
 Device memory managed via buffer objects,
which provides richer functionality than in
CUDA
 Allows a host memory region to be used by
the device directly
 OpenCl requires memcpy between device
and host to be explicitly synchronized



OpenCL Implementations
On August 5, 2009, AMD unveiled the first
development tools for its OpenCL platform as
part of its ATI Stream SDK v2.0 Beta Program
 On August 28, 2009, Apple released Mac OS X
Snow Leopard, which contains a full
implementation of OpenCL
 September 28, 2009, NVIDIA released OpenCL
drivers and SDK implementation.

Limitations
Nvidia openCL supports only GPU as
the openCL device.
 Driver doesnt consider CPU as an
openCL device.
 Hence cannot invoke an openCL
kernel on CPU.
 Had to use openMP for CPU
 AMD stream openCL has support for
openCL on CPU

Work Flow
Currently , we split work b/w CPU and
GPU’s only for Cooley-Tukey.
 Cooley-Tukey here is radix-2
 Results are merged
 On Tesla, bias is highly in favor of GPU
computation
 From the host one thread invokes OpenMp
kernel and other threads equivalent to
number of GPU’s invoke OpenCL Kernels

Comparison of all Radix 2
Power 2 FFT
0.08
0.07
0.06
Time in Second
0.05
CL-1GPU
0.04
ST-1 GPU
Santook-1GPU
0.03
0.02
0.01
0
32
64
128
256
512
1024
2048
4096
8192
16384
Comparison of Cooley-Tukey
Cooley -Tukey
0.12
Time in Seconds
0.1
0.08
0.06
0.04
0.02
0
CPU-GPU
GPU only
CLFFT vs CuFFT for Power of 2
CooleyTukey vs CuFFT
0.007
0.006
Time in Seconds
0.005
0.004
CL-2GPU-1
CUFFT
0.003
0.002
0.001
0
32
64
128
256
512
1024
2048
4096
8192
16384
Performance Comparison
OpenCL vs CUDA
100000
Time in MilliSeconds
10000
1000
Cuda
OpenCL
100
10
1
512x512
1024x1024
2048x2048
Matrix Size
4096x4096
8192x8192
513
516
519
522
525
528
531
534
537
540
543
546
549
552
555
558
561
564
567
570
573
576
579
582
585
588
591
594
597
600
603
606
609
612
Time in Ms
CuFFT vs CLFFT for any n
CuFFT vs CLFFT for n from 513 to 615
0.3
0.25
0.2
0.15
Bluesteins CL
CUFFT
0.1
0.05
0
Future Work
Split Radix and Mixed Radix Algorithms
 2^p3^q5^r point FFTs
 Winnograds Prime Number FFT
 Optimize CPU implementations
 Create Plan for an n and implement
across multiple compute devices.

Thank You