09-Parallelization_and_CUDA_libraries
Download
Report
Transcript 09-Parallelization_and_CUDA_libraries
Parallelization and CUDA libraries
Lei Zhou, Yafeng Yin, Hong Man
Outline
GPU & CUDA
Manually CUDA Coding
CUDA Library
FIR Realization
Auto Parallelizing Tool
GPU & CUDA
GPUs are massively multithreaded
many core chips
Hundreds of scalar processors
Tens of thousands of concurrent threads
CUDA is the acronym for Compute
Unified Device Architecture.
GeForce 8800 GTX (128 cores)
A parallel computing architecture
developed by NVIDIA.
The computing engine in GPU.
CUDA can be accessible to software
developers through industry standard
programming languages.
Tesla C1060 (240 cores)
Processing Flow
Serial code executes on the host while parallel code executes on
the device.
Manually CUDA Coding
Find parallel kernels
Improve data reuse inside kernels to have better
compute intensity
Access the memory in a GPU-friendly
Take advantage of complex memory hierarchy
that make the GPU fast
Reduce the copy-in and copy-out transfers that
pile up on the PCIe
Reduce memory usage in the GPU
Limit inter-block synchronizations
CUDA Libraries
Basic CUDA computation library
CUBLAS
CUFFT
GPULib
Advanced CUDA computation library
CULA
MAGMA
VSIPL
Basic libraries
CUBLAS provides a set of functions for basic vector and
matrix operations
matrix‐vector copy, sort, dot product, Euclidean norm etc
CUFFT is the CUDA FFT library
cufftPlan1d() ,cufftPlan2d() ,cufftPlan3d()
GPULib provides a library of mathematical functions
addition, subtraction, multiplication, and division, as well as
unary functions, including sin(), cos(), gamma(), and exp(),
interpolation, array reshaping, array slicing, and reduction
operations
Advanced libraries
CULA: GPU Accelerated Linear Algebra
provide LAPACK (Linear Algebra PACKage) function
on CUDA GPUs
MAGMA: Matrix Algebra on GPU and
Multicore Architectures
develop a dense linear algebra library similar to
LAPACK but for heterogeneous/hybrid
architectures and "Multicore+GPU" systems
Advanced lib -VSIPL
VSIPL: Vector Image Signal Processing Library
Generalized matrix product
Fast FIR filtering
Correlation
Fast Fourier Transform
QR decomposition
Random number generation
Elementwise arithmetic, logical, and comparison
operators, linear algebra procedures
Example
// Allocate device memory for filter kernel
Complex* d_filter_kernel;
cutilSafeCall(cudaMalloc((void**)&d_filter_kernel, mem_size));
// Copy host memory to device
cutilSafeCall(cudaMemcpy(d_filter_kernel, h_padded_filter_kernel,
mem_size, cudaMemcpyHostToDevice));
// CUFFT plan
cufftHandle plan;
cufftSafeCall(cufftPlan1d(&plan, new_size, CUFFT_C2C, 1));
// Transform signal and kernel
cufftSafeCall(cufftExecC2C(plan, (cufftComplex *)d_signal,
(cufftComplex *)d_signal, CUFFT_FORWARD));
FIR Realization on CUDA
FIR Realization on CUDA
Threads
t
CUDA Demo (FIR)
GPU: NVIDIA GeForce 8600 GT
CPU: Intel Duo CPU 2.33G
Software: Visual Studio 2005
CUDA Demo (FIR)
FIR Performance
5000
4500
CPU
4000
CPU+GPU
3500
3000
msec
2500
2000
1500
1000
500
0
1000
10000
100000
1000000
10000000
Auto-Parallelizing Tool
Par4All (open source environment): C and
Fortran to CUDA C
PGI Accelerator: Fortran and C to CUDA C
Auto-parallelizing Compiler
CAPS HMPP: C and Fortran to CUDA C Autoparallelizing Compiler
Goose: C to CUDA C Auto-parallelizing
Compiler
NOAA F2C : Fortran to CUDA C Translator
Par4All (open source environment): C and
Fortran to CUDA C