PPT - Caltech
Download
Report
Transcript PPT - Caltech
CS179: GPU Programming
Lecture 10: GPU-Accelerated Libraries
Today
Some useful libraries:
cuRAND
cuBLAS
cuFFT
cuRAND
Oftentimes, we want random data
Simulations often need entropy to behave realistically
How to obtain on GPU?
No rand(), or simple equivalent
Could use pseudo-random function with inputs based on properties
Ex.: int i = cos(999 * thread.Idx.x + 123 * threadIdx.y)
Works okay, but not great
cuRAND
What could do with your current tools:
Generate N random numbers on CPU
Allocate space on GPU
Memcpy to GPU
Not bad -- if we want to do this only once
Issues:
Number generation is synchronous
Memcpy can be slow
Much more ideal if random data can live only on GPU
cuRAND
Solution: cuRAND
CUDA random number library
Works on both host and device
Lots of different distributions
Uniform, normal, log-normal, Poisson, etc.
cuRAND
Performance
cuRAND
Host API
Using on the host:
Call from host
Allocates memory on GPU
Generates random numbers on GPU
Several pseudorandom generators available
Several random distributions available
cuRAND
Host API
Functions to know:
curandCreateGenerator(&g, GEN_TYPE)
GEN_TYPE = CURAND_RNG_PSEUDO_DEFAULT,
CURAND_RNG_PSEUDO_XORWOW
Doesn’t particularly matter, differences are small
curandSetRandomGeneratorSeed(g, SEED)
Again, SEED doesn’t matter too much, just pick one (ex.: time(NULL))
curandGenerate______(…)
Depends on distribution
Ex.: curandGenerate(g, src, n),
curandGenerateNormal(g, src, n, mean, stddev)
curandDestroyGenerator(g)
cuRAND
Host API
curandGenerate() launches asynchronously
Much faster than serial CPU generation
However, we still need to copy data to GPU
src in curandGenerate() is host pointer, not device pointer!
Introduces some undesired overhead
Might need more memory than we can pass in one go
Solution: cuRAND device API
cuRAND
Device API
Supports RNG on kernels
Do not need to generate random data before kernel
We don’t have to copy and store all data at once
Stores RNG states completely on GPU
Still need to allocate memory for it on host
cuRAND
Device API
Example:
curandState *devStates;
cudaMalloc(&devStates,
sizeof(curandState) * nThreads);
kernel<<<gD, bD, sM>>>(devStates, …);
cudaFree(devStates);
don’t forget to free!
cuRAND
Device API
Example continued:
// On the device:
__global__ kernel(curandState *states, …) {
int id = … // calculate thread id
curand_init(seed, id, 0, &states[id]);
// generate random value in range [0, 1]
v[id] = curand_uniform(&states[id])
// transform to rand [a, b]
v[id] = v[id] * (b - a) + a
}
cuRAND
Device API
Note the difference between cuRAND states and the actual
values
States determine random seed of variables
Numbers aren’t generated until
curand_<DISTRIBUTION>(&state) is called
cuRAND
Overview
Can generate numbers on either host or device
Whether generating on host or device, host must allocate
space for device
Many different random seeds, distributions available
Check out these for more details:
http://docs.nvidia.com/cuda/curand/host-api-overview.html
http://docs.nvidia.com/cuda/curand/device-api-overview.html
cuBLAS
Linear algebra is extremely important in many applications
Physics, engineering, mathematics, computer graphics, networking,
…
Anything STEM, really
Linear algebra systems are oftentimes HUGE
Ex.: Invert a matrix of size 106x106 would take a while on a CPU…
Linear algebra systems are oftentimes parallelizable
Element a[0][0] doesn’t care about what a[1][0] will be, just what it was
Linear algebra is a perfect candidate for GPU
cuBLAS
cuBLAS: CUDA’s linear algebra system
Based on BLAS (basic linear algebra system)
Supports all 152 standard BLAS routines
Works pretty similarly to BLAS
cuBLAS
Performance
cuBLAS
Performance
cuBLAS
Performance
cuBLAS
Several levels of BLAS:
BLAS1: Handles vector & vector-vector functions
Sum, min, max, etc.
Add, scale, dot, etc.
BLAS2: Handles matrix-vector functions
Multiplication, generally
BLAS3: Handles matrix-matrix functions
Multiplication, adding, etc.
cuBLAS
Using is fairly simple
Call initialization before kernel
cublasInit()
Use whatever functions you need in kernel
Call shutdown after you’re done with cuBLAS
cublasShutdown
Check out the following for more info:
http://docs.nvidia.com/cuda/cublas/index.html
cuBLAS
Alternative: cuSPARSE
Another CUDA LA library
Generally works well when dealing with sparse matrices (most
entries are 0)
Works pretty well even with dense vectors
cuFFT
Another concept with lots of application, scalability, and
parallelizability: Fourier Transformation
Commonly used in physics, signal processing, etc.
Oftentimes needs to be real-time
Makes great use of GPU
cuFFT
Supports 1D, 2D, or 3D Fourier Transforms
1D transforms can have up to 128 million elements
Based on Cooley-Tukey and Bluestein FFT algorithms
Similar API to FFTW, if familiar
Thread-safe, streamed, asynchronous execution
Supports both in-place and out-of-place transforms
Supports real, complex, float, double data
cuFFT
Performance
cuFFT
Performance
cuFFT
Usage is fairly simple
Allocate space on the GPU
Same old cudaMalloc() call
Create a cuFFT plan
Tells dimension, sizes, and data types
cufftPlan3d(&plan, nx, ny, nz, TYPE)
TYPE = C2C, C2R, R2C (complex to complex, complex to real, real to
complex)
cuFFT
Execute the plan
cufftExecC2C(plan, in_data, out_data, CUFFT_FORWARD)
Replace C2C with your plan type
Can replace CUFFT_FORWARD with CUFFT_INVERSE
Destroy plan, clean up data
cufftDestroy(plan)
cudaFree(in_data), cudaFree(out_data)
Check out more here:
http://docs.nvidia.com/cuda/cufft/index.html
GPU-Accelerated Libraries
Many more available
https://developer.nvidia.com/gpu-accelerated-libraries
OpenCV: Computer vision library (has GPU acceleration libraries)
NPP: Performance primitives library, helps with signal/image processing
Check them out!
Best practice for learning:
Check out documentation
Check out examples
Modify example code
Repeat above until familiar, then use in your own code!