7-20-2010-CUDA lib summary

Download Report

Transcript 7-20-2010-CUDA lib summary

CUDA Library and Demo

Yafeng Yin, Lei Zhou, Hong Man 07/21/2010

Outline

• • • •

Basic CUDA computation library

 GPULib, CUBLAS, CUFFT

Advanced CUDA computation library

 CULA /MAGMA , VSIPL

CUDA FIR Demo(UMD) Discuss and future work

Basic lib - GPULib

GPULib provides a library of mathematical functions – addition, subtraction, multiplication, and division, as well as unary functions, including sin(), cos(), gamma(), and exp(), – interpolation, array reshaping, array slicing, and reduction operations

Basic lib -

CUBLAS

• • BLAS-- Basic Linear Algebra Subprograms

CUBLAS

Provide a set of functions for basic vector and matrix operations, such as matrix‐vector copy, sort, dot product, Euclidean norm etc – Real data • Level 1 (vector-vector O(N) ) • Level 2 (matrix-vector O(N2) ) • Level 3 (matrix-matrix O(N3) ) – Complex data • Level 1

cublasSgbmv()

cublasSgemv() cublasSger() cublasSsbmv() cublasSspmv() cublasSspr() cublasSspr2() cublasSsymv() cublasSsyr() cublasSsyr2() cublasStbmv() cublasStbsv()

CUBLAS-Level 2 function

y = alpha * op(A) * x + beta * y

y = alpha * op(A) * x + beta * y A = alpha * x * yT + A y = alpha * A * x + beta * y , y = alpha * A * x + beta * y A = alpha * x * xT + A A = alpha * x * yT + alpha * y * xT + A , y = alpha * A * x + beta * y A = alpha * x * xT + A A = alpha * x * yT + alpha * y * xT + A , x = op(A) * x op(A) * x = b , output x

Basic lib - CUFFT

CUFFT is the CUDA FFT library

Provides a simple interface for computing parallel FFT on an NVIDIA GPU

Allows users to leverage the floating-point power and parallelism of the GPU without having to develop a GPU-based FFT implementation

cufftPlan1d() ,cufftPlan2d() ,cufftPlan3d()

Creates a 1D,2D or 3D FFT plan configuration for a specified signal size

Advanced lib – CULA and MAGMA

• • CULA: GPU Accelerated Linear Algebra – provide LAPACK (Linear Algebra PACKage) function on CUDA GPUs MAGMA: Matrix Algebra on GPU and Multicore Architectures – develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures and "Multicore+GPU" systems

Advanced lib -CULA function

• • • • •

Linear Equation Routines

– Solves a general system of linear equations AX=B.

Orthogonal Factorizations

– LQ ,RQ factorization

Least Squares Routines Symmetric and non- Symmetric Eigenvalue Routines Singular Value Decomposition (SVD) Routines

Advanced lib - MAGMA

• LAPACK on CUDA GPUs – LU, QR, and Cholesky factorizations in both real and complex arithmetic (single and double) – Linear solvers based on LU, QR, and Cholesky in real arithmetic (single and double) – Mixed-precision iterative refinement solvers based on LU, QR, and Cholesky in real arithmetic – Reduction to upper Hessenberg form in real arithmetic (single and double) – MAGMA BLAS in real arithmetic (single and double),

Advanced lib -VSIPL

VSIPL: Vector Image Signal Processing Library –

Generalized matrix product

Fast FIR filtering

Correlation

Fast Fourier Transform

QR decomposition

Random number generation

– Elementwise arithmetic, logical, and comparison operators, linear algebra procedures

CUDA library Summary

• •

Basic vector or matrix computation

GPULib, CUBLAS, CUFFT

vector or matrix: addition, subtraction, multiplication, and division sin(), cos(), sort, dot product, Libraries can be used for Signal Processing –

CULA /MAGMA , VSIPL

– –

LU, QR, and Cholesky factorizations SVD decompostion

CUDA Demo (FIR)

GPU: NVIDIA GeForce 8600 GT CPU: Intel Duo CPU 2.33G

Software: Visual Studio 2005

CUDA Demo (FIR)

Output NO GPU Run Time(msec) Memory Time(msec) Total Time CPU + GPU

1000 0.312121

0.166641

CPU Only Time(msec)

10000 100000 1000000 10000000 0.667264

4.210870

39.460812

391.816345

0.284254

1.489784

5.597150

48.080204

5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 1000

CUDA Demo (FIR)

FIR Performance

CPU CPU+GPU 10000 100000 1000000 10000000

Discuss and future work

• • • how to connect CUDA to the SSP re-hosting demo how to change the sequential executed codes in signal processing system to CUDA codes how to transfer the XML codes to CUDA codes to generate the CUDA input.

Reference

• • CUDA Zone http://www.nvidia.com/object/cuda_home_new.ht

ml http://en.wikipedia.org/wiki/CUDA