CUDA Specialized Libraries and Tools CIS 665 – GPU Programming Or CUDA PROPER features NVIDIA Research.

Transcript CUDA Specialized Libraries and Tools CIS 665 – GPU Programming Or CUDA PROPER features NVIDIA Research.

CUDA Specialized Libraries and Tools

CIS 665 – GPU Programming

CUDA Libraries and Tools

Overview Scan,

CUDA Libraries and Tools

Overarching theme: PROGRAMMER PRODUCTIVITY !!!

Programmer productivity Rapidly develop complex applications Leverage parallel primitives Encourage generic programming Don’t reinvent the wheel E.g. one reduction to rule them all High performance With minimal programmer effort

REGISTER as a CUDA Developer

March 4 release@@ Can use in final projects!

References

Scan primitives for GPU Computing.

Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens Presentation on scan primitives by Gary J. Katz based on the particle Parallel Prefix Sum (Scan) with CUDA - Harris, Sengupta and Owens (GPU GEMS Chapter 39) Super Computing 2009 CUDA Tools (Cohen) Thrust Introduction – Nathan Bell

CUDA Libraries and Tools

Parallel Primitives

Scan (Parallel Prefix Sum) Map Reduce Sort ….

(Build algorithms in terms of primitives)

Parallel Primitives: Scan Prefix-Sum Example

in: 3 1 7 0 4 1 6 3 out: 0 3 4 11 11 15 16 22

Trivial Sequential Implementation

void scan(int* in, int* out, int n) { out[0] = 0; for (int i = 1; i < n; i++) out[i] = in[i-1] + out[i-1]; }

Parallel Primitives: Scan

Definition: The scan operation takes a binary associative operator with identity I, and an array of n elements [a 0 , a 1 , …, a n-1 ] and returns the array [I, a 0 , (a 0 a 1 ), … , (a 0 a 1 … a n-2 )]

Types

– inclusive, exclusive, forward,

backward

Parallel Primitives

The all-prefix-sums operation on an array of data is commonly known as scan. The scan just defined is an exclusive scan, because each element j of the result is the sum of all elements up to but not including j in the input array. In an inclusive scan, all elements including j are summed. An exclusive scan can be generated from an inclusive scan by shifting the resulting array right by one element and inserting the identity. An inclusive scan can be generated from an exclusive scan by shifting the resulting array left and inserting at the end the sum of the last element of the scan and the last element of the input array

Parallel Primitives

in: 3 1 7 0 4 1 6 3 out: 0 3 4 11 11 15 16 22 in: 3 1 7 0 4 1 6 3 out: 3 4 11 11 15 16 22 25 Exclusive Scan Inclusive Scan

Parallel Primitives

For (d = 1; d < log 2 n; d++) for all k in parallel if( k >= 2 d ) x[out][k] = x[in][k – 2 d-1 ] + x[in][k] else x[out][k] = x[in][k] Complexity O(nlog 2 n) Not very work efficient!

Parallel Primitives

Goal is a parallel scan that is O(n) instead of O(nlog 2 n) Solution: Balanced Trees: Build a binary tree on the input data and sweep it to and from the root.

Binary tree with n leaves has d=log 2 n levels, each level d has 2

nodes One add is performed per node, therefore O(n) add on a single traversal of the tree.

Parallel Primitives O(n) unsegmented scan

Reduce/Up-Sweep for(d = 0; d < log 2 n-1; d++) for all k=0; k < n-1; k+=2 d+1 in parallel x[k+2 d+1 -1] = x[k+2 d -1] + x[k+2 d+1 -1] Down-Sweep x[n-1] = 0; for(d = log 2 n – 1; d >=0; d--) for all k = 0; k < n-1; k += 2 d+1 t = x[k + 2 d x[k + 2 d x[k + 2 d+1 – 1] - 1] = x[k + 2 d+1 -1] - 1] = t + x[k + 2 d+1 in parallel – 1]

Parallel Primitives Tree analogy

The tree we build is not an actual data structure, but a concept we use to determine what each thread does at each step of the traversal

1 ) ∑(x 0 ..x

1 ) x 0 ∑(x 0 ..x

1 ) x 2 ∑(x 0 ..x

3 ) x 4 x 2 ∑(x 0 ..x

3 ) x 4 x 2 0 ∑(x 4 ..x

5 ) ∑(x 4 ..x

5 ) x 6 x 6 ∑(x 0 ..x

7 ) 0 x 4 ∑(x 4 ..x

5 ) x 6 ∑(x 0 ..x

3 ) x 0 0 x 2 ∑(x 0 ..x

1 ) x 4 ∑(x 0 ..x

3 ) x 6 ∑(x 0 ..x

5 ) 0 x 0 ∑(x 0 ..x

1 ) ∑(x 0 ..x

2 ) ∑(x 0 ..x

3 ) ∑(x 0 ..x

4 ) ∑(x 0 ..x

5 ) ∑(x 0 ..x

6 )

Parallel Primitives

Up-Sweep (Reduce traverse the tree from leaves to root computing partial sums at internal nodes of the tree. This is also known as a parallel reduction, because after this phase, the root node (the last node in the array) holds the sum of all nodes in the array.

Parallel Primitives

Down-Sweep traverse back down the tree from the root, using the partial sums from the reduce phase to build the scan in place on the array. We start by inserting zero at the root of the tree, and on each step, each node at the current level passes its own value to its left child, and the sum of its value and the former value of its left child to its right child.

Parallel Primitives Features of segmented scan

3 times slower than unsegmented scan Useful for building broad variety of applications which are not possible with unsegmented scan.

A convenient way to execute a scan independently over many sets of values Inputs: A data vector and a flag vector A flag marks the first element of a segment

Primitives built on scan

Enumerate enumerate([t f f t f t t]) = [0 1 1 1 2 2 3] Exclusive scan of input vector Distribute (copy) distribute ([a b c][d e]) = [a a a][d d] Inclusive scan of input vector Split and split-and-segment Split divides the input vector into two pieces, with all the elements marked false on the left side of the output vector and all the elements marked true on the right.

Applications

Quicksort Sparse Matrix-Vector Multiply Tridiagonal Matrix Solvers and Fluid Simulation Radix Sort Stream Compaction Summed-Area Tables

Quicksort

Radix Sort Using Scan

100 0 1 0 111 1 0 1 010 0 1 1 110 0 1 2 011 1 0 3 101 1 0 3 001 1 0 3 000 0 1 3 Input Array b = least significant bit e = Insert a 1 for all false sort keys f = Scan the 1s 0-0+4 = 4 1-1+4 = 4 0 4 2-1+4 = 5 1 3-2+4 = 5 2 4-3+4 = 5 5 5-3+4 = 6 6 6-3+4 = 7 7 7-3+4 = 8 3 100 111 010 110 011 101 001 000 Total Falses = e[n-1] + f[n-1] t = index – f + Total Falses d = b ? t : f 110 000 111 011 101 001 Scatter input using d as scatter address

CUDA Specialized Libraries

CUDA Specialized Libraries: Thrust

Thrust is a CUDA library of parallel algorithms with an interface resembling the C++ Standard Template Library (STL). Thrust provides a flexible high-level interface for GPU programming that greatly enhances developer productivity. Develop high-performance applications rapidly with Thrust!

“Standard Template Library for CUDA” Heavy use of C++ templates for efficiency

Facts and Figures

Thrust v1.0

Open source (Apache license) 1,100+ downloads Development 460+ unit tests 25+ compiler bugs reported 35k lines of code Including whitespace & comments Uses CUDA Runtime API Essential for template generation

Containers

Make common operations concise and readable Hides cudaMalloc & cudaMemcpy

// allocate host vector with two elements

thrust:: host_vector h_vec(2);

// copy host vector to device

thrust:: device_vector d_vec = h_vec;

// manipulate device values from the host

d_vec[0] = 13; d_vec[1] = 27; std::cout << "sum: " << d_vec[0] + d_vec[1] << std::endl;

Containers

Compatible with STL containers Eases integration vector, list, map, ...

// list container on host

std::list h_list; h_list.push_back(13); h_list.push_back(27);

// copy list to device vector

thrust:: device_vector d_vec(h_list.size()); thrust:: copy (h_list.begin(), h_list.end(), d_vec.begin());

// alternative method

thrust:: device_vector d_vec(h_list.begin(), h_list.end());

Iterators

Sequences defined by pair of iterators

// allocate device vector

thrust:: device_vector < int > d_vec(4); d_vec.begin();

// returns iterator at first element of d_vec

d_vec.end()

// returns iterator one past the last element of d_vec // [begin, end) pair defines a sequence of 4 elements

d_vec.begin() d_vec.end()

Iterators

Iterators act like pointers

// allocate device vector

thrust:: device_vector < int > d_vec(4); thrust:: device_vector < int >::iterator begin = d_vec.begin(); thrust:: device_vector < int >::iterator end = d_vec.end(); int length = end - begin;

// compute size of sequence [begin, end)

end = d_vec.begin() + 3;

// define a sequence of 3 elements

begin end

Iterators

Use iterators like pointers

// allocate device vector

thrust:: device_vector < int > d_vec(4); thrust:: device_vector < int >::iterator begin = d_vec.begin(); *begin = 13;

// same as d_vec[0] = 13;

int temp = *begin;

// same as temp = d_vec[0];

begin++;

// advance iterator one position

*begin = 25;

Iterators

Track memory space (host/device) Guides algorithm dispatch

// initialize random values on host

thrust:: host_vector < int > h_vec(1000); thrust:: generate (h_vec.begin(), h_vec.end(), rand);

// copy values to device

thrust:: device_vector < int > d_vec = h_vec;

// compute sum on host

int h_sum = thrust:: reduce (h_vec.begin(), h_vec.end());

// compute sum on device

int d_sum = thrust:: reduce (d_vec.begin(), d_vec.end());

Iterators

Convertible to raw pointers

// allocate device vector

thrust:: device_vector < int > d_vec(4);

// obtain raw pointer to device vector’s memory

int * ptr = thrust:: raw_pointer_cast (&d_vec[0]);

// use ptr in a CUDA C kernel

my_kernel<<>>(N, ptr);

// Note: ptr cannot be dereferenced on the host!

Iterators

Wrap raw pointers with device_ptr int N = 10;

// raw pointer to device memory

int * raw_ptr; cudaMalloc(( void **) &raw_ptr, N * sizeof(int ));

// wrap raw pointer with a device_ptr

thrust:: device_ptr < int > dev_ptr(raw_ptr);

// use device_ptr in thrust algorithms

thrust:: fill (dev_ptr, dev_ptr + N, ( int ) 0);

// access device memory through device_ptr

dev_ptr[0] = 1;

// free memory

cudaFree(raw_ptr);

Namespaces

C++ supports namespaces Thrust uses thrust namespace thrust:: device_vector thrust:: copy STL uses std namespace std:: vector std:: list Avoids collisions thrust:: sort () std:: sort () For brevity using namespace thrust;

Recap

Containers Manage host & device memory Automatic allocation and deallocation Simplify data transfers Iterators Behave like pointers Keep track of memory spaces Convertible to raw pointers Namespaces Avoids collisions

C++ Background

Function templates

// function template to add numbers (type of T is variable)

template < typename T > T add(T a, T b) { return a + b; }

// add integers

int x = 10; int y = 20; int z; z = add< int >(x,y);

// type of T explicitly specified

z = add(x,y);

// type of T determined automatically // add floats

float x = 10.0f; float y = 20.0f; float z; z = add< float >(x,y);

// type of T explicitly specified

z = add(x,y);

C++ Background

Function objects (Functors)

// templated functor to add numbers

template < typename T > class add { public : T operator ()(T a, T b) { return a + b; } }; int x = 10; int y = 20; int z; add< int > func;

// create an add functor for T=int

z = func(x,y);

// invoke functor on x and y

float x = 10; float y = 20; float z; add< float > func;

// create an add functor for T=float

z = func(x,y);

Algorithms

Thrust provides many standard algorithms Transformations Reductions Prefix Sums Sorting Generic definitions General Types Built-in types ( int , float , User-defined structures …) General Operators reduce with scan with plus maximum operator operator

Algorithms

General types and operators

// declare storage

device_vector i_vec = ... device_vector f_vec = ...

// sum of integers (equivalent calls)

reduce (i_vec.begin(), i_vec.end()); reduce (i_vec.begin(), i_vec.end(), 0, plus ());

// sum of floats (equivalent calls)

reduce (f_vec.begin(), f_vec.end()); reduce (f_vec.begin(), f_vec.end(), 0.0f, plus ());

// maximum of integers

reduce (i_vec.begin(), i_vec.end(), 0, maximum ());

Fancy Iterators

Behave like “normal” iterators Algorithms don't know the difference Examples constant_iterator counting_iterator transform_iterator zip_iterator

Fancy Iterators

constant_iterator An infinite array filled with a constant value

// create iterators

constant_iterator first(10); constant_iterator last = first + 3; first[0]

// returns 10

first[1]

// returns 10

first[100]

// returns 10 // sum of [first, last)

reduce (first, last);

Fancy Iterators

counting_iterator An infinite array with sequential values

// create iterators

counting_iterator first(10); counting_iterator last = first + 3; first[0]

// returns 10

first[1]

// returns 11

first[100]

// returns 110 // sum of [first, last)

reduce (first, last);

Fancy Iterators

transform_iterator Yields a transformed sequence Facilitates kernel fusion

F(

)

Fancy Iterators

transform_iterator Conserves memory capacity and bandwidth

// initialize vector

device_vector vec(3); vec[0] = 10; vec[1] = 20; vec[2] = 30;

// create iterator (type omitted)

first = make_transform_iterator (vec.begin(), negate ()); last = make_transform_iterator (vec.end(), negate ()); first[0]

// returns -10

first[1]

// returns -20

first[2]

// returns -30 // sum of [first, last)

reduce (first, last);

Structure of Arrays (SoA)

Array of Structures (AoS) Often does not obey coalescing rules device_vector < float3 > Structure of Arrays (SoA) Obeys coalescing rules Components stored in separate arrays device_vector < float > x, y, z; Example: Rotate 3d vectors SoA is 2.8x faster

Structure of Arrays (SoA)

struct { rotate_float3 __host__ __device__ float3 operator ()( float3 { float float float x = v.x; y = v.y; z = v.z; v) float float float rx = 0.36f*x + 0.48f*y + -0.80f*z; ry =-0.80f*x + 0.60f*y + 0.00f*z; rz = 0.48f*x + 0.64f*y + 0.60f*z; }; } return make_float3 (rx, ry, rz); ...

device_vector < float3 > vec(N); transform (vec.begin(), vec.end, vec.begin(), rotate_float3());

Structure of Arrays (SoA)

struct { rotate_tuple __host__ __device__ tuple < float , float , float > operator ()( tuple < float , float , float > v) { float float float x = y = z = get get get <0>(v); <1>(v); <2>(v); float float float rx = 0.36f*x + 0.48f*y + -0.80f*z; ry =-0.80f*x + 0.60f*y + 0.00f*z; rz = 0.48f*x + 0.64f*y + 0.60f*z; }; } return make_tuple (rx, ry, rz); ...

device_vector < float > x(N), y(N), z(N); transform ( make_zip_iterator ( make_tuple (x.begin(), y.begin(), z.begin())), make_zip_iterator ( make_tuple (x.end(), y.end(), z.end())), make_zip_iterator ( make_tuple (x.begin(), y.begin(), z.begin())), rotate_tuple());

CUDA Specialized Libraries: PyCUDA

PyCUDA

PyCUDA - Differences

Object cleanup tied to lifetime of objects. This idiom, often called RAII in C++, makes it much easier to write correct, leak- and crash-free code. PyCUDA knows about dependencies, too, so (for example) it won’t detach from a context before all memory allocated in it is also freed. Convenience. Abstractions like pycuda.driver.SourceModule and pycuda.gpuarray.GPUArray make CUDA programming even more convenient than with Nvidia’s C-based runtime. Completeness. PyCUDA puts the full power of CUDA’s driver API at your disposal, if you wish. Automatic Error Checking. All CUDA errors are automatically translated into Python exceptions. Speed. PyCUDA’s base layer is written in C++, so all the niceties above are virtually free.

PyCUDA - Example

CUDA Specialized Libraries: CUDPP

CUDPP: CUDA Data Parallel Primitives Library CUDPP is a library of data-parallel algorithm primitives such as parallel prefix-sum (”scan”), parallel sort and parallel reduction.

CUDPP – Design Goals

Performance: aims to provide best-of-class performance for simple primitives. Modularity. primitives easily included in other applications. CUDPP is provided as a library that can link against other applications.

CUDPP calls run on the GPU on GPU data. Thus they can be used as standalone calls on the GPU (on GPU data initialized by the calling application) and, more importantly, as GPU components in larger CPU/GPU applications.

CUDPP is implemented as 4 layers: The Public Interface is the external library interface, which is the intended entry point for most applications. The public interface calls into the Application-Level API .

The functions execute code jointly on the CPU (host) and the GPU by calling into the Application-Level API Kernel-Level API comprises functions callable from CPU code. These below them.

The Kernel-Level API comprises functions that run entirely on the GPU across an entire grid of thread blocks. These functions may call into the CTA-Level API below them.

The CTA-Level API comprises functions that run entirely on the GPU within a single Cooperative Thread Array (CTA, aka thread block). These are low-level functions that implement core data-parallel algorithms, typically by processing data within shared (CUDA __shared__) memory.

Programmers may use any of the lower three CUDPP layers in their own programs by building the source directly into their application. However, the typical usage of CUDPP is to link to the library and invoke functions in the CUDPP

Public Interface , as in the simpleCUDPP , satGL, and cudpp_testrig application examples included in the CUDPP distribution.

CUDPP

CUDPP_DLL PPHandle CUDPPResult cudppSparseMatrixVectorMultiply(CUD

sparseMatrixHandle,

void *

d_y,

const void *

d_x

)

Perform matrix-vector multiply y = A*x for arbitrary sparse matrix A and vector x.

CUDPP - Example

CUDPPScanConfig config; config.direction = CUDPP_SCAN_FORWARD; config.exclusivity = CUDPP_SCAN_EXCLUSIVE; config.op = CUDPP_ADD; config.datatype = CUDPP_FLOAT; config.maxNumElements = numElements; config.maxNumRows = 1; config.rowPitch = 0; cudppInitializeScan(&config); cudppScan(d_odata, d_idata, numElements, &config);

CUDPP + Thrust

To put it simply, CUDPP's interface is optimized for performance while Thrust is oriented towards productivity int main(void) { CUT_DEVICE_INIT(); unsigned int numElements = 32768; // allocate host memory thrust::host_vector h_idata(numElements); // initialize the memory thrust::generate(h_idata.begin(), h_idata.end(), rand); // set up plan CUDPPConfiguration config; config.op = CUDPP_ADD; config.datatype = CUDPP_FLOAT; config.algorithm = CUDPP_SCAN; config.options = CUDPP_OPTION_FORWARD | CUDPP_OPTION_EXCLUSIVE; CUDPPHandle scanplan = 0; CUDPPResult result = cudppPlan(&scanplan, config, numElements, 1, 0); if(CUDPP_SUCCESS != result) { printf("Error creating CUDPPPlan\n"); exit(-1); } // Run the scan cudppScan(scanplan, thrust::raw_pointer_cast(&d_odata[0]), thrust::raw_pointer_cast(&d_idata[0]), numElements);

CUDA Specialized Libraries: CUBLAS

Cuda Based Linear Algebra Subroutines Saxpy, conjugate gradient, linear solvers.

3D reconstruction of planetary nebulae example.

CUBLAS

CUBLAS Features

CUBLAS: Performance – CPU vs GPU

CUBLAS

GPU Variant 100 times faster than CPU version Matrix size is limited by graphics card memory and texture size.

Although taking advantage of sparce matrices will help reduce memory consumption, sparse matrix storage is not implemented by CUBLAS.

CUDA Specialized Libraries: CUFFT

Cuda Based Fast Fourier Transform Library.

The FFT is a divide-and-conquer algorithm for efficiently computing discrete Fourier transforms of complex or real-valued data sets, One of the most important and widely used numerical algorithms, with applications that include computational physics and general signal processing

CUFFT

No. of elements<8192 slower than fftw >8192, 5x speedup over threaded fftw and 10x over serial fftw.

CUFFT: Example

CUFFT: Performance – CPU vs GPU

CUDA Specialized Libraries: MAGMA

Matrix Algebra on GPU and Multicore Architectures The MAGMA project aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures, starting with current "Multicore+GPU" systems.

CUDA Specialized Libraries: CULA

CULA is EM Photonics' GPU-accelerated numerical linear algebra library that contains a growing list of LAPACK functions. LAPACK stands for Linear Algebra PACKage. It is an industry standard computational library that has been in development for over 15 years and provides a large number of routines for factorization, decomposition, system solvers, and eigenvalue problems.

CUDA Specialized Libraries: HONEI

A collection of libraries for numerical computations targeting multiple processor architectures

HONEI

HONEI, an open-source collection of libraries oering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor.

The most important frontend library is libhoneila, HONEI's linear algebra library. It provides templated container classes for dierent matrix and vector types.

The numerics and math library libhoneimath contains high performance kernels for iterative linear system solvers as well as other useful components like interpolation and approximation.

CUDA Development Tools

CUDA Development Tools : CUDA-gdb

Simple Debugger integrated into gdb

CUDA-gdb

CUDA Development Tools : MemCheck

Track memory accesses

CUDA Development Tools : Visual Profiler

Profile your CUDA code

CUDA Visual Profiler

CUDA Development Tools : Nexus Parallel Nsight 1.51

IDE for GPU Computing on Windows: Code Named Nexus

Debugger

> Debug CUDA C/C++ and DirectCompute kernels directly on GPU hardware > Examine thousands of threads executing in parallel using the familiar Locals, Watch, Memory and Breakpoints windows in Visual Studio > View GPU memory directly using the standard Memory windows in Visual Studio > Use conditional breakpoints to quickly identify and correct errors in massively parallel code > Identify memory access violations using the CUDA C/C++ Memory Checker

ANALYZER (prof v)

> Capture CPU and GPU level events, including: API calls, kernel launches, memory transfers and custom application annotations > Single correlated timeline displays all captured events > Timeline inspection tools allow for the examination of workload dependencies > Filter and sort captured events using specialized reporting views > Profile CUDA kernels using GPU performance counters

Graphics Development

> Debug all shaders directly on GPU hardware > Examine shaders executing in parallel using the familiar Locals, Watch, Memory and Breakpoints windows in Visual Studio > View and interact at the source code level with all shaders loaded by the application > Identify shaders that affect any given primitive or pixel using conditional breakpoints > Instantly debug any shader or graphics application