Compilers, Parallel Computing, and Grid Computing

Download Report

Transcript Compilers, Parallel Computing, and Grid Computing

Workshop 9: General purpose computing using
GPUs: Developing a hands-on
undergraduate course on CUDA programming
Dr. Barry Wilkinson
Dr. Yaohang Li
University of North Carolina
Charlotte
Old Dominion University
SIGCSE 2011 - The 42nd ACM Technical Symposium
on Computer Science Education
Wednesday March 9, 2011, 7:00 pm - 10:00 pm
1
SIGCSE 2011 Workshop 9 Session1.ppt © 2010 B. Wilkinson Modification date: Feb 23, 2011
Session 1
7:15 pm - 8:25 pm Session 1: Basic CUDA programming
Presentation (about 35 minutes)
• Kernel calls, data movement, predefined
variables, thread organization, code
examples
Hands-on experience using remote GPU
server (about 35 minutes)
2
CUDA Program
A CUDA program has code to be executed on CPU and code to be
executed on GPU in one source file (in simple cases) and one
executable when compiled
A CUDA kernel is a routine to be executed on the GPU -- a SIMT
code sequence.
Kernel code will be regular C except one typically needs to use the
thread ID in expressions to ensure each thread accesses different
data.
When a kernel is reached in the code for the first time, it will
launched onto the GPU.
3
CPU and GPU Memories
• Separate memories on CPU
(host) and GPU (device)*
Usually need to
• Explicitly transfer data from CPU
to GPU for GPU computation, and
• Explicitly transfer results in GPU
memory copied back to CPU
memory
CPU
CPU main memory
Copy from
CPU to
GPU
Copy from
GPU to
CPU
GPU global memory
GPU
* assuming a separate GPU card. Integrated systems might share some memory.
4
Basic CUDA program structure
int main (int argc, char **argv ) {
1. Allocate memory space in device (GPU) for data
2. Allocate memory space in host (CPU) for data
3. Copy data to GPU
4. Call “kernel” routine to execute on GPU
(with CUDA syntax that defines no of threads and their physical structure)
5. Transfer results from GPU to CPU
6. Free memory space in device (GPU)
7. Free memory space in host (CPU)
return;
}
5
1. Allocating memory space in
“device” (GPU) for data
Use CUDA malloc routines:
int size = N *sizeof( int);
// space for N integers
int *devA, *devB, *devC;
// devA, devB, devC ptrs
cudaMalloc( (void**)&devA, size) );
cudaMalloc( (void**)&devB, size );
cudaMalloc( (void**)&devC, size );
6
Derived from Jason Sanders, "Introduction to CUDA C" GPU technology conference, Sept. 20, 2010.
2. Allocating memory space in
“host” (CPU) for data
Use regular C malloc routines:
int *a, *b, *c;
…
a = (int*)malloc(size);
b = (int*)malloc(size);
c = (int*)malloc(size);
or statically declare variables:
#define N 256
…
int a[N], b[N], c[N];
7
3. Transferring data from host
(CPU) to device (GPU)
Use CUDA routine cudaMemcpy
cudaMemcpy( devA, &A, size, cudaMemcpyHostToDevice);
cudaMemcpy( dev_B, &B, size, cudaMemcpyHostToDevice);
where devA and devB are pointers to destination in device
and
A and B are pointers to host data
8
4. Declaring “kernel” routine to
execute on device (GPU)
CUDA introduces a <<<…>>> syntax addition to C for kernel
calls:
myKernel<<< n, m >>>(arg1, … );
<<< … >>> contains thread organization for this particular kernel
call in two parameters, n and m:
For now, we will set n = 1, which say one block and m = N, which
says N threads in this block.
arg1, … , -- arguments to routine myKernel typically pointers to
device memory obtained previously from cudaMallac.
9
Kernel Routine
Defined using CUDA specifier __global__
Example – Adding to vectors A and B
#define N 256
__global__ void vecAdd(int *A, int *B, int *C) { // Kernel definition
int i = threadIdx.x;
CUDA structure that provides thread ID in block
C[i] = A[i] + B[i];
}
Each thread performs one pair-wise addition:
int main() {
Thread 0:
// allocate device memory &
Thread 1:
Thread 2:
// copy data to device
// device mem. ptrs devA,devB,devC
}
vecAdd<<<1, N>>>(devA,devB,devC);
…
One block of N threads
devC[0] = devA[0] + devB[0];
devC[1] = devA[1] + devB[1];
devC[2] = devA[2] + devB[2];
.
.
.
10
Loosely derived from CUDA C programming guide, v 3.2 , 2010, NVIDIA
5. Transferring data from device
(GPU) to host (CPU)
Use CUDA routine cudaMemcpy
cudaMemcpy( &C, devC, size, cudaMemcpyDeviceToHost);
where devC is a pointer in device and C is a pointer in
host.
11
Free memory space
In “device” (GPU) -- Use CUDA cudaFree routine:
cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);
In (CPU) host (if CPU memory allocated with malloc) -Use regular C free routine:
free( a );
free( b );
free( c );
12
#define N 256
__global__ void vecAdd(int *A, int *B, int *C) {
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
Complete
CUDA
program
Adding two
vectors, A and B
int main (int argc, char **argv ) {
int size = N *sizeof( int);
int a[N], b[N], c[N], *devA, *devB, *devC;
cudaMalloc( (void**)&devA, size) );
cudaMalloc( (void**)&devB, size );
cudaMalloc( (void**)&devC, size );
a = (int*)malloc(size); b = (int*)malloc(size);c =
(int*)malloc(size);
cudaMemcpy( devA, a, size, cudaMemcpyHostToDevice);
cudaMemcpy( dev_B, b size, cudaMemcpyHostToDevice);
N elements in A and
B, and N threads
vecAdd<<<1, N>>>(devA, devB, devC);
(without code to load
arrays with data)
cudaMemcpy( &c, devC size, cudaMemcpyDeviceToHost);
cudaFree( dev_a);
cudaFree( dev_b);
cudaFree( dev_c);
free( a ); free( b ); free( c );
return (0);
}
13
So far, organization of threads is one block of N
threads.
GPUs are actually organized to execute blocks of
threads in 1 or 2 dimensions – the collection of
blocks being called a grid
The blocks themselves can be organized in 1-D
2-D or 3-D.
14
CUDA SIMT
Thread Structure
Allows
flexibility and
efficiency in
processing
1D, 2-D, and
3-D data on
GPU.
Can be 1 or 2
dimensions
Can be 1, 2 or
3 dimensions
Linked to
internal
organization
Threads in
one block
execute
together.
CUDA C programming guide, v 3.2, 2010,
NVIDIA
15
Device characteristics -- some limitations
NVIDIA defines “compute capabilities”, 1.0, 1.1, … with
these limits and features supported.
Compute capability 1.0
Maximum number of threads per block
Maximum sizes of x- and y- dimension
of thread block
Maximum size of each dimension of grid
of thread blocks
= 512
= 512
= 65535
16
Defining Grid/Block Structure
Need to provide each kernel call with values for two key structures:
• Number of blocks in each dimension
• Threads per block in each dimension
myKernel<<< B, T >>>(arg1, … );
B – a structure that defines the number of blocks in grid in each
dimension (1D or 2D).
T – a structure that defines the number of threads in a block in each
dimension (1D, 2D, or 3D).
17
1-D grid and/or 1-D blocks
If want a 1-D structure, can use a integer for B and T in:
myKernel<<< B, T >>>(arg1, … );
B – An integer would define a 1D grid of that size
T –An integer would define a 1D block of that size
Example
myKernel<<< 1, 100 >>>(arg1, … );
18
CUDA Built-in Variables
for a 1-D grid and 1-D block
threadIdx.x -- “thread index” within block in “x” dimension
blockIdx.x -- “block index” within grid in “x” dimension
blockDim.x -- “block dimension” in “x” dimension
(i.e. number of threads in a block in the x dimension)
Full global thread ID in x dimension can be computed by:
x = blockIdx.x * blockDim.x + threadIdx.x;
19
Example -- x direction
A 1-D grid and 1-D block
4 blocks, each having 8 threads
Global ID 26
threadIdx.x
threadIdx.x
threadIdx.x
threadIdx.x
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
blockIdx.x = 0
blockIdx.x = 1
blockIdx.x = 2
blockIdx.x = 3
gridDim = 4 x 1
blockDim = 8 x 1
Global thread ID = blockIdx.x * blockDim.x + threadIdx.x
= 3 * 8 + 2 = thread 26 with linear global addressing
Derived from Jason Sanders, "Introduction to CUDA
C" GPU technology conference, Sept. 20, 2010.
20
Code example with a 1-D grid and blocks
Vector addition
#define N 2048
#define T 256
// size of vectors
// number of threads per block
__global__ void vecAdd(int *A, int *B, int *C) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
Note: __global__ CUDA
function qualifier.
__ is two underscores
__global__ must return
a void
}
int main (int argc, char **argv ) {
…
vecAdd<<<N/T, T>>>(devA, devB, devC);
…
return (0);
}
// assumes N/T is an integer
Number of blocks to map each vector across grid,
one element of each vector per thread.
N/T assumed an integer
21
Higher dimensional grids/blocks
1-D grid and 1-D block suitable for processing one dimensional data
Higher dimensional grids and blocks convenient for higher
dimensional data:
Processing 2-D arrays might use a two dimensional grid and two
dimensional block
Might need higher dimensions because of limitation on sizes of
block in each dimension
CUDA provided with built-in variables and structures to define
number of blocks in grid in each dimension and number of threads
in a block in each dimension.
22
Built-in CUDA data types and structures
to define multidimensional structures
dim3 – can be considered essentially as CUDA-defined structure of
unsigned integers: x, y, z, i.e.: struct dim3 { x; y; z; };
Grid/Block Sizes
dim3 gridDim -- Grid dimensions, x and y (z not used).
No of blocks in grid = gridDim.x * gridDim.y
dim3 blockDim -- Size of block dimensions x, y, and z.
No of threads in a block = blockDim.x * blockDim.y * blockDim.z
23
Example Initializing Values
To set dimensions, use for example:
dim3 grid(16, 16);
// Grid -- 16 x 16 blocks
dim3 block(32, 32);
// Block -- 32 x 32 threads
myKernel<<<grid, block>>>(...);
which sets:
gridDim.x = 16
gridDim.y = 16
blockDim.x = 32
blockDim.y = 32
blockDim.z = 1
(although you do not initial CUDA structure elements that way)
24
CUDA Built-in Variables
for Grid/Block Indices
uint3 – can be considered essentially as CUDA-defined structure of
unsigned integers: x, y, z, i.e.: struct uint3 { x; y; z; };
Block index within grid
uint3 blockIdx -- blockIdx.x, blockIdx.y (z not used)
Thread index within block
uint3 threadIdx -- threadIdx.x, threadIdx.y, threadId.z
25
2-D Grids and 2-D blocks
blockIdx.y * blockDim.y + threadIdx.y
threadID.x
threadID.y
blockIdx.x * blockDim.x + threadIdx.x
Thread
26
Flattening arrays onto linear
memory
Generally memory allocated dynamically on device (GPU)
and we cannot not use two-dimensional indices (e.g.
A[row][column]) to access array as we might otherwise.
Need to know how array is laid out in memory and then
compute distance from the beginning of the array.
C uses row-major order --- rows are stored one after the
other in memory, i.e. row 0 then row 1 etc.
27
Flattening an array
Number of columns, N
column
0
0
row
N-1
Array element
a[row][column] = a[offset]
offset = column + row * N
where N is number of column in
array
row * number of columns
28
With one thread per array
element
int col = blockIdx.x*blockDim.x+threadIdx.x;
int row = blockIdx.y*blockDim.y+threadIdx.y;
int index = col + row * N; // thread ID
A[index] = …
29
CUDA version using 2-D grid and 2-D blocks
Adding two arrays where one thread handles
one element in each array
#define N 2048
// size of arrays
__global__void addMatrix (int *a, int *b, int *c) {
int col = blockIdx.x*blockDim.x+threadIdx.x;
int row =blockIdx.y*blockDim.y+threadIdx.y;
int index = col + row * N;
if ( col < N && row < N) c[index]= a[index] + b[index];
}
int main() {
...
dim3 dimBlock (16,16);
dim3 dimGrid (N/dimBlock.x, N/dimBlock.y);
addMatrix<<<dimGrid, dimBlock>>>(devA, devB, devC);
…
}
30
Compiling CUDA programs
“nvcc”
NVIDIA provides nvcc -- the NVIDIA CUDA “compiler
driver”.
Will separate out code for host and for device
Regular C/C++ compiler used for host (needs to be
available)
Programmer simply uses nvcc instead of gcc/cc compiler
on a Linux system
Command line options include for GPU features
31
Compiling code - Linux
Command line:
Directories for #include files
nvcc –O3 –o <exe> <source_file> -I/usr/local/cuda/include
–L/usr/local/cuda/lib –lcuda –lcudart
Optimization level if
you want optimized
code
Directories for libraries
Libraries to be linked
CUDA source file that includes device code has the extension .cu
nvcc separates code for CPU and for GPU and compiles code.
Need regular C compiler installed for CPU.
Make file convenient – see next.
See “The CUDA Compiler Driver NVCC” from NVIDIA for more details
32
Very simple sample Make file
NVCC = /usr/local/cuda/bin/nvcc
CUDAPATH = /usr/local/cuda
NVCCFLAGS = -I$(CUDAPATH)/include
LFLAGS = -L$(CUDAPATH)/lib64 -lcuda -lcudart -lm
prog1:
cc -o prog1 prog1.c –lm
A regular C program
A C program with X11 graphics
prog2:
cc -I/usr/openwin/include -o prog2 prog2.c -L/usr/openwin/lib -L/usr/X11R6/lib
-lX11 –lm
A CUDA program
prog3:
$(NVCC) $(NVCCFLAGS) $(LFLAGS) -o prog3 prog3.cu
A CUDA program with X11 graphics
prog4:
$(NVCC) $(NVCCFLAGS) $(LFLAGS) -I/usr/openwin/include -o prog4
prog4.cu -L/usr/openwin/lib -L/usr/X11R6/lib -lX11 -lm
33
Compilation process
nvcc “wrapper” divides
code into host and
device parts.
nvcc –o prog prog.cu –I/includepath -L/libpath
nvcc
Host part compiled by
regular C compiler
Device part compiled
by NVIDIA “ptxas”
assembler
Two compiled parts
combined into one
executable
ptxas
gcc
Combine
Object file
executable
Executable file a “fat” binary” with
both host and device code
34
Executing Program
Simple type name of executable created by nvcc:
./prog1
File includes all the code for host and for device in a “fat binary” file
Host code starts running
When first encounter device kernel, GPU code physically sent to
GPU and function launched on GPU
Hence first launch will be slow!!
Run time environment (cudart) controls memcpy timing and
synchronization
35
Ways to measure time of
execution
Generally instrument code
Measure time at two places and get difference
Ways to measure time:
•C clock() or time() routines
•CUDA “events” (seems the best way)
•CUDA SDK timer
36
Timing GPU Execution with CUDA events
Code
cudaEvent_t start, stop;
float elapsedTime;
cudaEventCreate(&start);
cudaEventCreate(&stop);
// create event objects
cudaEventRecord(start, 0);
// Record start event
.
Time period
.
.
cudaEventRecord(stop, 0);
// record end event
cudaEventSynchronize(stop);
// wait for all device work to complete
cudaEventElapsedTime(&elapsedTime, start, stop); //time between events
cudaEventDestroy(start);
cudaEventDestroy(stop););
//destroy start event
//destroy stop event
37
Recording Events
cudaEventRecord(event1, 0) record an “event” into
default “stream” (0).
Device will record a timestamp for the event when it
reaches that event in the stream, that is, after all
preceding operations have completed.
(Default stream 0 will mean completed in CUDA context)
NOTE: This operation is asynchronous and may
return before recording event!
38
Making event actually recorded
cudaEventSynchronize(event) -- waits until
named event actually recorded.
Event recorded when all work done by threads
to complete prior to specified event
(Not strictly be necessary if synchronous CUDA call in
code.)
39
Measuring time between two events
cudaEventElapsedTime(&time, event1, event2) will
return (pointer argument) the time elapsed between
two events, in milliseconds.
Resolution approx ½ millisecond.
Timing measured using GPU clock.
40
Questions