Getting started with GPU programming

Transcript Getting started with GPU programming

Introduction to GPU
Programming
Volodymyr (Vlad) Kindratenko
Innovative Systems Laboratory @ NCSA
Institute for Advanced Computing
Applications and Technologies (IACAT)
Tutorial Goals
• Become familiar with NVIDIA GPU architecture
• Become familiar with the NVIDIA GPU
application development flow
• Be able to write and run a simple NVIDIA GPU
application in CUDA
7/26/2009
2
Tutorial Outline
•
Introduction (15 minutes)
–
–
Why use Graphics Processing Units (GPUs) for general-purpose computing
Modern GPU architecture
•
–
GPU programming
•
•
Cluster architecture overview
How to login and check out a node
How to compile and run an existing application
Hands-on: Anatomy of a GPU application (25 minutes)
–
–
Host side
Device side
•
•
•
Libraries, CUDA C, OpenCL, PGI x64+GPU
Hands-on: getting started with NCSA GPU cluster (15 minutes)
–
–
–
•
NVIDIA
CUDA programming model
Hands-on: Porting matrix multiplier to GPU (25 minutes)
More on CUDA programming (40 minutes)
7/26/2009
3
Introduction
• Why use Graphics Processing Units (GPUs) for
general-purpose computing
• Modern GPU architecture
– NVIDIA
• GPU programming
– Libraries
– CUDA C
– OpenCL
– PGI x64+GPU
7/26/2009
4
Why GPUs?
Raw Performance Trends
1200
GTX 285
1000
GTX 280
GFLOP/s
800
8800 Ultra
600
8800 GTX
NVIDIA GPU
Intel CPU
400
7900 GTX
7800 GTX
200
5800
5950 Ultra
6800 Ultra
Intel Xeon Quad-core 3 GHz
0
9.22.02
7/26/2009
2.4.04
6.18.05
10.31.06
3.14.08
5
Graph is courtesy of NVIDIA
180
Why GPUs?
Memory Bandwidth Trends
GTX 285
160
GTX 280
140
GByte/s
120
8800 Ultra
100
8800 GTX
80
60
7800 GTX
40
7900 GTX
5950 Ultra
6800 Ultra
20
5800
0
9.22.02
7/26/2009
2.4.04
6.18.05
10.31.06
3.14.08
6
Graph is courtesy of NVIDIA
GPU vs. CPU Silicon Use
7/26/2009
7
Graph is courtesy of NVIDIA
NVIDIA GPU Architecture
• A scalable array of
multithreaded Streaming
Multiprocessors (SMs),
each SM consists of
– 8 Scalar Processor (SP)
cores
– 2 special function units for
transcendentals
– A multithreaded
instruction unit
– On-chip shared memory
• GDDR3 SDRAM
• PCIe interface
7/26/2009
8
Figure is courtesy of NVIDIA
NVIDIA GeForce9400M G GPU
• 16 streaming processors
arranged as 2 streaming
multiprocessors
• At 0.8 GHz this provides
TPC
Geometry controller
SMC
SM
SM
I cache
I cache
MT issue
MT issue
C cache
C cache
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SFU
SFU
SFU
SFU
Shared
memory
– 54 GFLOPS in singleprecision (SP)
Shared
memory
• 128-bit interface to offchip GDDR3 memory
Texture units
Texture L1
128-bit interconnect
L2
ROP ROP
DRAM
7/26/2009
– 21 GB/s bandwidth
L2
DRAM
9
NVIDIA Tesla C1060 GPU
TPC 1
TPC 10
Geometry controller
Geometry controller
SMC
SMC
SM
SM
SM
SM
SM
SM
I cache
I cache
I cache
I cache
I cache
I cache
MT issue
MT issue
MT issue
MT issue
MT issue
MT issue
C cache
C cache
C cache
C cache
C cache
C cache
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
SFU
Shared
memory
Shared
memory
Shared
memory
Shared
memory
Texture units
Texture L1
Shared
memory
Shared
memory
Texture units
Texture L1
512-bit memory interconnect
L2
ROP
ROP
DRAM
DRAM
7/26/2009
DRAM
DRAM
DRAM
DRAM
DRAM
L2
DRAM
• 240 streaming
processors arranged
as 30 streaming
multiprocessors
• At 1.3 GHz this
provides
– 1 TFLOPS SP
– 86.4 GFLOPS DP
• 512-bit interface to
off-chip GDDR3
memory
– 102 GB/s bandwidth
10
NVIDIA Tesla S1070 Computing Server
Tesla GPU
Tesla GPU
System
monitoring
Thermal
managemen
t
Power
supply
4 GB GDDR3
SDRAM
NVIDIA
SWITCH
PCI x16
NVIDIA
SWITCH
Tesla GPU
Tesla GPU
4 GB GDDR3
SDRAM
7/26/2009
4 GB GDDR3
SDRAM
PCI x16
• 4 T10 GPUs
4 GB GDDR3
SDRAM
11
Graph is courtesy of NVIDIA
GPU Use/Programming
• GPU libraries
– NVIDIA’s CUDA BLAS and FFT libraries
– Many 3rd party libraries
• Low abstraction lightweight GPU
programming toolkits
– CUDA C
– OpenCL
• High abstraction compiler-based tools
– PGI x64+GPU
7/26/2009
12
Getting Started with NCSA GPU Cluster
• Cluster architecture overview
• How to login and check out a node
• How to compile and run an existing
application
7/26/2009
13
NCSA AC GPU Cluster
7/26/2009
14
GPU Cluster Architecture
• Servers: 32
• Accelerator Units: 32
– CPU cores: 128
– GPUs: 128
ac
(head node)
ac01
(compute node)
7/26/2009
ac02
(compute node)
ac32
(compute node)
15
GPU Cluster Node Architecture
IB
• HP xw9400 workstation
– 2216 AMD Opteron 2.4
GHz dual socket dual
core
– 8 GB DDR2
– InfiniBand QDR
• S1070 1U GPU
Computing Server
– 1.3 GHz Tesla T10
processors
– 4x4 GB GDDR3 SDRAM
QDR IB
HP xw9400 workstation
PCIe x16
PCIe x16
PCIe interface
PCIe interface
T10
T10
T10
T10
DRAM
DRAM
DRAM
DRAM
Tesla S1070
Compute node
7/26/2009
16
Accessing the GPU Cluster
• Use Secure Shell (SSH) client to access AC
– `ssh [email protected]` (User: gpu001 - gpu200; Password: CHiPS-09)
– You will see something like this printed out:
See machine details and a technical report at:
http://www.ncsa.uiuc.edu/Projects/GPUcluster/
Machine Description and HOW TO USE. See: /usr/local/share/ac.readme
CUDA wrapper readme: /usr/local/share/cuda_wrapper.readme
*IMPORTANT* If you are using multiple GPU devices per host, be sure to understand how
the cuda_wrapper changes this system!!
…
July 03, 2009
Nvidia compute exclusive mode made default. If this breaks your application,
"touch ~/FORCE_NORMAL" to create an override for all your jobs.
Questions? Contact Jeremy Enos [email protected]
[gpuXYZ@ac ~]$ _
7/26/2009
17
Installing Tutorial Examples
• Run this sequence to retrieve and install
tutorial examples:
cd
cp /tmp/chips_tutorial.tgz .
tar -xvzf chips_tutorial.tgz
cd chips_tutorial
ls
src1 src2 src3
7/26/2009
18
Accessing the GPU Cluster
Laptop 1
You are here
ac01
(compute node)
7/26/2009
Laptop 2
Laptop 30
ac
(head node)
ac02
(compute node)
ac32
(compute node)
19
Requesting a Cluster Node for
Interactive Use
• Run `qstat` to see what other users do, just for
the fun of it
• Run `qsub -I -l walltime=02:00:00` to request a
node with a single GPU for 2 hours of interactive
use
– You will see something like this printed out:
qsub: waiting for job 64424.acm to start
qsub: job 64424.acm ready
[gpuXYZ@acAB ~]$ _
7/26/2009
20
Requesting a Cluster Node
Laptop 1
Laptop 2
Laptop 30
ac
(head node)
ac01
(compute node)
7/26/2009
ac02
(compute node)
You are here
ac32
(compute node)
21
Checking GPU Characteristics
• Run `deviceQuery`
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
Device 0: "Tesla C1060"
CUDA Capability Major revision number:
1
CUDA Capability Minor revision number:
3
Total amount of global memory:
4294705152 bytes
Number of multiprocessors:
30
Number of cores:
240
Total amount of constant memory:
65536 bytes
Total amount of shared memory per block: 16384 bytes
…
Clock rate:
1.30 GHz
Compute mode:
Exclusive (only one host thread at a time can use this device)
7/26/2009
22
Compiling and Running and Existing
Application
• cd chips_tutorial/src1
– vecadd.c - reference C implementation
– vecadd.cu – CUDA implementation
• Compile & run CPU version
gcc vecadd.c -o vecadd_cpu
./vecadd_cpu
Running CPU vecAdd for 16384 elements
C[0]=2147483648.00 ...
• Compile & run GPU version
nvcc vecadd.cu -o vecadd_gpu
./vecadd_gpu
Running GPU vecAdd for 16384 elements
C[0]=2147483648.00 ...
7/26/2009
23
nvcc
• Any source file containing CUDA C language
extensions must be compiled with nvcc
• nvcc is a compiler driver that invokes many other
tools to accomplish the job
• Basic nvcc usage
– nvcc <filename>.cu [-o <executable>]
• Builds release mode
– nvcc -deviceemu <filename>.cu
• Builds device emulation mode (all code runs on CPU)
– -g flag allows to build debug mode for gdb debugger
7/26/2009
24
Anatomy of a GPU Application
• Host side
• Device side
• CUDA programming model
7/26/2009
25
CPU-Only Version
void vecAdd(int N, float* A, float* B, float* C) {
for (int i = 0; i < N; i++) C[i] = A[i] + B[i];
}
Computational kernel
int main(int argc, char **argv)
{
int N = 16384; // default vector size
float *A = (float*)malloc(N * sizeof(float));
float *B = (float*)malloc(N * sizeof(float));
float *C = (float*)malloc(N * sizeof(float));
vecAdd(N, A, B, C); // call compute kernel
free(A); free(B); free(C);
Memory allocation
Kernel invocation
Memory de-allocation
}
7/26/2009
26
Adding GPU support
int main(int argc, char **argv)
{
int N = 16384; // default vector size
float *A = (float*)malloc(N * sizeof(float));
float *B = (float*)malloc(N * sizeof(float));
float *C = (float*)malloc(N * sizeof(float));
float *devPtrA, *devPtrB, *devPtrC;
cudaMalloc((void**)&devPtrA, N * sizeof(float));
cudaMalloc((void**)&devPtrB, N * sizeof(float));
cudaMalloc((void**)&devPtrC, N * sizeof(float));
Memory allocation
on the GPU card
Copy data from the
CPU (host) memory
to the GPU (device)
memory
cudaMemcpy(devPtrA, A, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(devPtrB, B, N * sizeof(float), cudaMemcpyHostToDevice);
7/26/2009
27
Adding GPU support
vecAdd<<<N/512, 512>>>(devPtrA, devPtrB, devPtrC);
Kernel invocation
cudaMemcpy(C, devPtrC, N * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(devPtrA);
cudaFree(devPtrB);
cudaFree(devPtrC);
free(A); free(B); free(C);
}
7/26/2009
Copy results from
device memory to
the host memory
Device memory
de-allocation
28
GPU Kernel
• CPU version
void vecAdd(int N, float* A, float* B, float* C)
{
for (int i = 0; i < N; i++)
C[i] = A[i] + B[i];
}
• GPU version
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
C[i] = A[i] + B[i];
}
7/26/2009
29
CUDA Programming Model
• A CUDA kernel is executed by an
array of threads
threadID
– All threads run the same code (SPMD)
– Each thread has an ID that it uses to
compute memory addresses and make
control decisions
…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…
• Threads are arranged as a grid of
thread blocks
– Threads within a
block have access
to a segment of
shared memory
7/26/2009
Grid
Thread Block 0
Thread Block 1
Thread Block N-1
…
Shared memory
Shared memory
Shared memory
30
Kernel Invocation Syntax
grid & thread block dimensionality
vecAdd<<<32, 512>>>(devPtrA, devPtrB, devPtrC);
Grid
Thread Block 0
Thread Block 1
Thread Block N-1
…
Shared memory
Shared memory
Shared memory
int i = blockIdx.x * blockDim.x + threadIdx.x;
block ID within a grid
7/26/2009
number of theards per block
thread ID within a thread block
31
Mapping Threads to the Hardware
• Blocks of threads are transparently assigned to SMs
– A block of threads executes on one SM & does not migrate
– Several blocks can reside concurrently on one SM
Kernel grid
Device
Device
Block 0 Block 1
Block 2 Block 3
Block 0
Block 1
Block 4 Block 5
Block 6 Block 7
Block 2
Block 4
Block 6
7/26/2009
time
Block 0
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
Block 3
Block 5
Block 7
Each block can execute in any
order relative to other blocks.
32
Slide is courtesy of NVIDIA
CUDA Programming Model
• A kernel is executed as a
grid of thread blocks
– Grid of blocks can be 1 or 2dimentional
– Thread blocks can be 1, 2, or
3-dimensional
• Different kernels can have
different grid/block
configuration
• Threads from the same
block have access to a
shared memory and their
execution can be
synchronized
7/26/2009
Host
Device
Grid 1
Kerne
l1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Kerne
l2
Block (1, 1)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Thread Thread Thread Thread
(0,0,0) (1,0,0) (2,0,0) (3,0,0)
Thread Thread Thread Thread
(0,1,0) (1,1,0) (2,1,0) (3,1,0)
33
Slide is courtesy of NVIDIA
GPU Memory Hierarchy
Host
Device
CPU
DRAM
chipset
local
global
DRAM
constant
texture
GPU
Multiprocessor
Multiprocessor
Multiprocessor
registers
shared
memory
constant and texture caches
Memory
Location
Cached
Access
Scope
Lifetime
Register
On-chip
N/A
R/W
One thread
Thread
Local
Off-chip
No
R/W
One thread
Thread
Shared
On-chip
N/A
R/W
All threads in a block
Block
Global
Off-chip
No
R/W
All threads + host
Application
Constant
Off-chip
Yes
R
All threads + host
Application
Texture
Off-chip
Yes
R
All threads + host
Application
7/26/2009
34
Porting matrix multiplier to GPU
• cd ../chips_tutorial/src2
• Compile & run CPU version
icc -O3 mmult.c -o mmult
./mmult
1024.00 1024.00 1024.00 1024.00 1024.00 ...
1024.00 1024.00 1024.00 1024.00 1024.00 ...
1024.00 1024.00 1024.00 1024.00 1024.00 ...
1024.00 1024.00 1024.00 1024.00 1024.00 ...
1024.00 1024.00 1024.00 1024.00 1024.00 ...
...
msec = 2215478 GFLOPS = 0.969
7/26/2009
35
int main(int argc, char* argv[])
{
int N = 1024;
struct timeval t1, t2, ta, tb;
long msec1, msec2;
float flop, mflop, gflop;
float *a = (float *)malloc(N*N*sizeof(float));
float *b = (float *)malloc(N*N*sizeof(float));
float *c = (float *)malloc(N*N*sizeof(float));
// a = b * c
void mmult(float *a, float *b, float *c, int N)
{
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++)
for (int i = 0; i < N; i++)
a[i+j*N] += b[i+k*N]*c[k+j*N];
}
mprint(a, N, 5);
void minit(float *a, float *b, float *c, int N)
{
for (int j = 0; j < N; j++)
for (int i = 0; i < N; i++) {
a[i+N*j] = 0.0f;
b[i+N*j] = 1.0f;
c[i+N*j] = 1.0f;
}
}
free(a);
free(b);
free(c);
void mprint(float *a, int N, int M)
{
int i, j;
minit(a, b, c, N);
gettimeofday(&t1, NULL);
mmult(a, b, c, N); // a = b * c
gettimeofday(&t2, NULL);
msec1 = t1.tv_sec * 1000000 + t1.tv_usec;
msec2 = t2.tv_sec * 1000000 + t2.tv_usec;
msec2 -= msec1;
flop = N*N*N*2.0f;
mflop = flop / msec2;
gflop = mflop / 1000.0f;
printf("msec = %10ld GFLOPS = %.3f\n", msec2, gflop);
}
for (int j = 0; j < M; j++)
{
for (int i = 0; i < M; i++)
printf("%.2f ", a[i+N*j]);
printf("...\n");
}
printf("...\n");
}
7/26/2009
36
Matrix Representation in Memory
for (i = 0; i < n; ++i)
for (j = 0; j < n; ++j)
for (k = 0; k < n; ++k)
a[i+n*j] += b[i+n*k] * c[k+n*j];
-
Matrices are stored in column-major order
-
For reference, jki-ordered version runs at 1.7 GFLOPS on 3 GHz Intel Xeon
(single core)
7/26/2009
37
C
Map this code:
c1,1
c1,2
c1,3
c2,1
c2,2
c2,3
B
for (i = 0; i < n; ++i)
for (j = 0; j < n; ++j)
for (k = 0; k < n; ++k)
a[i+n*j] += b[i+n*k] * c[k+n*j];
b1,1
b1,2
a1,1
a1,2
a1,3
b2,1
b2,2
a2,1
a2,2
a2,3
b3,1
b3,2
a3,1
a3,2
a3,3
A=B*C
a1,2=b1,1*c1,2+b1,2*c2,2
into this (logical) architecture:
Grid of thread blocks
blockIdx.x
0
1
2
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
blockDim.x
threadIdx.x
blockIdx.x * blockDim.x +
threadIdx.x
7/26/2009
38
dim3 grid(1024/32, 1024);
dim3 threads (32);
32x1024 grid of thread blocks
Block of 32x1x1 threads
(blockIdx.x, blockIdx.y)
(threadIdx.x)
1024 thread blocks (j)
{
7/26/2009
int i = blockIdx.x*32 + threadIdx.x;
int j = blockIdx.y;
float sum = 0.0f;
for (int k = 0; k < n; k++)
sum += b[i+n*k] * c[k+n*j];
a[i+n*j] = sum;
}
32 threads per block (i)
32 thread blocks (i)
39
Kernel
Original CPU kernel
void mmult(float *a, float *b, float *c, int N)
{
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++)
a[i+j*N] += b[i+k*N]*c[k+j*N];
}
GPU Kernel
__global__
void mmult(float *a, float *b, float *c, int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y;
float sum = 0.0;
for (int k = 0; k < N; k++)
sum += b[i+N*k] * c[k+N*j];
a[i+N*j] = sum;
}
dim3 dimGrid(32, 1024);
dim3 dimBlock(32);
mmult<<<dimGrid, dimBlock>>>(devPtrA, devPtrB, devPtrC, N);
7/26/2009
40
int main(int argc, char* argv[])
{
int N = 1024;
struct timeval t1, t2;
long msec1, msec2;
float flop, mflop, gflop;
float *a = (float *)malloc(N*N*sizeof(float));
float *b = (float *)malloc(N*N*sizeof(float));
float *c = (float *)malloc(N*N*sizeof(float));
minit(a, b, c, N);
// allocate device memory
float *devPtrA, *devPtrB, *devPtrC;
cudaMalloc((void**)&devPtrA, N*N*sizeof(float));
cudaMalloc((void**)&devPtrB, N*N*sizeof(float));
cudaMalloc((void**)&devPtrC, N*N*sizeof(float));
// copu input arrays to the device meory
cudaMemcpy(devPtrB, b, N*N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(devPtrC, c, N*N*sizeof(float), cudaMemcpyHostToDevice);
7/26/2009
41
gettimeofday(&t1, NULL);
msec1 = t1.tv_sec * 1000000 + t1.tv_usec;
// define grid and thread block sizes
dim3 dimGrid(32, 1024);
dim3 dimBlock(32);
// launch GPU kernel
mmult<<<dimGrid, dimBlock>>>(devPtrA, devPtrB, devPtrC, N);
// check for errors
cudaError_t err = cudaGetLastError();
if (cudaSuccess != err)
{
fprintf(stderr, "CUDA error: %s.\n", cudaGetErrorString( err) );
exit(EXIT_FAILURE);
}
// wait until GPU kernel is done
cudaThreadSynchronize();
gettimeofday(&t2, NULL);
msec2 = t2.tv_sec * 1000000 + t2.tv_usec;
7/26/2009
42
// copy results to host
cudaMemcpy(a, devPtrA, N*N*sizeof(float), cudaMemcpyDeviceToHost);
mprint(a, N, 5);
// free device memory
cudaFree(devPtrA);
cudaFree(devPtrB);
cudaFree(devPtrC);
free(a);
free(b);
free(c);
msec2 -= msec1;
flop = N*N*N*2.0f;
mflop = flop / msec2;
gflop = mflop / 1000.0f;
printf("msec = %10ld GFLOPS = %.3f\n", msec2, gflop);
}
7/26/2009
43
Porting matrix multiplier to GPU
• Compile & run GPU version
nvcc mmult.cu -o mmult
./mmult
1024.00 1024.00 1024.00 1024.00 1024.00 ...
1024.00 1024.00 1024.00 1024.00 1024.00 ...
1024.00 1024.00 1024.00 1024.00 1024.00 ...
1024.00 1024.00 1024.00 1024.00 1024.00 ...
1024.00 1024.00 1024.00 1024.00 1024.00 ...
...
msec = 91363 GFLOPS = 23.505
7/26/2009
44
More on CUDA Programming
•
Language extensions
–
–
–
–
•
Common runtime components
–
•
Intrinsic functions
Synchronization and memory fencing functions
Atomic functions
Host runtime components (runtime API only)
–
–
–
–
•
Built-in vector types
Device runtime components
–
–
–
•
Function type qualifiers
Variable type qualifiers
Execution configuration
Built-in variables
Device management
Memory management
Error handling
Debugging in the device emulation mode
Exercise
7/26/2009
45
Function Type Qualifiers
Executed
on the:
Only callable
from the:
__device__ float DeviceFunc()
device
device
__global__ void
device
host
host
host
__host__
KernelFunc()
float HostFunc()
__device__ and __global__ functions do not support recursion, cannot declare
static variables inside their body, cannot have a variable number of arguments
__device__ functions cannot have their address taken
__host__ and __device__ qualifiers can be used together, in which case the
function is compiled for both
__global__ and __host__ qualifiers cannot be used together
__global__ function must have void return type, its execution configuration
must be specified, and the call is asynchronous
7/26/2009
46
Variable Type Qualifiers
Memory
Scope
Lifetime
__device__
int GlobalVar;
global
grid
application
__device__ __shared__
int SharedVar;
shared
block
block
constant
grid
Application
__device__ __constant__ int ConstantVar;
volatile int GlobarVar or SharedVar;
__shared__ and __constant__ variables have implied static storage
__device__, __shared__ and __constant__ variables cannot be defined using
external keyword
__device__ and __constant__ variables are only allowed at file scope
__constant__ variables cannot be assigned to from the devices, they are initialized
from the host only
__shared__ variables cannot have an initialization as part of their declaration
7/26/2009
47
Execution Configuration
Function declared as
__global__ void kernel(float* param);
must be called like this:
kernel<<<Dg, Db, Ns, S>>>(param);
where
• Dg (type dim3) specifies the dimension and size of the grid, such that Dg.x*Dg.y equals
the number of blocks being launched;
• Db (type dim3) spesifies the dimension abd size of each block of threads, such that
Db.x*Db.y*Db.z equals the number of threads per block;
• optional Ns (type size_z) specifies the number of bytes of shared memory dynamically
allocated per block for this call in addition to the statically allocated memory
• optional S (type cudaStream_t) specifies the stream associated with this kernel call
7/26/2009
48
Built-in Variables
variable
type
description
gridDim
dim3
dimensions of the grid
blockID
unit3
block index within the grid
blockDim
dim3
dimensions of the block
threadIdx
uint3
thread index within the block
warpSize
int
warp size in threads
It is not allowed to take addresses of any of the built-in variables
It is not allowed to assign values to any of the built-in variables
7/26/2009
49
Built-in Vector Types
Vector types derived from basic
integer and float types
•
•
•
•
•
•
•
•
•
•
•
char1, char2, char3, char4
uchar1, uchar2, uchar3, uchar4
short1, short2, short3, short4
ushort1, ushort2, ushort3, ushort4
int1, int2, int3, int4
uint1, uint2, uint3 (dim3), uint4
long1, long2, long3, long4
ulong1, ulong2, ulong3, ulong4
longlong1, longlong2
float1, float2, float3, float4
double1, double2
7/26/2009
They are all structures, like this:
typedef struct {
float x,y,z,w;
} float4;
They all come with a constructor
function in the form make_<type
name>, e.g.,
int2 make_int2(int x, int y);
50
Intrinsic Functions
Supported on the device only
Start with __, as in __sinf(x)
End with
_rn (round-to-nearest-even rounding mode)
_rz (round-towards-zero rounding mode)
_ru (round-up rounding mode)
_rd (round-down rounding mode)
as in __fadd_rn(x,y);
There are mathematical (__log10f(x)), type conversion (__int2float_rn(x)),
type casting (__int_as_float(x)), and bit manipulation (__ffs(x)) functions
7/26/2009
51
Synchronization and Memory Fencing
Functions
function
description
void __threadfence()
wait until all global and shared memory
accesses made by the calling thread become
visible to all threads in the device for global
memory accesses and all threads in the thread
block for shared memory accesses
void
__threadfence_block()
Waits until all global and shared memory
accesses made by the calling thread become
visible to all threads in the thread block
void __syncthreads()
Waits until all threads in the thread block have
reached this point and all global and shared
memory accesses made by these threads
become visible to all threads in the block
7/26/2009
52
Atomic Functions
function
Description
atomicAdd()
new = old + val
atomicSub()
new = old – val
atomicExch()
new = val
atomicMin()
new = min(old, val)
atomicMax()
new = max(old, val)
atomicInc()
new = ((old >= val) ? 0 : (old+1))
atomicDec()
new = (((old==0) | (old > val)) ? val : (old-1))
atomicCAS()
new = (old == compare ? val : old)
Atomic{And, Or, Xor}()
new = {(old & val), (old | val), (old^val)}
An atomic function performs read-modify-write atomic operation on one 32-bit or one
64-bit word residing in global or shared memory. The operation is atomic in the sense
that it is guaranteed to be performed without interference from other threads.
7/26/2009
53
CUDA APIs
•
higher-level API called the CUDA
runtime API
– myKernelunsigned char*)devPtr,
width, <<<Dg, Db>>>(( height, pitch);
•
low-level API called the CUDA driver
API
– cuModuleLoad( &module, binfile );
– cuModuleGetFunction( &func,
module, "mmkernel" );
– …
– cuParamSetv( func, 0, &args, 48 );
– cuParamSetSize( func, 48 );
– cuFuncSetBlockShape( func, ts[0],
ts[1], 1 );
– cuLaunchGrid( func, gs[0], gs[1] );
7/26/2009
54
Device Management
function
description
cudaGetDeviceCount()
Returns the number of compute-capable
devices
cudaGetDeviceProperties()
Returns information on the compute
device
cudaSetDevice()
Sets device to be used for GPU execution
cudaGetDevice()
Returns the device currently being used
cudaChooseDevice()
Selects device that best matches given
criteria
7/26/2009
55
Device Management Example
void cudaDeviceInit() {
int devCount, device;
cudaGetDeviceCount(&devCount);
if (devCount == 0) {
printf("No CUDA capable devices detected.\n");
exit(EXIT_FAILURE);
}
for (device=0; device < devCount; device++) {
cudaDeviceProp props;
cudaGetDeviceProperties(&props, device);
// If a device of compute capability >= 1.3 is found, use it
if (props.major > 1 || (props.major == 1 && props.minor >= 3)) break;
}
if (device == devCount) {
printf("No device above 1.2 compute capability detected.\n");
exit(EXIT_FAILURE);
}
else cudaSetDevice(device);
}
7/26/2009
56
Memory Management
function
description
cudaMalloc()
Allocates memory on the GPU
cudaMallocPitch()
Allocates memory on the GPU device for 2D
arrays, may pad the allocated memory to
ensure alignment requirements
cudaFree()
Frees the memory allocated on the GPU
cudaMallocArray()
Allocates an array on the GPU
cudaFreeArray()
Frees an array allocated on the GPU
cudaMallocHost()
Allocates page-locked memory on the host
cudaFreeHost()
Frees page-locked memory in the host
7/26/2009
57
More on Memory Alignment
cudaMalloc(&dev_a, m*n*sizeof(float));
a1,1 a1,2 a1,3
a2,1 a2,2 a2,3
a1,1
a2,1
a3,1
a1,2
a2,2
a3,2
a1,3
a2,3
a3,3
a3,1 a3,2 a3,3
Matrix columns are not aligned at 64-bit boundary
cudaMallocPitch(&dev_a, &n, n*sizeof(float), m);
a1,1
a2,1
a3,1
a1,2
a2,2
a3,2
a1,3
a2,3
a3,3
Matrix columns are aligned at 64-bit boundary
n is the allocated (aligned) size for the first dimension (the pitch), given the requested sizes of the two
dimensions.
7/26/2009
58
Memory Management Example
cudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(float), height);
myKernel<<<100, 192>>>(devPtr, pitch);
// device code
__global__ void myKernel(float* devPtr, int pitch)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c];
}
}
}
7/26/2009
59
Memory Management
function
description
cudaMemset()
Initializes or sets GPU memory to a value
cudaMemCpy()
Copies data between host and the device
cudaMemcpyToArray()
cudaMemcpyFromArray()
cudaMemcpyArrayToArray()
cudaMemcpyToSymbol()
cudaMemcpyFromSymbol()
cudaGetSymbolAddress()
Finds the address associated with a CUDA
symbol
cudaGetSymbolSize()
Finds the size of the object associated with
a CUDA symbol
7/26/2009
60
Error Handling
All CUDA runtime API functions return an error code. The runtime maintains
an error variable for each host thread that is overwritten by the error code
every time an error concurs.
function
description
cudaGetLastError()
Returns error variable and resets it to
cudaSuccess
cudaGetErrorString()
Returns the message string from an error
code
cudaError_t err = cudaGetLastError();
if (cudaSuccess != err) {
fprintf(stderr, "CUDA error: %s.\n", cudaGetErrorString( err) );
exit(EXIT_FAILURE);
}
7/26/2009
61
Exercise
• Port Mandelbrot set fractal renderer to CUDA
– Source is in ~/chips_tutorial/src3
• fractal.c – reference C implementation
• Makefile – make file
• fractal.cu.reference – CUDA implementation for
reference
7/26/2009
62
Reference C Implementation
void makefractal_cpu(unsigned char *image, int width, int height, double xupper,
double xlower, double yupper, double ylower)
{
int x, y;
double xinc = (xupper - xlower) / width;
double yinc = (yupper - ylower) / height;
for (y = 0; y < height; y++)
{
for (x = 0; x < width; x++)
{
image[y*width+x] = iter((xlower + x*xinc), (ylower + y*yinc));
}
}
}
7/26/2009
63
CUDA Kernel Implementation
__global__ void makefractal_cpu(unsigned char *image, int width, int height, double
xupper, double xlower, double yupper, double ylower)
{
int x = blockIdx.x;
int y = blockIdx.y;
int width = gridDim.x;
int height = gridDim.y;
double xupper=-0.74624, xlower=-0.74758, yupper=0.10779, ylower=0.10671;
double xinc = (xupper - xlower) / width;
double yinc = (yupper - ylower) / height;
image[y*width+x] = iter((xlower + x*xinc), (ylower + y*yinc));
}
7/26/2009
64
Reference C Implementation
inline unsigned char iter(double a, double b)
{
unsigned char i = 0;
double c_x = 0, c_y = 0;
double c_x_tmp, c_y_tmp;
double D = 4.0;
while ((c_x*c_x+c_y*c_y < D) && (i++ < 255))
{
c_x_tmp = c_x * c_x - c_y * c_y;
c_y_tmp = 2* c_y * c_x;
c_x = a + c_x_tmp;
c_y = b + c_y_tmp;
}
return i;
The Mandelbrot set is
generated by iterating complex
function z2 + c, where c is a
constant:
z1 = (z0)2 + c
z2 = (z1)2 + c
z3 = (z2)2 + c
and so forth. Sequence z0, z1,
z2,... is called the orbit of z0
under iteration of z2 + c. We
stop iteration when the orbit
starts to diverge, or when a
maximum number of iterations
is done.
}
7/26/2009
65
CUDA Kernel Implementation
inline __device__ unsigned char iter(double a, double b)
{
unsigned char i = 0;
double c_x = 0, c_y = 0;
double c_x_tmp, c_y_tmp;
double D = 4.0;
while ((c_x*c_x+c_y*c_y < D) && (i++ < 255))
{
c_x_tmp = c_x * c_x - c_y * c_y;
c_y_tmp = 2* c_y * c_x;
c_x = a + c_x_tmp;
c_y = b + c_y_tmp;
}
return i;
}
7/26/2009
66
Host Code
int width = 1024;
int height = 768;
unsigned char *image = NULL;
unsigned char *devImage;
image = (unsigned char*)malloc(width*height*sizeof(unsigned char));
cudaMalloc((void**)&devImage, width*height*sizeof(unsigned char));
dim3 dimGrid(width, height);
dim3 dimBlock(1);
makefractal_gpu<<<dimGrid, dimBlock>>>(devImage);
cudaMemcpy(image, devImage, width*height*sizeof(unsigned char), cudaMemcpyDeviceToHost);
free(image);
cudaFree(devImage);
7/26/2009
67
Few Examples
•
•
•
•
xupper=-0.74624
xlower=-0.74758
yupper=0.10779
ylower=0.10671
• CPU time: 2.27 sec
• GPU time: 0.29 sec
7/26/2009
•
•
•
•
xupper=-0.754534912109
xlower=-.757077407837
yupper=0.060144042969
ylower=0.057710774740
• CPU time: 1.5 sec
• GPU time: 0.25 sec
68

Getting started with GPU programming

Transcript Getting started with GPU programming

Directory