SIGCSE 2011 - The 42nd ACM Technical Symposium on Computer Science Education Wednesday March 9, 2011, 7:00 pm - 10:00 pm Workshop 9:

Download Report

Transcript SIGCSE 2011 - The 42nd ACM Technical Symposium on Computer Science Education Wednesday March 9, 2011, 7:00 pm - 10:00 pm Workshop 9:

SIGCSE 2011 - The 42nd ACM Technical Symposium on Computer
Science Education
Wednesday March 9, 2011, 7:00 pm - 10:00 pm
Workshop 9: General purpose computing using GPUs:
Developing a hands-on undergraduate course on
CUDA programming
Session 2
Dr. Barry Wilkinson
Dr. Yaohang Li
University of North Carolina Charlotte
Old Dominion University
1
SIGCSE 2011 Workshop 9 Session2.ppt Modification date: Feb 24, 2011 © 2011 B. Wilkinson
Session 2
In this session we will look at:
• Synchronization
• Device routines and routines
• Memory hierarchy
In the hands-on part, we will try:
• Improving performance through memory coalescing
• Atomics
2
CUDA synchronization
CUDA provides a synchronization barrier routine for
those threads within each block
__syncthreads()
Used within a kernel.
Threads wait at this point until all threads in the block
have reached it and then they are all released.
NOTE only synchronizes with other threads in block
3
__syncthreads() constraints
All threads must reach a particular __syncthreads()
routine or deadlock occurs.
Multiple __syncthreads() can be used in a kernel but each one is
unique.
Hence cannot have:
if ... __syncthreads();
else … __syncthreads();
and expect threads going thro different paths to be synchronized.
They all must go through the if or all go through the else clause,
ideally for efficiency reaching the __synthreads() at the same time
4
Global Kernel Barrier
Unfortunately no global kernel barrier routine available
Often want to synchronized all threads in computation
One possibility is to return from kernel and place a
synchronization point in the host code
Kernels are asynchronous so need a host call that is synchronous
such as cudaMemcpy or an explicit synchronization using:
cudaThreadSynchronize()
which waits until all preceding commands in all “streams” have
completed.
5
Reasoning behind not having CUDA
global synchronization on GPU
Expensive to implement for a large number of GPU
processors
Allows blocks to be executed in any order on GPU
Can use different sizes of blocks depending upon the
resources of GPU – so-called “transparent scalability”
Note: Kernel launches efficiently implemented:
- Minimal hardware overhead, little software overhead
6
CUDA qualifiers for kernel routines
(placed before routine declaration)
Two underscores
each
Host = CPU
Device = GPU
__global__
indicates routine can only be called from
host and only executed on device
(already seen) Must have void return type.
__device__
indicates routine can only be called from
device and only executed on device
__host__
indicates routine can only be called from
host and only executed on host
(generally only used in combination with __device__ )
7
Note cannot call a routine from the kernel to be executed on host
__device__ routines
__global__ void gpu_sort (int *a, int *b, int N) {
…
swap (&list[m],&list[j]);
…
}
Executed
on device
__device__ void swap (int *x, int *y) {
int temp;
temp = *x;
*x = *y;
*y = temp;
}
int main (int argc, char *argv[]) {
…
gpu_sort<<< B, T >>>(dev_a, dev_b, N);
…
return 0;
}
8
Routines to be executed on
device
Generally cannot call C library routines from device!
However CUDA has math routines for device that are
equivalent to standard C math routines with the same
names, so in practice can call math routines such as
sin(x) – check CUDA docs before use.
Also CUDA has GPU-only routines implemented, faster
less accurate (have __ names)*
* See NVIDIA CUDA C Programming Guide for more details
9
GPU Memory Hierarchy
global memory, shared memory, and registers
Host
Grid
Block
Threads
32768 32-bit registers
divided across 32 thread
“warps”.Compiler places
variables declared in
kernel here if possible
(not arrays)
Registers
Host
memory
Shared memory
Local memory
Global memory
Constant memory
For storing global constants.
Also a specialized read-only global memory called texture memory.
10
Host - Global Memory Connection
(separate GPU card)
Memory bus
limited by memory
and processormemory connection
bandwidth
Host
(CPU)
Hypertransport and
Intel’s Quickpath
currently 25.6 GB/s
For bulk transfers,
after dynamic
device memory
allocation using
cudaMalloc(), use
cudaMemcpy()
PCIe x16
4 GB/s
DDR 400
3.2 GB/s
Host
Memory
PCIe x16 Gen2
8 GB/s peak
Device
(GPU)
GPU bus
C2050 1030.4 GB/s
GTX 280 141.7 GB/s
Device
Global
Memory
GDDR5
230 GB/s
11
Declaring program variables for registers,
shared memory and global memory
Memory
Qualifier
Scope
Lifetime
Registers
Automatic variables*
other than arrays
Automatic array variables
Thread
Kernel
Thread
Kernel
Local
Shared
__shared__
Block
Kernel
Global
__device__
Grid
Application
Constant
__constant__
Grid
Application
*Automatic variables allocated automatically when entering scope of variable and deallocated when leaving scope. In C, all variables declared within a block are “automatic” by
default, see http://en.wikipedia.org/wiki/Automatic_variable
12
Device Global Memory
__device__ qualifier
For data
available to all
threads in
device.
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#define N 1000
…
__device__ int A[N];
Declared outside
function bodies
__global__ kernel() {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
A[tid] = …
…
}
Scope of Grid
and lifetime of
application
main {
…
}
Can be loaded/read from
host if necessary using
cudaMemcpyToSymbol()
cudaMemcpyFromSymbol()
13
Issues with using Global memory
•
Long delays, slow
•
Access congestion
• Cannot synchronize accesses
• Need to ensure no conflicts of accesses
between threads
14
Shared Memory
Shared memory is on the GPU
chip and very fast
Separate data available to all
threads in one block.
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#define N 1000
…
__global__ kernel() {
Declared inside function bodies
__shared__ int A[N];
Scope of block and lifetime of
kernel call
int tid = threadIdx.x;
A[tid] = …
…
So each block would have its
own array A[N]
}
main {
…
}
15
Issues with Shared Memory
Host data first needs to be transfer to global memory and then to
shared memory
Shared memory is not immediately synchronized after access.
Usually it is the writes that matter.
Use a cuda memory fence routine before you read data that has
been altered.
Shared memory is very limited
(Fermi has up to 48KB per GPU core, NOT per block)
Hence may have to divide your data into “chunks”
16
Example uses of shared data
Where the data can be divided into independent parts:
Image processing
- Image can be divided into blocks and placed into
shared memory for processing
Block matrix multiplication
Sub-matrices can be stored in shared memory
17
Session 2(a)
Hands-on session
Memory Coalescing
-- Combining separate memory accesses into one
combined access
Done by the GPU when the locations are sequential
locations in global memory banks.
18
Memory Banks
Device (GPU)
A[0]
Memory 1
A[1]
Memory 2
A[2]
Memory 3
A[3]
Memory 4
Consecutive locations on successive memory banks
Device can fetch A[0], A[1], A[2], A[3] … A[B-1] at
the same time, where there are B banks.
19
Ideally try to arrange for threads to access different memory
modules at the same time, and consecutive addresses
A bad case would be:
Time
•Thread 0 to access A[0], A[2], ... A[15]
•Thread 1 to access A[16], A[17], ... A[31]
•Thread 2 to access A[32], A[33], ... A[63]
… etc.
Good case would be
•Thread 0 to access A[0], A[16], ... A[31]
•Thread 1 to access A[1], A[17], ... A[32]
•Thread 2 to access A[2], A[18], ... A[33]
… etc.
if there are 16 banks. Need to know that detail!
20
Warp
A “warp’ in CUDA is a group of 32 threads that will
operate in the SIMT mode
A “half warp” (16 threads) actually execute
simultaneously (current GPUs)
Using knowledge of warps and how the memory is laid
out can improve code performance
21
Experiment
Simply load numbers into a two-dimensional array
Global threadID loaded into array element being accessed
so one can tell which thread accesses which location.
Loading could be done across rows or down column
Time of execution of each waycompared.
GPU structure -- one or more 2-D 32 x 32 blocks in a 2-D
grid.
22
One way
__global__ void gpu_Comput1 (int *h, int N, int T) {
int col = threadIdx.x + blockDim.x * blockIdx.x;
int row = threadIdx.y + blockDim.y * blockIdx.y;
int threadID = col + row * N;
int index = col + row * N;
for (int t = 0; t < T; t++)
h[index] = threadID;
// thread ID
// array index
// loop to reduce other time effects
// load array with global thread ID
}
Alternate way part of hands-on tasks
23
A grid of one
block and
1000000
iterations
Array 32 x 32
Speedup =
17.16
24
Cache memory
Recent GPUs have L1 and L2 cache memory, but
apparently without cache coherence so up to
programmer to ensure that.
Compute capability 1.3 Tesla’s do not have cache
memory
Compute capability 2.0 Fermi’s have L1/L2 caches
25
Session 2b hands-on session
Accessing Shared Data
Accessing shared data needs careful control.
Consider two threads each of which is to add one to a
shared data item, x.
Instruction
x = x + 1;
Time
Thread 1
Read x
Thread 2
Read x
Compute x + 1
Compute x + 1
Write to x
Write to x
Different interleavings can produce different answers
26
Atomic Operations
CUDA Atomic operations perform a read-modify-write operation
on a word in global or shared memory without interference by
other threads
Ensures each thread is allowed exclusive access to shared
variable to complete its operation (if a write operation is involved)
Example CUDA Atomic
int atomicAdd(int* address, int val);
Adds val to memory location given by address, atomically (atomic
read-modify-write operation)
27
#include <stdio.h>
#include <cuda.h>
#include <stdlib.h>
Example code
Note use of __device__
__device__ int gpu_Count=0;
//global variable in device
__global__ void gpu_Counter() {
atomicAdd(&gpu_Count,1);
// could use atomicInc()
}
int main(void) {
int cpu_Count;
…
gpu_Counter<<<B,T>>>();
Synchronous, so
cudaThreadSynchronize() not needed
cudaMemcpyFromSymbol(&cpu_Count, "gpu_Count",
sizeof(int), 0, cudaMemcpyDeviceToHost);
// read gpu_Count
printf("Count = %d\n",cpu_Count);
…
return 0;
}
28
Compilation Notes
Atomics only implemented on compute capability of 1.1 and above
and extra features such as floating point add on later versions
Previous code will need to be compiled with -arch=sm_11 (or later)
compile flag
Make file:
NVCC = /usr/local/cuda/bin/nvcc
CUDAPATH = /usr/local/cuda
NVCCFLAGS = -I$(CUDAPATH)/include -arch=sm_11
LFLAGS = -L$(CUDAPATH)/lib64 -lcuda -lcudart -lm
Counter:
$(NVCC) $(NVCCFLAGS) $(LFLAGS) -o Counter Counter.cu
29
Another Example
Computing Histogram
__device__ int gpu_hist[10];
// histogram computed on gpu
__global__ void gpu_histogram(int *a, int N) {
int *ptr;
int tid = blockIdx.x * blockDim.x + threadIdx.x;
int numberThreads = blockDim.x * gridDim.x;
if (tid == 0)
for (int i = 0; i < 10; i++)
gpu_hist[i] = 0;
// initialize histogram on host to all zeros
// maybe a better way but may not be 10 tids
__threadfence();
// wait for memory to be cleared
while (tid < N) {
ptr = &gpu_hist[a[tid]];
atomicAdd(ptr,1);
tid += numberThreads;
}
}
// if no of threads less than N, threads reused
30
Some Results
31
Serializing Code
Have as few as possible
atomics as their use can
serialize code.
Suppose, all processes
happen to come to their
atomics together.
They will execute their
atomics one after the other.
In that situation, execution
time becomes almost that of
a single processor.
32
Questions