Intermediate GPGPU Programming in CUDA

Download Report

Transcript Intermediate GPGPU Programming in CUDA

Intermediate GPGPU
Programming in CUDA
Supada Laosooksathit
NVIDIA Hardware Architecture
Host
memory
Recall
• 5 steps for CUDA Programming
– Initialize device
– Allocate device memory
– Copy data to device memory
– Execute kernel
– Copy data back from device memory
Initialize Device Calls
• To select the device associated to the host thread
– cudaSetDevice(device)
– This function must be called before any __global__
function, otherwise device 0 is automatically selected.
• To get number of devices
– cudaGetDeviceCount(&devicecount)
• To retrieve device’s property
– cudaGetDeviceProperties(&deviceProp, device)
Hello World Example
• Allocate host and device memory
Hello World Example
• Host code
Hello World Example
• Kernel code
To Try CUDA Programming
• SSH to 138.47.102.111
• Set environment vals in .bashrc in your home directory
export PATH=$PATH:/usr/local/cuda/bin
export LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
• Copy the SDK from
/home/students/NVIDIA_GPU_Computing_SDK
• Compile the following directories
– NVIDIA_GPU_Computing_SDK/shared/
– NVIDIA_GPU_Computing_SDK/C/common/
• The sample codes are in
NVIDIA_GPU_Computing_SDK/C/src/
Demo
• Hello World
– Print out block and thread IDs
• Vector Add
–C=A+B
NVIDIA Hardware Architecture
SM
Specifications of a Device
Specifications
Compute
Capability 1.3
Compute
Capability 2.0
Warp size
32
32
Max threads/block
512
1024
Max Blocks/grid
65535
65535
Shared mem
16 KB/SM
48 KB/SM
• For more details
– deviceQuery in CUDA SDK
– Appendix F in Programming Guide 4.0
Demo
• deviceQuery
– Show hardware specifications in details
Memory Optimizations
• Reduce the time of memory transfer between
host and device
– Use asynchronous memory transfer (CUDA
streams)
– Use zero copy
• Reduce the number of transactions between
on-chip and off-chip memory
– Memory coalescing
• Avoid bank conflicts in shared memory
Reduce Time of Host-Device Memory Transfer
• Regular memory transfer (synchronously)
Reduce Time of Host-Device Memory Transfer
• CUDA streams
– Allow overlapping between kernel and memory copy
CUDA Streams Example
CUDA Streams Example
GPU Timers
• CUDA Events
– An API
– Use the clock shade in kernel
– Accurate for timing kernel executions
• CUDA timer calls
– Libraries implemented in CUDA SDK
CUDA Events Example
Demo
• simpleStreams
Reduce Time of Host-Device Memory Transfer
• Zero copy
– Allow device pointers to access page-locked host
memory directly
– Page-locked host memory is allocated by
cudaHostAlloc()
Demo
• Zero copy
Reduce number of On-chip and Off-chip
Memory Transactions
• Threads in a warp access global memory
• Memory coalescing
– Copy a bunch of words at the same time
Memory Coalescing
• Threads in a warp access global memory in a
straight forward way (4-byte word per thread)
Memory Coalescing
• Memory addresses are aligned in the same
segment but the accesses are not sequential
Memory Coalescing
• Memory addresses are not aligned in the
same segment
Shared Memory
• 16 banks for compute capability 1.x, 32 banks
for compute capability 2.x
• Help utilizing memory coalescing
• Bank conflicts may occur
– Two or more threads in access the same bank
– In compute capability 1.x, no broadcast
– In compute capability 2.x, broadcast the same
data to many threads that request
Bank Conflicts
No bank conflict
2-way bank conflict
Threads: Banks:
0
0
Threads: Banks:
0
0
1
1
1
1
2
2
2
2
3
3
3
3
Matrix Multiplication Example
Matrix Multiplication Example
• Reduce accesses to global memory
– (A.height/BLOCK_SIZE) times reading A
– (B.width/BLOCK_SIZE) times reading B
Demo
• Matrix Multiplication
– With and without shared memory
– Different block sizes
Control Flow
• if, switch, do, for, while
• Branch divergence in a warp
– Threads in a warp issue different instruction sets
• Different execution paths will be serialized
• Increase number of instructions in that warp
Branch Divergence
Summary
• 5 steps for CUDA Programming
• NVIDIA Hardware Architecture
– Memory hierarchy: global memory, shared
memory, register file
– Specifications of a device: block, warp, thread, SM
Summary
• Memory optimization
– Reduce overhead due to host-device memory
transfer with CUDA streams, Zero copy
– Reduce the number of transactions between onchip and off-chip memory by utilizing memory
coalescing (shared memory)
– Try to avoid bank conflicts in shared memory
• Control flow
– Try to avoid branch divergence in a warp
References
• http://docs.nvidia.com/cuda/cuda-cprogramming-guide/
• http://docs.nvidia.com/cuda/cuda-c-bestpractices-guide/
• http://www.developer.nvidia.com/cudatoolkit