CUDA Programming - VLSI Signal Processing Lab, EE, NCTU

Download Report

Transcript CUDA Programming - VLSI Signal Processing Lab, EE, NCTU

Basic CUDA Programming
Shin-Kai Chen
[email protected]
VLSI Signal Processing Laboratory
Department of Electronics Engineering
National Chiao Tung University
What will you learn in
this lab?
• Concept of multicore accelerator
• Multithreaded/multicore programming
• Memory optimization
Slides
• Mostly from Prof. Wen-Mei Hwu of
UIUC
– http://courses.ece.uiuc.edu/ece498/al/
Syllabus.html
CUDA – Hardware?
Software?
...
Application
...
Host
Thread Id #:
0123…
m
CUDA
Device
Grid 1
Kernel
1
Block
(0, 0)
Block
(1, 0)
Block
(0, 1)
Block
(1, 1)
Grid 2
Thread program
Kernel
2
Block (1, 1)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Thread Thread Thread Thread
(0,0,0) (1,0,0) (2,0,0) (3,0,0)
Thread Thread Thread Thread
(0,1,0) (1,1,0) (2,1,0) (3,1,0)
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organization.
Host
Input Assembler
Thread Execution Manager
Platform
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Global Memory
Load/store
Load/store
Host-Device Architecture
CPU
(host)
GPU w/
local DRAM
(device)
G80 CUDA mode – A Device
Example
Host
Input Assembler
Thread Execution Manager
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Global Memory
Load/store
Load/store
Functional Units in G80
• Streaming Multiprocessor (SM)
– 1 instruction decoder ( 1 instruction / 4
cycle )
– 8 streaming processor (SP)
SM 0 SM 1
– Shared memory
t0 t1 t2 … tm
MT IU
SP
MT IU
t0 t1 t2 … tm
SP
Blocks
Blocks
Shared
Memory
Shared
Memory
Setup CUDA for
Windows
CUDA Environment Setup
• Get GPU that support CUDA
– http://www.nvidia.com/object/cuda_learn_pro
ducts.html
• Download CUDA
– http://www.nvidia.com/object/cuda_get.html
• CUDA driver
• CUDA toolkit
• CUDA SDK (optional)
• Install CUDA
• Test CUDA
– Device Query
Setup CUDA for Visual
Studio
• From scratch
– http://forums.nvidia.com/index.php?sho
wtopic=30273
• CUDA VS Wizard
– http://sourceforge.net/projects/cudavs
wizard/
• Modified from existing project
Lab1: First CUDA
Program
CUDA Computing Model
Host
Host
Serial Code
Serial Code
Device
Memory Transfer
Lunch Kernel
Parallel Code
Parallel Code
Memory Transfer
Serial Code
Serial Code
Memory Transfer
Lunch Kernel
Parallel Code
Memory Transfer
Parallel Code
Data Manipulation between
Host and Device
• cudaError_t cudaMalloc( void** devPtr, size_t count )
– Allocates count bytes of linear memory on the device and
return in *devPtr as a pointer to the allocated memory
• cudaError_t cudaMemcpy( void* dst, const void* src, size_t
count, enum cudaMemcpyKind kind)
– Copies count bytes from memory area pointed to by src to the
memory area pointed to by dst
– kind indicates the type of memory transfer
•
•
•
•
cudaMemcpyHostToHost
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
• cudaError_t cudaFree( void* devPtr )
– Frees the memory space pointed to by devPtr
Example
Float GPU_kernel(int *B, int *A) {
• Functionality:
// Create Pointers for Memory Space on Device
int *dA, *dB;
– Given an integer
array A holding
8192 elements
– For each element in
array A, calculate
A[i]256 and leave
the result in B[i]
// Allocate Memory Space on Device
cudaMalloc( (void**) &dA, sizeof(int)*SIZE );
cudaMalloc( (void**) &dB, sizeof(int)*SIZE );
// Copy Data to be Calculated
cudaMemcpy( dA, A, sizeof(int)*SIZE, cudaMemcpyHostToDevice );
cudaMemcpy( dB, B, sizeof(int)*SIZE, cudaMemcpyHostToDevice );
// Lunch Kernel
cuda_kernel<<<1,1>>>(dB, dA);
// Copy Output Back
cudaMemcpy( B, dB, sizeof(int)*SIZE, cudaMemcpyDeviceToHost );
// Free Memory Spaces on Device
cudaFree( dA );
cudaFree( dB );
}
Now, go and finish your first
CUDA program !!!
• Download
http://twins.ee.nctu.edu.tw/~skchen/
lab1.zip
• Open project with Visual C++ 2008
( lab1/cuda_lab/cuda_lab.vcproj )
– main.cu
• Random input generation, output validation,
result reporting
– device.cu
• Lunch GPU kernel, GPU kernel code
– parameter.h
• Fill in appropriate APIs
– GPU_kernel() in device.cu
Lab2: Make the Parallel
Code Faster
Parallel Processing in
CUDA
• Parallel code can be partitioned into blocks and
threads
– cuda_kernel<<<nBlk, nTid>>>(…)
• Multiple tasks will be initialized, each with
different block id and thread id
• The tasks are dynamically scheduled
– Tasks within the same block will be scheduled on the
same stream multiprocessor
• Each task take care of single data partition
according to its block id and thread id
Locate Data Partition by
Built-in Variables
• Built-in Variables
– gridDim
• x, y
Host
Device
Grid 1
Kernel
1
– blockIdx
• x, y
– blockDim
• x, y, z
– threadIdx
• x, y, z
Block
(0, 0)
Block
(1, 0)
Block
(0, 1)
Block
(1, 1)
Grid 2
Kernel
2
Block (1, 1)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Thread Thread Thread Thread
(0,0,0) (1,0,0) (2,0,0) (3,0,0)
Thread Thread Thread Thread
(0,1,0) (1,1,0) (2,1,0) (3,1,0)
Courtesy: NDVIA
Data Partition for Previous
Example
When processing 64 integer data:
cuda_kernel<<<2, 2>>>(…)
TASK 0
blockIdx.x = 0
threadIdx.x = 0
TASK 1
blockIdx.x = 0
threadIdx.x = 1
TASK 2
blockIdx.x = 1
threadIdx.x = 0
TASK 3
blockIdx.x = 1
threadIdx.x = 1
int total_task = gridDim.x * blockDim.x ;
length
int task_sn = blockIdx.x * blockDim.x + threadIdx.x ;
…
…
…
int length = SIZE / total_task ;
head
int head = task_sn * length ;
Processing Single Data
Partition
__global__ void cuda_kernel ( int *B, int *A ) {
int total_task = gridDim.x * blockDim.x;
int task_sn = blockDim.x * blockIdx.x + threadIdx.x;
int length = SIZE / total_task;
int head = task_sn * length;
for ( int i = head ; i < head + length ; i++ ) {
B[i] = A[i]256;
}
return;
}
Parallelize Your
Program !!!
• Partition kernel into threads
– Increase nTid from 1 to 512
– Keep nBlk = 1
• Group threads into blocks
– Adjust nBlk and see if it helps
• Maintain total number of threads below 512,
e.g. nBlk * nTid < 512
Lab3: Resolve Memory
Contention
Parallel Memory
Architecture
• Memory is divided into
banks to achieve high
bandwidth
• Each bank can service one
address per cycle
• Successive 32-bit words are
assigned to successive
banks
BANK0
BANK1
BANK2
BANK3
BANK4
BANK5
BANK6
BANK7
BANK8
BANK9
BANK10
BANK11
BANK12
BANK13
BANK14
BANK15
Lab2 Review
When processing 64 integer data:
cuda_kernel<<<1, 4>>>(…)
Iteration 2
Iteration 1
A[ 0]
THREAD0
A[16]
THREAD1
A[32]
THREAD2
A[48]
THREAD3
CONFILICT!!!!
BANK0
BANK1
BANK2
BANK3
BANK4
BANK5
BANK6
BANK7
BANK8
BANK9
BANK10
BANK11
BANK12
BANK13
BANK14
BANK15
A[ 1]
THREAD0
A[17]
THREAD1
A[33]
THREAD2
A[49]
THREAD3
CONFILICT!!!!
BANK0
BANK1
BANK2
BANK3
BANK4
BANK5
BANK6
BANK7
BANK8
BANK9
BANK10
BANK11
BANK12
BANK13
BANK14
BANK15
How about Interleave
Accessing?
When processing 64 integer data:
cuda_kernel<<<1, 4>>>(…)
Iteration 1
A[ 0]
A[ 1]
THREAD0
A[ 2]
THREAD1
A[ 3]
THREAD2
THREAD3
NO CONFLICT
BANK0
BANK1
BANK2
BANK3
BANK4
BANK5
BANK6
BANK7
BANK8
BANK9
BANK10
BANK11
BANK12
BANK13
BANK14
BANK15
Iteration 2
A[ 4]
THREAD0
A[ 5]
THREAD1
A[ 6]
THREAD2
A[ 7]
THREAD3
NO CONFLICT
BANK0
BANK1
BANK2
BANK3
BANK4
BANK5
BANK6
BANK7
BANK8
BANK9
BANK10
BANK11
BANK12
BANK13
BANK14
BANK15
Implementation of
Interleave Accessing
cuda_kernel<<<1, 4>>>(…)
stripe
…
head
• head = task_sn
• stripe = total_task
Improve Your
Program !!!
• Modify original kernel code in
interleaving manner
– cuda_kernel() in device.cu
• Adjusting nBlk and nTid as in Lab2
and examine the effect
– Maintain total number of threads below
512, e.g. nBlk * nTid < 512
Thank You
• http://twins.ee.nctu.edu.tw/~skchen/
lab3.zip
• Final project issue
• Group issue