www.math.ntu.edu.tw

Download Report

Transcript www.math.ntu.edu.tw

CUDA Programming Model Overview
Hardware Hierarchy and Optimization
Yukai Hung
[email protected]
Department of Mathematics
National Taiwan University
CUDA Development Environment
CUDA Development Environment
for developers who
wants low-level APIs
for developers who
wants high-level APIs
share back-end compiler
and optimize technology
3
CUDA Development Environment
4
CUDA Development Environment
CUDA source files must be compiled by nvcc
- create host object code directly
- create device object code directly
- create Parallel Thread eXecution source code directly
- compile into device emulation mode for host simulation
Executable CUDA code requires two libraries
- dynamic linking with cuda core library
- dynamic linking with cuda runtime library
5
CUDA Development Environment
CUDA emulation mode
- use host to simulate kernel execution
- no need of any device and CUDA driver
- each device thread is emulated with a host thread
Running in device emulation mode can
- use host native debugger support
- call any host function from device code and vice-versa
- access any device-specific data from host code and vice-versa
- detect deadlock situation caused by improper synchronization
6
CUDA Development Environment
Linux and Windows environment
7
CUDA Development Environment
float4 me=gx[gtid];
me.x=me.x+me.y*me.z;
Physical Layer
Virtual Layer
C/C++ CUDA
application
CPU code
NVCC
PTX code
PTX to target
compiler
G80
GPU
…
target code
ld.global.v4.f32 {$f1,$f3,$f5,$f7},[$r9+0];
mad.f32
$f1,$f5,$f3,$f1;
8
CUDA Development Environment
How to install CUDA?
How to compile CUDA file?
How to compile CUDA file into PTX?
9
CUDA Development Environment
Windows Nexus IDE
- fist IDE for massively thread parallel applications
- accelerate coprocessing application development
- complete Visual Studio integrated development environment
10
CUDA Development Environment
Linux and Mac Visual Profiler
11
Start from Matrix-Matrix Multiplication
naïve and tiled algorithm
Naïve Matrix-Matrix Multiplication
Each thread calculates one result element
- read one row and one column from global memory
- perform inner product and store back to global memory
WIDTH
N
P
WIDTH
M
WIDTH
WIDTH
13
Naïve Matrix-Matrix Multiplication
//compute matrix matrix multiplication P=M*N
__global__ void SGEMM(float* d_A,float* d_B,float* d_C)
{
float value;
float Melement;
float Nelement;
//calculate each thread global index
...
value=0.0f;
//each thread performs its inner product
for(int loop=0;loop<dim;loop++)
{
Melement=Amatrix[yidx*dim+loop];
Nelement=Bmatrix[loop*dim+xidx];
value=value+Melement*Nelement;
}
//store final result into P matrix
Pmatrix[yidx*dim+loop]=value;
}
14
Naïve Matrix-Matrix Multiplication
Analyze the global memory usage
- each thread compute one element on the result matrix
- load one row and one column elements from matrix M and N
- perform one multiply and addition for each pair of elements
- compute and access global memory ratio is close to 1:1
Analyze the global memory bandwidth
- 4bytes x 346.5GFLOPs = 1386GB/s requires to achieve peak
- 86.4GB/s limits the actual performance about 21.6GFLOPs
- the actual code for naïve algorithm runs at about 15GFLOPs
15
Naïve Matrix-Matrix Multiplication
Need to drastically cut own memory accessing
- each row on matrix M is read for many times
- each column on matrix N is read for many times
WIDTH
N
P
WIDTH
M
WIDTH
WIDTH
16
Tiled Matrix-Matrix Multiplication
N
M
WIDTH
TILE_WIDTH TILE_WIDTH
Use tiled algorithm for matrix data reusing
- each block computes one square sub-matrix result
- each thread computes one element on the sub-matrix
TILE_WIDTH TILE_WIDTH
TILE_WIDTH
WIDTH
WIDTH
17
WIDTH
Pdsub
TILE_WIDTHE
P
Tiled Matrix-Matrix Multiplication
Each block should have many threads
- tiled size 16 creates 16x16=256 threads per block
- 1024x1024 matrix size needs 64x64=4096 blocks on grid
Each thread block loads two sub-matrix and compute results
- load 2x256=512 floats from global memory to shared memory
- perform 256x(2x16)=512x16 operations on the shared memory
- global memory bandwidth no longer a limiting factor
- compute and access global memory ratio is close to 16:1
18
Tiled Matrix-Matrix Multiplication
//compute matrix matrix multiplication P=M*N
void SGEMM(float* h_A,float* h_B,float* h_C)
{
dim3 gridSize;
dim3 blockSize;
//setup the execution configuration
blockSize.x=TILED_WIDTH;
blockSize.y=TILED_WIDTH;
gridSize.x=WIDTH/TILED_WIDTH;
gridSize.y=WIDTH/TILED_WIDTH;
//launch device matrix-matrix multiplication kernel
sgemm<<<gridSize,blockSize>>>(d_A,d_B,d_C);
...
}
19
Tiled Matrix-Matrix Multiplication
//compute matrix matrix multiplication P=M*N
__global__ void SGEMM(float* d_A,float* d_B,float* d_C)
{
//setup the block index and thread index
int bx=blockIdx.x;
int by=blockIdx.y;
int tx=threadIdx.x;
int ty=threadIdx.y;
float value=0.0f;
//declare the shared memory per block
__shared__ float Mshared[TILED_WIDTH][TILED_WIDTH];
__shared__ float Nshared[TILED_WIDTH][TILED_WIDTH];
...
20
Tiled Matrix-Matrix Multiplication
//loop over all sub-matrices on matrix M and N
//required to compute block sub-matrix results
for(int loop=0;loop<WIDTH/TILED_WIDTH;loop++)
{
//get the pointer to the current sub-matrix M and N
float* Msub=GetSubMatrix(M,loop,bx,by,tx,ty,WIDTH);
float* Nsub=GetSubMatrix(N,loop,bx,by,tx,ty,WIDTH);
//each thread load one element to shared memory
Msub[ty][tx]=GetMatrixElement(Msub,tx,ty);
Nsub[ty][tx]=GetMatrixElement(Nsub,tx,ty);
//synchronize to make sure the sub-matrix are load
//to shared memory before computing partial result
__syncthreads();
21
Tiled Matrix-Matrix Multiplication
void __syncthreads() function
- synchronize all executed threads in the same block
- resume normally when all threads have reached the same point
- use to avoid RAW/WAR/WRW hazards when accessing memory
22
Tiled Matrix-Matrix Multiplication
//each thread conputes one element of the sub-matrix
for(int loope=0;loope<TILED_WIDTH;loope++)
value=value+Msud[ty][loope]*Nsub[loope][tx];
//synchronize to make sure that the proceding
//computation is done before loading two new
//sub-matrices of M and N in the next iteration
__syncthreads();
}
//get the pointer to the block sub-matrix P
float* Psub=GetSubMatrix(P,loop,bx,by,tx,ty,WIDTH);
//save the block sub-matrix result to global memory
SetMatrixElement(Psub,value,bx,by,tx,ty,WIDTH);
return;
}
23
Tiled Matrix-Matrix Multiplication
24
Tiled Matrix-Matrix Multiplication
What is the limitation on matrix tiled size?
- access memory and compute ratio depends on tiled size
- higher memory usage ratio comes from the lager tiled size
- the ratio is limited by shared memory size and block size
What is the performance between two algorithms?
- naïve matrix-matrix multiplication runs at about 15GFLOPs
- tiled matrix-matrix multiplication runs at about 90GFLOPs
What kind of tiled algorithm used on Intel Math Kernel Library?
25
Matrix-Matrix Multiplication
device memory
host memory
shared memory
L3 cache
L2 cache
GPU version
CPU version
26
Matrix-Matrix Multiplication
27
Matrix-Matrix Multiplication
How about the memory accessing pattern?
WIDTH
N
P
WIDTH
M
WIDTH
WIDTH
28
Tiled Matrix-Matrix Multiplication
How to accelerate current tiled algorithm?
- prefetch the next sub-matrix data while computing
- overlap the loading and computing times to hide latency
29
Recommend Program Strategy
Global memory resides in device memory
- much slower accessing than shared memory
A profitable way of programming on device is tiled data
- divide data into subsets that fit into shared memory
- handle each data subset with one thread block by:
1: load the subset from global memory to shared memory
use multiple threads to exploit memory-level parallelism
2: perform computation on the subset from shared memory
3: store final results from shared memory to global memory
30
Same Idea for Matrix Transpose
naïve and tiled algorithm
Naïve Matrix Transpose
WIDTH
Each thread represents one element of the matrix
- load one row element from original matrix
- store one column element into another matrix
WIDTH
32
Naïve Matrix Transpose
load data from global memory
WIDTH
stride = 1
WIDTH
store data into global memory
stride = dimension
33
Tiled Matrix Transpose
load data from memory store data into memory
0,0 on1,0
2,0
0,0 0,1
0,2 algorithm
Consider
the tiled
the matrix!
1,0
1,1
1,2
0,1
1,1
2,1
2,0
2,1
2,2
0,2
1,2
2,2
34
Tiled Matrix Transpose
load data from global memory to shard memory
0,0 0,1 0,2
0,0
0,1
0,2
2,0 2,1 2,2
2,0
global memory
2,1
2,2
shared memory
0,0
0,1
0,2
0,0
1,0
2,0
1,0
1,1
1,2
0,1
1,1
2,1
2,0
2,1
2,2
0,2
1,2
2,2
transpose matrix on
shared memory
1,0 1,1 one
1,2 of thread
1,0 1,1
1,2
Consider
blocks!
store data from shared memory to global memory
35
Hardware Hierarchy Architecture Details
Hardware Hierarchy Architecture
this unit distributes all blocks into shader multiprocessors
Host
Input Assembler
Setup / Rstr / ZCull
SP
SP
SP
TF
SP
SP
TF
L1
TF
L1
SP
SP
SP
Pixel Thread Issue
SP
SP
TF
L1
L1
L2
FB
SP
TF
TF
L1
L2
FB
SP
Work Distribution
SP
TF
L1
L2
FB
37
SP
SP
TF
L1
L2
FB
SP
L1
L2
FB
Thread Processor
Vtx Thread Issue
L2
FB
Hardware Hierarchy Architecture
this means you need
sufficient blocks to fill
all the shader pipeline
all blocks get scheduled
round-robin based on
the number of shaders
38
Hardware Hierarchy Architecture
Host
Input Assembler
Setup / Rstr / ZCull
SP
SP
SP
TF
SP
SP
TF
L1
TF
L1
SP
SP
SP
Pixel Thread Issue
SP
SP
TF
L1
L1
L2
FB
SP
TF
TF
L1
L2
FB
SP
Work Distribution
SP
TF
L1
L2
FB
SP
L1
L2
FB
this unit performs graphic texture operations
texture filtering and pixel interpolations
39
SP
TF
L1
L2
FB
SP
Thread Processor
Vtx Thread Issue
L2
FB
Hardware Hierarchy Architecture
shader processor array SPA on chip
shader number depends on the different hardware
Host
Input Assembler
Setup / Rstr / ZCull
SP
SP
SP
TF
SP
SP
TF
L1
TF
L1
SP
SP
SP
Pixel Thread Issue
SP
SP
TF
L1
L1
L2
FB
SP
TF
TF
L1
L2
FB
SP
Work Distribution
SP
TF
L1
L2
FB
40
SP
SP
TF
L1
L2
FB
SP
L1
L2
FB
Thread Processor
Vtx Thread Issue
L2
FB
Hardware Hierarchy Architecture
shader multiprocessor SM on SPA
thread block is scheduled into one SM
Host
Input Assembler
Setup / Rstr / ZCull
SP
SP
SP
TF
SP
SP
TF
L1
TF
L1
SP
SP
SP
Pixel Thread Issue
SP
SP
TF
L1
L1
L2
FB
SP
TF
TF
L1
L2
FB
SP
Work Distribution
SP
TF
L1
L2
FB
41
SP
SP
TF
L1
L2
FB
SP
L1
L2
FB
Thread Processor
Vtx Thread Issue
L2
FB
Hardware Hierarchy Architecture
shader multiprocessor SM on SPA
thread block is scheduled into one SM
Host
Input Assembler
Setup / Rstr / ZCull
SP
SP
SP
TF
SP
SP
TF
L1
TF
L1
SP
SP
SP
Pixel Thread Issue
SP
SP
TF
L1
L1
L2
FB
SP
TF
TF
L1
L2
FB
SP
Work Distribution
SP
TF
L1
L2
FB
42
SP
SP
TF
L1
L2
FB
SP
L1
L2
FB
Thread Processor
Vtx Thread Issue
L2
FB
Hardware Hierarchy Architecture
shader processor SP for regular arithmetic
thread is scheduled into one shader processor
43
Hardware Hierarchy Architecture
low precision special function unit SFU
hardware supports intrinsic functions
44
Hardware Hierarchy Architecture
Intrinsic functions are supported by hardware SFU
- intrinsic function comes from traditional usage
- the functions are less accurate but faster version
- the functions have name prefixed with __function()
Selectable operation rounding mode
- function suffixed with _rn uses round-to-nearest-even
- function suffixed with _rz uses round-towards-zero
- function suffixed with _rd uses round-down
- function suffixed with _ru uses round-up
SFU and SP units are two independent operation units
- SFU and SP units can be performed simultaneously
45
Hardware Hierarchy Architecture
on-chip shared memory on each SM fetch instruction and dispatch
shared data in the same thread block to all threads in the same warp
46
Warp Architecture
Unrelated to software design which is purely hardware
- designed to solve variable block size
- designed to achieve effective divergence
- designed to hide memory accessing latency
Each block are divided into several warps
- each warp contains 32-threads on current shader
- warp size is decided from some hardware reasons
Each shader can execute only one warp at the same time
- each shader can monitor and switch between several warps
- synchronization not necessary for all threads in the same warp
47
Warp Architecture
block size 256
1
2
3
4
5
6
7
8
7
8
warp umber 8
block size 280
1
2
3
4
5
6
warp umber 9
8 threads on the last warp are idle
48
9
Warp Architecture
Shader multiprocessor scheduling
Shader Multiprocessor
block 0
block 1
block 0
warp 0
block 0
warp 1
block 1
warp 0
block 1
warp 1
block 0
warp 2
block 0
warp 3
block 1
warp 2
block 1
warp 3
49
Warp Architecture
Shader multiprocessor synchronization
Shader Multiprocessor
block 0
block 1
block 0
warp 0
block 0
warp 1
block 1
warp 0
block 1
warp 1
block 0
warp 2
block 0
warp 3
block 1
warp 2
block 1
warp 3
50
Warp Architecture
The warp scheduler on each shader is very smart!!
51
Warp Architecture
SM hardware implements zero-overhead warp scheduling
- warps whose next instruction has its operand ready for
consumption are eligible for execution via score boarding
- eligible warps are selected for execution on a prioritized
scheduling policy via warp scheduling priority table
Thread block is partitioned into several warps
- partition is always the same on different device
- warp size may be changed from generation to generation
- thread index within the warp is consecutive and increasing
All threads in the same warp execute the same instruction
- synchronize not necessary for threads in the same warp
52
Warp Architecture
1
2
3
4
5
6
7
8
9 10 11 12
block scheduler
11
12
9
10
7
8
5
6
3
4
1
2
53
Warp Architecture
11
9
warp scheduler
score boarding
7
5
3
1
54
Warp Architecture
Why the warp size is equal to 32-threads?
- depend on the shader instruction dispatch hardware
- dispatch same instruction to all threads need 4 clock cycle
- each warp thread needs 4 clock cycles to get next instruction
- 8 concurrent threads x 4 clock cycles = 32 threads on one warp
55
Warp Architecture
clock cycle 1
clock cycle 2
clock cycle 3
clock cycle 4
instruction
instructionk+1
k
56
Warp Architecture
How to hide the global memory accessing latency?
- suppose one global access is needed for every n instructions
- more than 200/4n warps are needed to fully tolerate latency
warp 0
warp 1
warp 2
warp 3
warp 4
warp 5
57
Warp Architecture
How to hide the global memory accessing latency?
- suppose one global access is needed for every n instructions
- more than 200/4n warps are needed to fully tolerate latency
warp 0
warp 1
warp 2
warp 3
58
Control Flow
Control Flow
Each instruction is issued per 32 threads or 1 warp
Main performance concern with branching is divergent
- threads within a single warp take different control path
- different execution paths within a warp are serialized
- the control paths taken by the threads in the same warp
are traversed one at a time until there is no more
Different warps are independent for the divergent condition
60
Control Flow
//divergent condition
if(threadIdx.x<2)
{......}
else
{......}
- this creates two different control paths for threads in a block
- the branch granularity is less than warp size (only the first warp)
- thread 0 and 1 follow different path than the rest of threads in warp 0
warp 0
warp 1
warp 2
61
warp 3
Control Flow
//no divergent condition
if(threadIdx.x/WARP_SIZE<2)
{......}
else
{......}
- this creates two different control paths for threads in a block
- the branch granularity is a whole multiple of the warp size
- all threads in any given warp follow the same control path
warp 0
warp 1
warp 2
62
warp 3
Control Flow
Profiler counts each shader multiprocessor execution
- count branch and divergent branch
- count warp serialize and instructions
63
Parallel Vector Reduction
Parallel Vector Reduction
Give any array and reduce them to a single value in parallel
Some examples:
- sum reduction: sum of all values in the array
- max reduction: maximum of all values in the array
65
Parallel Vector Reduction
Assume an in-place reduction using shared memory
- the original vector is in device global memory
- shared memory used to hold a partial sum result
- each iteration bring the partial sum close to final sum
- the partial solution on shared memory will be in element 0
66
Parallel Vector Reduction
67
Parallel Vector Reduction
//naïve parallel vector sum reduction
//raw data is loaded into shared memory
__shared__ float array[size];
....
//outer loop for parallel reduction
for(int stride=1;stride<blockDimx.x;stride=stride*2)
{
//check the thread is active or not
if(threadIdx.x%(2*stride)==0)
array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+stride];
//make sure all active threads are done
__syncthreads();
}
68
Parallel Vector Reduction
Time
Bandwidth
69
Parallel Vector Reduction
Each warp will traverse two control paths sequentially
- threads that do not perform addition may cost extra
cycles cause from the implementation of divergence
No more than half of threads will be executed at any time
- all old index threads are disable right from the beginning
- less than ¼ threads will be activated for all warps on average
- after the 5th iteration, entire warps in each block will be
disabled, this is poor resource utilization but no divergence
70
Parallel Vector Reduction
//naïve parallel vector summation reduction
//raw data is loaded into shared memory
__shared__ float array[size];
BAD: divergence due to
interleaved branch
decisions
//outer loop for parallel reduction
for(int stride=1;stride<blockDimx.x;stride=stride*2)
{
//check the thread is active or not
if(threadIdx.x%(2*stride)==0)
array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+stride];
//make sure all active threads are done
__syncthreads();
}
71
Parallel Vector Reduction
//naïve parallel vector summation reduction
//raw data is loaded into shared memory
__shared__ float array[size];
take out the mod
operation on first step
//outer loop for parallel reduction
for(int stride=1;stride<blockDimx.x;stride=stride*2)
{
index=threadIdx.x*stride*2;
if(index<blockDim.x)
array[index]=array[index]+array[index+stride];
//make sure all active threads are done
__syncthreads();
}
72
Parallel Vector Reduction
Time
Bandwidth
73
Step
Speedup
Cumulative
Speedup
Parallel Vector Reduction
74
Parallel Vector Reduction
//naïve parallel vector summation reduction
//raw data is loaded into shared memory
__shared__ float array[size];
//outer loop for parallel reduction
for(int stride=blockDim.x/2;stride>0;stride=stride>>1)
{
//check the thread is active or not
if(threadIdx.x<stride)
array[threadIdx.x]=array[threadIdx.x]
+array[threadIdx.x+stride];
//make sure all active threads are done
__syncthreads();
}
75
Parallel Vector Reduction
Time
Bandwidth
76
Step
Speedup
Cumulative
Speedup
Parallel Vector Reduction
Idle thread problem
- half of the threads are idle on the first loop iteration!!
//raw data is loaded into shared memory
__shared__ float array[size];
Idle: only a half of
threads are active
this is wasteful!!
//outer loop for parallel reduction
for(int stride=blockDim.x/2;stride>0;stride=stride>>1)
{
//check the thread is active or not
if(threadIdx.x<stride)
array[threadIdx.x]+=array[threadIdx.x+stride];
//make sure all active threads are done
__syncthreads();
}
77
Parallel Vector Reduction
Halve the block size
- replace original single load with two loads
- add the first two element for first level reduction
input data
input data
block size
block size
78
Parallel Vector Reduction
Original data loading
//raw data is loaded into shared memory
__shared__ float array[size];
//load data from global memory to shared memory
array[threadIdx.x]= ivector[threadIdx.x];
New data loading: halve original block size
//raw data is loaded into in shared memory
__shared__ float array[size];
//load data from global memory to shared memory
array[threadIdx.x]= ivector[threadIdx.x]
+ ivector[threadIdx.x+blockDim.x];
79
Parallel Vector Reduction
Time
Bandwidth
80
Step
Speedup
Cumulative
Speedup
Parallel Vector Reduction
It seems far from bandwidth bound at 17 GB/s
- we know reduction has low arithmetic intensity
- therefore a likely bottleneck is instruction overhead
- consider the address arithmetic and loop overhead
Active thread number decreases as reduction proceed
- only one warp is active when stride is less than 32
- instructions are synchronous within the same warp
(remove the block synchronization on the last warp)
Strategy: unroll the last 6 iterations on the inner loop
81
Parallel Vector Reduction
//outer loop for parallel reduction
for(int stride=blockDim.x/2; stride>32 ;stride=stride>>1)
{
//check the thread is active or not
if(threadIdx.x<stride)
array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+stride];
this not occurs divergent
branch on the last active warp
//make sure all active threads are done
__syncthreads();
}
//unroll the last 6 loop and skip synchronize
if(threadIdx.x<32)
{
array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+ 32 ];
array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+ 16 ];
remove loop instruction
array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+
8 ];
array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+
4 ];
and block synchronization
array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+ 2 ];
array[threadIdx.x]=array[threadIdx.x]+array[threadIdx.x+ 1 ];
}
82
Parallel Vector Reduction
Time
Bandwidth
Step
Speedup
Cumulative
Speedup
Note: this saves useless work in all warps, not just the last one!!
All warps execute all for loop and if instructions without unrolling
83
Parallel Vector Reduction
Completely unroll the loop reduction
- the block size is limited by 512 threads
- we are sticking to power-of-2 block size
- suppose we know the number of iterations
It is easy to unroll for a fixed block size
- need a very smart method to unroll all possible size
- how can we unroll all possible block size at compile time?
Templates to the rescue!!
- CUDA supports C++ template parameters on device function
84
Parallel Vector Reduction
Specify block size as a function template parameter
//use template to unroll all possible block size
template<int blockSize>
__global__ void reduce(float* ivector,float* ovector)
How to invoke template kernel in different block size?
//invoke template kernel in different block size
reduce<512><<<blockNumber,blockSize>>>(ivector,ovector);
reduce<256><<<blockNumber,blockSize>>>(ivector,ovector);
reduce<128><<<blockNumber,blockSize>>>(ivector,ovector);
reduce< 64><<<blockNumber,blockSize>>>(ivector,ovector);
reduce< 32><<<blockNumber,blockSize>>>(ivector,ovector);
reduce< 16><<<blockNumber,blockSize>>>(ivector,ovector);
reduce< 8><<<blockNumber,blockSize>>>(ivector,ovector);
reduce< 4><<<blockNumber,blockSize>>>(ivector,ovector);
reduce< 2><<<blockNumber,blockSize>>>(ivector,ovector);
Reduce< 1><<<blockNumber,blockSize>>>(ivector,ovector);
85
Parallel Vector Reduction
//unroll the first 3 iterations conditionally
if(blockSize>=512)
{
if(threadIdx.x<256)
array[threadIdx.x]+=array[threadIdx.x+256];
__syncthraeds();
}
if(blockSize>=256)
{
if(threadIdx.x<128)
array[threadIdx.x]+=array[threadIdx.x+128];
__syncthraeds();
}
if(blockSize>=128)
{
if(threadIdx.x<64)
array[threadIdx.x]+=array[threadIdx.x+64];
__syncthraeds();
}
86
Parallel Vector Reduction
//unroll the last active warp conditionally
if(threadIdx.x<32)
{
if(blockSize>=64) array[threadIdx.x]+=array[threadIdx.x+32];
if(blockSize>=32) array[threadIdx.x]+=array[threadIdx.x+16];
if(blockSize>=16) array[threadIdx.x]+=array[threadIdx.x+ 8];
if(blockSize>= 8) array[threadIdx.x]+=array[threadIdx.x+ 4];
if(blockSize>= 4) array[threadIdx.x]+=array[threadIdx.x+ 2];
if(blockSize>= 2) array[threadIdx.x]+=array[threadIdx.x+ 1];
}
Note: all code in RED will be evaluated at compile time!!
This result in a very efficient inner loop, no loop instruction
87
Parallel Vector Reduction
Time
Bandwidth
88
Step
Speedup
Cumulative
Speedup
Parallel Vector Reduction
Combine both sequential and parallel reduction
- each thread loads and sums multiple elements into shared
- last parallel part performs tree-based reduction in shared
input data
input data
block size
block size
input data
block size
89
Parallel Vector Reduction
//load data from global memory to shared memory
array[threadIdx.x]=vector[threadIdx.x]
+vector[threadIdx.x+blockDim.x];
int index=threadIdx.x;
//load data from global memory to shared memory
While(index<maxn)
{
array[threadIdx.x]=vector[index]
+vector[index+blockDim.x];
index=index+blockSize<<1;
}
90
Parallel Vector Reduction
Time
Bandwidth
91
Step
Speedup
Cumulative
Speedup
Parallel Vector Reduction
Why this is suitable for traditional parallel architecture?
92
Matrix-Matrix Multiplication
How about unroll the matrix-matrix multiplication?
93
Reference
- Mark Harris
http://www.markmark.net/
- Wei-Chao Chen http://www.cs.unc.edu/~ciao/
- Wen-Mei Hwu http://impact.crhc.illinois.edu/people/current/hwu.php
94