Introduction to GPU Programming for EDA John F. Croix Cadence Design Systems, Inc. Sunil P.
Download
Report
Transcript Introduction to GPU Programming for EDA John F. Croix Cadence Design Systems, Inc. Sunil P.
Introduction to GPU
Programming for EDA
John F. Croix
Cadence Design Systems, Inc.
Sunil P. Khatri
Texas A&M University
Acknowledgements: NVIDIA, Nascentric Inc., Accelicon Inc.
Students: Kanupriya Gulati, Vinay Karkala, Kalyana Bollapalli
Outline
GPU Architecture Overview
GPU Programming
Algorithm Acceleration Guidelines
Case Studies
Conclusion
Q&A
2
Outline
GPU Architecture Overview
Evolution and architecture
Peak performance
GPU and CPU interaction – practical considerations
GPU Programming
Algorithm Acceleration Guidelines
Case Studies
Conclusion
Q&A
3
GPU Evolution
In the early days, graphics accelerators were primitive
With VLSI technology scaling, the GPU was born
Many programmable processors to handle graphics rendering tasks
Increased peak memory bandwidths and peak performance
Goal was faster and more realistic rendering for gaming applications
Recently, several scientific communities began to leverage
these GPUs
Acceleration of graphics rendering tasks for (CRT) displays
Many hardwired graphics acceleration units
Initially used graphics APIs like OpenGL and DirectX for these tasks
GPU vendors recognized this interest
Development of C-like programming environments such as CUDA
Development of GPU architectures tuned for scientific computations
4
GPU Introduction
A GPU is essentially a commodity stream processor
GPUs, owing to their massively parallel architecture, have been
used to accelerate
Highly parallel (100s of processor cores)
Very fast (>900 GFLOPS of peak performance)
Operates in a SIMD manner. This is a key restriction
Multiple processors operate in lock-step (same instruction) but on different
data
Image/stream processing, data compression, numerical algorithms
Recently they have been used to accelerate CAD algorithms as well.
Inexpensive, off-the-shelf cards like the NVIDIA Quadro FX / 280
GTX GPU achieve impressive performance
933 GFLOPs peak performance
240 SIMD cores partitioned into 30 Multiprocessors (MPs)
4GB (Quadro) and 1GB (GTX 280) device memory with 142 GB/s bandwidth
1.4 GHz GPU operating frequency
Programmed with Compute Unified Device Architecture (CUDA) framework
5
GPU Architecture
In the GTX 280, there are 10 Thread Processing
Clusters (TPCs)
Each has 3 Streaming Multiprocessors (SMs), which we will
refer to as multiprocessors (MPs)
Each MP has 8 Streaming Processors (SPs) or Thread
Processors (TPs). We will refer to these as processors.
240 processors and 30 MPs in all!
One double-precision FP unit per SM
Source : NVIDIA
6
GPU vs CPU:
NVIDIA 280 vs Intel i7 860
CPU1
GPU
Registers
16,384 (32-bit) /
multi-processor3
128 reservation stations
Peak memory bandwidth
141.7 Gb/sec
21 Gb/sec
Peak GFLOPs
562 (float)/
77 (double)
50 (double)
Cores
240
4/8 (hyperthreaded)
Processor Clock (MHz)
1296
2800
Memory
1Gb
16Gb
Shared memory
16Kb/TPC2
N/A
Virtual memory
None
1http://ark.intel.com/Product.aspx?id=41316
2TPC
330
= Thread Processing Cluster (24 cores)
multi-processors in a 280
7
GPU vs CPU Peak Performance Trends
GPU peak performance has grown aggressively.
Hardware has kept up with Moore’s law
Source : NVIDIA
8
GPU Programming Model
The GPU is viewed as a compute device that:
Is a coprocessor (slave) to the CPU (host)
Has its own DRAM (device memory) but no virtual memory
Entire design instance may not fit on the GPU!
Kernel is a CPU-callable function. Thread is an instance of a kernel.
GPU runs many threads in parallel.
Device
Host
(CPU)
(GPU)
Kernel
Threads
(instances of
the kernel)
PCIe
Device
Memory
9
Data Transfers (CPUGPU)
GPUs and CPUs communicate via a PCIe bus
Graphics applications usually require
This communication is expensive and should be minimized for target applications
Initial data to be sent from CPU to GPU
Single transfer of processed data from GPU to CPU
General purpose computations usually require
Multiple transfers between CPU and GPU (since conditional checks on CPU)
Possibility of saturating the PCIe bus and reducing the achievable performance
Device
(GPU)
Host
(CPU)
Kernel
Threads
(instances of
the kernel)
PCIe
Device
Memory
10
GPU Threads v/s CPU Threads
GPU threads:
CPU threads:
Lightweight, small creation and scheduling overhead,
extremely fast hardware context switching
Need to issue 1000s of GPU threads to hide global
memory latencies (600-800 cycles)
Heavyweight, large scheduling overhead, slow context
switching
Multi-GPU usage requires invocation of multiple CPU
threads
Each CPU thread creates a GPU context
Context swapping is required for a CPU thread to access
GPU memory allocated by another CPU thread
11
Device Memory Space Overview
Each thread runs on a SP and has:
R/W per-thread registers (on-chip)
Limit usage (max 16K/MP)
R/W per-thread local memory (off)
R/W per-block shared memory (on)
Need to avoid bank conflicts
R/W per-grid global memory (off)
Not cached, 600-800 cycle read
Latency hidden by parallelism
and fast context switches
Main means for data transfer
from host and device
Host
Coalescing recommended
RO per-grid cached constant
and texture memory (off)
The host can R/W global,
constant and texture
memories (visible to all threads)
(Device) Grid
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Source : “NVIDIA CUDA Programming Guide” version 1.1
Outline
GPU Architecture Overview
GPU Programming
CPU threads
Conditional and Loop processing
Floating point
General GPU program structure
CUDA and OpenCL
Algorithm Acceleration Guidelines
Case Studies
Conclusion
Q&A
13
CPU Threading
CPU
All threads are equivalent
Read/write concurrently to the same memory
Synchronization primitives required to avoid collisions
GPU (NVIDIA)
Each CPU thread maintains a unique context
GPU resources (e.g. memory, code modules, address
space) are context-specific
Each CPU thread can access a single context at once
Contexts must be exchanged between CPU threads to
share GPU resources between CPU threads
Contexts use reference counting and are automatically
destroyed
14
SIMD Conditional Processing
Unlike threads in a CPU-based program, SIMD programs cannot
follow different execution paths
Ideal scenario:
In divergent paths, some processors execute the then-block and
others the else-block
All GPU threads follow the same execution path
All processors active continuously
Program flow cannot actually diverge. All instructions are executed
The then- and else- blocks are both executed
A bit is used to enable/disable processors based on the block being
executed
Parallelism is reduced, impacting performance
15
Idle Processors
Idle CPU processors can be dynamically rescheduled
by OS
SIMD processors are not actually idle
All processors scheduled are following identical execution
paths
Disabled (idle) processors are unavailable for other work
and cannot be rescheduled
Effective utilization of processors is the programmer’s
responsibility
Scheduling is an art, not necessarily a science
Techniques will vary from chip to chip
16
Conditional Processing
…
If (condition)
{
…
}
else
{
…
}
…
17
Nested Conditional Processing
…
If (condition)
{
if (condition2)
{
…
}
else
{
…
}
}
else
{
…
}
…
18
Loop Processing
…
while (condition)
{
if (cond2)
{
…
}
}
…
19
The Cost of Memory Access
Registers are extremely fast, but are a limited resource
Cached memories also tend to be small
For large data sets, global memory provides read & write access
Accesses take between 600 and 800 clock cycles
Accesses are *not* cached
To hide memory latency, the hardware provides fast context switches when
memory is accessed
However, there must be enough computational work to do to hide the high
cost of memory access
Programmers need to be smart
Compilers often don’t provide the necessary optimizations when optimizing
for speed instead of code size
It can sometimes be cheaper to recompute a result than perform a memory
read/write
20
Conditional Processing
...
if (condition)
{
...
float a = someVar;
...
}
else
{
...
float a = someVar;
...
}
...
Access & Swap
Access & Swap
…
float a = someVar;
if (condition)
{
…
}
else
{
…
}
…
Access & Swap
21
Floating Point
GPUs are optimized for 32bit accesses
64-bit double-precision
values fetched from
memory as two 32-bit
quantities
May impact performance in
the event of memory bank
conflicts
One double-precision unit
per multi-processor1
1http://www.ddj.com/hpc-high-performance-computing/210102115
22
OpenCL vs CUDA
CUDA uses early code binding
Code is compiled with normal C/C++/FORTRAN (beta)
source code
Need CUDA occupancy calculator to determine number of
threads based on resource utilization
Library support: BLAS & FFT & DPT
OpenCL
Late binding of OpenCL code to executable
OpenCL compiler/linker embedded within application
No need for CUDA occupancy calculator
Only supports C
No libraries
23
CUDA Occupancy Calculator
24
OpenCL vs CUDA
25
General Program Structure
Initialize GPU
Create GPU context
Build GPU program
Allocate GPU memory
Transfer data from CPU to GPU
Invoke GPU functions
Transfer data from GPU to CPU
Deallocate GPU memory
Finalize GPU usage
26
Create GPU Context
CUDA
Context creation is implicit in single-threaded programs
Multiple contexts can be explicitly created
Each thread maintains a context stack
Top context is current context
Threads
Contexts can be swapped between threads
A thread can only have one context active at a time (stack)
A context cannot be shared simultaneously between
threads
OpenCL
All commands explicitly associated with a context
Must create a command queue to invoke
27
Initialize GPU
CUDA
cudaGetDeviceCount()
cudaSetDevice()
cudaGetDeviceProperties()
CUDA::CUDA(int Device) : Base()
{
mValid = false;
int DeviceCount;
cudaGetDeviceCount( &DeviceCount );
if (!DeviceCount)
{
return;
}
Device = Device == -1 ? DeviceCount - 1 : Device;
cudaSetDevice( Device );
mValid = true;
}
28
Initialize GPU
OpenCL
Context must be built before anything can be done on
the GPU
All commands are with respect to a given context
OpenCL::OpenCL(int Device) : Base()
{
init();
// Initialize class pointers to NULL
cl_int RC;
mGPUContext = clCreateContextFromType( 0, CL_DEVICE_TYPE_GPU, NULL, NULL, &RC );
size_t Bytes;
RC = clGetContextInfo( mGPUContext, CL_CONTEXT_DEVICES, 0, NULL, &Bytes );
int NumDevices = Bytes / sizeof( cl_device_id );
cl_device_id *Devices = new cl_device_id[ NumDevices ];
RC = clGetContextInfo( mGPUContext, CL_CONTEXT_DEVICES, Bytes, Devices, NULL );
mCommandQueue = clCreateCommandQueue( mGPUContext, Devices[ Device ], 0, &RC );
size_t MaxWorkItemSizes[ 256 ];
RC = clGetDeviceInfo( Devices[ Device ], CL_DEVICE_MAX_WORK_ITEM_SIZES,
sizeof( MaxWorkItemSizes ), MaxWorkItemSizes, NULL );
mMaxWorkItems = MaxWorkItemSizes[ 0 ];
mMaxWorkItemsMask = ~(mMaxWorkItems - 1);
29
Build GPU Program
CUDA
GPU code is compiled using nvcc compiler
Object code is statically bound to CPU executable
GPU code is intrinsically part of the program
Mapping of problem to threads performed at compile time
30
Build GPU Program
OpenCL
GPU code is bound at runtime to the GPU
OpenCL compiler is part of executable
Code can be source code or object code
Source code can be dynamically generated by the program
Can be stored in an external file
// Continued from constructor
char *code = shrFindFilePath( ”code.cl", "." );
size_t CodeLength = 0;
char *Source = oclLoadProgSource( myCode, "", &CodeLength );
const char *SourceCode = Source;
mProgram = clCreateProgramWithSource( mGPUContext, 1, &SourceCode,
&CodeLength, &RC );
RC = clBuildProgram( mProgram, 0, NULL, NULL, NULL, NULL );
std::free( code );
std::free( Source );
mValid = RC == CL_SUCCESS;
}
31
Allocate/Deallocate GPU Memory
CUDA
Most frequently used allocator: cudaMalloc()
Returns a memory pointer to GPU memory
Memory pointer cannot be used by CPU directly
Passed to GPU calls
void *CUDA::malloc(size_t Bytes)
{
void *Memory;
cudaError_t RC = cudaMalloc( &Memory,
Bytes );
return( RC == cudaSuccess ? Memory :
NULL );
}
void CUDA::free(void *Memory)
{
if (Memory)
{
cudaFree( Memory );
}
}
32
Allocate/Deallocate GPU Memory
OpenCL
Like all things, memory allocation explicitly performed
within a context
void *OpenCL::malloc(size_t NumBytes)
{
size_t Size = NumBytes / 32 + (NumBytes & 31 ? 1 : 0);
cl_int RC;
cl_mem Memory = clCreateBuffer( mGPUContext, CL_MEM_READ_WRITE,
Size, NULL, &RC );
return( RC == CL_SUCCESS ? Memory : NULL );
}
void OpenCL::free(void *Memory)
{
if (Memory)
{
cl_mem *Ptr = reinterpret_cast<cl_mem>( Memory );
clReleaseMemObject( Memory );
}
}
33
CPU/GPU Data Transfer
Data moved across PCIe bus
CUDA
Data transfer accomplished via cudaMemcpy() routine
Implicit synchronization point
Direction is determined by enumeration
cudaMemcpyHostToDevice
Non-blocking copies are available
cudaMemcpyDeviceToHost
Allocated memory can be bound to texture memory
cudaBindTexture
OpenCL
Memory transfer via clEnqueueWriteBuffer() and clEnqueueReadBuffer()
Synchronization controlled by parameters to calls
Default is non-blocking
34
Call GPU Functions (Kernels)
Functions in CPU are executed when invoked
GPU function calls from CPU create execution queue
CPU does not wait until GPU function completes –
command is simply queued
GPU executes commands on the queue using its own
ordering
Synchronization points cause CPU to stall to wait for GPU
return
CUDA
cudaThreadSynchronize()
35
GPU Function Calls
GPU function calls have an associated
dimensionality (which can be 1D, 2D or 3D)
CUDA
Extended language syntax to include problem dimension
Syntax
function<<<dimBlock,dimGrid>>>( arguments );
OpenCL
Must explicitly put function arguments into context
clSetKernelArg()
Invoke kernel using the context
Kernel retrieves arguments from context automatically
36
GPU Cleanup/Termination
CUDA
Manages most cleanup operations automatically as a
context is destroyed
OpenCL
Provides low-level APIs for deallocation of all resources
Invoked in order opposite to invocation
clReleaseKernel()
clReleaseProgram()
clReleaseCommandQueue()
clReleaseContext()
37
Thread Batching: Grids and Blocks
A kernel is executed as a grid of
thread blocks (aka blocks)
A thread block is a batch of
threads that can cooperate
with each other by:
Synchronizing their execution
Diverging execution results in
performance loss
Efficiently sharing data through a
low latency shared memory
Two threads from two different
blocks cannot cooperate
Host
Device
Grid 1
Kernel
1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Kernel
2
Block (1, 1)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Source : “NVIDIA CUDA Programming Guide” version 1.1
Block and Thread IDs
Threads and blocks have IDs
So each thread can identify what data
they will operate on
Block ID: 1D or 2D
Thread ID: 1D, 2D, or 3D
Simplifies memory
addressing when processing
multidimensional data
Image processing
Solving PDEs on volumes
Other problems with underlying 1D,
2D or 3D geometry
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Source : “NVIDIA CUDA Programming Guide” version 1.1
GPU Kernels
Each function is passed data to create a unique ID
Data typically specifies “spatial coordinates” of function execution processor
within the hardware
The ID is used to coordinate data access
Ensures that two threads’ accesses do not collide
CUDA function types
__global__
Callable by CPU
Cannot be called by GPU
__device__
Callable by other GPU functions
Cannot be called by CPU
CUDA expands these as inline functions via nvcc
Adds to function resource utilization
40
OpenCL Kernel Invocation
Use C++ templates to simplify argument handling
template<typename T> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, T Arg)
{
return( clSetKernelArg( Kernel, Pos, sizeof( T ), &Arg ) );
}
template<> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, size_t SharedSize)
{
// This routine, unlike the others, sets up shared memory by passing
// NULL in as the pointer to the variable.
return( clSetKernelArg( Kernel, Pos, SharedSize, NULL ) );
}
template<> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, int Arg)
{
cl_int ArgInt = Arg;
return( clSetKernelArg( Kernel, Pos, sizeof( ArgInt ), &ArgInt ) );
}
template<> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, float Arg)
{
cl_float ArgFloat = Arg;
return( clSetKernelArg( Kernel, Pos, sizeof( ArgFloat ), &ArgFloat ) );
}
...
template<typename T0> inline cl_int setArgs(cl_kernel Kernel, T0 Arg0)
{
return( setArg( Kernel, 0, Arg0 ) );
}
template<typename T0, typename T1> inline cl_int setArgs(cl_kernel Kernel, T0 Arg0, T1 Arg1)
{
return( setArg( Kernel, 0, Arg0 ) | setArg( Kernel, 1, Arg1 ) );
}
template<typename T0, typename T1, typename T2> inline cl_int setArgs(cl_kernel Kernel, T0 Arg0, T1 Arg1, T2 Arg2)
{
return( setArg( Kernel, 0, Arg0 ) | setArg( Kernel, 1, Arg1 ) | setArg( Kernel, 2, Arg2 ) );
}
...
41
OpenCL Kernel Invocation
BLAS-like example
CUDA provides BLAS library; OpenCL doesn’t
Must write own BLAS routines in OpenCL to port between
the two easily
swap() function swaps contents of 2 vectors with differing
vector strides
void OpenCL::blasSswap(int n, float *x, int incx, float *y, int incy)
{
if (!checkBLASKernel( &mSswapKernel, "Sswap" ))
{
return;
}
mLastBLASStatus = Base::BLAS_INTERNAL_ERROR;
if (x && y)
{
if (setArgs( mSswapKernel, n, x, incx, y, incy ) == CL_SUCCESS)
{
executeBLASKernel( mSswapKernel, n );
}
}
}
42
OpenCL Kernel Invocation
BLAS support functions
bool OpenCL::checkBLASKernel(cl_kernel *Kernel, const char *KernelName)
{
if (!mValid)
{
mLastBLASStatus = Base::BLAS_NOT_INITIALIZED;
return( false );
}
if (!(*Kernel))
{
cl_int RC;
*Kernel = clCreateKernel( mProgram, KernelName, &RC );
if (RC != CL_SUCCESS)
{
mLastBLASStatus = Base::BLAS_INTERNAL_ERROR;
return( false );
}
}
return( true );
}
inline void OpenCL::executeBLASKernel(cl_kernel Kernel, int n)
{
size_t Size = n;
size_t GlobalWorkSize = Size & mMaxWorkItemsMask;
if (Size & ~mMaxWorkItemsMask)
{
GlobalWorkSize += mMaxWorkItems;
}
cl_int RC = clEnqueueNDRangeKernel( mCommandQueue, Kernel, 1, NULL, &GlobalWorkSize,
&mMaxWorkItems, 0, NULL, NULL );
clFinish( mCommandQueue );
mLastBLASStatus = (RC == CL_SUCCESS) ? Base::BLAS_SUCCESS : Base::BLAS_EXECUTION_FAILED;
}
43
OpenCL Kernels
BLAS SSWAP example
__kernel void Sswap(__global int n, __global float *x,
__global int incx, __global float *y,
__global int incy)
{
const unsigned GID = get_global_id( 0 );
if (GID < n)
{
int lx = (incx >= 0) ? 0 : ((1 - n) * incx);
int ly = (incy >= 0) ? 0 : ((1 - n) * incy);
float temp = y[ ly + GID * incy ];
y[ ly + GID * incy ] = x[ lx + GID * incx ];
x[ lx + GID * incx ] = temp;
}
}
http://developer.download.nvidia.com/OpenCL/NVIDIA_OpenCL_JumpStart_Guide.pdf
44
CUDA Kernels
CPU
#include “kernel.cu”
...
{
const unsigned int size_x = 256;
const unsigned int size_y = 4096;
...
dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1);
dim3 threads(BLOCK_DIM, BLOCK_DIM, 1);
transpose_naive<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);
cudaThreadSynchronize();
...
}
GPU (kernel.cu)
#define BLOCK_DIM 16
__global__ void transpose_naive(float *odata, float* idata, int width, int height)
{
unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;
if (xIndex < width && yIndex < height)
{
unsigned int index_in = xIndex + width * yIndex;
unsigned int index_out = yIndex + height * xIndex;
odata[index_out] = idata[index_in];
}
}
45
Outline
GPU Architecture Overview
GPU Programming
Algorithm Acceleration Guidelines
Streams and Pinned Memory
Thread Scheduling
Parallel reduction
Program partitioning
Simultaneous graphics and algorithm processing
Case Studies
Conclusion
Q&A
46
Streams
Data1
Sequence of commands
that execute serially
Allow overlapping of
memory transfers and
kernel computations
from different streams
Data2
Data1
H→D Transfers
Data2
Data1
Data2
D→H Transfers
Kernel Computation
Hides data transfer cost
Implementable in CUDA devices
with compute capability ≥ 1.1
Host memory must be of type
‘pinned’
Data1
Data2
Data1
Data2
Data1
Data2
H→D Transfers
Kernel Computation
D→H Transfers
47
Pinned Memory
Memory on the host that is mapped to device’s address space and
thus accessible directly by a kernel
Has several advantages
There is no need to allocate a block in device memory and copy data between
this block and the block in host memory; data transfers are implicitly
performed as needed by the kernel
Bandwidth between host and device memories is higher
Write-combining Memory
Type of pinned memory where individual writes are aggregated into a larger
write operation
Avoids internal L1, L2 cache writes making more cache available for rest of the
application
Is not snooped during transfers across the PCI Express bus, which can improve
transfer performance by up to 40%
48
Threads and Scheduling in GPU
GPU consists of “multiprocessors”,
each of which has many processors
A kernel is executed as a grid of blocks
Thread block is a batch of threads that
cooperate with each other by:
Synchronizing their execution
Diverging execution results in
performance loss
Efficiently sharing data through a low
latency shared memory
All threads of a block reside on the
same multiprocessor (max 1024/MP)
Number of blocks a multiprocessor can
process at once depends on register
and shared memory usage per thread
Host
Device
Grid 1
Kernel 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Kernel 2
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)
Thread Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)
Thread Thread Thread Thread Thread
(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)
Source : “NVIDIA CUDA Programming Guide” version 1.1
Threads and Scheduling in GPU (contd…)
Before execution a block is split into warps
Half-warp is either first 16 or second 16 threads of a warp
Full efficiency is realized when all 16 threads of a half-warp
agree on their execution path
Branch divergence occurs if threads of a half-warp diverge
via a data dependent conditional branch
A warp is a set of 32 threads which execute the same instruction on a
MP
The half-warp serially executes each branch path taken, ignoring the
result from threads that are not on that path
Increases kernel execution time
Warps of the same block are executed in time sliced fashion
50
Program Parallelism
The GPU is designed to address applications that are
data-parallel
Parallelism is an inherent factor to determine
suitability of a problem for GPU applications
In fact, applications in which enough parallelism cannot be
exposed may be slower on a GPU in comparison to a single
threaded CPU
Since the same program is executed for each data
element, there is no sophisticated flow control
Conditional checks need to be done on the CPU
Reduce the output of all threads, transfer reduced result
to CPU which tests condition and appropriately issues
further GPU threads
Can be expensive since transfers are done over the PCIe
bus!
Parallel Reduction
Perform a reduction of the data before transferring to the CPU
Tree based reduction approach used within each thread block
3
7
1
4
4
0
5
7
3
6
1
9
Example of tree based SUM
14
11
25
Reduction decomposed into multiple kernels to reduce number of
threads issued in the later stages of tree based reduction
Level 0
syncThreads()
Level 1
52
Parallel Reduction (contd…)
Types of optimization for efficient parallel reduction
Algorithmic optimizations
Avoid divergent warps
Avoid shared memory bank conflicts – sequential
addressing
First addition during global load – halves the number of
blocks
Code optimizations
Loop unrolling
Multiple adds per thread to increase ‘arithmetic intensity’
of kernels (high ratio of computation in kernel to global read
and writes)
53
Parallel Reduction (contd…)
Example of tree based reduced sum
Shared Memory
10
Thread IDs
0
11
1
8
-1
2
1
7
-2
-1
-2
3
5
6
4
0
18
0
-2
8
7
-1
6
5
-5
2
7
10
-3
9
-2
8
5
-5
0
11
7
11
0
2
14
12
11
2
2
12
8
-3
9
7
13
11
2
2
8
0
24
-3
8
4
1
-2
1
7
-1
6
-2
8
5
17
-3
9
7
13
11
2
2
1
7
-1
6
-2
8
5
17
-3
9
7
13
11
2
2
0
41
54
Parallel Reduction (contd…)
Warp divergence removed
0
1
2
3
4
5
6
7
0
1
2
3
4
Shared Memory
10
1
8
-1
0
-2
3
5
-2
-3
2
7
0
Thread IDs
0
Bank IDs
11
1
1
7
-1
0
18
3
2
-2
-2
8
5
1
1
7
-1
6
-5
-3
9
-2
8
5
-5
11
7
11
6
7
0
2
7
6
11
2
2
3
2
-3
9
7
13
11
2
2
1
0
24
5
4
5
1
7
-1
6
-2
8
5
17
-3
9
7
13
11
2
2
1
7
-1
6
-2
8
5
17
-3
9
7
13
11
2
2
0
41
55
Parallel Reduction (contd…)
Sequential Addressing
Shared Memory
Thread IDs
10
1
8
0
1
2
11
1
0
1
18
1
0
1
24
1
1
-1
0
-2
3
5
-2
-3
2
7
0
11
0
2
4
5
6
7
-2
-2
8
5
-5
-3
9
7
11
11
2
2
-1
6
-2
8
5
-5
-3
9
7
13
11
2
2
7
-1
6
-2
8
5
17
-3
9
7
13
11
2
2
7
-1
6
-2
8
5
17
-3
9
7
13
11
2
2
7
2
7
3
-1
3
0
41
56
Program Partitioning
Assume a subroutine S is invoked N times in an application
Registers y
Time T
Registers x
1
2
3
N
A multiprocessor of the GPU has 16K registers, then maximum
parallelism = 16K/x
Since GPU can do fast hardware context switches between the threads,
which share the 16K registers
However, data transfers between kernels will become a
significant overhead with increase in number of partitions
57
Simultaneous Graphics and Algorithm
Processing
If the same GPU is used for graphics and algorithmic
processing
GPU resources may be saturated by graphics application, leaving
little bandwidth for other applications
The fixed size of GPU memory (without swap space) may cause
application launch failure
Graphics tasks may cause cache pollution which may cause
erratic runtimes for general purpose applications
Run “warm up code” to flush out caches
A single kernel execution cannot be longer than 5 seconds
Using a separate GPU for graphics and computation avoids the
above listed problems
58
Outline
GPU Architecture Overview
GPU Programming
Algorithm Acceleration Guidelines
Case Studies
Boolean Satisfiability
Fast SPICE model evaluation
Fault Simulation
SSTA
Conclusion
Q&A
59
Guidelines for GPU Acceleration for Software
Current GPUs have an expensive communication link
to the host. Data transfers should be minimized
Streams should be used to overlap communication
and computation
Partition kernels to increase parallelism that can be
leveraged
Full efficiency is realized when all 16 threads of a
half-warp agree on their execution path
Reduce warp divergence
Avoid bank conflicts when using shared memory
Kernels should have high arithmetic intensity
60
Case Studies
Two approaches for accelerating an algorithm on the GPU
Re-architecting approach
Applicable when the problem does not have inherent SIMD nature
May require significant algorithmic modifications
Examples:
Boolean Satisfiability
Fault Dictionary Computation (not covered in this talk, slides at end)
Porting approach
Applicable when problem runtime is dominated by a subroutine,
multiple invocations of which operate upon independent data
Partition the subroutine into GPU kernels
Examples:
Accelerating SPICE by porting model evaluation on the GPU
Fault Simulation
Monte Carlo based statistical static timing analysis (SSTA)
61
Boolean Satisfiability (SAT)
Given a Boolean formula in conjunctive normal form (CNF)
Either find a satisfying truth assignment of all variables
Or prove that there is no satisfying assignment
f = ( x + z )( y + z )( x’ + y’ + z’ )
x = true
y = true
Decisions: x =Clause
true y = true
Positive
Negative
The unassigned literal z’ gets
implied because
of the unit clause rule
Literal
Literal
Implication: z = false
Iterative application of the unit clause rule is called Boolean constant
propagation (BCP)
Recent BCP based SAT solvers incorporate conflict driven learning
A learned clause represents the search space that has been pruned
62
Approach
Complete Approaches for SAT
Stochastic Approaches for SAT
Are exact, but algorithms do not easily lend themselves to parallel
implementations. Examples: GRASP, zChaff , CirCUs, MiniSAT
Can execute at high speeds, are scalable, but are not exact. Examples:
Survey Propagation, WalkSAT, RandomSAT
Present a hybrid procedure for SAT
Retains the best features of complete and stochastic approaches
Proposed algorithm is based on MiniSAT (implemented on the CPU)
The variable ordering heuristic of MiniSAT is enhanced by a survey
propagation (SP) based procedure, which is implemented on the GPU
Proposed approach is called MESP (MiniSAT enhanced with SP)
Next few slides:
Discuss the GPU based SP implementation
Describe our MESP approach
MESP
SP
MiniSAT
63
Survey Propagation (SP) based SAT
Factor Graph - graphical representation of a SAT instance
Variable nodes (variables)
Function nodes (clauses)
Is a tree if it has no cycles
( x y )( x' z )
x
y
z
SP is an algorithm in which ‘agreement’ between clauses and
variables is reached by sending probabilistic ‘messages’ along
edges of the factor graph (message passing)
Pros: highly scalable, parallelizable, exact for factor graphs that are
trees
Cons: incomplete for non-tree factor graphs
64
Survey Propagation Equations
Notation
, β are clauses; i, j are variables
V +(i) set of all clauses where i appears in the positive ‘+’ form
V -(i) set of all clauses where i appears in the negative ‘-’ form
ηα→i is a warning (a probability) from clause to variable i
Let i be in the ‘+’ form in
η’s and π’s are iteratively computed until convergence
During Computation
p
V (i)\
(1 i ) q
V (i)\
After Convergence
(1 i )
p
V (i)
(1 i )
u
i
p(1 q)
i p(1 q)
s
i
q(1 p)
i q(1 p)
*i pq
*i pq
uj
i j \ i u
j sj *j
W
()
i
i
i i *i
q
V (i)
W
()
i
(1 i )
i
i i *i
65
Survey Propagation Flowchart
Randomly initialize ηα→i
Fixed
variables &
satisfied
clauses
(ignored)
Compute π
new
Compute ηα→i
Declare
non-convergence
Y
N
new
C = Σ [ | ηα→i
- ηα→i | ≤ ε?0:1]
N
new
ηα→i←ηα→i
it++>max
N
If
contradiction,
report and
quit
C==0
Y
Call WalkSAT to determine
satisfying assignment
Y
Σ(ηα→i ) ≈ 0
N
Compute W (biases)
Sort variables in
decreasing order of W’s
Sorted
List
Fix first x% of
variables
66
Survey Propagation on the GPU
Implemented GPU kernels for the following
Compute π’s, for all variables (V ) in parallel
Compute η’s, for all clauses (C ) in parallel
In particular, computes ηα→i for each variable i in clause α
new
Check convergence (Σ[(ηα→i
- ηα→i ) ≤ ε?0:1]) using a
reduced ‘integer’ add operation over all literals in all clauses
Compute Σ( ηα→i ) (to determine if non- trivial convergence)
using a reduced ‘float’ add operation
Compute W’s, for all variables in parallel
Parallel ‘bitonic’ sort to find the largest x% of the W’s
CPU performs conditional checks, fixes variables and
executes WalkSAT
67
Data Structure on the GPU
|V |
2
1
Clause #
Literal #
Polarity
Per Variable Data (Static)
|C |
2
1
Variable #
Polarity
Per Clause Data (Static)
|C |
2
1
ηα→i
η’s Written by Clauses Read by Variables
1
2
|V |
ππ+
With 1 GB of Global memory, the 280
GTX GPU can fit instances with upto
10M clauses and 1M variables
π’s Written by Variables Read by Clauses
68
Survey Propagation on the GPU
Memory transfers between GPU and CPU
Single transfer for static per variable and per clause data
During the computation of π and η, there are no transfers
at all. All intermediate data is stored in the global memory
of the GPU
After convergence is detected, the sorted list of variables in
decreasing order of biases is transferred (GPU → CPU)
After the graph is simplified, the following are updated
(CPU → GPU)
Variables that are fixed (don’t contribute to η computation)
Clauses that are satisfied (don’t contribute to π computation)
69
Results (GPU based SP)
Inst. Name
# VARs.
# CLs.
MiniSAT
B05
Ours
Speedup
Runtime in seconds
Random_1
20,000
83,999
>2 hrs.
3009.67
172.87
17.41X
Random_2
16,000
67,199
>2 hrs.
1729.48
110.60
15.63X
Random_3
12,000
50,399
>2 hrs.
1002.48
57.98
17.29X
Random_4
8,000
33,599
>2 hrs.
369.61
5.82
63.80X
Random_5
4,000
16,799
>2 hrs.
65.01
3.69
17.62X
Uf200-07
200
860
0.15
0.20
0.08
2.50X
hole10
187
792
1.30
Contrdn.
Contrdn.
Uf200-018
200
860
0.19
No Conv
No Conv
Avg. (over 20)
MESP is compared against
22.37X
Braunstein et al. 2005 (B05) and MiniSAT which were executed on a 3.6 GHz,
3GB Intel machine running Linux
Manolios et al. 2006 (M06), which uses OpenGL on NVIDIA GTX 7900 (512 MB
memory , 128 cores, 750MHz) to implement survey propagation
For hard random instances MESP shows a 22× speedup over B05
M06 reports a 9× speedup over B05
70
MESP
SAT instance is read into MiniSAT and on the GPU (executing SP)
MiniSAT is first invoked on the instance and after it has made
some progress, it invokes GPU-based SP. MiniSAT transfers to SP
Augment the current clause database in GPU-based SP with 3
sets of learned clauses (LC) C1, C2 and C3 . L is num. of literals in LC
The current assignments and
A subset of the current learned clauses
C1 (0 < L ≤ 10) ; C2 (10 < L ≤ 25); C3 (25 < L ≤ 50)
Statically allocate enough space in GPU’s Global Memory to store
8K clauses in C1, C2 and C3 each
Messages computed over all clauses (η) are now computed in 4
separate kernels, one for each set of clauses (C1, C2, C3 and C*)
On convergence, SP (in MESP) fixes variables for which the
absolute bias difference |W (+) - W (-)| < τ
71
MESP
MiniSAT decides the next variable to assign based on Variable State
Independent Decaying Sum (VSIDS) heuristic
VSIDS chooses next decision variable with the highest activity
Activity is the variable occurrence count, with a higher weight on the
variables of the more recently added learned clauses
Activity of the variables in the learned clauses is incremented by FM
In MESP, GPU-based SP invocation can return with the following outcomes
SP converges and fixes certain variables, S
SP converges, fixes S and determines factor
graph is a tree, invokes WalkSAT. If WalkSAT
finds assignment, instance is solved. Else fixed
variables in S are returned to MiniSAT
SP converges but does not fix any variable
SP does not converge/reports contradiction
MiniSAT updates
activity of variables in
S by FSP
MiniSAT continues the
search
72
MESP
MiniSAT
(complete)
Survey Propagation
(stochastic)
Current Assignments
Subset of Learned Clauses
Initial search
GPU attempts to converge
on the SP messages
Continues
search using
updated
activities
Activity
Table
CPU
GPU
GPU works in conjunction
with CPU to fix variables
CPU instructs GPU to
ignore fixed variables and
satisfied clauses
MiniSAT’s Decision Tree
CPU
Activity updated for the variables
S that are fixed in SP
GPU
73
Results
S/U
Instance
K SAT
3 SAT
Speedup
#
VARs.
#
CLs.
MiniSAT
(k)
#
VARs.
#
CLs.
MiniSAT
(3)
MESP
MiniSAT
(k)
MiniSAT
(3)
1394694p
S
327932
1283772
29.84
530027
1890057
39.58
15.28
1.95 X
2.59 X
AProVE07
U
78607
208911
110.39
104732
287286
166.25
95.91
1.15 X
1.73 X
eijk.bs4863
S
140089
530249
487.98
234412
813218
619.03
181.86
2.68 X
3.40 X
:
:
:
:
:
:
:
:
:
:
:
eijk.S298
U
73222
283211
8.42
136731
473738
10.01
8.47
0.99 X
1.18 X
1.64 X
1.92 X
Avg (over 13)
MESP approach on GTX 280 GPU card on an Intel i7 CPU with 2.6 GHz, 9GB RAM,
and running Linux. MiniSAT run on the same CPU. Runtime in seconds
D = 1% of Number of Variables; FSP = FM = 1; C = 20; τ = 0.01
The learned clauses on the GPU were updated at every 5th invocation of SP
Up to 24K learned clauses
None of these instances were solved in MESP by an invocation to WalkSAT
74
Summary
MESP is a GPU enhanced variable ordering heuristic for SAT
GPU based survey propagation
π’s for all variables and η’s for all clauses computed in parallel
Check convergence using a reduced ‘integer’ add operation over all
literals in all clauses
Test whether non-trivial convergence uses a reduced ‘float’ add
operation
Compute biases for all variables in parallel
Parallel ‘bitonic’ sort to find the largest x% of the biases
Survey propagation enhances the variable ordering in MESP
Augment clause database on GPU with 3 sets of learned clauses
η’s for all clauses computed in 4 different kernels
On average MESP is
64% (92%) faster than MiniSAT on original (3-SAT) instance
75
SPICE Model Evaluation on a GPU
SPICE is the de facto industry standard for VLSI circuit
simulations
Significant motivation for accelerating SPICE simulations
without losing accuracy
Accelerate the computationally expensive portion of SPICE –
transistor model evaluation – on a GPU
Proposed approach is integrated into a commercial SPICE
accelerator tool OmegaSIM
Increasing complexity and size of VLSI circuits
Increasing impact of process variations on the electrical behavior of
circuits
Require Monte Carlo based simulations
Already 10-1000x faster than traditional SPICE implementations
With the proposed approach integrated, OmegaSIM achieves
a further speedup of 2.36X (3.07X) on average (max)
Approach
Profiled SPICE simulations over several benchmarks
75% of time spent in BSIM3 device model evaluations
Billions of calls to device model evaluation routines
Every device in the circuit is evaluated for every time step
Possibly repeatedly until the Newton Raphson loop for solving
non-linear equations converges
Asymptotic speedup of 4X considering Amdahl’s law.
These calls are parallelizable
Since they are independent of each other
Each call performs identical computations on different data
Conform to the GPU’s SIMD operating paradigm
Approach
CDFG-guided manual partitioning of BSIM3 evaluation code
Limitation on the available hardware resources
Registers (8192/per multiprocessor)
Shared Memory (16KB/per multiprocessor)
Bandwidth to global memory (max. sustainable is ~80 GB/s)
If entire BSIM3 model is implemented as a single kernel
Number of threads that can be issued in parallel are not enough
If BSIM3 code is partitioned into many (small) kernels
Requires large amounts of data transfer across kernels
To hide global memory access latency
Done using global memory (not cached)
Negatively impacts performance
Proposed approach
Creates CDFG of the BSIM3 equations
Uses maximally disconnected components of this graph as different
kernels, considering the above hardware limitations
Approach
Take GPU memory constraints into account
Global Memory
Texture Memory
Used to store intermediate data – which is generated by one kernel and
needed by another (instead of transferring this data to host)
Used for storing ‘runtime parameters’
Device parameters that remain unchanged throughout the simulation
Advantages
It is cached, unlike global memory
No coalescing requirements, unlike global memory
No bank conflicts, such as possible in shared memory
CUDA’s efficient built in texture fetching routines are used
Small texture memory loading overhead is easily amortized
Constant Memory used for storing physical constants
Most efficient when all threads access the same data
Experiments
Proposed approach is implemented and integrated into a
commercial SPICE accelerator tool – OmegaSIM
Hardware used:
CPU: Intel Core 2 Quad, 2.4 GHz, 4GB RAM
GPU: GeForce 8800 GTS, 128 Processors, 675 MHz, 512 MB RAM
Comparing BSIM3 model evaluation alone
# Eval.
GPU runtimes (ms)
CPU runtimes (ms)
Speedup
Proc.
Tran.
Tot.
1M
81.17
196.48
277.65
8975.63
32.33X
2M
184.91
258.79
443.7
18086.29
40.76X
Experiments - Complete SPICE Sim
Ckt. Name
#
Trans.
Total #
Evals.
CPU-alone
GPU+CPU
Industrial_1
324
1.86 X 107
49.96
34.06
1.47X
Industrial_2
1098
2.62 X 109
118.69
38.65
3.07X
Industrial_3
1098
4.30 X108
725.35
281.5
2.58X
Buf_1
500
1.62 X 107
27.45
20.26
1.35X
Buf_2
1000
5.22 X 107
111.5
48.19
2.31X
Buf_3
2000
2.13 X 108
486.6
164.96
2.95X
ClockTree_1
1922
1.86 X 108
345.69
132.59
2.61X
ClockTree_2
7682
1.92 X 108
458.98
182.88
2.51X
Avg.
OmegaSIM (s)
Speedup
2.36X
With increase in number of transistors, speedup obtained is higher
More device evaluation calls made in parallel, latencies are better hidden
High accuracy with single precision floating point implementation
Over 1M device evals. avg. (max.) error of 2.88 X 10-26 (9.0 X 10-22) Amp.
Newer devices with double precision capability already in market
Conclusions
Significant interest in accelerating SPICE
75% of the SPICE runtime spent in BSIM3 model evaluation – allows
asymptotic speedup of 4X
Our approach of accelerating model evaluation using GPUs has been
integrated with a commercial fast SPICE tool
BSIM3 model evaluation can be sped up by 30-40X over 1M-2M calls
Take GPU memory constraints into account
Obtained speedup of 2.36 X on average
Global Memory used to store intermediate data
Texture Memory used for storing ‘runtime parameters’
Constant Memory used for storing physical constants
Carefully partition kernels since
If entire BSIM3 model is implemented as a single kernel
Number of threads that can be issued in parallel are not enough to hide
global memory access latency
If BSIM3 code is partitioned into many (small) kernels
Requires large amounts of data transfer across kernels done using global
memory
Introduction – Fault Simulation
Fault Simulation (FS) is crucial in the VLSI design flow
Given a digital design and a set of vectors V, FS evaluates the number of
stuck at faults (Fsim) tested by applying V
The ratio of Fsim/Ftotal is a measure of fault coverage
Current designs have millions of logic gates
The number of faulty variations are proportional to design size
Each of these variations needs to be simulated for the V vectors
Therefore, it is important to explore ways to accelerate FS
The ideal FS approach should be
Fast
Scalable &
Cost effective
83
Approach
Implement a look up table (LUT) based FS
0 1 2 3
0
All gates’ LUTs stored in texture memory (cached)
LUTs of all library gates fit in texture cache
To avoid cache misses during lookup
Individual k-input gate LUT requires 2k entries
Each gate’s LUT entries are located at a fixed offset in the texture memory as
shown above
Gate output is obtained by
accessing the memory at the “gate offset + input value”
Example: output of AND2 gate when inputs are ‘1’ and ‘0’
84
Approach
Evaluate two vectors for the same gate in a single
thread
1/2/3/4 input gates require 4/16/64/256 entries in LUT
respectively
Our library consists of an INV and 2/3/4 input AND, NAND,
NOR and OR gates.
Hence total memory required for all LUTs is 1348 words
This fits in the texture memory cache (8KB per MP)
Exploit both fault and pattern parallelism
Fault Parallel
All gates at a fixed topological level are evaluated in parallel
Pattern Parallel
Simulations for any gate, for different patterns, are done in
parallel
85
Approach
Faulty
vector vector
2
1
vector
N
Faulty circuit
value
for vector 1
Good circuit
value
for vector 1
In practice, simulations for any gate, for different patterns, are done
in 2 phases, for all the faults which lie in its TFI only
Good
Phase 1 : Good circuit simulation. Results returned to CPU
Phase 2 : Faulty circuit simulation. CPU does not schedule a stuck-at-v
fault in a pattern which has v as the good circuit value
Fault injection also performed in parallel
86
Approach – Fault Simulation
Injection
typedef struct __align__(16){
int offset; // Gate type’s offset
int a, b, c, d; // Input values
int m0, m1; // Mask variables
} threadData;
m0 m1
Meaning
-
11
Stuck-a-1 Mask
11
00
No Fault Injection
00
00
Stuck-at-0 Mask
87
Approach – Fault Detection
typedef struct __align__(16){
int offset; // Gate type’s offset
int a, b, c, d; // Input values
int Good_Circuit_threadID; // Good circuit simulation thread ID
} threadData_Detect;
88
Approach
We maximize GPU performance by ensuring that
No data dependency exists between threads issued in parallel
The same instructions, on different data are executed by all threads
We adapt to specific G80 memory constraints
LUT stored in texture memory. Key advantages are:
Texture memory is cached
Total LUT size easily fits into available cache size of 8KB/MP
No memory coalescing requirements
Efficient built-in texture fetching routines available in CUDA
Non-zero time taken to load texture memory, but cost easily
amortized
Global memory writes for level i gates (and reads for level i+1
gates) are performed in a coalesced fashion
89
Results
Circuit
#Gates
#Faults
Comm.
GPU
Speed Up
PROJ.
Speed Up
s9234_1
1462
3883
6.190
0.134
46.067
0.022
275.754
s35932
14828
34628
51.920
1.390
37.352
0.260
199.723
s5378
1907
4821
8.390
0.155
54.052
0.025
333.344
s13207
2195
5735
14.980
0.260
57.648
0.047
320.997
:
:
:
:
:
:
:
:
b22
35280
86205
17.130
1.504
11.390
0.225
75.970
Avg (30 ckts.)
~47X
~300X
FS on 280 GTX runtimes compared to a commercial fault simulator for 30 IWLS and
ITC benchmarks
32 K patterns were simulated for all 30 circuits
CPU times obtained on a 1.5 GHz 1.5 GB UltraSPARCIV+ processor running Solaris 9
GPU time includes
Data transfer time between the GPU and CPU (both directions)
CPU → GPU : 32 K patterns, LUT data
GPU → CPU : 32 K good circuit evals. for all gates, array Detect
Processing time on the GPU
Time spent by CPU to issue good/faulty gate evaluation calls
Time spent for loading the LUTs
Conclusions
Fault simulation is accelerated using GPUs
Maximize GPU performance by ensuring that
No data dependency exists between threads issued in parallel
The same instructions, on different data are executed by all
threads
Adapt to specific G280 memory constraints
LUT stored in texture memory
Global memory writes for level i gates (and reads for level i+1
gates) are performed in a coalesced fashion
When using a Single 280 GTX GPU
Implement a pattern and fault parallel technique
47X speedup compared to commercial FS engine
When projected for a 1U NVIDIA Tesla Server
300X speedup is possible over the commercial engine
91
Introduction - SSTA
Static timing analysis (STA) is heavily used in VLSI design to estimate
circuit delay
Impact of process variations on circuit delay is increasing
Therefore, statistical STA (SSTA) was proposed
Monte Carlo (MC) based SSTA accounts for variations by
Generating N delay samples for each gate (random variable)
Executing STA for each sample
Aggregating results to generate full circuit delay under variations
MC based SSTA has several advantages over block based and path
based SSTA
It includes the effect of variations while estimating circuit delay
High accuracy, simplicity and compatibility to fabrication line data
Main disadvantage is extremely high runtime cost
92
Approach – STA
STA at a gate
a
c
Over all inputs compute the MAX of the SUM of
b
Input arrival time for input i and
Pin-to-output (P2O) rising (or falling) delay from pin i to output
For example, let
Atifall (Atirise) denote the arrival time of a falling(rising) signal at node i
MAX (D11→00, D11→01) (MAX (D11→00, D11→10)) denotes the P2O rising delay from
a to c (b to c)
ATcrise = MAX [(ATafall + MAX (D11→00 , D11→01)), ((ATbfall + MAX (D11→00 , D11→10))]
STA at a gate on the GPU
The P2O rising (or falling) delay from every input to output is stored in a lookup
table (LUT) in texture memory of GPU
For an n-input gate, do the following
Fetch n pin-to-output rising (or falling) delays from texture memory Using the
gate type offset, pin number and falling/rising delay information
n SUM computations Of the pin-to-output delay and input arrival time
n-1 MAX computations CUDA only supports 2 operand MAX operations
93
Approach – SSTA
SSTA at a gate
Need (µ , σ) for the 2n Gaussian distributions of the pin-to-output
rising and falling delay values for n inputs
Store (µ , σ) for every input in the LUT
As opposed to storing the nominal delay, as for STA
Mersenne Twister (MT) pseudo random number generator is used
The uniformly distributed random number sequences are then
transformed into the normal distribution N(0,1)
Using the Box-Muller transformations (BM)
Delay of a sample = µ + k · σ
Both algorithms, MT and BM kernels are available with the CUDA
software development kit (SDK)
For a circuit, SSTA is performed topologically from inputs to outputs
Delays of gates at logic depth i are computed, and stored in global memory
Gates at logic higher depths use this data as their input arrival times
94
Experiments – SSTA
MC based SSTA on 280 GTX runtimes compared to a CPU based
implementation for 30 large IWLS and ITC benchmarks
Monte Carlo analysis performed by using 64 K samples for all 30 circuits
CPU runtimes are computed
GPU time includes data transfer time
On 3.6 GHz, 3GB RAM Intel processor running Linux
CPU → GPU :
arrival time at each primary input
µ and σ for all pin-to-output delays of all gates
GPU → CPU:
64K delay values at each primary output
GPU time also includes the time spent in the MT and BM kernels, and
loading texture memory
Computation results have been verified for correctness
For the SLI Quad system, the runtimes are obtained by scaling the
processing times only
Transfer times are included as well (not scaled)
95
Results – SSTA
Runtime (s)
Circuit
Speedup
GPU
SLI QUAD
CPU
GPU
SLI QUAD
s9234_1
8.11
2.92
6621.16
816.64
2269.11
s35932
46.50
18.14
36174.56
778.00
1993.97
s38584
47.24
17.24
38270.72
810.19
2219.98
s13207
14.55
6.21
10633.48
731.07
1712.24
:
:
:
:
:
:
b22_1
51.50
15.51
45909.95
891.51
2959.80
~818X
~2405
Avg. (30 Ckts.)
When using a single 280 GTX GPU
~818X speedup in MC based SSTA is obtained
The SSTA runtimes are projected on a Quad GPU system
~2405X speedup is possible
96
Conclusions
We accelerate MC based SSTA using graphics processors
We take maximal advantage of the GPU’s
Raw computational power and
Huge memory bandwidths
Maximize GPU performance by ensuring that
No data dependency exists between threads issued in parallel
The same instructions, on different data are executed by all
threads
Adapt to specific G280 memory constraints
LUT stored in texture memory
Global memory writes for level i gates (and reads for
level i+1 gates) are performed in a coalesced fashion
Summary
We discussed the GPU platform, and its use in highperformance EDA applications, with case studies.
Outlined the GPU memory and processing
constraints induced by the GPU architecture.
Presented programming guidelines with sample code
fragments
Suggested tips to maximize performance of GPUbased code
Discussed case studies of EDA algorithms, and
pointed out how the code was architected for
maximum performance.
Resources
General
CUDA references
www.gpgpu.org
www.nvidia.com/object/cuda_home.html
Supported platforms: Windows, MacOS, Linux
OpenCL stuff
General OpenCL information: www.khronos.org
Apple: developer.apple.com/mac/snowleopard/opencl.html
Amd: developer.amd.com/gpu/ATIStreamSDK/pages/TutorialOpenCL.aspx
Nvidia: http://developer.nvidia.com/object/opencl-download.html
Thank You
Fault Table Generation
Two key steps in VLSI testing and debug
Fault detection: Differentiates a faulty design from a fault free design
Fault diagnosis: Identifies and isolates a fault, to analyze the defect
causing the faulty behavior
Both detection and diagnosis require precalculated fault table
Whether vector vj can detect fault fi
Stored as matrix [aij], where aij = 1(0) if fault fi is (not) detected by
vector vj
Implemented pattern parallel approach on the GPU
Simulate several patterns simultaneously
Other parallel efforts require dynamic load balancing
Algorithm parallel: Partition fault list across many processors
Model parallel: Partition circuit into components, each assigned to one
or more processors
101
Approach
FSIM [Lee et al. 91] is an efficient fault simulator
FSIM+ is FSIM modified to compute a fault table
FSIM and FSIM+ both
Are pattern parallel, run on a single core microprocessor
Simulate a circuit in a forward levelized manner
Prune off unnecessary simulations early
New approach (GFTABLE) is an enhancement of FSIM+
Target hardware is a GPU – SIMD machine
Issue thousands of threads (T) in parallel
‘word_size × T’ patterns (packet width) computed in parallel
Hardware and software constraints are maximally satisfied
CUDA specific (memory, device utilization)
Only CPU can launch a kernel or perform efficient conditional tests
Minimize (expensive) transfers between the GPU and CPU
102
Approach
CPT
s-a-0
s-a-1
✗
✗
Stems:
Fanout Nets
p is the dominator of k Primary Outputs
CD(k) p is also the immediate dominator of k
Stem region (SR)
All gates on any path from a stem to its immediate dominator
Fanout free region (FFR)
A subcircuit induced by cutting off the fanout branches of each stem
Such subcircuits form a partition of the original netlist
103
Approach
Sensitive input
Only input of a gate driving the dominant logic value (DLV)
All inputs, when all inputs driving DLV
Critical line is the line driving the sensitive input
Critical path tracing (CPT)
Determine paths of critical lines in FFR(k) by backtracking
from the output of the FFR(k) towards its inputs
104
Approach
FS
s-a-0
1 ✗
0 ✗
s-a-1
0 ✗
D(k, p)
FD(a s-a-0, p)
FD(c s-a-0, p)
FD(b s-a-1, p)
s-a-0
Detectabilities
D(a,k) = 0
D(b,k) = 1
D(i,k) = 1
D(c,k) = 0
D(j,k) = 0
FD(a s-a-0, k) = 0
FD(c s-a-0, k) = 0
FD(b s-a-1, k) = 1
CD(k) = 0
CD(k) = 1
FD(a s-a-0, p) = FD(a s-a-0, k) D(k,p)
105
Approach
FSIM+(N) {
Initialize
while v < N do
Generate test vector; v = v + packet_width;
Perform fault free simulation
for each stem s, compute CD(s)
if (CD(s)), then compute D(s,t)
if (D(s,t)), then compute and store fault detectabilities
end while
}
106
Approach
Generate test vectors; v = v + packet_width
Mersenne Twister (MT) pseudo random number generator
Long period, efficient use of memory, good distribution properties
Perform fault free simulation
Logic_simulation_AND_2(int* MEM, int z, int a, int b) {
tx = my_thread_id
MEM[tx + z X T] = MEM[tx + a X T]
·
MEM[tx + b X T]
}
All threads evaluate the same gate for different patterns
Sort gates topologically from inputs to outputs
Fault free data for the first L gates, for all patterns, is stored in the
global memory
Avoids the need to transfer this data from the CPU
107
Approach
for each stem s, compute CD(s)
CD(s) is computed using CPT
Launch T threads in parallel
In FSIM+, gates that are not driving
any critical lines are not backtracked
on during CPT
In GFTABLE, all gates are backtracked on during CPT
The test (if gates are driving critical lines) does not help prune
99.99% of gates, due to the large packet width – T × 32 bits (T = 16K)
Large packet width is necessary in order to
Take advantage of the immense parallelism on the GPU
Reduce overhead of kernel launch or global memory access
108
Approach
if CD(s), then Compute D(s,t)
Explicit fault simulation, in the forward levelized manner
From stem s to its immediate dominator t (or PO)
Input is CD(s) XORed with fault free value at s
Injecting faults which are upstream from s and observable at s
Example:
CD(k) = 0010, Fault Free Value at k = 0000
Input applied at k = 0010 0000 = 0010
Fault simulation yields p = 0010
Fault free value at p = 0000
Therefore D(k, p) = 0010 0000 = 0010
In FSIM+, simulation of the fanout of a gate g is scheduled
Only if output at g is different from its fault free value
109
Approach
if CD(s), then Compute D(s,t)
On the GPU, a bitwise XOR operation is performed on T words of the
current output (gate evaluation) and fault free data
For the test (if result of XOR is all zero) perform a hybrid of a depth-first and
breadth-first approach
Divide T long array (of the XOR’s output) into groups of
size Q (256)
0
Compute reduced OR of data in each group into single
word which is transferred to the CPU
Avoid bank conflicts and divergent executions
Minimize global memory access latencies
Employ loop unrolling in reduction code
At the first non-zero value found on the CPU, return false
Q=3
0
0
0
0
0
1
1
0
1
1
1
These values returned to CPU
Perform this test after simulating G (20) gates
All conditional tests, on CD(s) or D(s,t), are performed in a similar manner
110
Approach
if D(s,t), then compute and store fault detectabilities
For all faults in FFR(s) of the current stem s
If fault fi is detectable at the stem s and
If stem s is detectable at a primary output, then
Fault fi is (globally) detectable
The ith row of the fault table is accordingly updated
In GFTABLE
Detectabilities are computed on the GPU
Fault Detectabilities are immediately transferred to the CPU
‘word_size × T’ bits transferred
The entire fault table is never stored on the GPU
111
Results
Circuit
FSIM+
(s)
GFTABLE
(s)
GFTABLE
v/s FSIM+
GFTABLE-TESLA
(s)
GFTABLE-TESLA
v/s FSIM+
b14
1502.47
100.87
14.90 X
17.65
85.12 X
b20
4992.73
319.82
15.61 X
55.97
89.21 X
6319.47
399.34
15.82 X
69.88
90.43 X
:
b22
Avg (20 ckts)
15.68 X
89.57 X
GFTABLE implemented on NVIDIA Quadro FX 5800
T = 16 K, word_size = 32, L = 32 K, Q = 256, G = 20
To use the global memory effectively, the FAULT_LIST is partitioned into
subsets of 1K faults, and GFTABLE executed iteratively
Allows GFTABLE to operate on circuits with arbitrary number of gates
and faults
FSIM+ run on a 32-bit, 3.6 GHz Intel CPU with 3GB RAM, running Linux
Projected runtime to NVIDIA Tesla system (8 GPUs) is 90× faster
112
Summary
Fault table is required by fault detection and diagnosis
Compute time of a fault table is very high
Fault table generation is accelerated using GPUs
All data parallel computations are performed on the GPU
All conditional statements are evaluated on the CPU
Entire fault table is never stored on the GPU’s memory
To handle larger circuits
Tree based reduced OR operation performed on the GPU, before
transferring test data to the CPU
Global memory stores a subset of the fault free data
Fault list is partitioned
Experimental results show
15× speedup over FSIM+ using a single Quadro FX 5800
Potential speedup over FSIM+ is 90× when using a Tesla GPU system
113