Introduction to GPU Programming for EDA John F. Croix Cadence Design Systems, Inc. Sunil P.

Transcript Introduction to GPU Programming for EDA John F. Croix Cadence Design Systems, Inc. Sunil P.

Introduction to GPU
Programming for EDA
John F. Croix
Cadence Design Systems, Inc.
Sunil P. Khatri
Texas A&M University
Acknowledgements: NVIDIA, Nascentric Inc., Accelicon Inc.
Students: Kanupriya Gulati, Vinay Karkala, Kalyana Bollapalli
Outline






GPU Architecture Overview
GPU Programming
Algorithm Acceleration Guidelines
Case Studies
Conclusion
Q&A
2
Outline

GPU Architecture Overview








Evolution and architecture
Peak performance
GPU and CPU interaction – practical considerations
GPU Programming
Algorithm Acceleration Guidelines
Case Studies
Conclusion
Q&A
3
GPU Evolution

In the early days, graphics accelerators were primitive



With VLSI technology scaling, the GPU was born




Many programmable processors to handle graphics rendering tasks
Increased peak memory bandwidths and peak performance
Goal was faster and more realistic rendering for gaming applications
Recently, several scientific communities began to leverage
these GPUs


Acceleration of graphics rendering tasks for (CRT) displays
Many hardwired graphics acceleration units
Initially used graphics APIs like OpenGL and DirectX for these tasks
GPU vendors recognized this interest


Development of C-like programming environments such as CUDA
Development of GPU architectures tuned for scientific computations
4
GPU Introduction

A GPU is essentially a commodity stream processor





GPUs, owing to their massively parallel architecture, have been
used to accelerate



Highly parallel (100s of processor cores)
Very fast (>900 GFLOPS of peak performance)
Operates in a SIMD manner. This is a key restriction
Multiple processors operate in lock-step (same instruction) but on different
data
Image/stream processing, data compression, numerical algorithms
Recently they have been used to accelerate CAD algorithms as well.
Inexpensive, off-the-shelf cards like the NVIDIA Quadro FX / 280
GTX GPU achieve impressive performance





933 GFLOPs peak performance
240 SIMD cores partitioned into 30 Multiprocessors (MPs)
4GB (Quadro) and 1GB (GTX 280) device memory with 142 GB/s bandwidth
1.4 GHz GPU operating frequency
Programmed with Compute Unified Device Architecture (CUDA) framework
5
GPU Architecture

In the GTX 280, there are 10 Thread Processing
Clusters (TPCs)




Each has 3 Streaming Multiprocessors (SMs), which we will
refer to as multiprocessors (MPs)
Each MP has 8 Streaming Processors (SPs) or Thread
Processors (TPs). We will refer to these as processors.
240 processors and 30 MPs in all!
One double-precision FP unit per SM
Source : NVIDIA
6
GPU vs CPU:
NVIDIA 280 vs Intel i7 860
CPU1
GPU
Registers
16,384 (32-bit) /
multi-processor3
128 reservation stations
Peak memory bandwidth
141.7 Gb/sec
21 Gb/sec
Peak GFLOPs
562 (float)/
77 (double)
50 (double)
Cores
240
4/8 (hyperthreaded)
Processor Clock (MHz)
1296
2800
Memory
1Gb
16Gb
Shared memory
16Kb/TPC2
N/A
Virtual memory
None
1http://ark.intel.com/Product.aspx?id=41316
2TPC
330
= Thread Processing Cluster (24 cores)
multi-processors in a 280
7
GPU vs CPU Peak Performance Trends


GPU peak performance has grown aggressively.
Hardware has kept up with Moore’s law
Source : NVIDIA
8
GPU Programming Model

The GPU is viewed as a compute device that:




Is a coprocessor (slave) to the CPU (host)
Has its own DRAM (device memory) but no virtual memory
Entire design instance may not fit on the GPU!
Kernel is a CPU-callable function. Thread is an instance of a kernel.
GPU runs many threads in parallel.
Device
Host
(CPU)
(GPU)
Kernel
Threads
(instances of
the kernel)
PCIe
Device
Memory
9
Data Transfers (CPUGPU)

GPUs and CPUs communicate via a PCIe bus


Graphics applications usually require



This communication is expensive and should be minimized for target applications
Initial data to be sent from CPU to GPU
Single transfer of processed data from GPU to CPU
General purpose computations usually require


Multiple transfers between CPU and GPU (since conditional checks on CPU)
Possibility of saturating the PCIe bus and reducing the achievable performance
Device
(GPU)
Host
(CPU)
Kernel
Threads
(instances of
the kernel)
PCIe
Device
Memory
10
GPU Threads v/s CPU Threads

GPU threads:



CPU threads:


Lightweight, small creation and scheduling overhead,
extremely fast hardware context switching
Need to issue 1000s of GPU threads to hide global
memory latencies (600-800 cycles)
Heavyweight, large scheduling overhead, slow context
switching
Multi-GPU usage requires invocation of multiple CPU
threads


Each CPU thread creates a GPU context
Context swapping is required for a CPU thread to access
GPU memory allocated by another CPU thread
11
Device Memory Space Overview

Each thread runs on a SP and has:




R/W per-thread registers (on-chip)
 Limit usage (max 16K/MP)
R/W per-thread local memory (off)
R/W per-block shared memory (on)
 Need to avoid bank conflicts
R/W per-grid global memory (off)
 Not cached, 600-800 cycle read


Latency hidden by parallelism
and fast context switches
Main means for data transfer
from host and device
Host
 Coalescing recommended
RO per-grid cached constant
and texture memory (off)



The host can R/W global,
constant and texture
memories (visible to all threads)
(Device) Grid
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Source : “NVIDIA CUDA Programming Guide” version 1.1
Outline


GPU Architecture Overview
GPU Programming




CPU threads
Conditional and Loop processing
Floating point
General GPU program structure
CUDA and OpenCL




Algorithm Acceleration Guidelines
Case Studies
Conclusion
Q&A
13
CPU Threading

CPU




All threads are equivalent
Read/write concurrently to the same memory
Synchronization primitives required to avoid collisions
GPU (NVIDIA)





Each CPU thread maintains a unique context
GPU resources (e.g. memory, code modules, address
space) are context-specific
Each CPU thread can access a single context at once
Contexts must be exchanged between CPU threads to
share GPU resources between CPU threads
Contexts use reference counting and are automatically
destroyed
14
SIMD Conditional Processing


Unlike threads in a CPU-based program, SIMD programs cannot
follow different execution paths
Ideal scenario:



In divergent paths, some processors execute the then-block and
others the else-block




All GPU threads follow the same execution path
All processors active continuously
Program flow cannot actually diverge. All instructions are executed
The then- and else- blocks are both executed
A bit is used to enable/disable processors based on the block being
executed
Parallelism is reduced, impacting performance
15
Idle Processors


Idle CPU processors can be dynamically rescheduled
by OS
SIMD processors are not actually idle




All processors scheduled are following identical execution
paths
Disabled (idle) processors are unavailable for other work
and cannot be rescheduled
Effective utilization of processors is the programmer’s
responsibility
Scheduling is an art, not necessarily a science

Techniques will vary from chip to chip
16
Conditional Processing
…
If (condition)
{
…
}
else
{
…
}
…
17
Nested Conditional Processing
…
If (condition)
{
if (condition2)
{
…
}
else
{
…
}
}
else
{
…
}
…
18
Loop Processing
…
while (condition)
{
if (cond2)
{
…
}
}
…
19
The Cost of Memory Access



Registers are extremely fast, but are a limited resource
Cached memories also tend to be small
For large data sets, global memory provides read & write access





Accesses take between 600 and 800 clock cycles
Accesses are *not* cached
To hide memory latency, the hardware provides fast context switches when
memory is accessed
However, there must be enough computational work to do to hide the high
cost of memory access
Programmers need to be smart


Compilers often don’t provide the necessary optimizations when optimizing
for speed instead of code size
It can sometimes be cheaper to recompute a result than perform a memory
read/write
20
Conditional Processing
...
if (condition)
{
...
float a = someVar;
...
}
else
{
...
float a = someVar;
...
}
...
Access & Swap
Access & Swap
…
float a = someVar;
if (condition)
{
…
}
else
{
…
}
…
Access & Swap
21
Floating Point


GPUs are optimized for 32bit accesses
64-bit double-precision
values fetched from
memory as two 32-bit
quantities


May impact performance in
the event of memory bank
conflicts
One double-precision unit
per multi-processor1
1http://www.ddj.com/hpc-high-performance-computing/210102115
22
OpenCL vs CUDA

CUDA uses early code binding

Code is compiled with normal C/C++/FORTRAN (beta)
source code



Need CUDA occupancy calculator to determine number of
threads based on resource utilization
Library support: BLAS & FFT & DPT
OpenCL

Late binding of OpenCL code to executable
OpenCL compiler/linker embedded within application
 No need for CUDA occupancy calculator



Only supports C
No libraries
23
CUDA Occupancy Calculator
24
OpenCL vs CUDA
25
General Program Structure









Initialize GPU
Create GPU context
Build GPU program
Allocate GPU memory
Transfer data from CPU to GPU
Invoke GPU functions
Transfer data from GPU to CPU
Deallocate GPU memory
Finalize GPU usage
26
Create GPU Context

CUDA



Context creation is implicit in single-threaded programs
Multiple contexts can be explicitly created
Each thread maintains a context stack


Top context is current context
Threads
Contexts can be swapped between threads
 A thread can only have one context active at a time (stack)
 A context cannot be shared simultaneously between
threads


OpenCL


All commands explicitly associated with a context
Must create a command queue to invoke
27
Initialize GPU

CUDA



cudaGetDeviceCount()
cudaSetDevice()
cudaGetDeviceProperties()
CUDA::CUDA(int Device) : Base()
{
mValid = false;
int DeviceCount;
cudaGetDeviceCount( &DeviceCount );
if (!DeviceCount)
{
return;
}
Device = Device == -1 ? DeviceCount - 1 : Device;
cudaSetDevice( Device );
mValid = true;
}
28
Initialize GPU

OpenCL


Context must be built before anything can be done on
the GPU
All commands are with respect to a given context
OpenCL::OpenCL(int Device) : Base()
{
init();
// Initialize class pointers to NULL
cl_int RC;
mGPUContext = clCreateContextFromType( 0, CL_DEVICE_TYPE_GPU, NULL, NULL, &RC );
size_t Bytes;
RC = clGetContextInfo( mGPUContext, CL_CONTEXT_DEVICES, 0, NULL, &Bytes );
int NumDevices = Bytes / sizeof( cl_device_id );
cl_device_id *Devices = new cl_device_id[ NumDevices ];
RC = clGetContextInfo( mGPUContext, CL_CONTEXT_DEVICES, Bytes, Devices, NULL );
mCommandQueue = clCreateCommandQueue( mGPUContext, Devices[ Device ], 0, &RC );
size_t MaxWorkItemSizes[ 256 ];
RC = clGetDeviceInfo( Devices[ Device ], CL_DEVICE_MAX_WORK_ITEM_SIZES,
sizeof( MaxWorkItemSizes ), MaxWorkItemSizes, NULL );
mMaxWorkItems = MaxWorkItemSizes[ 0 ];
mMaxWorkItemsMask = ~(mMaxWorkItems - 1);
29
Build GPU Program

CUDA


GPU code is compiled using nvcc compiler
Object code is statically bound to CPU executable
GPU code is intrinsically part of the program
 Mapping of problem to threads performed at compile time

30
Build GPU Program

OpenCL



GPU code is bound at runtime to the GPU
OpenCL compiler is part of executable
Code can be source code or object code
Source code can be dynamically generated by the program
 Can be stored in an external file

// Continued from constructor
char *code = shrFindFilePath( ”code.cl", "." );
size_t CodeLength = 0;
char *Source = oclLoadProgSource( myCode, "", &CodeLength );
const char *SourceCode = Source;
mProgram = clCreateProgramWithSource( mGPUContext, 1, &SourceCode,
&CodeLength, &RC );
RC = clBuildProgram( mProgram, 0, NULL, NULL, NULL, NULL );
std::free( code );
std::free( Source );
mValid = RC == CL_SUCCESS;
}
31
Allocate/Deallocate GPU Memory

CUDA



Most frequently used allocator: cudaMalloc()
Returns a memory pointer to GPU memory
Memory pointer cannot be used by CPU directly

Passed to GPU calls
void *CUDA::malloc(size_t Bytes)
{
void *Memory;
cudaError_t RC = cudaMalloc( &Memory,
Bytes );
return( RC == cudaSuccess ? Memory :
NULL );
}
void CUDA::free(void *Memory)
{
if (Memory)
{
cudaFree( Memory );
}
}
32
Allocate/Deallocate GPU Memory

OpenCL

Like all things, memory allocation explicitly performed
within a context
void *OpenCL::malloc(size_t NumBytes)
{
size_t Size = NumBytes / 32 + (NumBytes & 31 ? 1 : 0);
cl_int RC;
cl_mem Memory = clCreateBuffer( mGPUContext, CL_MEM_READ_WRITE,
Size, NULL, &RC );
return( RC == CL_SUCCESS ? Memory : NULL );
}
void OpenCL::free(void *Memory)
{
if (Memory)
{
cl_mem *Ptr = reinterpret_cast<cl_mem>( Memory );
clReleaseMemObject( Memory );
}
}
33
CPU/GPU Data Transfer


Data moved across PCIe bus
CUDA


Data transfer accomplished via cudaMemcpy() routine
Implicit synchronization point


Direction is determined by enumeration
 cudaMemcpyHostToDevice



Non-blocking copies are available
cudaMemcpyDeviceToHost
Allocated memory can be bound to texture memory
 cudaBindTexture
OpenCL


Memory transfer via clEnqueueWriteBuffer() and clEnqueueReadBuffer()
Synchronization controlled by parameters to calls
 Default is non-blocking
34
Call GPU Functions (Kernels)


Functions in CPU are executed when invoked
GPU function calls from CPU create execution queue




CPU does not wait until GPU function completes –
command is simply queued
GPU executes commands on the queue using its own
ordering
Synchronization points cause CPU to stall to wait for GPU
return
CUDA

cudaThreadSynchronize()
35
GPU Function Calls


GPU function calls have an associated
dimensionality (which can be 1D, 2D or 3D)
CUDA


Extended language syntax to include problem dimension
Syntax
function<<<dimBlock,dimGrid>>>( arguments );

OpenCL

Must explicitly put function arguments into context
clSetKernelArg()

Invoke kernel using the context

Kernel retrieves arguments from context automatically
36
GPU Cleanup/Termination

CUDA


Manages most cleanup operations automatically as a
context is destroyed
OpenCL


Provides low-level APIs for deallocation of all resources
Invoked in order opposite to invocation
clReleaseKernel()
clReleaseProgram()
clReleaseCommandQueue()
clReleaseContext()
37
Thread Batching: Grids and Blocks


A kernel is executed as a grid of
thread blocks (aka blocks)
A thread block is a batch of
threads that can cooperate
with each other by:



Synchronizing their execution
 Diverging execution results in
performance loss
Efficiently sharing data through a
low latency shared memory
Two threads from two different
blocks cannot cooperate
Host
Device
Grid 1
Kernel
1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Kernel
2
Block (1, 1)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Source : “NVIDIA CUDA Programming Guide” version 1.1
Block and Thread IDs


Threads and blocks have IDs
 So each thread can identify what data
they will operate on
 Block ID: 1D or 2D
 Thread ID: 1D, 2D, or 3D
Simplifies memory
addressing when processing
multidimensional data
 Image processing
 Solving PDEs on volumes
 Other problems with underlying 1D,
2D or 3D geometry
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Source : “NVIDIA CUDA Programming Guide” version 1.1
GPU Kernels

Each function is passed data to create a unique ID




Data typically specifies “spatial coordinates” of function execution processor
within the hardware
The ID is used to coordinate data access
Ensures that two threads’ accesses do not collide
CUDA function types

__global__
 Callable by CPU


Cannot be called by GPU
__device__
 Callable by other GPU functions

Cannot be called by CPU

CUDA expands these as inline functions via nvcc

Adds to function resource utilization
40
OpenCL Kernel Invocation

Use C++ templates to simplify argument handling
template<typename T> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, T Arg)
{
return( clSetKernelArg( Kernel, Pos, sizeof( T ), &Arg ) );
}
template<> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, size_t SharedSize)
{
// This routine, unlike the others, sets up shared memory by passing
// NULL in as the pointer to the variable.
return( clSetKernelArg( Kernel, Pos, SharedSize, NULL ) );
}
template<> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, int Arg)
{
cl_int ArgInt = Arg;
return( clSetKernelArg( Kernel, Pos, sizeof( ArgInt ), &ArgInt ) );
}
template<> inline cl_int setArg(cl_kernel Kernel, unsigned Pos, float Arg)
{
cl_float ArgFloat = Arg;
return( clSetKernelArg( Kernel, Pos, sizeof( ArgFloat ), &ArgFloat ) );
}
...
template<typename T0> inline cl_int setArgs(cl_kernel Kernel, T0 Arg0)
{
return( setArg( Kernel, 0, Arg0 ) );
}
template<typename T0, typename T1> inline cl_int setArgs(cl_kernel Kernel, T0 Arg0, T1 Arg1)
{
return( setArg( Kernel, 0, Arg0 ) | setArg( Kernel, 1, Arg1 ) );
}
template<typename T0, typename T1, typename T2> inline cl_int setArgs(cl_kernel Kernel, T0 Arg0, T1 Arg1, T2 Arg2)
{
return( setArg( Kernel, 0, Arg0 ) | setArg( Kernel, 1, Arg1 ) | setArg( Kernel, 2, Arg2 ) );
}
...
41
OpenCL Kernel Invocation

BLAS-like example



CUDA provides BLAS library; OpenCL doesn’t
Must write own BLAS routines in OpenCL to port between
the two easily
swap() function swaps contents of 2 vectors with differing
vector strides
void OpenCL::blasSswap(int n, float *x, int incx, float *y, int incy)
{
if (!checkBLASKernel( &mSswapKernel, "Sswap" ))
{
return;
}
mLastBLASStatus = Base::BLAS_INTERNAL_ERROR;
if (x && y)
{
if (setArgs( mSswapKernel, n, x, incx, y, incy ) == CL_SUCCESS)
{
executeBLASKernel( mSswapKernel, n );
}
}
}
42
OpenCL Kernel Invocation

BLAS support functions
bool OpenCL::checkBLASKernel(cl_kernel *Kernel, const char *KernelName)
{
if (!mValid)
{
mLastBLASStatus = Base::BLAS_NOT_INITIALIZED;
return( false );
}
if (!(*Kernel))
{
cl_int RC;
*Kernel = clCreateKernel( mProgram, KernelName, &RC );
if (RC != CL_SUCCESS)
{
mLastBLASStatus = Base::BLAS_INTERNAL_ERROR;
return( false );
}
}
return( true );
}
inline void OpenCL::executeBLASKernel(cl_kernel Kernel, int n)
{
size_t Size = n;
size_t GlobalWorkSize = Size & mMaxWorkItemsMask;
if (Size & ~mMaxWorkItemsMask)
{
GlobalWorkSize += mMaxWorkItems;
}
cl_int RC = clEnqueueNDRangeKernel( mCommandQueue, Kernel, 1, NULL, &GlobalWorkSize,
&mMaxWorkItems, 0, NULL, NULL );
clFinish( mCommandQueue );
mLastBLASStatus = (RC == CL_SUCCESS) ? Base::BLAS_SUCCESS : Base::BLAS_EXECUTION_FAILED;
}
43
OpenCL Kernels

BLAS SSWAP example
__kernel void Sswap(__global int n, __global float *x,
__global int incx, __global float *y,
__global int incy)
{
const unsigned GID = get_global_id( 0 );
if (GID < n)
{
int lx = (incx >= 0) ? 0 : ((1 - n) * incx);
int ly = (incy >= 0) ? 0 : ((1 - n) * incy);
float temp = y[ ly + GID * incy ];
y[ ly + GID * incy ] = x[ lx + GID * incx ];
x[ lx + GID * incx ] = temp;
}
}
http://developer.download.nvidia.com/OpenCL/NVIDIA_OpenCL_JumpStart_Guide.pdf
44
CUDA Kernels

CPU
#include “kernel.cu”
...
{
const unsigned int size_x = 256;
const unsigned int size_y = 4096;
...
dim3 grid(size_x / BLOCK_DIM, size_y / BLOCK_DIM, 1);
dim3 threads(BLOCK_DIM, BLOCK_DIM, 1);
transpose_naive<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);
cudaThreadSynchronize();
...
}

GPU (kernel.cu)
#define BLOCK_DIM 16
__global__ void transpose_naive(float *odata, float* idata, int width, int height)
{
unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y;
if (xIndex < width && yIndex < height)
{
unsigned int index_in = xIndex + width * yIndex;
unsigned int index_out = yIndex + height * xIndex;
odata[index_out] = idata[index_in];
}
}
45
Outline



GPU Architecture Overview
GPU Programming
Algorithm Acceleration Guidelines








Streams and Pinned Memory
Thread Scheduling
Parallel reduction
Program partitioning
Simultaneous graphics and algorithm processing
Case Studies
Conclusion
Q&A
46
Streams
Data1


Sequence of commands
that execute serially
Allow overlapping of
memory transfers and
kernel computations
from different streams



Data2
Data1
H→D Transfers
Data2
Data1
Data2
D→H Transfers
Kernel Computation
Hides data transfer cost
Implementable in CUDA devices
with compute capability ≥ 1.1
Host memory must be of type
‘pinned’
Data1
Data2
Data1
Data2
Data1
Data2
H→D Transfers
Kernel Computation
D→H Transfers
47
Pinned Memory


Memory on the host that is mapped to device’s address space and
thus accessible directly by a kernel
Has several advantages



There is no need to allocate a block in device memory and copy data between
this block and the block in host memory; data transfers are implicitly
performed as needed by the kernel
Bandwidth between host and device memories is higher
Write-combining Memory



Type of pinned memory where individual writes are aggregated into a larger
write operation
Avoids internal L1, L2 cache writes making more cache available for rest of the
application
Is not snooped during transfers across the PCI Express bus, which can improve
transfer performance by up to 40%
48
Threads and Scheduling in GPU



GPU consists of “multiprocessors”,
each of which has many processors
A kernel is executed as a grid of blocks
Thread block is a batch of threads that
cooperate with each other by:




Synchronizing their execution
 Diverging execution results in
performance loss
Efficiently sharing data through a low
latency shared memory
All threads of a block reside on the
same multiprocessor (max 1024/MP)
Number of blocks a multiprocessor can
process at once depends on register
and shared memory usage per thread
Host
Device
Grid 1
Kernel 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Kernel 2
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)
Thread Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)
Thread Thread Thread Thread Thread
(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)
Source : “NVIDIA CUDA Programming Guide” version 1.1
Threads and Scheduling in GPU (contd…)

Before execution a block is split into warps



Half-warp is either first 16 or second 16 threads of a warp
 Full efficiency is realized when all 16 threads of a half-warp
agree on their execution path
Branch divergence occurs if threads of a half-warp diverge
via a data dependent conditional branch



A warp is a set of 32 threads which execute the same instruction on a
MP
The half-warp serially executes each branch path taken, ignoring the
result from threads that are not on that path
Increases kernel execution time
Warps of the same block are executed in time sliced fashion
50
Program Parallelism


The GPU is designed to address applications that are
data-parallel
Parallelism is an inherent factor to determine
suitability of a problem for GPU applications



In fact, applications in which enough parallelism cannot be
exposed may be slower on a GPU in comparison to a single
threaded CPU
Since the same program is executed for each data
element, there is no sophisticated flow control
Conditional checks need to be done on the CPU


Reduce the output of all threads, transfer reduced result
to CPU which tests condition and appropriately issues
further GPU threads
Can be expensive since transfers are done over the PCIe
bus!
Parallel Reduction


Perform a reduction of the data before transferring to the CPU
Tree based reduction approach used within each thread block
3
7
1
4
4
0
5
7
3
6
1
9
Example of tree based SUM
14
11
25

Reduction decomposed into multiple kernels to reduce number of
threads issued in the later stages of tree based reduction
Level 0
syncThreads()
Level 1
52
Parallel Reduction (contd…)

Types of optimization for efficient parallel reduction

Algorithmic optimizations
Avoid divergent warps
 Avoid shared memory bank conflicts – sequential
addressing
 First addition during global load – halves the number of
blocks


Code optimizations
Loop unrolling
 Multiple adds per thread to increase ‘arithmetic intensity’
of kernels (high ratio of computation in kernel to global read
and writes)

53
Parallel Reduction (contd…)

Example of tree based reduced sum
Shared Memory
10
Thread IDs
0
11
1
8
-1
2
1
7
-2
-1
-2
3
5
6
4
0
18
0
-2
8
7
-1
6
5
-5
2
7
10
-3
9
-2
8
5
-5
0
11
7
11
0
2
14
12
11
2
2
12
8
-3
9
7
13
11
2
2
8
0
24
-3
8
4
1
-2
1
7
-1
6
-2
8
5
17
-3
9
7
13
11
2
2
1
7
-1
6
-2
8
5
17
-3
9
7
13
11
2
2
0
41
54
Parallel Reduction (contd…)

Warp divergence removed
0
1
2
3
4
5
6
7
0
1
2
3
4
Shared Memory
10
1
8
-1
0
-2
3
5
-2
-3
2
7
0
Thread IDs
0
Bank IDs
11
1
1
7
-1
0
18
3
2
-2
-2
8
5
1
1
7
-1
6
-5
-3
9
-2
8
5
-5
11
7
11
6
7
0
2
7
6
11
2
2
3
2
-3
9
7
13
11
2
2
1
0
24
5
4
5
1
7
-1
6
-2
8
5
17
-3
9
7
13
11
2
2
1
7
-1
6
-2
8
5
17
-3
9
7
13
11
2
2
0
41
55
Parallel Reduction (contd…)

Sequential Addressing
Shared Memory
Thread IDs
10
1
8
0
1
2
11
1
0
1
18
1
0
1
24
1
1
-1
0
-2
3
5
-2
-3
2
7
0
11
0
2
4
5
6
7
-2
-2
8
5
-5
-3
9
7
11
11
2
2
-1
6
-2
8
5
-5
-3
9
7
13
11
2
2
7
-1
6
-2
8
5
17
-3
9
7
13
11
2
2
7
-1
6
-2
8
5
17
-3
9
7
13
11
2
2
7
2
7
3
-1
3
0
41
56
Program Partitioning

Assume a subroutine S is invoked N times in an application
Registers y
Time T
Registers x

1
2
3
N
A multiprocessor of the GPU has 16K registers, then maximum
parallelism = 16K/x
Since GPU can do fast hardware context switches between the threads,
which share the 16K registers

However, data transfers between kernels will become a
significant overhead with increase in number of partitions
57
Simultaneous Graphics and Algorithm
Processing

If the same GPU is used for graphics and algorithmic
processing





GPU resources may be saturated by graphics application, leaving
little bandwidth for other applications
The fixed size of GPU memory (without swap space) may cause
application launch failure
Graphics tasks may cause cache pollution which may cause
erratic runtimes for general purpose applications
 Run “warm up code” to flush out caches
A single kernel execution cannot be longer than 5 seconds
Using a separate GPU for graphics and computation avoids the
above listed problems
58
Outline




GPU Architecture Overview
GPU Programming
Algorithm Acceleration Guidelines
Case Studies






Boolean Satisfiability
Fast SPICE model evaluation
Fault Simulation
SSTA
Conclusion
Q&A
59
Guidelines for GPU Acceleration for Software




Current GPUs have an expensive communication link
to the host. Data transfers should be minimized
Streams should be used to overlap communication
and computation
Partition kernels to increase parallelism that can be
leveraged
Full efficiency is realized when all 16 threads of a
half-warp agree on their execution path



Reduce warp divergence
Avoid bank conflicts when using shared memory
Kernels should have high arithmetic intensity
60
Case Studies

Two approaches for accelerating an algorithm on the GPU

Re-architecting approach
Applicable when the problem does not have inherent SIMD nature
 May require significant algorithmic modifications
 Examples:




Boolean Satisfiability
Fault Dictionary Computation (not covered in this talk, slides at end)
Porting approach
Applicable when problem runtime is dominated by a subroutine,
multiple invocations of which operate upon independent data
 Partition the subroutine into GPU kernels
 Examples:




Accelerating SPICE by porting model evaluation on the GPU
Fault Simulation
Monte Carlo based statistical static timing analysis (SSTA)
61
Boolean Satisfiability (SAT)

Given a Boolean formula in conjunctive normal form (CNF)


Either find a satisfying truth assignment of all variables
Or prove that there is no satisfying assignment
f = ( x + z )( y + z )( x’ + y’ + z’ )
x = true

y = true
Decisions: x =Clause
true y = true
Positive




Negative
The unassigned literal z’ gets
implied because
of the unit clause rule
Literal
Literal
 Implication: z = false
Iterative application of the unit clause rule is called Boolean constant
propagation (BCP)
Recent BCP based SAT solvers incorporate conflict driven learning
A learned clause represents the search space that has been pruned
62
Approach

Complete Approaches for SAT


Stochastic Approaches for SAT


Are exact, but algorithms do not easily lend themselves to parallel
implementations. Examples: GRASP, zChaff , CirCUs, MiniSAT
Can execute at high speeds, are scalable, but are not exact. Examples:
Survey Propagation, WalkSAT, RandomSAT
Present a hybrid procedure for SAT



Retains the best features of complete and stochastic approaches
Proposed algorithm is based on MiniSAT (implemented on the CPU)
The variable ordering heuristic of MiniSAT is enhanced by a survey
propagation (SP) based procedure, which is implemented on the GPU

Proposed approach is called MESP (MiniSAT enhanced with SP)

Next few slides:


Discuss the GPU based SP implementation
Describe our MESP approach
MESP
SP
MiniSAT
63
Survey Propagation (SP) based SAT

Factor Graph - graphical representation of a SAT instance



Variable nodes (variables)
Function nodes (clauses)
Is a tree if it has no cycles
( x  y )( x' z )
 



x
y

z

SP is an algorithm in which ‘agreement’ between clauses and
variables is reached by sending probabilistic ‘messages’ along
edges of the factor graph (message passing)


Pros: highly scalable, parallelizable, exact for factor graphs that are
trees
Cons: incomplete for non-tree factor graphs
64
Survey Propagation Equations

Notation





, β are clauses; i, j are variables
V +(i) set of all clauses where i appears in the positive ‘+’ form
V -(i) set of all clauses where i appears in the negative ‘-’ form
ηα→i is a warning (a probability) from clause  to variable i
 Let i be in the ‘+’ form in 
η’s and π’s are iteratively computed until convergence
During Computation
p
 V  (i)\ 
(1  i ) q 
 V  (i)\ 
After Convergence
(1  i )
p 
 V  (i)
(1  i )
u
 i
  p(1 q)
 i  p(1 q)
s
 i
  q(1 p)
 i  q(1 p)
 *i  pq
 *i  pq
 uj 
 i   j  \ i u
 j    sj    *j 
W
()
i
 i
 
 i   i   *i
q 
 V  (i)
W
()
i
(1  i )
 i
 
 i   i   *i
65
Survey Propagation Flowchart
Randomly initialize ηα→i
Fixed
variables &
satisfied
clauses
(ignored)
Compute π
new
Compute ηα→i
Declare
non-convergence
Y
N
new
C = Σ [ | ηα→i
- ηα→i | ≤ ε?0:1]
N
new
ηα→i←ηα→i
it++>max
N
If
contradiction,
report and
quit
C==0
Y
Call WalkSAT to determine
satisfying assignment
Y
Σ(ηα→i ) ≈ 0
N
Compute W (biases)
Sort variables in
decreasing order of W’s
Sorted
List
Fix first x% of
variables
66
Survey Propagation on the GPU

Implemented GPU kernels for the following


Compute π’s, for all variables (V ) in parallel
Compute η’s, for all clauses (C ) in parallel






In particular, computes ηα→i for each variable i in clause α
new
Check convergence (Σ[(ηα→i
- ηα→i ) ≤ ε?0:1]) using a
reduced ‘integer’ add operation over all literals in all clauses
Compute Σ( ηα→i ) (to determine if non- trivial convergence)
using a reduced ‘float’ add operation
Compute W’s, for all variables in parallel
Parallel ‘bitonic’ sort to find the largest x% of the W’s
CPU performs conditional checks, fixes variables and
executes WalkSAT
67
Data Structure on the GPU
|V |
2
1
Clause #
Literal #
Polarity
Per Variable Data (Static)
|C |
2
1
Variable #
Polarity
Per Clause Data (Static)
|C |
2
1
ηα→i
η’s Written by Clauses Read by Variables
1
2
|V |
ππ+

With 1 GB of Global memory, the 280
GTX GPU can fit instances with upto
10M clauses and 1M variables
π’s Written by Variables Read by Clauses
68
Survey Propagation on the GPU

Memory transfers between GPU and CPU




Single transfer for static per variable and per clause data
During the computation of π and η, there are no transfers
at all. All intermediate data is stored in the global memory
of the GPU
After convergence is detected, the sorted list of variables in
decreasing order of biases is transferred (GPU → CPU)
After the graph is simplified, the following are updated
(CPU → GPU)
Variables that are fixed (don’t contribute to η computation)
 Clauses that are satisfied (don’t contribute to π computation)

69
Results (GPU based SP)
Inst. Name
# VARs.
# CLs.
MiniSAT
B05
Ours
Speedup
Runtime in seconds
Random_1
20,000
83,999
>2 hrs.
3009.67
172.87
17.41X
Random_2
16,000
67,199
>2 hrs.
1729.48
110.60
15.63X
Random_3
12,000
50,399
>2 hrs.
1002.48
57.98
17.29X
Random_4
8,000
33,599
>2 hrs.
369.61
5.82
63.80X
Random_5
4,000
16,799
>2 hrs.
65.01
3.69
17.62X
Uf200-07
200
860
0.15
0.20
0.08
2.50X
hole10
187
792
1.30
Contrdn.
Contrdn.
Uf200-018
200
860
0.19
No Conv
No Conv
Avg. (over 20)

MESP is compared against



22.37X
Braunstein et al. 2005 (B05) and MiniSAT which were executed on a 3.6 GHz,
3GB Intel machine running Linux
Manolios et al. 2006 (M06), which uses OpenGL on NVIDIA GTX 7900 (512 MB
memory , 128 cores, 750MHz) to implement survey propagation
For hard random instances MESP shows a 22× speedup over B05

M06 reports a 9× speedup over B05
70
MESP


SAT instance is read into MiniSAT and on the GPU (executing SP)
MiniSAT is first invoked on the instance and after it has made
some progress, it invokes GPU-based SP. MiniSAT transfers to SP



Augment the current clause database in GPU-based SP with 3
sets of learned clauses (LC) C1, C2 and C3 . L is num. of literals in LC




The current assignments and
A subset of the current learned clauses
C1 (0 < L ≤ 10) ; C2 (10 < L ≤ 25); C3 (25 < L ≤ 50)
Statically allocate enough space in GPU’s Global Memory to store
8K clauses in C1, C2 and C3 each
Messages computed over all clauses (η) are now computed in 4
separate kernels, one for each set of clauses (C1, C2, C3 and C*)
On convergence, SP (in MESP) fixes variables for which the
absolute bias difference |W (+) - W (-)| < τ
71
MESP



MiniSAT decides the next variable to assign based on Variable State
Independent Decaying Sum (VSIDS) heuristic
VSIDS chooses next decision variable with the highest activity
Activity is the variable occurrence count, with a higher weight on the
variables of the more recently added learned clauses


Activity of the variables in the learned clauses is incremented by FM
In MESP, GPU-based SP invocation can return with the following outcomes
SP converges and fixes certain variables, S
SP converges, fixes S and determines factor
graph is a tree, invokes WalkSAT. If WalkSAT
finds assignment, instance is solved. Else fixed
variables in S are returned to MiniSAT
SP converges but does not fix any variable
SP does not converge/reports contradiction
MiniSAT updates
activity of variables in
S by FSP
MiniSAT continues the
search
72
MESP
MiniSAT
(complete)
Survey Propagation
(stochastic)
Current Assignments
Subset of Learned Clauses
Initial search
GPU attempts to converge
on the SP messages
Continues
search using
updated
activities
Activity
Table
CPU
GPU
GPU works in conjunction
with CPU to fix variables
CPU instructs GPU to
ignore fixed variables and
satisfied clauses
MiniSAT’s Decision Tree
CPU
Activity updated for the variables
S that are fixed in SP
GPU
73
Results
S/U
Instance
K SAT
3 SAT
Speedup
#
VARs.
#
CLs.
MiniSAT
(k)
#
VARs.
#
CLs.
MiniSAT
(3)
MESP
MiniSAT
(k)
MiniSAT
(3)
1394694p
S
327932
1283772
29.84
530027
1890057
39.58
15.28
1.95 X
2.59 X
AProVE07
U
78607
208911
110.39
104732
287286
166.25
95.91
1.15 X
1.73 X
eijk.bs4863
S
140089
530249
487.98
234412
813218
619.03
181.86
2.68 X
3.40 X
:
:
:
:
:
:
:
:
:
:
:
eijk.S298
U
73222
283211
8.42
136731
473738
10.01
8.47
0.99 X
1.18 X
1.64 X
1.92 X
Avg (over 13)



MESP approach on GTX 280 GPU card on an Intel i7 CPU with 2.6 GHz, 9GB RAM,
and running Linux. MiniSAT run on the same CPU. Runtime in seconds
D = 1% of Number of Variables; FSP = FM = 1; C = 20; τ = 0.01
The learned clauses on the GPU were updated at every 5th invocation of SP


Up to 24K learned clauses
None of these instances were solved in MESP by an invocation to WalkSAT
74
Summary

MESP is a GPU enhanced variable ordering heuristic for SAT

GPU based survey propagation
π’s for all variables and η’s for all clauses computed in parallel
 Check convergence using a reduced ‘integer’ add operation over all
literals in all clauses
 Test whether non-trivial convergence uses a reduced ‘float’ add
operation
 Compute biases for all variables in parallel
 Parallel ‘bitonic’ sort to find the largest x% of the biases


Survey propagation enhances the variable ordering in MESP
Augment clause database on GPU with 3 sets of learned clauses
 η’s for all clauses computed in 4 different kernels


On average MESP is

64% (92%) faster than MiniSAT on original (3-SAT) instance
75
SPICE Model Evaluation on a GPU


SPICE is the de facto industry standard for VLSI circuit
simulations
Significant motivation for accelerating SPICE simulations
without losing accuracy




Accelerate the computationally expensive portion of SPICE –
transistor model evaluation – on a GPU
Proposed approach is integrated into a commercial SPICE
accelerator tool OmegaSIM


Increasing complexity and size of VLSI circuits
Increasing impact of process variations on the electrical behavior of
circuits
 Require Monte Carlo based simulations
Already 10-1000x faster than traditional SPICE implementations
With the proposed approach integrated, OmegaSIM achieves
a further speedup of 2.36X (3.07X) on average (max)
Approach

Profiled SPICE simulations over several benchmarks




75% of time spent in BSIM3 device model evaluations
Billions of calls to device model evaluation routines
 Every device in the circuit is evaluated for every time step
 Possibly repeatedly until the Newton Raphson loop for solving
non-linear equations converges
Asymptotic speedup of 4X considering Amdahl’s law.
These calls are parallelizable


Since they are independent of each other
Each call performs identical computations on different data
 Conform to the GPU’s SIMD operating paradigm
Approach

CDFG-guided manual partitioning of BSIM3 evaluation code

Limitation on the available hardware resources




Registers (8192/per multiprocessor)
Shared Memory (16KB/per multiprocessor)
Bandwidth to global memory (max. sustainable is ~80 GB/s)
If entire BSIM3 model is implemented as a single kernel

Number of threads that can be issued in parallel are not enough


If BSIM3 code is partitioned into many (small) kernels

Requires large amounts of data transfer across kernels



To hide global memory access latency
Done using global memory (not cached)
Negatively impacts performance
Proposed approach


Creates CDFG of the BSIM3 equations
Uses maximally disconnected components of this graph as different
kernels, considering the above hardware limitations
Approach

Take GPU memory constraints into account

Global Memory


Texture Memory



Used to store intermediate data – which is generated by one kernel and
needed by another (instead of transferring this data to host)
Used for storing ‘runtime parameters’
 Device parameters that remain unchanged throughout the simulation
Advantages
 It is cached, unlike global memory
 No coalescing requirements, unlike global memory
 No bank conflicts, such as possible in shared memory
 CUDA’s efficient built in texture fetching routines are used
 Small texture memory loading overhead is easily amortized
Constant Memory used for storing physical constants

Most efficient when all threads access the same data
Experiments


Proposed approach is implemented and integrated into a
commercial SPICE accelerator tool – OmegaSIM
Hardware used:



CPU: Intel Core 2 Quad, 2.4 GHz, 4GB RAM
GPU: GeForce 8800 GTS, 128 Processors, 675 MHz, 512 MB RAM
Comparing BSIM3 model evaluation alone
# Eval.
GPU runtimes (ms)
CPU runtimes (ms)
Speedup
Proc.
Tran.
Tot.
1M
81.17
196.48
277.65
8975.63
32.33X
2M
184.91
258.79
443.7
18086.29
40.76X
Experiments - Complete SPICE Sim
Ckt. Name
#
Trans.
Total #
Evals.
CPU-alone
GPU+CPU
Industrial_1
324
1.86 X 107
49.96
34.06
1.47X
Industrial_2
1098
2.62 X 109
118.69
38.65
3.07X
Industrial_3
1098
4.30 X108
725.35
281.5
2.58X
Buf_1
500
1.62 X 107
27.45
20.26
1.35X
Buf_2
1000
5.22 X 107
111.5
48.19
2.31X
Buf_3
2000
2.13 X 108
486.6
164.96
2.95X
ClockTree_1
1922
1.86 X 108
345.69
132.59
2.61X
ClockTree_2
7682
1.92 X 108
458.98
182.88
2.51X
Avg.
OmegaSIM (s)
Speedup
2.36X
With increase in number of transistors, speedup obtained is higher
 More device evaluation calls made in parallel, latencies are better hidden
 High accuracy with single precision floating point implementation
 Over 1M device evals. avg. (max.) error of 2.88 X 10-26 (9.0 X 10-22) Amp.
 Newer devices with double precision capability already in market

Conclusions



Significant interest in accelerating SPICE
75% of the SPICE runtime spent in BSIM3 model evaluation – allows
asymptotic speedup of 4X
Our approach of accelerating model evaluation using GPUs has been
integrated with a commercial fast SPICE tool



BSIM3 model evaluation can be sped up by 30-40X over 1M-2M calls
Take GPU memory constraints into account




Obtained speedup of 2.36 X on average
Global Memory used to store intermediate data
Texture Memory used for storing ‘runtime parameters’
Constant Memory used for storing physical constants
Carefully partition kernels since

If entire BSIM3 model is implemented as a single kernel


Number of threads that can be issued in parallel are not enough to hide
global memory access latency
If BSIM3 code is partitioned into many (small) kernels

Requires large amounts of data transfer across kernels done using global
memory
Introduction – Fault Simulation

Fault Simulation (FS) is crucial in the VLSI design flow



Given a digital design and a set of vectors V, FS evaluates the number of
stuck at faults (Fsim) tested by applying V
The ratio of Fsim/Ftotal is a measure of fault coverage
Current designs have millions of logic gates


The number of faulty variations are proportional to design size
Each of these variations needs to be simulated for the V vectors

Therefore, it is important to explore ways to accelerate FS

The ideal FS approach should be



Fast
Scalable &
Cost effective
83
Approach

Implement a look up table (LUT) based FS
0 1 2 3
0

All gates’ LUTs stored in texture memory (cached)





LUTs of all library gates fit in texture cache
 To avoid cache misses during lookup
Individual k-input gate LUT requires 2k entries
Each gate’s LUT entries are located at a fixed offset in the texture memory as
shown above
Gate output is obtained by
 accessing the memory at the “gate offset + input value”
Example: output of AND2 gate when inputs are ‘1’ and ‘0’
84
Approach

Evaluate two vectors for the same gate in a single
thread


1/2/3/4 input gates require 4/16/64/256 entries in LUT
respectively
Our library consists of an INV and 2/3/4 input AND, NAND,
NOR and OR gates.



Hence total memory required for all LUTs is 1348 words
This fits in the texture memory cache (8KB per MP)
Exploit both fault and pattern parallelism

Fault Parallel


All gates at a fixed topological level are evaluated in parallel
Pattern Parallel

Simulations for any gate, for different patterns, are done in
parallel
85
Approach
Faulty
vector vector
2
1
vector
N
Faulty circuit
value
for vector 1

Good circuit
value
for vector 1
In practice, simulations for any gate, for different patterns, are done
in 2 phases, for all the faults which lie in its TFI only



Good
Phase 1 : Good circuit simulation. Results returned to CPU
Phase 2 : Faulty circuit simulation. CPU does not schedule a stuck-at-v
fault in a pattern which has v as the good circuit value
Fault injection also performed in parallel
86
Approach – Fault Simulation
Injection
typedef struct __align__(16){
int offset; // Gate type’s offset
int a, b, c, d; // Input values
int m0, m1; // Mask variables
} threadData;
m0 m1
Meaning
-
11
Stuck-a-1 Mask
11
00
No Fault Injection
00
00
Stuck-at-0 Mask
87
Approach – Fault Detection
typedef struct __align__(16){
int offset; // Gate type’s offset
int a, b, c, d; // Input values
int Good_Circuit_threadID; // Good circuit simulation thread ID
} threadData_Detect;
88
Approach

We maximize GPU performance by ensuring that



No data dependency exists between threads issued in parallel
The same instructions, on different data are executed by all threads
We adapt to specific G80 memory constraints

LUT stored in texture memory. Key advantages are:






Texture memory is cached
Total LUT size easily fits into available cache size of 8KB/MP
No memory coalescing requirements
Efficient built-in texture fetching routines available in CUDA
Non-zero time taken to load texture memory, but cost easily
amortized
Global memory writes for level i gates (and reads for level i+1
gates) are performed in a coalesced fashion
89
Results
Circuit
#Gates
#Faults
Comm.
GPU
Speed Up
PROJ.
Speed Up
s9234_1
1462
3883
6.190
0.134
46.067
0.022
275.754
s35932
14828
34628
51.920
1.390
37.352
0.260
199.723
s5378
1907
4821
8.390
0.155
54.052
0.025
333.344
s13207
2195
5735
14.980
0.260
57.648
0.047
320.997
:
:
:
:
:
:
:
:
b22
35280
86205
17.130
1.504
11.390
0.225
75.970
Avg (30 ckts.)




~47X
~300X
FS on 280 GTX runtimes compared to a commercial fault simulator for 30 IWLS and
ITC benchmarks
32 K patterns were simulated for all 30 circuits
CPU times obtained on a 1.5 GHz 1.5 GB UltraSPARCIV+ processor running Solaris 9
GPU time includes




Data transfer time between the GPU and CPU (both directions)
 CPU → GPU : 32 K patterns, LUT data
 GPU → CPU : 32 K good circuit evals. for all gates, array Detect
Processing time on the GPU
Time spent by CPU to issue good/faulty gate evaluation calls
Time spent for loading the LUTs
Conclusions

Fault simulation is accelerated using GPUs


Maximize GPU performance by ensuring that




No data dependency exists between threads issued in parallel
The same instructions, on different data are executed by all
threads
Adapt to specific G280 memory constraints
 LUT stored in texture memory
 Global memory writes for level i gates (and reads for level i+1
gates) are performed in a coalesced fashion
When using a Single 280 GTX GPU


Implement a pattern and fault parallel technique
47X speedup compared to commercial FS engine
When projected for a 1U NVIDIA Tesla Server

300X speedup is possible over the commercial engine
91
Introduction - SSTA



Static timing analysis (STA) is heavily used in VLSI design to estimate
circuit delay
Impact of process variations on circuit delay is increasing
Therefore, statistical STA (SSTA) was proposed


Monte Carlo (MC) based SSTA accounts for variations by




Generating N delay samples for each gate (random variable)
Executing STA for each sample
Aggregating results to generate full circuit delay under variations
MC based SSTA has several advantages over block based and path
based SSTA


It includes the effect of variations while estimating circuit delay
High accuracy, simplicity and compatibility to fabrication line data
Main disadvantage is extremely high runtime cost
92
Approach – STA

STA at a gate


a
c
Over all inputs compute the MAX of the SUM of
b
 Input arrival time for input i and
 Pin-to-output (P2O) rising (or falling) delay from pin i to output
For example, let
 Atifall (Atirise) denote the arrival time of a falling(rising) signal at node i
 MAX (D11→00, D11→01) (MAX (D11→00, D11→10)) denotes the P2O rising delay from
a to c (b to c)
ATcrise = MAX [(ATafall + MAX (D11→00 , D11→01)), ((ATbfall + MAX (D11→00 , D11→10))]

STA at a gate on the GPU


The P2O rising (or falling) delay from every input to output is stored in a lookup
table (LUT) in texture memory of GPU
For an n-input gate, do the following
 Fetch n pin-to-output rising (or falling) delays from texture memory Using the
gate type offset, pin number and falling/rising delay information

n SUM computations Of the pin-to-output delay and input arrival time

n-1 MAX computations CUDA only supports 2 operand MAX operations
93
Approach – SSTA

SSTA at a gate







Need (µ , σ) for the 2n Gaussian distributions of the pin-to-output
rising and falling delay values for n inputs
Store (µ , σ) for every input in the LUT
 As opposed to storing the nominal delay, as for STA
Mersenne Twister (MT) pseudo random number generator is used
The uniformly distributed random number sequences are then
transformed into the normal distribution N(0,1)
 Using the Box-Muller transformations (BM)
Delay of a sample = µ + k · σ
Both algorithms, MT and BM kernels are available with the CUDA
software development kit (SDK)
For a circuit, SSTA is performed topologically from inputs to outputs


Delays of gates at logic depth i are computed, and stored in global memory
Gates at logic higher depths use this data as their input arrival times
94
Experiments – SSTA



MC based SSTA on 280 GTX runtimes compared to a CPU based
implementation for 30 large IWLS and ITC benchmarks
Monte Carlo analysis performed by using 64 K samples for all 30 circuits
CPU runtimes are computed


GPU time includes data transfer time





On 3.6 GHz, 3GB RAM Intel processor running Linux
CPU → GPU :
 arrival time at each primary input
 µ and σ for all pin-to-output delays of all gates
GPU → CPU:
 64K delay values at each primary output
GPU time also includes the time spent in the MT and BM kernels, and
loading texture memory
Computation results have been verified for correctness
For the SLI Quad system, the runtimes are obtained by scaling the
processing times only

Transfer times are included as well (not scaled)
95
Results – SSTA
Runtime (s)
Circuit
Speedup
GPU
SLI QUAD
CPU
GPU
SLI QUAD
s9234_1
8.11
2.92
6621.16
816.64
2269.11
s35932
46.50
18.14
36174.56
778.00
1993.97
s38584
47.24
17.24
38270.72
810.19
2219.98
s13207
14.55
6.21
10633.48
731.07
1712.24
:
:
:
:
:
:
b22_1
51.50
15.51
45909.95
891.51
2959.80
~818X
~2405
Avg. (30 Ckts.)

When using a single 280 GTX GPU


~818X speedup in MC based SSTA is obtained
The SSTA runtimes are projected on a Quad GPU system

~2405X speedup is possible
96
Conclusions


We accelerate MC based SSTA using graphics processors
We take maximal advantage of the GPU’s



Raw computational power and
Huge memory bandwidths
Maximize GPU performance by ensuring that



No data dependency exists between threads issued in parallel
The same instructions, on different data are executed by all
threads
Adapt to specific G280 memory constraints
LUT stored in texture memory
 Global memory writes for level i gates (and reads for
level i+1 gates) are performed in a coalesced fashion

Summary





We discussed the GPU platform, and its use in highperformance EDA applications, with case studies.
Outlined the GPU memory and processing
constraints induced by the GPU architecture.
Presented programming guidelines with sample code
fragments
Suggested tips to maximize performance of GPUbased code
Discussed case studies of EDA algorithms, and
pointed out how the code was architected for
maximum performance.
Resources

General


CUDA references



www.gpgpu.org
www.nvidia.com/object/cuda_home.html
Supported platforms: Windows, MacOS, Linux
OpenCL stuff




General OpenCL information: www.khronos.org
Apple: developer.apple.com/mac/snowleopard/opencl.html
Amd: developer.amd.com/gpu/ATIStreamSDK/pages/TutorialOpenCL.aspx
Nvidia: http://developer.nvidia.com/object/opencl-download.html
Thank You
Fault Table Generation

Two key steps in VLSI testing and debug


Fault detection: Differentiates a faulty design from a fault free design
Fault diagnosis: Identifies and isolates a fault, to analyze the defect
causing the faulty behavior

Both detection and diagnosis require precalculated fault table



Whether vector vj can detect fault fi
Stored as matrix [aij], where aij = 1(0) if fault fi is (not) detected by
vector vj
Implemented pattern parallel approach on the GPU


Simulate several patterns simultaneously
Other parallel efforts require dynamic load balancing


Algorithm parallel: Partition fault list across many processors
Model parallel: Partition circuit into components, each assigned to one
or more processors
101
Approach



FSIM [Lee et al. 91] is an efficient fault simulator
FSIM+ is FSIM modified to compute a fault table
FSIM and FSIM+ both




Are pattern parallel, run on a single core microprocessor
Simulate a circuit in a forward levelized manner
Prune off unnecessary simulations early
New approach (GFTABLE) is an enhancement of FSIM+



Target hardware is a GPU – SIMD machine
Issue thousands of threads (T) in parallel
 ‘word_size × T’ patterns (packet width) computed in parallel
Hardware and software constraints are maximally satisfied



CUDA specific (memory, device utilization)
Only CPU can launch a kernel or perform efficient conditional tests
Minimize (expensive) transfers between the GPU and CPU
102
Approach
CPT
s-a-0
s-a-1
✗
✗
Stems:
Fanout Nets
p is the dominator of k Primary Outputs
CD(k) p is also the immediate dominator of k

Stem region (SR)


All gates on any path from a stem to its immediate dominator
Fanout free region (FFR)

A subcircuit induced by cutting off the fanout branches of each stem

Such subcircuits form a partition of the original netlist
103
Approach

Sensitive input




Only input of a gate driving the dominant logic value (DLV)
All inputs, when all inputs driving DLV
Critical line is the line driving the sensitive input
Critical path tracing (CPT)

Determine paths of critical lines in FFR(k) by backtracking
from the output of the FFR(k) towards its inputs
104
Approach
FS
s-a-0
1 ✗
0 ✗
s-a-1
0 ✗
D(k, p)
FD(a s-a-0, p)
FD(c s-a-0, p)
FD(b s-a-1, p)
s-a-0

Detectabilities
D(a,k) = 0
D(b,k) = 1
D(i,k) = 1
D(c,k) = 0
D(j,k) = 0
FD(a s-a-0, k) = 0
FD(c s-a-0, k) = 0
FD(b s-a-1, k) = 1
CD(k) = 0
CD(k) = 1
FD(a s-a-0, p) = FD(a s-a-0, k)  D(k,p)
105
Approach
FSIM+(N) {
Initialize
while v < N do
Generate test vector; v = v + packet_width;
Perform fault free simulation
for each stem s, compute CD(s)
if (CD(s)), then compute D(s,t)
if (D(s,t)), then compute and store fault detectabilities
end while
}
106
Approach
Generate test vectors; v = v + packet_width

Mersenne Twister (MT) pseudo random number generator

Long period, efficient use of memory, good distribution properties
Perform fault free simulation
Logic_simulation_AND_2(int* MEM, int z, int a, int b) {
tx = my_thread_id
MEM[tx + z X T] = MEM[tx + a X T]
·
MEM[tx + b X T]
}


All threads evaluate the same gate for different patterns
Sort gates topologically from inputs to outputs


Fault free data for the first L gates, for all patterns, is stored in the
global memory
Avoids the need to transfer this data from the CPU
107
Approach
for each stem s, compute CD(s)

CD(s) is computed using CPT
Launch T threads in parallel
In FSIM+, gates that are not driving
any critical lines are not backtracked
on during CPT

In GFTABLE, all gates are backtracked on during CPT


The test (if gates are driving critical lines) does not help prune
99.99% of gates, due to the large packet width – T × 32 bits (T = 16K)
 Large packet width is necessary in order to



Take advantage of the immense parallelism on the GPU
Reduce overhead of kernel launch or global memory access
108
Approach
if CD(s), then Compute D(s,t)

Explicit fault simulation, in the forward levelized manner
From stem s to its immediate dominator t (or PO)
 Input is CD(s) XORed with fault free value at s




Injecting faults which are upstream from s and observable at s
Example:
 CD(k) = 0010, Fault Free Value at k = 0000
 Input applied at k = 0010  0000 = 0010
 Fault simulation yields p = 0010
 Fault free value at p = 0000
 Therefore D(k, p) = 0010  0000 = 0010
In FSIM+, simulation of the fanout of a gate g is scheduled

Only if output at g is different from its fault free value
109
Approach
if CD(s), then Compute D(s,t)

On the GPU, a bitwise XOR operation is performed on T words of the
current output (gate evaluation) and fault free data

For the test (if result of XOR is all zero) perform a hybrid of a depth-first and
breadth-first approach





Divide T long array (of the XOR’s output) into groups of
size Q (256)
0
Compute reduced OR of data in each group into single
word which is transferred to the CPU
 Avoid bank conflicts and divergent executions
 Minimize global memory access latencies
 Employ loop unrolling in reduction code
At the first non-zero value found on the CPU, return false
Q=3
0
0
0
0
0
1
1
0
1
1
1
These values returned to CPU
Perform this test after simulating G (20) gates
All conditional tests, on CD(s) or D(s,t), are performed in a similar manner
110
Approach
if D(s,t), then compute and store fault detectabilities

For all faults in FFR(s) of the current stem s
If fault fi is detectable at the stem s and
 If stem s is detectable at a primary output, then
 Fault fi is (globally) detectable
 The ith row of the fault table is accordingly updated


In GFTABLE

Detectabilities are computed on the GPU
 Fault Detectabilities are immediately transferred to the CPU
 ‘word_size × T’ bits transferred
 The entire fault table is never stored on the GPU
111
Results
Circuit
FSIM+
(s)
GFTABLE
(s)
GFTABLE
v/s FSIM+
GFTABLE-TESLA
(s)
GFTABLE-TESLA
v/s FSIM+
b14
1502.47
100.87
14.90 X
17.65
85.12 X
b20
4992.73
319.82
15.61 X
55.97
89.21 X
6319.47
399.34
15.82 X
69.88
90.43 X
:
b22
Avg (20 ckts)

15.68 X
89.57 X
GFTABLE implemented on NVIDIA Quadro FX 5800


T = 16 K, word_size = 32, L = 32 K, Q = 256, G = 20
To use the global memory effectively, the FAULT_LIST is partitioned into
subsets of 1K faults, and GFTABLE executed iteratively
 Allows GFTABLE to operate on circuits with arbitrary number of gates
and faults

FSIM+ run on a 32-bit, 3.6 GHz Intel CPU with 3GB RAM, running Linux

Projected runtime to NVIDIA Tesla system (8 GPUs) is 90× faster
112
Summary

Fault table is required by fault detection and diagnosis


Compute time of a fault table is very high
Fault table generation is accelerated using GPUs


All data parallel computations are performed on the GPU
All conditional statements are evaluated on the CPU



Entire fault table is never stored on the GPU’s memory
To handle larger circuits



Tree based reduced OR operation performed on the GPU, before
transferring test data to the CPU
Global memory stores a subset of the fault free data
Fault list is partitioned
Experimental results show


15× speedup over FSIM+ using a single Quadro FX 5800
Potential speedup over FSIM+ is 90× when using a Tesla GPU system
113

Introduction to GPU Programming for EDA John F. Croix Cadence Design Systems, Inc. Sunil P.

Transcript Introduction to GPU Programming for EDA John F. Croix Cadence Design Systems, Inc. Sunil P.

Directory