Transcript Document

Towards Acceleration of
Fault Simulation Using
Graphics Processing Units
Kanupriya Gulati
Sunil P. Khatri
Department of ECE
Texas A&M University, College Station
Outline






Introduction
Technical Specifications of the GPU
CUDA Programming Model
Approach
Experimental Setup and Results
Conclusions
Outline






Introduction
Technical Specifications of the GPU
CUDA Programming Model
Approach
Experimental Setup and Results
Conclusions
Introduction

Fault Simulation (FS) is crucial in the VLSI design flow



Given a digital design and a set of vectors V, FS evaluates the
number of stuck at faults (Fsim) tested by applying V
The ratio of Fsim/Ftotal is a measure of fault coverage
Current designs have millions of logic gates


The number of faulty variations are proportional to design size
Each of these variations needs to be simulated for the V vectors

Therefore, it is important to explore ways to accelerate FS

The ideal FS approach should be



Fast
Scalable &
Cost effective
Introduction

We accelerate FS using graphics processing units (GPUs)


A GPU is essentially a commodity stream processor




By exploiting fault and pattern parallel approaches
Highly parallel
Very fast
Operating paradigm is SIMD (Single-Instruction, Multiple Data)
GPUs, owing to their massively parallel architecture,
have been used to accelerate



Image/stream processing
Data compression
Numerical algorithms

LU decomposition, FFT etc
Introduction

We implemented our approach on the



We used the Compute Unified Device Architecture
(CUDA) framework


Open source C-like GPU programming and interfacing tool
When using a single 8800 GTX GPU card



NVIDIA GeForce 8800 GTX GPU
By careful engineering, we maximally harness the GPU’s
 Raw computational power and
 Huge memory bandwidth
~35X speedup is obtained compared to a commercial FS tool
Accounts for CPU processing and data transfer times as well
Our runtimes are projected for the NVIDIA Tesla server


Can house up to 8 GPU devices
~238X speedup is possible compared to the commercial engine
Outline






Introduction
Technical Specifications of the GPU
CUDA Programming Model
Approach
Experimental Setup and Results
Conclusions
GPU – A Massively Parallel
Processor
Source : “NVIDIA CUDA Programming Guide” version 1.1
GeForce 8800 GTX Technical Specs.

367 GFLOPS peak performance for certain applications



25-50 times of current high-end microprocessors
Up to 265 GFLOPS sustained performance
Massively parallel, 128 SIMD processor cores

Partitioned into 16 Multiprocessors (MPs)

Massively threaded, sustains 1000s of threads per
application

768 MB device memory

1.4 GHz clock frequency


86.4 GB/sec memory bandwidth


CPU at ~4 GHz
CPU at 8 GB/sec front side bus
1U Tesla server from NVIDIA can house up to 8 GPUs
Outline






Introduction
Technical Specifications of the GPU
CUDA Programming Model
Approach
Experimental Setup and Results
Conclusions
CUDA Programming Model

The GPU is viewed as a compute device that:



Is a coprocessor to the CPU or host
Has its own DRAM (device memory)
Runs many threads in parallel
Device
(GPU)
Host
(CPU)
Kernel
Threads
(instances of
the kernel)
PCIe
Device
Memory
CUDA Programming Model

Data-parallel portions of an application are executed on
the device in parallel on many threads



Kernel : code routine executed on GPU
Thread : instance of a kernel
Differences between GPU and CPU threads


GPU threads are extremely lightweight
 Very little creation overhead
GPU needs 1000s of threads to achieve full parallelism
 Allows memory access latencies to be hidden
 Multi-core CPUs require fewer threads, but the available
parallelism is lower
Thread Batching: Grids and Blocks


A kernel is executed as a grid of
thread blocks (aka blocks)
 All threads within a block share
a portion of data memory
A thread block is a batch of
threads that can cooperate with
each other by:
 Synchronizing their execution



Device
Grid 1
Kernel
1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
For hazard-free common
memory accesses
Efficiently sharing data through
a low latency shared memory
Two threads from two different
blocks cannot cooperate
Host
Kernel
2
Block (1, 1)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Source : “NVIDIA CUDA Programming Guide” version 1.1
Block and Thread IDs


Threads and blocks have IDs
 So each thread can identify
what data they will operate on
 Block ID: 1D or 2D
 Thread ID: 1D, 2D, or 3D
Device
Grid 1
Simplifies memory
addressing when processing
multidimensional data
 Image processing
 Solving PDEs on volumes
 Other problems with underlying
1D, 2D or 3D geometry
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Source : “NVIDIA CUDA Programming Guide” version 1.1
Device Memory Space Overview
(Device) Grid

Each thread has:
 R/W per-thread registers
 R/W per-thread local memory
 R/W per-block shared memory
 R/W per-grid global memory
 Read only per-grid constant
memory
 Read only per-grid texture
memory
Host

The host can R/W global,
constant and texture memories
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Source : “NVIDIA CUDA Programming Guide” version 1.1
Device Memory Space Usage


(Device) Grid
Register usage per thread should be
minimized (max. 8192 registers/MP)
Block (0, 0)
Shared memory organized in banks

Shared Memory
Avoid bank conflicts
Registers

Global memory
 Main means of communicating
R/W data between host and
device
 Contents visible to all threads
 Coalescing recommended
Host

Texture and Constant Memories
 Cached memories
 Initialized by host
 Contents visible to all threads
Block (1, 0)
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Source : “NVIDIA CUDA Programming Guide” version 1.1
Outline






Introduction
Technical Specifications of the GPU
CUDA Programming Model
Approach
Experimental Setup and Results
Conclusions
Approach

We implement a Look up table (LUT) based FS
0
1
2 3
0

All gates’ LUTs stored in texture memory (cached)

LUTs of all library gates fit in texture cache




Individual k-input gate LUT requires 2k entries
Each gate’s LUT entries are located at a fixed offset in the
texture memory as shown above
Gate output is obtained by


To avoid cache misses during lookup
accessing the memory at the “gate offset + input value”
Example: output of AND2 gate when inputs are ‘0’ and ‘1’
Approach

In practice we evaluate two vectors for the same
gate in a single thread

1/2/3/4 input gates require 4/16/64/256 entries in
LUT respectively

Our library consists of an INV and 2/3/4 input AND,
NAND, NOR and OR gates.
Hence total memory required for all LUTs is 1348 words
 This fits in the texture memory cache (8KB per MP)


We exploit both fault and pattern parallelism
Approach – Fault Parallelism
Fault Parallel
Primary
Inputs
Primary
Outputs
1

2
3
Logic Levels →
L
All gates at a fixed topological level are evaluated in parallel.
Approach – Pattern Parallelism
Pattern Parallel
Faulty
vector vector
2
1
vector
N
Faulty circuit
value
for vector 1

Good circuit
value
for vector 1
Simulations for any gate, for different patterns, are done

In parallel, in 2 phases




Good
Phase 1 : Good circuit simulation. Results returned to CPU
Phase 2 : Faulty circuit simulation. CPU does not schedule a stuck-at-v
fault in a pattern which has v as the good circuit value.
For the all faults which lie in its TFI
Fault injection also performed in parallel
Approach – Logic Simulation
typedef struct __align__(16){
int offset; // Gate type’s offset
int a, b, c, d; // Input values
int m0, m1; // Mask variables
} threadData;
Approach – Fault Injection
typedef struct __align__(16){
int offset; // Gate type’s offset
int a, b, c, d; // Input values
int m0, m1; // Mask variables
} threadData;
m0 m1
Meaning
-
11
Stuck-a-1 Mask
11
00
No Fault Injection
00
00
Stuck-at-0 Mask
Approach – Fault Detection
typedef struct __align__(16){
int offset; // Gate type’s offset
int a, b, c, d; // input values
int Good_Circuit_threadID; // Good circuit simulation thread ID
} threadData_Detect;
3
Approach - Recap

CPU schedules the good and faulty gate evaluations.

Different threads perform in parallel (for 2 vectors of a
gate)




Gate evaluation (logic simulation) for good or faulty vectors
Fault injection
Fault detection for gates at the last topological level only
We maximize GPU performance by:


Ensuring no data dependency exists between threads issued in
parallel
Ensuring that the same instructions are executed by all threads,
but on different data
 Conforms to the SIMD architecture of GPUs
Maximizing Performance

We adapt to specific G80 memory constraints

LUT stored in texture memory. Key advantages are :
Texture memory is cached
 Total LUT size easily fits into available cache size of 8KB/MP
 No memory coalescing requirements
 Efficient built-in texture fetching routines available in CUDA
 Non-zero time taken to load texture memory, but cost easily
amortized


Global memory writes for level i gates (and reads for
level i+1 gates) are performed in a coalesced fashion
Outline






Introduction
Technical Specifications of the GPU
CUDA Programming Model
Approach
Experimental Setup and Results
Conclusions
Experimental Setup

FS on 8800 GTX runtimes compared to a commercial
fault simulator for 30 IWLS and ITC benchmarks.

32 K patterns were simulated for all 30 circuits.

CPU times obtained on a 1.5 GHz 1.5 GB UltraSPARC-IV+
Processor running Solaris 9.

OUR time includes

Data transfer time between the GPU and CPU (both directions)




CPU → GPU : 32 K patterns, LUT data
GPU → CPU : 32 K good circuit evals. for all gates, array Detect
Processing time on the GPU
Time spent by CPU to issue good/faulty gate evaluation calls
Results
Circuit
#Gates
#Faults
OURS (s)
COMM. (s)
Speed Up
s9234_1
1261
2202
2.043
26.740
13.089
s35932
10537
24256
7.883
265.590
33.691
s5378
1682
3543
1.961
31.950
16.290
s13207
1594
3032
0.656
52.590
80.160
b22
34060
55077
58.33
252.040
4.167
b17_1
51340
120639
14.84
736.670
41.232
b10
407
767
0.340
4.020
11.834
b02
52
3114
0.028
1.280
45.911
:
Avg (30 Ckts.)
34.879

Computation results have been verified.

On average, over 30 benchmarks, ~35X speedup obtained.
Results (IU Tesla Server)
Circuit
#Gates
#Faults
PROJ. (s)
COMM. (s)
Speed Up
s9234_1
1261
2202
0.282
26.740
94.953
s35932
10537
24256
0.802
265.590
567.941
s5378
1682
3543
0.271
31.950
117.716
s13207
1594
3032
0.091
52.590
579.453
b22
34060
55077
57.969
252.040
4.348
b17_1
51340
120639
14.335
736.670
51.391
b10
407
767
0.051
4.020
78.494
b02
52
3114
0.003
1.280
367.288
:
Avg (30 Ckts.)

NVIDIA Tesla 1U Server can house up to 8 GPUs



238.185
Runtimes are obtained by scaling the GPU processing times only
Transfer times and CPU processing times are included, without scaling
On average ~240X speedup obtained.
Outline






Introduction
Technical Specifications of the GPU
CUDA Programming Model
Approach
Experimental Setup and Results
Conclusions
Conclusions

We have accelerated FS using GPUs


By careful engineering, we maximally harness the GPU’s



~35X speedup compared to commercial FS engine
When projected for a 1U NVIDIA Tesla Server


Raw computational power and
Huge memory bandwidths
When using a Single 8800 GTX GPU


Implement a pattern and fault parallel technique
~238X speedup is possible over the commercial engine
Future work includes exploring parallel fault simulation
on the GPU
Thank You