GPUDet: A Deterministic GPU Architecture Hadi Jooybar1, Wilson Fung1, Mike O’Connor2, Joseph Devietti3, Tor M.

Download Report

Transcript GPUDet: A Deterministic GPU Architecture Hadi Jooybar1, Wilson Fung1, Mike O’Connor2, Joseph Devietti3, Tor M.

GPUDet: A Deterministic GPU Architecture
Hadi Jooybar1, Wilson Fung1, Mike O’Connor2,
Joseph Devietti3, Tor M. Aamodt1
1The
University of British Columbia
2AMD Research
3University of Washington
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
1
• GPUs are …
• Fast
• Energy efficient
• Commodity hardware
But…
× Mostly use for certain range of applications
Why?
Communication among concurrent threads
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
1000s of
Threads
2
Motivation
0
V0
__global__ void BFS_step_kernel(...) {
1
if( active[tid] ) {
2
3
4
active[tid] = false;
visited[tid] = true;
foreach (int id = neighbour_nodes){
5
6
7
8
if( visited[id] == false ){
cost[id] = cost[tid] + 1;
active[id] = true;
*over = true;
9
V1
Cost = 1
Active = 1
V2
1
Cost = 2
Active = 1
} } } }
BFS algorithm
Published in HiPC
2007
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
3
Motivation
What about debuggers?!
The bug may appear occasionally or in different places in each run.
OMG!
I willWhere
debugwas
it this
that
time
bug?!
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
4
GPUDet
 Strong Determinism (hardware proposal)
 Same Outputs
 Same Execution Path
 Makes the program easier to
 Debug
 Test
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
5
Motivation
0
V0
__global__ void BFS_step_kernel(...) {
1
if( active[tid] ) {
2
3
4
active[tid] = false;
visited[tid] = true;
foreach (int id = neighbour_nodes){
5
6
7
8
if( visited[id] == false ){
cost[id] = cost[tid] + 1;
active[id] = true;
*over = true;
9
V1
Cost = 1
Active = 1
V2
Cost = 2
Active = 1
} } } }
BFS algorithm
Published in HiPC
2007
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
6
GPUDet
 Strong Determinism
 Same Outputs
 Same Execution Path
 Makes the program easier to
 Debug
 Test
×Our
There
is no
free
lunch Deterministic
goal
is to
provide
× Performance Overhead
Execution on GPU architectures with
acceptable performance overhead
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
7
GPU Architecture
DRAM
CPU
Kernel
launch
L2 Cache
x = input[threadID];
y= func(x);
output[threadID] = y;
workgroup
workgroup
workgroup
012
Compute Unit
L1 Cache
Memory Unit
ALU
ALU
ALU
Workgroups
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
8
Outline
• Introduction
• GPU Architecture
• Challenges
• Deterministic Execution with GPUDet
• GPUDet Optimizations
• Workgroup-Aware Quantum Formation
• Deterministic parallel commit using Z-Buffer Unit
• Compute Unit level serialization
• Results and Conclusion
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
9
Deterministic GPU Execution Challenges
TT00
TT11
TT22
TT33
…
T0
T1
T2
T3
Isolation
Quantum 0Communication
Normal Execution
Isolation
Quantum nCommunication
• Isolation mechanism
• Provide method to pause execution of a thread
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
10
Deterministic GPU Execution Challenges
• Isolation mechanism
…
• Lack of private caches
• Lack of cache coherency
• Provide method to pause execution of a thread
• Single Instruction Multiple Threads (SIMT)
• Potential deadlock condition
• Major changes in control flow hardware
• Performance overhead
workgroupn
wavefront
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
11
Deterministic GPU Execution Challenges
• Very large number of threads
• Expensive global synchronization
• Expensive serialization
• Different program properties
• Large number of short running threads
• Frequent workgroup synchronization
• Less locality in intra thread memory accesses
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
12
Outline
• Introduction
• GPU Architecture
• Challenges
• Deterministic Execution with GPUDet
• GPUDet Optimizations
• Workgroup-Aware Quantum Formation
• Deterministic parallel commit using Z-Buffer Unit
• Compute Unit level serialization
• Results and Conclusion
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
13
Deterministic Execution of a Wavefront
if (tid < 16)
x[tid%2] = tid;
T0
T1
T2
T15
x[0] = 0
x[1] = 1
x[0] = 2
…
x[1] = 15
Execution of one wavefront is deterministic
Coalescing Unit
Data Race
Address x
Mask
v
v
-
-
-
-
-
- … -
Data
14 15
-
-
-
-
-
- … -
x[0] = 14
x[1] = 15
Not modified
To memory
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
14
Deterministic GPU Execution Challenges
T0
T1
T2
T3
…
Isolation
Communication
• Isolation mechanism
T0
T1
T2
T3
Isolation
Communication
wavefront granularity
• Provide method to pause execution of a thread
not a challenge anymore
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
15
Read
Only
• GPUDet-Basic
Reaching Quantum
Boundary
Global Memory
Load OpOp
Atomic
1.
2.
3.
4.
5.
Wavefronts
Instruction Count
Atomic Operations
… Fences
Memory
Workgroup Barriers
Execution Complete
Commit
Store Buffers
Local Memory
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
16
Outline
• Introduction
• GPU Architecture
• Challenges
• Deterministic Execution with GPUDet
• GPUDet Optimizations
• Workgroup-Aware Quantum Formation
• Deterministic parallel commit using Z-Buffer Unit
• Compute Unit level serialization
• Results and Conclusion
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
17
Workgroup-Aware Quantum Formation
• Extra global synchronizations
Reducing number of synchronizations
Avoid unnecessary quantum termination
Load Imbalance
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
18
Workgroup-Aware Quantum Formation
%of Termination Reasons
100%
Atomic Operations
Instruction Count
Execution Complete
Workgroup Barriers
All reach a workgroup
barrier
80%
60%
Continue execution in
40% parallel mode
the
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
CLopt
ATM
HT
SRAD
LPS
LIB
HOTSP
CP
CFD
BFSf
Workgroup-Aware Decision
Making
Quanta
are finished by
workgroup barriers
BFSr
0%
AES
20%
19
Workgroup-Aware Quantum Formation
80%
Atomic Operations
Instruction Count
Execution Complete
Workgroup Barriers
60%
Finish execution of the
Deterministic workgroup partitioning
Kernel function
40%
GPUDet: A Deterministic GPU Architecture
CLopt
ATM
HT
SRAD
LPS
LIB
HOTSP
CP
CFD
BFSf
BFSr
20%
0%
Hadi Jooybar
Workgroup-Aware Decision Making
AES
%of Termination Reasons
100%
20
Deterministic
Z-Buffer
Unit Parallel Commit using the Z-Buffer Unit
Depth Buffer
8∞ 8∞ 58
∞ 58
∞ 8∞ 8
∞
8∞ 8∞ 58
∞ 58
∞ 58
∞ 8
∞
7∞ 57
∞ 57
∞ 58
∞ 58
∞ 5
∞
8
7∞ 57
∞ 57
∞ 58
∞ 58
∞ 5
∞
8
57
∞ 57
∞ 57
∞ 58
∞ 58
∞ 5
∞
8
Store Buffer Contents ≈ Color Values
Wavefront ID ≈ Depth Values
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
21
Compute Unit Level Serialization
• GPUs preserve Point to Point Ordering
A
A
Serialization is only
among compute units
A
A
A
A
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
22
Outline
• Introduction
• GPU Architecture
• Challenges
• Deterministic Execution with GPUDet
• GPUDet Optimizations
• Workgroup-Aware Quantum Formation
• Deterministic parallel commit using Z-Buffer Unit
• Compute Unit level serialization
• Results and Conclusion
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
23
Results
Applications
with atomic
operations
• GPGPU-Sim 3.0.2
5
4
Parallel Mode
2
GPUDet: A Deterministic GPU Architecture
CLopt
ATM
HT
SRAD
LPS
LIB
HOTSP
CP
CFD
BFSf
BFSr
1
0
Hadi Jooybar
Serial Mode
Commit Mode
3
AES
Normalized
Execution Time
2x Slowdown
24
GPUDet-base
Quantum Formation
Workgroup Barrier
End of the Kernel
20% Performance Improvement
19%
for application
for application
with small
withkernel
barriers
functions
4
3
2
GPUDet: A Deterministic GPU Architecture
AVG
CLopt
ATM
HT
SRAD
LPS
LIB
CP
CFD
HOTSP
Hadi Jooybar
BFSf
0
BFSr
1
AES
Normalized Execution Time
5
25
Deterministic Parallel Commit using the Z-Buffer Unit
Commit Mode
Parallel & Serial Modes
8
60% Performance Improvement
6
on Average
4
AES
Hadi Jooybar
BFSr
BFSf
CFD
CP
HOTSP
LIB
LPS
GPUDet: A Deterministic GPU Architecture
SRAD
HT
ATM
Locking
Z-Buffer
Locking
Z-Buffer
Locking
Z-Buffer
Locking
Z-Buffer
Locking
Z-Buffer
Locking
Z-Buffer
Locking
Z-Buffer
Locking
Z-Buffer
Locking
Z-Buffer
Locking
Z-Buffer
Locking
Z-Buffer
0
Locking
2
Z-Buffer
Normalized Execution Time
10
Clopt
26
CLopt
Hadi Jooybar
HT
Serial Mode
Parallel&Commit Modes
CU-Ser
W-Ser
CU-Ser
W-Ser
CU-Ser
14
12
10
8
6
4
2
0
6.1x Performance
Improvement in
Serial Mode
W-Ser
Normalize Execution Time
Compute Unit Level Serialization
ATM
GPUDet: A Deterministic GPU Architecture
27
Conclusion
• Encourages programmers to use GPUs in broader
range of applications
• Exploits GPU characteristics to reduce performance
overhead
• Deterministic execution within a wavefront
• Workgroup-aware quantum formation
• Deterministic parallel commit using Z-Buffer Unit
• Compute Unit level serialization
Questions?
Hadi Jooybar
GPUDet: A Deterministic GPU Architecture
28