Efficient and Easily Programmable Accelerator Architectures Tor Aamodt University of British Columbia PPL Retreat, 31 May 2013

Download Report

Transcript Efficient and Easily Programmable Accelerator Architectures Tor Aamodt University of British Columbia PPL Retreat, 31 May 2013

Efficient and Easily Programmable
Accelerator Architectures
Tor Aamodt
University of British Columbia
PPL Retreat, 31 May 2013
Decreasing cost per unit computation
1971: Intel 4004
2007: iPhone
Advancing Computer Systems without Technology Progress
DARPA/ISAT Workshop, March2012:
26-27, 2012
Datacenter
Mark Hill & Christos Kozyrakis
1981: IBM 5150
5/1/2020
2
Single Core OoO Superscalar CPU
Better
(how to get here?)
Brawny (OoO) Multicore
Ease of
Programming
Wimpy (In-order) Multicore
16K thread, SIMT Accelerator
ASIC
Hardware Efficiency
3
Heterogeneity helps…
Ease of
Programming
Hardware Efficiency
4
Review: Amdahl’s Law
Hard to accelerate
Improvement
overall

Fraction hard
 5/1/2020
Easy to accelerate
1
1- Fraction hard

Improvement easy
5
What defines division between hard and easy?
Fractionhard = f(problem, prog. model, SW budget)
5/1/2020
6
Goal:
easy to accelerate (Acc. Arch1)
easy to accelerate (Acc. Arch2)
5/1/2020
7
Better
?
Ease of
Programming
Hardware Efficiency
8
Increase Accelerator Efficiency (x-axis)
 Control Flow
 Memory
Improve Accelerator Programmability (y-axis)
 Easier coding
 Fewer bugs
 Easier debugging
5/1/2020
9
SIMT Execution (MIMD on SIMD)
(Levinthal SIGGRAPH’84)
foo[] = {4,8,12,16};
A: n = foo[tid.x];
A
1 2 3 4
B: if (n > 10)
B
1 2 3 4
C
-- -- 3 4
D
1 2 -- --
C:
…;
else
D:
…;
Active Mask
1111
1100
0011
Time
E: …
PC
B
E
D
C
Branch Divergence
10
Dynamic Warp Formation
(Fung: MICRO 2007, HPCA 2011)
Warp 0
A
Warp 1
Warp 2
1234
A
5678
A 9 10 11 12
B
1234
B
5678
B 9 10 11 12
C
1 2 -- --
Time
C
D
5 -- 7 8
C
-- -- 11 12
D
9 10 -- --
SIMD Efficiency 58  88%
C 1 2 7 8
Pack
C 5 -- 11 12
-- -- 3 4
D
E
Reissue/Memory
Latency
-- 6 -- -22% average [HPCA’11]
1234
E
5678
E 9 10 11 12
11
Memory
5/1/2020
12
Scheduler affects access pattern
Warp0
Greedy then Oldest Scheduler
Round Robin Scheduler
Warp1
ld A,B,C,D…
ld Z,Y,X,W
Warp
Scheduler
W
X
Y
Z
D
C
B
A
Cache
Warp1
ld A,B,C,D…
...
...
...
ld A,B,C,D
ld Z,Y,X,W
Warp0
ld A,B,C,D…
Warp
Scheduler
D
C
B
A
D
C
B
A
Cache
Use scheduler to shape access pattern
Cache-Conscious Wavefront Scheduling
(Rogers: MICRO 2012, Top Picks 2013)
Greedy then Oldest Scheduler
Warp0
Warp1
ld A,B,C,D
...
ld A,B,C,D
ld Z,Y,X,W
Warp
Scheduler
Warp
Scheduler
working set
size per warp
W
X
Y
Z
63% perf.
improvement
D
C
B
A
Cache
Cache
Easier coding
5/1/2020
15
Accelerator Coherence Challenges

1.
2.
3.

•
Challenges of introducing coherence messages on a GPU
Traffic: transferring messages
Storage: tracking message
Complexity: managing races between messages
GPU cache coherence without coherence messages?
YES – using global time
Temporal Coherence (Singh: HPCA 2013)
Related: Library Cache Coherence
Local Timestamp
Global time
> Global Time  VALID
Core 1
Core 2
L1D
L1D
0
▪▪▪
A=0
Global Timestamp
Interconnect
L2 Bank
0
A=0
< Global Time 
NO L1 COPIES
▪▪▪
Temporal Coherence Example
T=11
T=0
T=15
Core 1
Core 2
L1D
L1D
No coherence
Interconnect
messages
10
A=0
L2 Bank
10
0
A=0
A=1
▪▪▪
Complexity
Non-Coherent L1
MESI
TC-Weak
L1L2 States
MESI L1 States
Non-Coherent L2
TC-Weak L2
Interconnect Traffic
MESI
Interconnect Traffic
1.50
NO-COH
GPU-VI
TC-Weak
2.3
1.25
•
Reduces traffic by 53%
over MESI and 23% over
GPU-VI
•
Lower traffic than 16xsized 32-way directory
1.00
0.75
0.50
0.25
0.00
Do not require
coherence
Performance
MESI
NO-L1
GPU-VI
TC-Weak
2.0
•
Speedup
1.5
1.0
0.5
0.0
Require
coherence
TC-Weak with simple
predictor performs 85%
better than disabling L1
caches
Fewer bugs
5/1/2020
22

Lifetime of Accelerator Application Development
Functionality
Performance
Time
Fine-Grained Locking
Transactional Memory
?
Time
Time
23
Are TM and GPUs Incompatible?
GPU uarch very different from multicore CPU.
KILO TM [Fung MICRO’11, Top Picks’12]

Hardware TM for GPUs
Half performance of fine grained locking

Chip area overhead of 0.5%

24
Hardware TM for GPUs
Challenge #1: SIMD Hardware

On GPUs, scalar threads in a warp/wavefront
execute in lockstep
A Warp with 4 Scalar Threads
...
TxBegin
LD r2,[B]
ADD r2,r2,2
ST r2,[A]
TxCommit
...
Committed
T0 T1 T2 T3
T0 T1 T2 T3
Branch Divergence!
T0 T1 T2 T3
Aborted
25
KILO TM – Solution to
Challenge #1: SIMD Hardware

Transaction Abort


Like a Loop
Extend SIMT Stack
...
TxBegin
LD r2,[B]
ADD r2,r2,2
ST r2,[A]
TxCommit
...
Abort
26
Hardware TM for GPUs
Challenge #2: Transaction Rollback
GPU Core (SM)
CPU Core
10s of
Register File Registers
@ TX
@ TX
Entry
Abort
Warp
Warp
Warp
Warp
Warp
Warp
Warp
Warp
32k Registers
Register
File
Checkpoint
Register File
Checkpoint?
2MB Total
On-Chip
Storage
27
KILO TM – Solution to
Challenge #2: Transaction Rollback

SW Register Checkpoint


Most TX: Reg overwritten first appearance
(idempotent)
TX in Barnes Hut: Checkpoint 2 registers
Overwritten
TxBegin
LD r2,[B]
ADD r2,r2,2
ST r2,[A]
TxCommit
Abort
28
Hardware TM for GPUs
Challenge #3: Conflict Detection
Existing HTMs use Cache Coherence Protocol
 Not Available on (current) GPUs
 No Private Data Cache per Thread
Signatures?
 1024-bit / Thread
 3.8MB / 30k Threads
29
KILO TM:
Value-Based Conflict Detection
Private Memory
Read-Log
A=1
Write-Log
B=2

TX1
atomic
{B=A+1}
TxBegin
LD r1,[A]
ADD r1,r1,1
ST r1,[B]
TxCommit
Global
Memory
TX2
atomic
{A=B+2}
TxBegin
LD r2,[B]
ADD r2,r2,2
ST r2,[A]
TxCommit
A=1
B=0
B=2
Private Memory
Read-Log
B=0
Write-Log
A=2
Self-Validation + Abort:

Only detects existence of conflict (not identity)
30
Easier debugging
31
1
2
3
4
5
6
7
8
9
V0
__global__ void BFS_step_kernel(...) {
if( active[tid] ) {
active[tid] = false;
visited[tid] = true;
foreach (int id = neighbour_nodes){
V1
V2
if( visited[id] == false ){
level = 1
level = 2
level[id] = level[tid] + 1;
active = 1
active = 1
active[id] = true;
…
Result Variation (Kepler)
} } } }
BFS algorithm
Published in HiPC
2007
different results over
multiple executions
0
100%
80%
60%
40%
20%
0%
20000 25000 30000 35000 40000 45000 50000
# edges
GPUDet (Jooybar: ASPLOS 2013)
Read
Only
Global Memory
Reaching
Quantum
Boundary
Load OpOp
Atomic
Wavefront
1. Instruction Count
2. sAtomic Operations
… Fences
3. Memory
4. Workgroup
Barriers
5. Execution
Complete
2x
Slowdown
Commi
Store
Buffers
Local Memory
Summary

Start from efficient architecture and try to
improve programmability

5/1/2020
Get efficiency and keep programmers happy
34
Thanks!
Questions?
35