Dynamic Warp Formation and Scheduling for GPU Control Flow

Download Report

Transcript Dynamic Warp Formation and Scheduling for GPU Control Flow

Dynamic Warp Formation and
Scheduling for
Efficient GPU Control Flow
Wilson W. L. Fung
Ivan Sham
George Yuan
Tor M. Aamodt
Electrical and Computer Engineering
University of British Columbia
Micro-40 Dec 5, 2007
Motivation =

GPU: A massively parallel architecture

SIMD pipeline: Most computation out of least silicon/energy
GFLOPS
1000
100
GPU
CPU-Scalar
CPU-SSE
10
1
2001

2002
2003
2004
2005
2006
2007
2008
Year
Goal: Apply GPU to non-graphics computing


Many challenges
This talk: Hardware Mechanism for Efficient Control Flow
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
2
Programming Model

Modern graphics pipeline
Vertex
Shader
OpenGL/
DirectX

Pixel
Shader
CUDA-like programming model




Hide SIMD pipeline from programmer
Single-Program-Multiple-Data (SPMD)
Programmer expresses parallelism using threads
~Stream processing
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
3
Programming Model


Warp = Threads grouped into a SIMD instruction
From Oxford Dictionary:

Warp: In the textile industry, the term “warp” refers
to “the threads stretched lengthwise in a loom to
be crossed by the weft”.
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
4
The Problem: Control flow

GPU uses SIMD
pipeline to save area
on control logic.


Group scalar threads into
warps
Branch divergence
occurs when threads
inside warps branches
to different execution
paths.
Branch
Path A
Path B
50.5% performance loss with SIMD width = 16
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
5
Dynamic Warp Formation

Consider multiple warps
Opportunity?
Branch
Path A
Path B
20.7% Speedup with 4.7% Area Increase
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
6
Outline







Introduction
Baseline Architecture
Branch Divergence
Dynamic Warp Formation and Scheduling
Experimental Result
Related Work
Conclusion
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
7
Baseline Architecture
CPU
spawn
GPU
CPU
CPU
Shader
Core
done
Shader
Core
Shader
Core
Interconnection Network
Memory
Controller
Memory
Controller
Memory
Controller
GDDR3
GDDR3
GDDR3
spawn
GPU
Time
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
8
SIMD Execution of Scalar Threads


All threads run the same kernel
Warp = Threads grouped into a SIMD instruction
Thread Warp
Common PC
Scalar Scalar Scalar
Thread Thread Thread
W
X
Y
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Scalar
Thread
Z
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
Thread Warp 3
Thread Warp 8
Thread Warp 7
SIMD Pipeline
9
Latency Hiding via
Fine Grain Multithreading


SIMD Pipeline
Decode
RF
ALU
ALU
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Threads available
for scheduling
I-Fetch
RF
Graphics has millions
of pixels
Thread Warp 7
ALU

Thread Warp 3
Thread Warp 8
RF

Interleave warp
execution to hide
latencies
Register values of all
threads stays in
register file
Need 100~1000
threads
D-Cache
All Hit?
Data
Writeback
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
Threads accessing
memory hierarchy
Miss?
Thread Warp 1
Thread Warp 2
Thread Warp 6
10
SPMD Execution on SIMD Hardware:
The Branch Divergence Problem
A
Thread Warp
B
C
D
F
Common PC
Thread Thread Thread Thread
1
2
3
4
E
G
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
11
Baseline: PDOM
Stack
AA/1111
Next PC
Active Mask
E
E
G
A
B
E
D
C
E
1111
0110
1001
TOS
TOS
TOS
BB/1111
C/1001
C
Reconv. PC
D/0110
D
F
Thread Warp
EE/1111
Common PC
Thread Thread Thread Thread
1
2
3
4
G/1111
G
A
B
C
D
E
G
A
Time
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
12
Dynamic Warp Formation: Key Idea

Idea: Form new warp at divergence

Enough threads branching to each path to create
full new warps
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
13
Dynamic Warp Formation: Example
A
x/1111
y/1111
A
x/1110
y/0011
B
x/1000
Execution of Warp x
at Basic Block A
x/0110
C y/0010 D y/0001 F
E
Legend
A
x/0001
y/1100
Execution of Warp y
at Basic Block A
D
A new warp created from scalar
threads of both Warp x and y
executing at Basic Block D
x/1110
y/0011
x/1111
G y/1111
A
A
B
B
C
C
D
D
E
E
F
F
G
G
A
A
Baseline
Time
Dynamic
Warp
Formation
A
A
B
B
C
D
E
E
F
G
G
A
A
Time
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
14
Dynamic Warp Formation:
Hardware Implementation
PC-Warp LUT
Warp Update Register T
5TID
2 x7
3N 8 REQ
1011 PCBA
0110
H
Warp Update Register NT
1TID
6 x N 4 REQ
1001 PCCB
0100
Warp Pool
0
2
B 0010
0110
PC
OCC IDX
PC
OCC IDX
1
C 1001
1101
PC
B TID x7N Prio
H
Y
X
Warp Allocator
Y
X
ALU 1
ALU 3
ALU 4
RF 1
RF 2
RF 3
RF 4
(TID, Reg#)
1
5
(TID, Reg#)
6
2
7 (TID, Reg#)
3
4 (TID, Reg#)
8
Y
X
Z
Y
X
Z
1
5
6
2
7
3
4
8
1
5
6
2
7
3
4
8
I-Cache
ALU 2
5
1
2
6
3
7
4
8
Y
X
Z
Decode
Commit/
Writeback
5
1
2
6
3
7
4
8
A 5
B
PC
1TID
2 x3N8
4 Prio
PC
C
A 1
5TID
6 x7N4
8 Prio
Issue Logic
Thread Scheduler
A: BEQ R2, B
C: …
No Lane Conflict
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
15
Methodology


Created new cycle-accurate simulator from
SimpleScalar (version 3.0d)
Selected benchmarks from SPEC CPU2006,
SPLASH2, CUDA Demo


Manually parallelized
Similar programming model to CUDA
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
16
Experimental Results
128
Baseline: PDOM
Dynamic Warp Formation
MIMD
112
96
IPC
80
64
48
32
16
0
hmmer
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
lbm
Black
Bitonic
FFT
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
LU
Matrix
HM
17
Dynamic Warp Scheduling
128
112
96
IPC
80
Baseline
DMaj
DMin
DTime
DPdPri
DPC
64
48
32
16
0
hmmer

lbm
Black
Bitonic
FFT
LU
Matrix
HM
Lane Conflict Ignored (~5% difference)
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
18
Area Estimation



CACTI 4.2 (90nm process)
Size of scheduler = 2.471mm2
8 x 2.471mm2 + 2.628mm2 = 22.39mm2

4.7% of Geforce 8800GTX (~480mm2)
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
19
Related Works

Predication


Lorie and Strong



Abstract/software proposal for “regrouping”
SMT processor
Liquid SIMD (Clark et al.)


JOIN and ELSE instruction at the beginning of divergence
Cervini


Convert control dependency into data dependency
Form SIMD instructions from scalar instructions
Conditional Routing (Kapasi)

Code transform into multiple kernels to eliminate branches
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
20
Conclusion

Branch divergence can significantly degrade
a GPU’s performance.


Dynamic Warp Formation & Scheduling



50.5% performance loss with SIMD width = 16
20.7% on average better than reconvergence
4.7% area cost
Future Work

Warp scheduling – Area and Performance
Tradeoff
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
21
Thank You.
Questions?
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
22
Shared Memory


Banked local memory accessible by all
threads within a shader core (a block)
Idea: Break Ld/St into 2 micro-code:



Address Calculation
Memory Access
After address calculation, use bit vector to
track bank access just like lane conflict in the
scheduler
Wilson Fung, Ivan Sham,
George Yuan, Tor Aamodt
Dynamic Warp Formation and Scheduling
for Efficient GPU Control Flow
23