Stream Processors vs. GPUs Bill Dally Computer Systems Laboratory Stanford University August 8, 2004 Stream GP2: 1 August 8, 2004

Download Report

Transcript Stream Processors vs. GPUs Bill Dally Computer Systems Laboratory Stanford University August 8, 2004 Stream GP2: 1 August 8, 2004

Stream Processors vs. GPUs
Bill Dally
Computer Systems Laboratory
Stanford University
August 8, 2004
Stream GP2: 1
August 8, 2004
The Big Picture
0.5mm
64-bit FPU
(to scale)
50pJ/FLOP
VLSI Technology
Arithmetic is cheap
Bandwidth is expensive
Cost grows w/ distance
1 clock
Applications
Producer-Consumer Locality
Data Parallelism Process
Cells
ILP
90nm Chip
$200
1GHz
Update
Faces
Subgrid
Subgrid
LRF
SRF
Lane
Global Switch
12mm
CL
SW
LRF
Cluster 1
Stream Processor
Exploits locality w/ update
Exploits DP and ILP
Cluster N-1
Stream GP2: 2
August 8, 2004
What is a stream processor? and how is
it different from a GPU?
My project is
a stream
processor
My project is
a stream
processor too
Stream GP2: 3
August 8, 2004
What is a stream processor?
Chip
Crossing(s)
10k
switch
1k
switch
100
wire
LRF
DRAM
Bank
Cache
Bank
SRF
Lane
CL
SW
LRF
16GB/s
Chip
Pins
and
Router
64GB/s
M
SW
512GB/s
3,840GB/s
LRF
DRAM
Bank
Cache
Bank
SRF
Lane
CL
SW
LRF
Stream GP2: 4
August 8, 2004
Mapping an Application
Grid
Gather
Grid
Subgrid
Subgrid
Scatter
Grid
Index
Process
Cells
Face
Index
Process
Faces
Abstraction of irregular-grid FEM code
written by Tim Barth (NASA Ames)
Stream GP2: 5
August 8, 2004
Kernel Locality and ILP
1k
switch
100
Gridwire
Grid
LRF
SRF
Lane
Grid
Index
CL
SW
Gather
Subgrid
512GB/s
Face
Index
Process
Faces
3,840GB/s
LRF
CL
SW
LRF
Stream GP2: 6
Scatter
LRF
Process
Cells
SRF
Lane
Subgrid
Kernel Locality and ILP within
each kernel
Uses multiple ALUs and LRFs
in each cluster
August 8, 2004
Producer-Consumer Locality
10k
switch
1k
Grid
switch
100
wire
Grid
LRF
Grid
Index
M
SW
SRF
Lane
Gather
CL
Subgrid
Scatter
SW
LRF
Process
Cells
512GB/s
SRF
Lane
Face
Index
Process
Faces
3,840GB/s
LRF
CL
SW
LRF
Stream GP2: 7
Subgrid
Producer-consumer locality
between kernels
Uses SRF lane local to each
cluster
August 8, 2004
Local Update of Working Set
Chip
Crossing(s)
10k
switch
1k
switch
100
wire
LRF
DRAM
Bank
Cache
Bank
SRF
Lane
CL
SW
LRF
16GB/s
Chip
Pins
and
Router
64GB/s
M
SW
512GB/s
3,840GB/s
LRF
Updates of subgrid DRAM
(working
Bank
subset of grid)
Grid
Gather
Cache
Bank
CL
SW
LRF
Global access to SRF banks
across clusters
Subgrid
SRF
Lane
Grid
Subgrid
Scatter
Grid
Index
Process
Cells
Stream GP2: 8
Face
Index
Process
Faces
August 8, 2004
Data Parallelism
Grid
Gather
Grid
Subgrid
Subgrid
Scatter
Grid
Index
Process
Cells
Face
Index
Chip
Crossing(s)
Process
Faces
10k
switch
1k
switch
100
wire
LRF
Operate on multiple grid
(subgrid) points in parallel
DRAM
Bank
Cache
Bank
SRF
Lane
CL
SW
LRF
16GB/s
Exploit multiple clusters
Chip
Pins
and
Router
64GB/s
M
SW
512GB/s
3,840GB/s
LRF
DRAM
Bank
Cache
Bank
SRF
Lane
CL
SW
LRF
Stream GP2: 9
August 8, 2004
P & L: Parallelism and Locality
• Data parallelism x ILP – uses lots of ALUs
–
–
–
–
Over 2K ALUs can be used productively (Khailany et al. HPCA 03)
DP much less expensive (area and power) than TLP
Can still handle conditionals efficiently (Kapasi et al. Micro 00)
Distributed local registers much less expensive than global
(Rixner et al. HPCA 00)
• Kernel locality + producer-consumer locality + local working-set
update
– Most references local – LRF or local SRF or global SRF
– Very few references to off-chip memory
– Explicit SRF accesses more efficient than implicit cache accesses
(enables scheduling)
• It takes good compilers to map to a register hierarchy well
– (Mattson et al. ASPLOS 00, Kapasi et al. WMSP 01)
Stream GP2: 10
August 8, 2004
Register Hierarchy
ProdCon
Locality
(.03A - .3A)
DRAM
Off-Chip
(< .01A)
Stream GP2: 11
$
M
SW
SRF
G
SW
SRF
Kernel Locality
DRF Efficiency
(3A)
LRF
CL
SW
LRF
R/W Access to Working Set
SRF Efficiency
( .01A - .1A)
August 8, 2004
So? What about GPUs?
Pixel
Shader
Frame
Buffer
Compositer
R/O
Rasterizer
Addr
Vertex
Shader
Texture
Cache
Can stream data through shaders, but…
R/W access only to local registers in shaders
All other writes must cycle through frame buffer
Can’t capture producer-consumer locality or working set
Stream GP2: 12
August 8, 2004
Stream Processor vs GPU
ProdCon
Locality
(.03A - .3A)
DRAM
M
SW
$
Off-Chip
(< .01A)
SRF
G
SW
SRF
Stream GP2: 13
LRF
CL
SW
LRF
R/W Access to Working Set
SRF Efficiency
( .01A - .1A)
R/O
Table Lookup
(.01A - .1A)
DRAM
Kernel Locality
DRF Efficiency
(3A)
$
Off-Chip
ProdCon and R/W Working Set
(.03A - .3A)
M
SW
SRF
G
SW
T$
Kernel Locality
(3A)
LRF
CL
SW
LRF
August 8, 2004
Our experience with stream processors
For signal/image processing, and for scientific
computing
Stream GP2: 14
August 8, 2004
Imagine Prototype
•
Imagine
–
–
–
–
•
Stream processor for image and signal processing
16mm die in 0.18um TI process
21M transistors
Collaboration with TI ASIC
Software tools based on
Stream-C/Kernel-C
– Stream scheduler
– Communication scheduling
•
Many Applications
–
–
–
–
–
3 Graphics pipelines
Image-processing apps – depth, MPEG
3G Cellphone (Rice)
STAP
IPv6, VPN
Stream GP2: 15
August 8, 2004
Bandwidth Demand of Media Applications
Stream GP2: 16
August 8, 2004
SPI
Power Dissipation
Other
5%
MBANKs
3%
UC SRAMs
3%
UC
2%
Clock Tree
11%
MBANKs Other
4%
1%
Clock
Tree
4%
SRF SRAMs + SBs
8%
SRF SRAMs + SBs
15%
Cluster ALUs
31%
Cluster Switch &
Control
12%
Cluster ALUs
42%
Cluster LRFs, Switch,
and Control
21%
• Imagine (0.18 m – 48 FP ALUs)
– 3.1 W, 132 MHz, 1.5 V (meas.)
Cluster LRFs
38%
•
SP64 (90nm – 1280 16b ALUs)
–
–
–
5 W, 640 MHz, 0.8 V (est.)
160MOPS/mW, >10GOPS/mm2
High-confidence power estimates:
•
Power dissipation is dominated (>90%) by
very predictable sources – RFs, ALUs,
switches between ALUs, and clocks.
Stream GP2: 17
•
•
•
SRAM/RF datasheets from 90nm process for LRFs,
SRF SRAMs+SBs, UC
Post-synthesis measurements for
Cluster ALUs
Detailed floorplan provides switch power
MBANKs and clock tree scaled from Imagine
August 8, 2004
SPI
SP8 vs Programmable Competition
Programmable GMACS / $
10
SPI SP8
SPI SP8-LV
Xilinx Virtex-II Pro
TI TMS320C6414T
TI DM642
ADI ADSP-Ts201s
Intrinsity Fast-math
Cradle CT3400
Equator BSP-16
Mathstar SOA13D40-01
PicoChip PC101
Morpho MRC6011
Intel MXP5800
1
0.1
0.01
1
10
100
Programmable GMACS / W
Stream GP2: 18
August 8, 2004
Architecture of a Streaming Supercomputer
Backplane
Board
Node
16 x
DRDRAM
2GBytes
16GBytes/s
Node
2
Node
16
Stream
Processor
128 FPUs
128GFLOPS
Board 2
16 Nodes
1K FPUs
2TFLOPS
32GBytes
Board 32
16GBytes/s
32+32 pairs
Backplane 2
32 Boards
512 Nodes
64K FPUs
64TFLOPS
1TByte
Backplane
32
On-Board Network
64GBytes/s
128+128 pairs
6" Teradyne GbX
E/O
O/E
Intra-Cabinet Network
1TBytes/s
2K+2K links
Ribbon Fiber
Inter-Cabinet Network
All links 5Gb/s per pair or fiber
All bandwidths are full duplex
Bisection 32TBytes/s
Stream GP2: 19
August 8, 2004
Merrimac Processor
1.6 mm
8K Words SRF Bank
FP/INT
64 Bit
MADD
FP/INT
64 Bit
MADD
64 W RF
64 W RF
64 W RF
64 W RF
64 W RF
64 W RF
64 W RF
64 W RF
64 W RF
64 W RF
64 W RF
64 W RF
FP/INT
64 Bit
MADD
FP/INT
64 Bit
MADD
0.6 mm
0.9 mm
Network
Stream GP2: 20
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
12.5 mm
90nm tech (1 V)
ASIC technology
1 GHz (20 FO4)
128 GOPs
Inter-cluster switch between
clusters
• 127.5 mm2 (small ~12x10)
– Stanford Imagine is 16mm x
16mm
– MIT Raw is 18mm x 18mm
10.2 mm
Cluster
Cluster
$ bank
Cluster
$ bank
Cluster
$ bank
Cluster
$ bank
Cluster
$ bank
Microcontroller
Cluster
$ bank
Cluster
$ bank
Forward ECC
16 RDRAM Interfaces
$ bank
Cluster
Mips64
20kc
M e m sw itc h
A ddres s Ge n
A ddres s Gen
A ddres s Ge n
A ddres s Gen
Reorder B uf fer
R eorder B uf fe r
Mips64
20kc
Cluster
2.3 mm
•
•
•
•
•
• 25 Watts (P4 = 75 W)
– ~41W with memories
August 8, 2004
Merrimac Power Estimates
Microcontroller
0.6W
Scalar CPU
2.2W
MADD ALU
7.0W
LRF
1.3W
SRF
1.4W
Switches
1.3W
Cache
1W
DRAM
16W
Stream GP2: 21
Memory
Controller
5W
Network
Controller
5W
August 8, 2004
Scientific Programs Stream Well
Application
Sustained
GFLOPS
FP Ops /
Mem Ref
LRF Refs
SRF Refs
Mem Refs
StreamFEM3D1
(Euler, quadratic)
31.6
17.1
153.0M
(95.0%)
6.3M
(3.9%)
1.8M
(1.1%)
StreamFEM3D1
(MHD, constant)
39.2
13.8
186.5M
(99.4%)
7.7M
(0.4%)
2.8M
(0.2%)
StreamMD1
(grid algorithm)
14.22
12.12
90.2M
(97.5%)
1.6M
(1.7%)
0.7M
(0.8%)
GROMACS1
22.02
7.12
181M
(95.4%)
5.3M
(2.8%)
3.4M
(1.8%)
StreamFLO
12.92
7.42
234.3M
(95.7%)
7.2M
(2.9%)
3.4M
(1.4%)
1. Simulated on a machine with 64GFLOPS peak performance
2. The low numbers are a result of many divide and square-root operations
Stream GP2: 22
August 8, 2004
Software tools efficiently map programs
to stream processors
Stream GP2: 23
August 8, 2004
Stream Compiler Achieves Near Optimum Kernel
Performance
0
0
10
20
10
30
40
50
20
60
70
80
30
90
100
110
40
120
20
50
30
40
50
60
60
70
80
70
90
100
110
80
120
20
90
30
40
50
100
60
70
80
110
90
100
120
110
single iteration
120
software pipeline shown
ComputeCellInterior Kernel from StreamFEM application
Stream GP2: 24
August 8, 2004
Stream Compiler Reduces Bandwidth Demand
Compared to Caching
Read-Only Table Lookup Data
(Master Element)
StreamFEM application
Compute
Flux
States
Element
Faces
Gathered
Elements
Stream GP2: 25
Face
Geometry
Compute
Numerical
Flux
Gather
Cell
Numerical
Flux
Cell
Geometry
Compute
Cell
Interior
Elements
(Current)
Advance
Cell
Elements
(New)
Cell
Orientations
August 8, 2004
Alternatives to stream processors,
what are the issues?
Stream GP2: 26
August 8, 2004
Many proposed architectures
• All are tiled – arrays of ALU/reg blocks
• Issues are
– Local data storage (register hierarchy)
– Control (time vs. space multiplexing, aspect ratio)
• Includes mix of ALUs (VLIW or not, aspect ratio)
– Latency hiding mechanisms
• Real issue is programming model and compiler technology
Stream GP2: 27
August 8, 2004
Data Storage
Local Regs
(per ALU)
Global Regs
(per tile)
Tile RAM
Chip RAM
LRF
LRF via
Clust SW
SRF
SRF via
IC SW
Processor
Array
(SM, Raw)
Regs
RAM
RAM via
network
GPU
Regs
R/O T$
Kernel
Locality
ProdCon
Locality
Tables
Stream
Processor
Use
Stream GP2: 28
Kernel
Locality
Working Set
August 8, 2004
Control
• Really a question of aspect ratio
– DP vs ILP vs TLP (data vs instruction vs thread parallelism)
• Data parallelism is the least expensive
– Amortizes area and power of control
• instruction fetch, decode, etc…
– Perfectly load balances computation (see next slide)
– Simplifies synchronization
– Can handle conditionals efficiently
• Instruction-level parallelism is the next least expensive
– Single sequencer
– Simple synchronization & scheduling
• Bottom Line:
Threads are expensive – do a lot in one thread SIMD x VLIW
Stream GP2: 29
August 8, 2004
Time-multiplexing vs. Space-multiplexing
K1
•
K2
K3
Time Multiplexed
Clusters
0
1
2
3
Time multiplexed
– All clusters execute the
same kernel, each
operating on a different
stream element until an
entire stream has been
processed.
Space multiplexed
– Each tile executes a
different kernel,
forwarding results to the
next tile
K1
K1
K1
K1
K2
K2
K2
K2
Space Multiplexed
Tiles
0
1
2
3
K1
K2
K1
Time
•
K4
K3
K3
K3
K3
K4
K4
K4
K4
K3
K2
K4
K1
K3
K2
K4
K1
K3
K2
K4
K3
K4
Stream GP2: 30
August 8, 2004
Load Imbalance in OpenGL Pipeline vs Scene
Stream GP2: 31
August 8, 2004
Latency Hiding Mechanisms
•
When waiting 100s of cycles on a memory access you should
a) Stall
b) Multithread
c) Stream
Stream GP2: 32
August 8, 2004
Latency Hiding Mechanisms
•
When waiting 100s of cycles on a memory access you should
a) Stall
– This is clearly the wrong answer, all resources will be idled for the full
access latency.
b) Multithread
c) Stream
Stream GP2: 33
August 8, 2004
Latency Hiding Mechanisms
•
When waiting 100s of cycles on a memory access you should
a) Stall
– This is clearly the wrong answer, all resources will be idled for the full
access latency.
b) Multithread
– This keeps resources busy, but at a very high cost. The full data state
(regs) and control state (pc, psw, etc…) of each thread must be replicated,
and a very large number of threads is required in the worst case to keep
the memory pipeline full N=BxT (Little’s Law)
c) Stream
Stream GP2: 34
August 8, 2004
Latency Hiding Mechanisms
•
When waiting 100s of cycles on a memory access you should
a) Stall
– This is clearly the wrong answer, all resources will be idled for the full
access latency.
b) Multithread
– This keeps resources busy, but at a very high cost. The full data state
(regs) and control state (pc, psw, etc…) of each thread must be replicated,
and a very large number of threads is required in the worst case to keep
the memory pipeline full N=BxT (Little’s Law)
c) Stream
– Correct answer. This keeps resources busy but at minimal cost. Only live
state is stored (in SRF) while waiting on remote references. No control
state is replicated. Much lower cost to hide a given amount of latency.
Stream GP2: 35
August 8, 2004
Summary of Tiled Architecture Issues
• Local data storage
– Expose a deep register hierarchy
•
•
•
•
Local registers – for kernel locality
Local RAM arrays (SRFs) – for producer/consumer locality (R/W access)
Global RAM arrays (SRFs) – for working set (global R/W access)
Caches – as backup
• Control
–
–
–
–
Exploit the parallelism where it is least expensive
2 to 8-way VLIW then
Data parallel (with efficient conditional mechanism) then
Thread parallel
• Latency hiding
– Keep execution resources busy with minimum of state
• Compiler technology
– Expose communication and optimize it
Stream GP2: 36
August 8, 2004
Summary
• The problem is bandwidth – arithmetic is cheap
• Programs exhibit P & L
Convolve
Convolve
SAD
Convolve
Convolve
– Parallelism – Data, Kernel-ILP, Thread
– Locality – Kernel, ProdCon, WorkingSet
• A stream processor exploits this P & L
– Exposed register hierarchy to exploit locality
• LRF, LRFs, SRF, SRFs, $, M
Chip
Crossing(s)
DRAM
Bank
100
wire
Cache
Bank
SRF
Lane
CL
SW
LRF
Chip
Pins
and
Router
• GPUs have the parallelism, but limited locality
• Demonstrated on several applications
1k
switch
LRF
– Clusters x ALUs to exploit DP x ILP
– No R/W storage between registers and DRAM
10k
switch
M
SW
LRF
DRAM
Bank
Cache
Bank
ALU and cluster
arrays shown 1D
here may be laid
out as 2D arrays
SRF
Lane
CL
SW
LRF
– Embedded applications – modems, codecs, beamforming, graphics, …
– Scientific applications – FLO, MD, FEM
– 50% of peak, GOPS/mm2 better than ASICs
• Many tiled architectures
– Issues are storage, control, and latency hiding
– Major issue is programming model and compiler technology
Stream GP2: 37
August 8, 2004