Stream Processors vs. GPUs Bill Dally Computer Systems Laboratory Stanford University August 8, 2004 Stream GP2: 1 August 8, 2004
Download ReportTranscript Stream Processors vs. GPUs Bill Dally Computer Systems Laboratory Stanford University August 8, 2004 Stream GP2: 1 August 8, 2004
Stream Processors vs. GPUs Bill Dally Computer Systems Laboratory Stanford University August 8, 2004 Stream GP2: 1 August 8, 2004 The Big Picture 0.5mm 64-bit FPU (to scale) 50pJ/FLOP VLSI Technology Arithmetic is cheap Bandwidth is expensive Cost grows w/ distance 1 clock Applications Producer-Consumer Locality Data Parallelism Process Cells ILP 90nm Chip $200 1GHz Update Faces Subgrid Subgrid LRF SRF Lane Global Switch 12mm CL SW LRF Cluster 1 Stream Processor Exploits locality w/ update Exploits DP and ILP Cluster N-1 Stream GP2: 2 August 8, 2004 What is a stream processor? and how is it different from a GPU? My project is a stream processor My project is a stream processor too Stream GP2: 3 August 8, 2004 What is a stream processor? Chip Crossing(s) 10k switch 1k switch 100 wire LRF DRAM Bank Cache Bank SRF Lane CL SW LRF 16GB/s Chip Pins and Router 64GB/s M SW 512GB/s 3,840GB/s LRF DRAM Bank Cache Bank SRF Lane CL SW LRF Stream GP2: 4 August 8, 2004 Mapping an Application Grid Gather Grid Subgrid Subgrid Scatter Grid Index Process Cells Face Index Process Faces Abstraction of irregular-grid FEM code written by Tim Barth (NASA Ames) Stream GP2: 5 August 8, 2004 Kernel Locality and ILP 1k switch 100 Gridwire Grid LRF SRF Lane Grid Index CL SW Gather Subgrid 512GB/s Face Index Process Faces 3,840GB/s LRF CL SW LRF Stream GP2: 6 Scatter LRF Process Cells SRF Lane Subgrid Kernel Locality and ILP within each kernel Uses multiple ALUs and LRFs in each cluster August 8, 2004 Producer-Consumer Locality 10k switch 1k Grid switch 100 wire Grid LRF Grid Index M SW SRF Lane Gather CL Subgrid Scatter SW LRF Process Cells 512GB/s SRF Lane Face Index Process Faces 3,840GB/s LRF CL SW LRF Stream GP2: 7 Subgrid Producer-consumer locality between kernels Uses SRF lane local to each cluster August 8, 2004 Local Update of Working Set Chip Crossing(s) 10k switch 1k switch 100 wire LRF DRAM Bank Cache Bank SRF Lane CL SW LRF 16GB/s Chip Pins and Router 64GB/s M SW 512GB/s 3,840GB/s LRF Updates of subgrid DRAM (working Bank subset of grid) Grid Gather Cache Bank CL SW LRF Global access to SRF banks across clusters Subgrid SRF Lane Grid Subgrid Scatter Grid Index Process Cells Stream GP2: 8 Face Index Process Faces August 8, 2004 Data Parallelism Grid Gather Grid Subgrid Subgrid Scatter Grid Index Process Cells Face Index Chip Crossing(s) Process Faces 10k switch 1k switch 100 wire LRF Operate on multiple grid (subgrid) points in parallel DRAM Bank Cache Bank SRF Lane CL SW LRF 16GB/s Exploit multiple clusters Chip Pins and Router 64GB/s M SW 512GB/s 3,840GB/s LRF DRAM Bank Cache Bank SRF Lane CL SW LRF Stream GP2: 9 August 8, 2004 P & L: Parallelism and Locality • Data parallelism x ILP – uses lots of ALUs – – – – Over 2K ALUs can be used productively (Khailany et al. HPCA 03) DP much less expensive (area and power) than TLP Can still handle conditionals efficiently (Kapasi et al. Micro 00) Distributed local registers much less expensive than global (Rixner et al. HPCA 00) • Kernel locality + producer-consumer locality + local working-set update – Most references local – LRF or local SRF or global SRF – Very few references to off-chip memory – Explicit SRF accesses more efficient than implicit cache accesses (enables scheduling) • It takes good compilers to map to a register hierarchy well – (Mattson et al. ASPLOS 00, Kapasi et al. WMSP 01) Stream GP2: 10 August 8, 2004 Register Hierarchy ProdCon Locality (.03A - .3A) DRAM Off-Chip (< .01A) Stream GP2: 11 $ M SW SRF G SW SRF Kernel Locality DRF Efficiency (3A) LRF CL SW LRF R/W Access to Working Set SRF Efficiency ( .01A - .1A) August 8, 2004 So? What about GPUs? Pixel Shader Frame Buffer Compositer R/O Rasterizer Addr Vertex Shader Texture Cache Can stream data through shaders, but… R/W access only to local registers in shaders All other writes must cycle through frame buffer Can’t capture producer-consumer locality or working set Stream GP2: 12 August 8, 2004 Stream Processor vs GPU ProdCon Locality (.03A - .3A) DRAM M SW $ Off-Chip (< .01A) SRF G SW SRF Stream GP2: 13 LRF CL SW LRF R/W Access to Working Set SRF Efficiency ( .01A - .1A) R/O Table Lookup (.01A - .1A) DRAM Kernel Locality DRF Efficiency (3A) $ Off-Chip ProdCon and R/W Working Set (.03A - .3A) M SW SRF G SW T$ Kernel Locality (3A) LRF CL SW LRF August 8, 2004 Our experience with stream processors For signal/image processing, and for scientific computing Stream GP2: 14 August 8, 2004 Imagine Prototype • Imagine – – – – • Stream processor for image and signal processing 16mm die in 0.18um TI process 21M transistors Collaboration with TI ASIC Software tools based on Stream-C/Kernel-C – Stream scheduler – Communication scheduling • Many Applications – – – – – 3 Graphics pipelines Image-processing apps – depth, MPEG 3G Cellphone (Rice) STAP IPv6, VPN Stream GP2: 15 August 8, 2004 Bandwidth Demand of Media Applications Stream GP2: 16 August 8, 2004 SPI Power Dissipation Other 5% MBANKs 3% UC SRAMs 3% UC 2% Clock Tree 11% MBANKs Other 4% 1% Clock Tree 4% SRF SRAMs + SBs 8% SRF SRAMs + SBs 15% Cluster ALUs 31% Cluster Switch & Control 12% Cluster ALUs 42% Cluster LRFs, Switch, and Control 21% • Imagine (0.18 m – 48 FP ALUs) – 3.1 W, 132 MHz, 1.5 V (meas.) Cluster LRFs 38% • SP64 (90nm – 1280 16b ALUs) – – – 5 W, 640 MHz, 0.8 V (est.) 160MOPS/mW, >10GOPS/mm2 High-confidence power estimates: • Power dissipation is dominated (>90%) by very predictable sources – RFs, ALUs, switches between ALUs, and clocks. Stream GP2: 17 • • • SRAM/RF datasheets from 90nm process for LRFs, SRF SRAMs+SBs, UC Post-synthesis measurements for Cluster ALUs Detailed floorplan provides switch power MBANKs and clock tree scaled from Imagine August 8, 2004 SPI SP8 vs Programmable Competition Programmable GMACS / $ 10 SPI SP8 SPI SP8-LV Xilinx Virtex-II Pro TI TMS320C6414T TI DM642 ADI ADSP-Ts201s Intrinsity Fast-math Cradle CT3400 Equator BSP-16 Mathstar SOA13D40-01 PicoChip PC101 Morpho MRC6011 Intel MXP5800 1 0.1 0.01 1 10 100 Programmable GMACS / W Stream GP2: 18 August 8, 2004 Architecture of a Streaming Supercomputer Backplane Board Node 16 x DRDRAM 2GBytes 16GBytes/s Node 2 Node 16 Stream Processor 128 FPUs 128GFLOPS Board 2 16 Nodes 1K FPUs 2TFLOPS 32GBytes Board 32 16GBytes/s 32+32 pairs Backplane 2 32 Boards 512 Nodes 64K FPUs 64TFLOPS 1TByte Backplane 32 On-Board Network 64GBytes/s 128+128 pairs 6" Teradyne GbX E/O O/E Intra-Cabinet Network 1TBytes/s 2K+2K links Ribbon Fiber Inter-Cabinet Network All links 5Gb/s per pair or fiber All bandwidths are full duplex Bisection 32TBytes/s Stream GP2: 19 August 8, 2004 Merrimac Processor 1.6 mm 8K Words SRF Bank FP/INT 64 Bit MADD FP/INT 64 Bit MADD 64 W RF 64 W RF 64 W RF 64 W RF 64 W RF 64 W RF 64 W RF 64 W RF 64 W RF 64 W RF 64 W RF 64 W RF FP/INT 64 Bit MADD FP/INT 64 Bit MADD 0.6 mm 0.9 mm Network Stream GP2: 20 Cluster Cluster Cluster Cluster Cluster Cluster 12.5 mm 90nm tech (1 V) ASIC technology 1 GHz (20 FO4) 128 GOPs Inter-cluster switch between clusters • 127.5 mm2 (small ~12x10) – Stanford Imagine is 16mm x 16mm – MIT Raw is 18mm x 18mm 10.2 mm Cluster Cluster $ bank Cluster $ bank Cluster $ bank Cluster $ bank Cluster $ bank Microcontroller Cluster $ bank Cluster $ bank Forward ECC 16 RDRAM Interfaces $ bank Cluster Mips64 20kc M e m sw itc h A ddres s Ge n A ddres s Gen A ddres s Ge n A ddres s Gen Reorder B uf fer R eorder B uf fe r Mips64 20kc Cluster 2.3 mm • • • • • • 25 Watts (P4 = 75 W) – ~41W with memories August 8, 2004 Merrimac Power Estimates Microcontroller 0.6W Scalar CPU 2.2W MADD ALU 7.0W LRF 1.3W SRF 1.4W Switches 1.3W Cache 1W DRAM 16W Stream GP2: 21 Memory Controller 5W Network Controller 5W August 8, 2004 Scientific Programs Stream Well Application Sustained GFLOPS FP Ops / Mem Ref LRF Refs SRF Refs Mem Refs StreamFEM3D1 (Euler, quadratic) 31.6 17.1 153.0M (95.0%) 6.3M (3.9%) 1.8M (1.1%) StreamFEM3D1 (MHD, constant) 39.2 13.8 186.5M (99.4%) 7.7M (0.4%) 2.8M (0.2%) StreamMD1 (grid algorithm) 14.22 12.12 90.2M (97.5%) 1.6M (1.7%) 0.7M (0.8%) GROMACS1 22.02 7.12 181M (95.4%) 5.3M (2.8%) 3.4M (1.8%) StreamFLO 12.92 7.42 234.3M (95.7%) 7.2M (2.9%) 3.4M (1.4%) 1. Simulated on a machine with 64GFLOPS peak performance 2. The low numbers are a result of many divide and square-root operations Stream GP2: 22 August 8, 2004 Software tools efficiently map programs to stream processors Stream GP2: 23 August 8, 2004 Stream Compiler Achieves Near Optimum Kernel Performance 0 0 10 20 10 30 40 50 20 60 70 80 30 90 100 110 40 120 20 50 30 40 50 60 60 70 80 70 90 100 110 80 120 20 90 30 40 50 100 60 70 80 110 90 100 120 110 single iteration 120 software pipeline shown ComputeCellInterior Kernel from StreamFEM application Stream GP2: 24 August 8, 2004 Stream Compiler Reduces Bandwidth Demand Compared to Caching Read-Only Table Lookup Data (Master Element) StreamFEM application Compute Flux States Element Faces Gathered Elements Stream GP2: 25 Face Geometry Compute Numerical Flux Gather Cell Numerical Flux Cell Geometry Compute Cell Interior Elements (Current) Advance Cell Elements (New) Cell Orientations August 8, 2004 Alternatives to stream processors, what are the issues? Stream GP2: 26 August 8, 2004 Many proposed architectures • All are tiled – arrays of ALU/reg blocks • Issues are – Local data storage (register hierarchy) – Control (time vs. space multiplexing, aspect ratio) • Includes mix of ALUs (VLIW or not, aspect ratio) – Latency hiding mechanisms • Real issue is programming model and compiler technology Stream GP2: 27 August 8, 2004 Data Storage Local Regs (per ALU) Global Regs (per tile) Tile RAM Chip RAM LRF LRF via Clust SW SRF SRF via IC SW Processor Array (SM, Raw) Regs RAM RAM via network GPU Regs R/O T$ Kernel Locality ProdCon Locality Tables Stream Processor Use Stream GP2: 28 Kernel Locality Working Set August 8, 2004 Control • Really a question of aspect ratio – DP vs ILP vs TLP (data vs instruction vs thread parallelism) • Data parallelism is the least expensive – Amortizes area and power of control • instruction fetch, decode, etc… – Perfectly load balances computation (see next slide) – Simplifies synchronization – Can handle conditionals efficiently • Instruction-level parallelism is the next least expensive – Single sequencer – Simple synchronization & scheduling • Bottom Line: Threads are expensive – do a lot in one thread SIMD x VLIW Stream GP2: 29 August 8, 2004 Time-multiplexing vs. Space-multiplexing K1 • K2 K3 Time Multiplexed Clusters 0 1 2 3 Time multiplexed – All clusters execute the same kernel, each operating on a different stream element until an entire stream has been processed. Space multiplexed – Each tile executes a different kernel, forwarding results to the next tile K1 K1 K1 K1 K2 K2 K2 K2 Space Multiplexed Tiles 0 1 2 3 K1 K2 K1 Time • K4 K3 K3 K3 K3 K4 K4 K4 K4 K3 K2 K4 K1 K3 K2 K4 K1 K3 K2 K4 K3 K4 Stream GP2: 30 August 8, 2004 Load Imbalance in OpenGL Pipeline vs Scene Stream GP2: 31 August 8, 2004 Latency Hiding Mechanisms • When waiting 100s of cycles on a memory access you should a) Stall b) Multithread c) Stream Stream GP2: 32 August 8, 2004 Latency Hiding Mechanisms • When waiting 100s of cycles on a memory access you should a) Stall – This is clearly the wrong answer, all resources will be idled for the full access latency. b) Multithread c) Stream Stream GP2: 33 August 8, 2004 Latency Hiding Mechanisms • When waiting 100s of cycles on a memory access you should a) Stall – This is clearly the wrong answer, all resources will be idled for the full access latency. b) Multithread – This keeps resources busy, but at a very high cost. The full data state (regs) and control state (pc, psw, etc…) of each thread must be replicated, and a very large number of threads is required in the worst case to keep the memory pipeline full N=BxT (Little’s Law) c) Stream Stream GP2: 34 August 8, 2004 Latency Hiding Mechanisms • When waiting 100s of cycles on a memory access you should a) Stall – This is clearly the wrong answer, all resources will be idled for the full access latency. b) Multithread – This keeps resources busy, but at a very high cost. The full data state (regs) and control state (pc, psw, etc…) of each thread must be replicated, and a very large number of threads is required in the worst case to keep the memory pipeline full N=BxT (Little’s Law) c) Stream – Correct answer. This keeps resources busy but at minimal cost. Only live state is stored (in SRF) while waiting on remote references. No control state is replicated. Much lower cost to hide a given amount of latency. Stream GP2: 35 August 8, 2004 Summary of Tiled Architecture Issues • Local data storage – Expose a deep register hierarchy • • • • Local registers – for kernel locality Local RAM arrays (SRFs) – for producer/consumer locality (R/W access) Global RAM arrays (SRFs) – for working set (global R/W access) Caches – as backup • Control – – – – Exploit the parallelism where it is least expensive 2 to 8-way VLIW then Data parallel (with efficient conditional mechanism) then Thread parallel • Latency hiding – Keep execution resources busy with minimum of state • Compiler technology – Expose communication and optimize it Stream GP2: 36 August 8, 2004 Summary • The problem is bandwidth – arithmetic is cheap • Programs exhibit P & L Convolve Convolve SAD Convolve Convolve – Parallelism – Data, Kernel-ILP, Thread – Locality – Kernel, ProdCon, WorkingSet • A stream processor exploits this P & L – Exposed register hierarchy to exploit locality • LRF, LRFs, SRF, SRFs, $, M Chip Crossing(s) DRAM Bank 100 wire Cache Bank SRF Lane CL SW LRF Chip Pins and Router • GPUs have the parallelism, but limited locality • Demonstrated on several applications 1k switch LRF – Clusters x ALUs to exploit DP x ILP – No R/W storage between registers and DRAM 10k switch M SW LRF DRAM Bank Cache Bank ALU and cluster arrays shown 1D here may be laid out as 2D arrays SRF Lane CL SW LRF – Embedded applications – modems, codecs, beamforming, graphics, … – Scientific applications – FLO, MD, FEM – 50% of peak, GOPS/mm2 better than ASICs • Many tiled architectures – Issues are storage, control, and latency hiding – Major issue is programming model and compiler technology Stream GP2: 37 August 8, 2004