Exascale Opportunities and Challenges: Why do we Care? Kathy Yelick Associate Laboratory Director for Computing Sciences and NERSC Center Director Lawrence Berkeley National Laboratory EECS Professor,
Download ReportTranscript Exascale Opportunities and Challenges: Why do we Care? Kathy Yelick Associate Laboratory Director for Computing Sciences and NERSC Center Director Lawrence Berkeley National Laboratory EECS Professor,
Exascale Opportunities and Challenges: Why do we Care? Kathy Yelick Associate Laboratory Director for Computing Sciences and NERSC Center Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley High Performance Computing in Science Science at Scale Science through Volume Science in Data 2 Science at Scale: Simulations Aid in the Energy Efficient Devices • Combustion simulations improve future designs – Model fluid flow, burning and chemistry – Uses advanced math algorithms – Requires petascale systems today Simulations reveal features not visible in lab experiments Energy efficient, low emissions technology licensed by industry • Need exascale computing to design for alternative fuels, new devices 3 Science at Scale: Impacts of Climate Change • Warming ocean and Antarctic ice sheet key to sea level rise – Previous climate models inadequate • Adaptive Mesh Refinement (AMR) to resolve ice-ocean interface BISICLES Pine Island Glacier simulation – mesh resolution crucial for grounding line behavior. – Dynamics very fine resolution (AMR) – Antarctica still very large (scalability) • Ongoing collaboration to couple ice sheet and ocean models – 19M Hours at NERSC • Exascale machines needed to improve detail in models, including ice and clouds 4 Antarctic ice speed (left): AMR enables sub-1 km resolution (black, above) (Using NERSC’s Hopper) Enhanced POP ocean model solution for coupling to ice Science in Data: From Simulation to Image Analysis 60 Increase over 2010 LBNL Computing on Data key in 4 of 10 Breakthroughs of the decade • 3 Genomics problems + CMB Data rates from experimental devices will require exascale volume computing • Cost of sequencing > Moore’s Law • Rate+Density of CCDs > Moore’s Law • Computing > Data, O(n2) common • Computing performance < Moore Law Projected Rates 50 Sequencers 40 Detectors 30 Processors Memory 20 10 0 2010 2011 2012 2013 2014 2015 5 Science through Volume: Screening Drugs to Batteries • Large number of simulations covering a variety of related materials, chemicals, proteins,… Today’s batteries Voltage limit Interesting materials… Dynameomics Database Improve understanding of disease and drug design, e.g., 11,000 protein unfolding simulations stored in a public database. Materials Genome Cut in half the 18 years from design to manufacturing, e.g., 20,000 potential battery materials stored in a database 6 Science in Data: Image Analysis in Astronomy Data Analysis in 2006 Nobel Prize • Measurement of temperature patterns Simulations used in 2011 Prize • Type Ia supernovae used as “standard candles” to measure distance. More recently: astrophysics discover early nearby supernova. • Palamor Transient Factory runs machine learning algorithms on ~300GB/night delivered by ESnet “science network” • Rare glimpse of a supernova within hours of explosion, 20M light years away • Telescopes world-wide redirected to catch images Smoot and Mather in 1992 with COBE experiment showed anisotropy of Cosmic Microwave Background. 23 August 24 August 25 August 7 HPC Has Moved Scientists through Difficult Technology Transitions 1.E+18 1.E+17 Application Performance Growth (Gordon Bell Prizes) 1.E+16 1.E+15 1.E+14 1.E+13 1.E+12 1.E+11 1.E+10 Attack of the “killer micros” 1.E+09 1.E+08 1990 1995 2000 2005 8 2010 2015 2020 HPC: From Vector Supercomputers to Massively Parallel Systems 500 Programmed by “annotating” serial programs SIMD Single Proc. SMP Constellation Cluster MPP 300 200 Programmed by completely rethinking algorithms and software for parallelism 100 25% industrial use 50% 9 2011 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 0 1993 Systems 400 Scientists Need to Undertake another Difficult Technology Transitions 1.E+18 1.E+17 Application Performance Growth (Gordon Bell Prizes) 1.E+16 First Exascale Application? 1.E+15 (billion-billion operations / sec) 1.E+14 1.E+13 Attack of the “killer cellphones”? 1.E+12 1.E+11 The rest of the computing world gets parallelism 1.E+10 1.E+09 1.E+08 1990 1995 2000 2005 10 2010 2015 2020 The Exascale Challenge Energy Efficiency 11 11 Energy Cost Challenge for Computing Facilities At ~$1M per MW, energy costs are substantial • 1 petaflop in 2010 used 3 MW • 1 exaflop in 2018 possible in 200 MW with “usual” scaling usual scaling goal 2005 2020 2010 2015 Exascale design = energy-constrained design 12 PUE of Data Centers PUE = overhead facility power . computer power But is this what we want to measure? Current Facility New Design 13 How to Measure Efficiency? – NERSC in 2010 ran at 450 publications per MW-year – But that number drops with each new machine • Next best: application performance per Watt – Newest, largest machine is best; lower energy and cost per core – Goes up with Moore’s Law • Race-to-Halt generally minimizes energy use 0.09 $ per core-hour • For Scientific Computing centers, the metric should be science output per Watt…. 0.10 0.08 Center SysAdmin Power & cooling 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0.00 Old-HPC Cluster New-HPC Power vs. Energy • Two related (but different!) problems – Minimize peak power: Keep machines from exceeding facility power and melting chips – Energy efficiency: Minimize Joules / science publication • Race-to-halt to minimize energy – Leakage current is nearly 50% of power – Finish as quickly as possible (maximizing simultaneous hardware usage) • Dynamic clock speed scaling – Under hardware control to implement power caps and thermal limits; software will probably adapt, not control this • Dark silicon: – More transistors than you can afford to power. More likely to have specialized hardware. Spectral Particls Structred Unstruct ured Science areas Sparse LINPACK Percent Peak 120% Dense Selecting Effective Machines for Science X X X X X X X X X X X X X X X X X X X X 100% Accel 80% Astro X X X X Chem X X X X 60% 40% X Climate 20% Combust 0% C u s t o… G i g a b… I n f i n… M N yU rM iA n… l… Fusion X X QCD Materiels X X • Goal is maximize application performance • Hard to predict if the apps don’t yet exist and architectures change fundamentally 16 Anticipating and Influencing the Future Hardware Design 17 17 New Processor Designs are Needed to Save Energy Cell phone processor (0.1 Watt, 4 Gflop/s) Server processor (100 Watts, 50 Gflop/s) • Server processors have been designed for performance, not energy – Graphics processors are 10-100x more efficient – Embedded processors are 100-1000x – Need manycore chips with thousands of cores 18 The Amdahl Case for Heterogeneity F is fraction of time in parallel; 1-F is serial 250 Asymmetric Speedup F=0.999 200 Chip with area for 256 thin cores 150 F=0.99 100 F=0.975 50 F=0.9 Assumes speedup for Fat / Thin = Sqrt of Area advantage F=0.5 0 1 (256 cores) 4 16 64 Size of Fat core in Thin Core units 256 (1 core) (193 cores) 1 fat core 256 small cores A Chip with up to 256 “thin” cores and “fat” core that uses some of the some of the thin core area 11/6/2015 Heterogeneity Analysis by: Mark Hill, U. Wisc 19 Energy Efficiency of Applications Performance Power Efficiency K. Datta, M. Murphy, V. Volkov, S. Williams , J. Carter, L. Oliker. D. Patterson, J. Shalf, K. Yelick, BDK11 book Cache-based Local store-based Gainestown Cell Blade Barcelona GTX280 Victoria Falls GTX280-Host What Heterogeneity Means to Me • Case for heterogeneity – Many small cores are needed for energy efficiency and power density; could have their own PC or use a wide SIMD – Need one fat core (at least) for running the OS • Local store, explicitly managed memory hierarchy – More efficient (get only what you need) and simpler to implement in hardware • Co-Processor interface between CPU and Accelerator – – – – Market: GPUs are separate chips for specific domains Control: Why are the minority CPUs in charge? Communication: The bus is a significant bottleneck. Do we really have to do this? Isn’t parallel programming hard enough Swim Lanes for Exascale Multicore * Moore’s Law GPUs / Accelerators Massive Manycore with Independent Cores • These may converge • How to organize the lightweight cores and (faster) heavyweight cores is key? • How much data parallelism / independent threads? 22 Co-Design for Exascale Co-Design Applications New Exascale Software Exascale Architectures All Exascale Applications Co-Design Hardware & Software • Green Flash Demo • CSU atmospheric model ported to low-power core design – Dual Core Tensilica processors running atmospheric model at 25MHz – MPI Routines ported to custom Tensilica Interconnect • Memory and processor Stats available for performance analysis • Emulation performance advantage – 250x Speedup over merely function software simulator • Actual code running - not representative benchmark John Shalf , and Dave Donofrio, LBNL Icosahedral mesh for algorithm scaling RAMP: Enabling Manycore Architecture Research Chisel Design Description Chisel Compiler C++ code C++ Compiler ISIS builds on Berkeley RAMP project. Ramp Gold shown here which models 64 cores of SPARC v8 with shared memory on $750 board. Has hardware FPU, MMU; boots OS. C++ Simulator FPGA Verilog ASIC Verilog FPGA Tools ASIC Tools FPGA Emulation GDS Layout ISIS Hardware description language based on Scala, modern OO/Functional language that compiles to JVM. • ISIS: rapid, accurate FPGA emulation of manycore chips • Spans VLSI design and simulation and includes chip fab – Trains students in real design trade-offs, power and area costs • Mapping RTL to FPGAs for algorithm/software co-design – 100x faster than software simulators and more accurate PIs: John Wawrzynek and Krste Asanovic, UC Berkeley 25 New Processors Means New Software Interconnect Memory Processors 130 MW Server Processors 75 MW Manycore 25 Megawatts Low power memory and interconnect • Exascale will have chips with thousands of tiny processor cores, and a few large ones • Architecture is an open question: – sea of embedded cores with heavyweight “service” nodes – Lightweight cores are accelerators to CPUs • Low power memory and storage technology are key 26 Memory is Not Keeping Pace Technology trends against a constant or increasing memory per core • Memory density is doubling every three years; processor logic is every two • Storage costs (dollars/Mbyte) are dropping gradually compared to logic costs Cost of Computation vs. Memory Source: David Turek, IBM Source: IBM Question: Can you double concurrency without doubling memory? 27 27 The Future of Software Design and Programming Models • Memory model • Control model • Resilience 28 What’s Wrong with Flat MPI? • We can run 1 MPI process per core – This works now for Quad-Core on Franklin • How long will it continue working? (circa 2008) – 4 - 8 cores? Probably. 128 - 1024 cores? Probably not. • What is the problem? – Latency: some copying required by semantics – Memory utilization: partitioning data for separate address space requires some replication • How big is your per core subgrid? At 10x10x10, over 1/2 of the points are surface points, probably replicated – Memory bandwidth: extra state means extra bandwidth – Weak scaling will not save us -- not enough memory per core • This means a “new” model for most NERSC users 29 Autotuning: Write Code Generators Rather than Compilers • Autotuners are code generators plus search algorithms to find best code • Avoids compiler problems of dependence analysis and approximate performance models Functional portability Performance of Autotuned Matrix Multiply from C HP 712 / 80i Performance portability from search at install time Atlas Autotuner: code generator +search Matrix Vector Mul Triangular BLAS specialized Solve Matrix Library to n,m specialized Multiply to n,m specialized to n,m BLAS = Basic Linear Algebra Subroutine: matrix multiply, etc. Resilience at Exascale • More analysis needed on what faults are most likely and their impact – Node / component failures, O(1 day) • Kills a job, hopefully not a system – System wide outages, O(1 month) • Kills all jobs, O(hours) to restart • Weakest links: network, file system • How much to virtualize? – Detection of errors visible on demand – Automatic recovery: maybe 31 Errors Can Turn into Performance Problems • Fault resilience introduces inhomogeneity in execution rates (error correction is not instantaneous) Slide source: John Shalf Algorithms to Optimize for Communication 33 33 Where does the Power Go? PicoJoules 10000 Intranode/SMP Intranode/MPI Communication Communication 1000 100 On-chip / CMP communication now 10 1 2018 Communication-Avoiding Algorithms • Sparse Iterative (Krylov Subpace) Methods – Nearest neighbor communication on a mesh – Dominated by time to read matrix (edges) from DRAM – And (small) communication and global synchronization events at each step • Can we lower data movement costs? – Take k steps with one matrix read from DRAM and one communication phase • Serial: O(1) moves of data moves vs. O(k) • Parallel: O(log p) messages vs. O(k log p) • Can we make communication provably optimal? – Communication both to DRAM and between cores work with Jim – Minimize independent accesses (‘latency’) Joint Demmel, Mark Hoemman, Marghoob – Minimize data volume (‘bandwidth’) Mohiyuddin Bigger Kernel (Akx) Runs at Faster Speed than Simpler (Ax) Speedups on Intel Clovertown (8 core) Jim Demmel, Mark Hoemmen, Marghoob Mohiyuddin, Kathy Yelick “Monomial” basis [Ax,…,Akx] fails to converge A different polynomial basis does converge Communication-Avoiding Krylov Method (GMRES) Performance on 8 core Clovertown Aren’t Clouds the Solution? 39 Power Limits Computing Performance Growth • Goal: Increase performance without increasing power – First time power is the problem: centralization makes this more obvious • Approaches 1,000,000 Performance “Expectation Gap” 100,000 The Expectation Ga 10,000 1,000 – Manycore, photonics, low power memory .. – Algorithms that avoid data movement 100 10 1985 1990 1995 2000 2005 2010 2015 2020 Processor industry running at Year of Introduction "maneuvering speed” - David Liddle • Risks – Building ineffective systems: not resilient, unprogrammable, unbalanced • Program structure – Design to workload: interdisciplinary teams for applications, hardware, software 40 HPC Commercial Cloud Results 53x Runtime Relative to Supercomputer 20 Commercial Cloud 16 ~ Magellan 12 EC2-Beta-Opt 8 4 0 • Commercial HPC clouds catch up with clusters if set up as shared cluster – High speed network (10GigE) and no over-subscription – Some slowdown from virtualization Keith Jackson, Lavanya Ramakrisha, John Shalf, Harvey Wasserman 41 TCP is slower than IB even at modest concurrency HPCC: PingPong Latency 250 PingPong Latency (us) 200 IB TCPoIB 10G - TCPoEth Amazon CC 10G- TCPoEth VM 1G-TCPoEth Better 150 40X 100 50 0 32 64 128 256 Number of Cores 42 512 1024 Network Hardware and Protocol Matter (PARATEC) 14 12 IB TCPoIB 10G - TCPoEth 1G-TCPoEth Performance 10 8 6 Better 4 TCP Can’t keep up 2 0 32 64 128 256 Number of cores 43 512 1024 Virtualization Penalty is Substantial (PARATEC) 14 12 IB 10G - TCPoEth Amazon CC 10G- TCPoEth Vm Performance 10 8 6 Better 4 2 0 32 64 Virtualization overhead increases with core count 128 256 Number of cores 44 512 1024 Public clouds compared to private HPC Centers Component Cost Compute Systems (1.38B hours) HPSS (17 PB) $180,900,000 $12,200,000 File Systems (2 PB) $2,500,000 Total (Annual Cost) $195,600,000 Over estimate: These are “list” prices, but... Underestimate: • Doesn’t include the measured performance slowdown 2x-10x. • This still only captures about 65% of NERSC’s $55M annual budget. No consulting staff, no administration, no support. 45 Factors in Price HPC Public Center Cloud Factor Utilization (30% private, 90% HPC, 60%? Cloud); Note: trades off against wait times, elasticity $$ Cost of people, largest machines lowest people costs/core $ Cost of power, advantage for placement of center, bulk $$ Energy efficiency (PUE, 1.1-1.3 is possible; 1.8 typical) Cost of specialized hardware (interconnect) $ Cost of commodity hardware $ Profit $$$ $ means “cost disadvantage” 46 Where is Moore’s Law (Cores/$) in Commercial Clouds? Increase in Cores/$ or per Socket Relative to 2006 1000% 900% 800% 700% 600% 500% 400% 300% 200% 100% 0% 2006 2007 Amazon (small) 2008 2009 Cores - Intel 2010 Cores AMD • Cost of a small instance at Amazon dropped 18% over 5 years. • Cores increased 2x-5x per socket; roughly constant cost. • NERSC cost/core dropped by 10x (20K – 200K cores in 2007-2011) 47 Challenges to Exascale Performance Growth 1) 2) 3) 4) 5) 6) 7) 8) 9) System power is the primary constraint Concurrency (1000x today) Memory bandwidth and capacity are not keeping pace Processor architecture is open, but likely heterogeneous Programming model heroic compilers will not hide this Algorithms need to minimize data movement, not flops I/O bandwidth unlikely to keep pace with machine speed Reliability and resiliency will be critical at this scale Bisection bandwidth limited by cost and energy Unlike the last 20 years most of these (1-7) are equally important across scales, e.g., 1000 1-PF machines Conclusions • “Exascale” is about continuing growth in computing performance for science – Energy efficiency is key – Job size is irrelevant • Success means: – Influencing market: HPC, technical computing, clouds, general purpose – Getting more science from data and computing • Failure means: – few big machines for a few big applications • Not all computing problems are exascale, but they should all be exascale-technology aware 49 Thank You! 50