Exascale Opportunities and Challenges: Why do we Care? Kathy Yelick Associate Laboratory Director for Computing Sciences and NERSC Center Director Lawrence Berkeley National Laboratory EECS Professor,

Transcript Exascale Opportunities and Challenges: Why do we Care? Kathy Yelick Associate Laboratory Director for Computing Sciences and NERSC Center Director Lawrence Berkeley National Laboratory EECS Professor,

Exascale Opportunities and
Challenges: Why do we Care?
Kathy Yelick
Associate Laboratory Director for Computing Sciences
and NERSC Center Director
Lawrence Berkeley National Laboratory
EECS Professor, UC Berkeley
High Performance Computing in
Science
Science at Scale
Science through Volume
Science in Data
2
Science at Scale: Simulations Aid
in the Energy Efficient Devices
• Combustion simulations improve future designs
– Model fluid flow, burning and chemistry
– Uses advanced math algorithms
– Requires petascale systems today
Simulations reveal
features not visible
in lab experiments
Energy efficient, low
emissions technology
licensed by industry
• Need exascale computing to design for
alternative fuels, new devices
3
Science at Scale: Impacts of
Climate Change
• Warming ocean and Antarctic ice
sheet key to sea level rise
– Previous climate models inadequate
• Adaptive Mesh Refinement (AMR)
to resolve ice-ocean interface
BISICLES Pine Island Glacier simulation – mesh
resolution crucial for grounding line behavior.
– Dynamics very fine resolution (AMR)
– Antarctica still very large (scalability)
• Ongoing collaboration to couple
ice sheet and ocean models
– 19M Hours at NERSC
• Exascale machines needed to
improve detail in models,
including ice and clouds
4
Antarctic ice speed (left):
AMR enables sub-1 km
resolution (black, above)
(Using NERSC’s Hopper)
Enhanced POP ocean model
solution for coupling to ice
Science in Data: From
Simulation to Image Analysis
60
Increase over 2010
LBNL Computing on Data key in 4
of 10 Breakthroughs of the decade
• 3 Genomics problems + CMB
Data rates from experimental
devices will require exascale
volume computing
• Cost of sequencing > Moore’s Law
• Rate+Density of CCDs > Moore’s
Law
• Computing > Data, O(n2) common
• Computing performance < Moore
Law
Projected Rates
50
Sequencers
40
Detectors
30
Processors
Memory
20
10
0
2010
2011
2012
2013
2014
2015
5
Science through Volume:
Screening Drugs to Batteries
• Large number of simulations covering a variety
of related materials, chemicals, proteins,…
Today’s batteries
Voltage limit
Interesting materials…
Dynameomics Database
Improve understanding of disease
and drug design, e.g., 11,000
protein unfolding simulations stored
in a public database.
Materials Genome
Cut in half the 18 years from
design to manufacturing, e.g.,
20,000 potential battery materials
stored in a database
6
Science in Data: Image
Analysis in Astronomy
Data Analysis in 2006 Nobel Prize
• Measurement of temperature patterns
Simulations used in 2011 Prize
• Type Ia supernovae used as “standard
candles” to measure distance.
More recently: astrophysics
discover early nearby supernova.
• Palamor Transient Factory runs machine
learning algorithms on ~300GB/night
delivered by ESnet “science network”
• Rare glimpse of a supernova within
hours of explosion, 20M light years away
• Telescopes world-wide redirected to
catch images
Smoot and Mather in 1992 with COBE
experiment showed anisotropy of
Cosmic Microwave Background.
23 August
24 August
25 August
7
HPC Has Moved Scientists through
Difficult Technology Transitions
1.E+18
1.E+17
Application Performance Growth
(Gordon Bell Prizes)
1.E+16
1.E+15
1.E+14
1.E+13
1.E+12
1.E+11
1.E+10
Attack of the
“killer micros”
1.E+09
1.E+08
1990
1995
2000
2005
8
2010
2015
2020
HPC: From Vector Supercomputers to
Massively Parallel Systems
500
Programmed by
“annotating” serial
programs
SIMD
Single Proc.
SMP
Constellation
Cluster
MPP
300
200
Programmed by
completely rethinking
algorithms and
software for parallelism
100
25%
industrial use
50%
9
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
0
1993
Systems
400
Scientists Need to Undertake another
Difficult Technology Transitions
1.E+18
1.E+17
Application Performance Growth
(Gordon Bell Prizes)
1.E+16
First Exascale
Application?
1.E+15
(billion-billion
operations / sec)
1.E+14
1.E+13
Attack of the
“killer cellphones”?
1.E+12
1.E+11
The rest of the
computing world
gets parallelism
1.E+10
1.E+09
1.E+08
1990
1995
2000
2005
10
2010
2015
2020
The Exascale Challenge
Energy Efficiency
11
11
Energy Cost Challenge for
Computing Facilities
At ~$1M per MW, energy costs are substantial
• 1 petaflop in 2010 used 3 MW
• 1 exaflop in 2018 possible in 200 MW with “usual” scaling
usual
scaling
goal
2005
2020
2010
2015
Exascale design = energy-constrained design
12
PUE of Data Centers
PUE = overhead
facility power .
computer power
But is this what we
want to measure?
Current
Facility
New
Design
13
How to Measure Efficiency?
– NERSC in 2010 ran at 450
publications per MW-year
– But that number drops with each
new machine
• Next best: application
performance per Watt
– Newest, largest machine is best;
lower energy and cost per core
– Goes up with Moore’s Law
• Race-to-Halt generally
minimizes energy use
0.09
$ per core-hour
• For Scientific Computing
centers, the metric should be
science output per Watt….
0.10
0.08
Center
SysAdmin
Power & cooling
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
Old-HPC Cluster New-HPC
Power vs. Energy
• Two related (but different!) problems
– Minimize peak power: Keep machines from exceeding facility
power and melting chips
– Energy efficiency: Minimize Joules / science publication
• Race-to-halt to minimize energy
– Leakage current is nearly 50% of power
– Finish as quickly as possible (maximizing simultaneous
hardware usage)
• Dynamic clock speed scaling
– Under hardware control to implement power caps and thermal
limits; software will probably adapt, not control this
• Dark silicon:
– More transistors than you can afford to power. More likely to
have specialized hardware.
Spectral
Particls
Structred
Unstruct
ured
Science
areas
Sparse
LINPACK Percent Peak
120%
Dense
Selecting Effective
Machines for Science
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
100%
Accel
80%
Astro
X
X
X
X
Chem
X
X
X
X
60%
40%
X
Climate
20%
Combust
0%
C
u
s
t
o…
G
i
g
a
b…
I
n
f
i
n…
M
N
yU
rM
iA
n…
l…
Fusion
X
X
QCD
Materiels
X
X
• Goal is maximize application performance
• Hard to predict if the apps don’t yet exist and
architectures change fundamentally
16
Anticipating and Influencing
the Future
Hardware Design
17
17
New Processor Designs are
Needed to Save Energy
Cell phone processor
(0.1 Watt, 4 Gflop/s)
Server processor
(100 Watts, 50 Gflop/s)
• Server processors have been designed for
performance, not energy
– Graphics processors are 10-100x more efficient
– Embedded processors are 100-1000x
– Need manycore chips with thousands of cores
18
The Amdahl Case for
Heterogeneity
F is fraction of time in parallel; 1-F is serial
250
Asymmetric Speedup
F=0.999
200
Chip with area for 256 thin cores
150
F=0.99
100
F=0.975
50
F=0.9
Assumes
speedup for
Fat / Thin =
Sqrt of Area
advantage
F=0.5
0
1
(256 cores)
4
16
64
Size of Fat core in Thin Core units
256
(1 core)
(193 cores)
1 fat core
256 small cores
A Chip with up to 256 “thin” cores and “fat” core that
uses some of the some of the thin core area
11/6/2015
Heterogeneity Analysis by: Mark Hill, U. Wisc
19
Energy Efficiency of
Applications
Performance
Power Efficiency
K. Datta, M. Murphy,
V. Volkov, S. Williams ,
J. Carter, L. Oliker.
D. Patterson, J. Shalf,
K. Yelick, BDK11 book
Cache-based
Local store-based
Gainestown
Cell Blade
Barcelona
GTX280
Victoria Falls
GTX280-Host
What Heterogeneity Means
to Me
• Case for heterogeneity
– Many small cores are needed for energy efficiency and
power density; could have their own PC or use a wide SIMD
– Need one fat core (at least) for running the OS
• Local store, explicitly managed memory hierarchy
– More efficient (get only what you need) and simpler to
implement in hardware
• Co-Processor interface between CPU and
Accelerator
–
–
–
–
Market: GPUs are separate chips for specific domains
Control: Why are the minority CPUs in charge?
Communication: The bus is a significant bottleneck.
Do we really have to do this? Isn’t parallel programming
hard enough
Swim Lanes for Exascale
Multicore * Moore’s Law
GPUs / Accelerators
Massive Manycore with Independent Cores
• These may converge
• How to organize the lightweight cores and (faster)
heavyweight cores is key?
• How much data parallelism / independent threads?
22
Co-Design for Exascale
Co-Design
Applications
New
Exascale
Software
Exascale
Architectures
All Exascale
Applications
Co-Design Hardware &
Software
• Green Flash Demo
• CSU atmospheric model ported to
low-power core design
– Dual Core Tensilica processors running
atmospheric model at 25MHz
– MPI Routines ported to custom Tensilica
Interconnect
• Memory and processor Stats
available for performance analysis
• Emulation performance advantage
– 250x Speedup over merely function
software simulator
• Actual code running - not
representative benchmark
John Shalf , and Dave Donofrio, LBNL
Icosahedral mesh for
algorithm scaling
RAMP: Enabling Manycore
Architecture Research
Chisel Design Description
Chisel Compiler
C++ code
C++ Compiler
ISIS builds on Berkeley RAMP project. Ramp
Gold shown here which models 64 cores of
SPARC v8 with shared memory on $750
board. Has hardware FPU, MMU; boots OS.
C++ Simulator
FPGA Verilog
ASIC Verilog
FPGA Tools
ASIC Tools
FPGA
Emulation
GDS Layout
ISIS Hardware description language based on Scala,
modern OO/Functional language that compiles to JVM.
• ISIS: rapid, accurate FPGA emulation of manycore chips
• Spans VLSI design and simulation and includes chip fab
– Trains students in real design trade-offs, power and area costs
• Mapping RTL to FPGAs for algorithm/software co-design
– 100x faster than software simulators and more accurate
PIs: John Wawrzynek and Krste Asanovic, UC Berkeley
25
New Processors Means New
Software
Interconnect
Memory
Processors
130 MW
Server Processors
75 MW
Manycore
25 Megawatts
Low power memory
and interconnect
• Exascale will have chips with thousands of tiny processor
cores, and a few large ones
• Architecture is an open question:
– sea of embedded cores with heavyweight “service” nodes
– Lightweight cores are accelerators to CPUs
• Low power memory and storage technology are key
26
Memory is Not Keeping
Pace
Technology trends against a constant or increasing memory per core
• Memory density is doubling every three years; processor logic is every two
• Storage costs (dollars/Mbyte) are dropping gradually compared to logic costs
Cost of Computation vs. Memory
Source: David Turek, IBM
Source: IBM
Question: Can you double concurrency without doubling memory?
27
27
The Future of Software Design
and
Programming Models
• Memory model
• Control model
• Resilience
28
What’s Wrong with Flat MPI?
• We can run 1 MPI process per core
– This works now for Quad-Core on Franklin
• How long will it continue working? (circa 2008)
– 4 - 8 cores? Probably. 128 - 1024 cores? Probably not.
• What is the problem?
– Latency: some copying required by semantics
– Memory utilization: partitioning data for separate address
space requires some replication
• How big is your per core subgrid? At 10x10x10, over 1/2 of the
points are surface points, probably replicated
– Memory bandwidth: extra state means extra bandwidth
– Weak scaling will not save us -- not enough memory per core
• This means a “new” model for most NERSC users
29
Autotuning: Write Code
Generators Rather than Compilers
• Autotuners are code generators plus search
algorithms to find best code
• Avoids compiler problems of dependence analysis
and approximate performance models
 Functional portability
Performance of Autotuned Matrix Multiply
from C
HP 712 / 80i
 Performance portability
from search at install time
Atlas
Autotuner:
code generator
+search
Matrix
Vector Mul
Triangular
BLAS
specialized
Solve
Matrix
Library
to n,m
specialized
Multiply
to n,m
specialized
to n,m
BLAS = Basic Linear Algebra Subroutine: matrix multiply, etc.
Resilience at Exascale
• More analysis needed on what faults
are most likely and their impact
– Node / component failures, O(1 day)
• Kills a job, hopefully not a system
– System wide outages, O(1 month)
• Kills all jobs, O(hours) to restart
• Weakest links: network, file system
• How much to virtualize?
– Detection of errors visible on demand
– Automatic recovery: maybe
31
Errors Can Turn into
Performance Problems
• Fault resilience introduces inhomogeneity in
execution rates (error correction is not instantaneous)
Slide source: John Shalf
Algorithms to Optimize for
Communication
33
33
Where does the Power Go?
PicoJoules
10000
Intranode/SMP Intranode/MPI
Communication Communication
1000
100
On-chip / CMP
communication
now
10
1
2018
Communication-Avoiding
Algorithms
• Sparse Iterative (Krylov Subpace) Methods
– Nearest neighbor communication on a mesh
– Dominated by time to read matrix (edges) from DRAM
– And (small) communication and global
synchronization events at each step
• Can we lower data movement costs?
– Take k steps with one matrix read from
DRAM and one communication phase
• Serial: O(1) moves of data moves vs. O(k)
• Parallel: O(log p) messages vs. O(k log p)
• Can we make communication provably optimal?
– Communication both to DRAM and between cores
work with Jim
– Minimize independent accesses (‘latency’) Joint
Demmel, Mark
Hoemman, Marghoob
– Minimize data volume (‘bandwidth’)
Mohiyuddin
Bigger Kernel (Akx) Runs at Faster
Speed than Simpler (Ax)
Speedups on Intel Clovertown (8 core)
Jim Demmel, Mark Hoemmen, Marghoob Mohiyuddin, Kathy Yelick
“Monomial” basis [Ax,…,Akx]
fails to converge
A different polynomial basis does converge
Communication-Avoiding
Krylov Method (GMRES)
Performance on 8 core Clovertown
Aren’t Clouds the Solution?
39
Power Limits Computing
Performance Growth
• Goal: Increase performance
without increasing power
– First time power is the problem:
centralization makes this more obvious
• Approaches
1,000,000
Performance “Expectation Gap”
100,000
The Expectation Ga
10,000
1,000
– Manycore, photonics, low power
memory ..
– Algorithms that avoid data movement
100
10
1985
1990
1995
2000
2005
2010
2015
2020
Processor
industry
running
at
Year of Introduction
"maneuvering speed”
- David Liddle
• Risks
– Building ineffective systems: not
resilient, unprogrammable, unbalanced
• Program structure
– Design to workload: interdisciplinary
teams for applications, hardware,
software
40
HPC Commercial Cloud Results
53x
Runtime Relative to
Supercomputer
20
Commercial Cloud
16
~
Magellan
12
EC2-Beta-Opt
8
4
0
• Commercial HPC clouds catch up with clusters if set
up as shared cluster
– High speed network (10GigE) and no over-subscription
– Some slowdown from virtualization
Keith Jackson, Lavanya Ramakrisha, John Shalf, Harvey Wasserman
41
TCP is slower than IB even
at modest concurrency
HPCC: PingPong Latency
250
PingPong Latency (us)
200
IB
TCPoIB
10G - TCPoEth
Amazon CC
10G- TCPoEth VM
1G-TCPoEth
Better
150
40X
100
50
0
32
64
128
256
Number of Cores
42
512
1024
Network Hardware and Protocol
Matter (PARATEC)
14
12
IB
TCPoIB
10G - TCPoEth
1G-TCPoEth
Performance
10
8
6
Better
4
TCP Can’t keep up
2
0
32
64
128
256
Number of cores
43
512
1024
Virtualization Penalty is
Substantial (PARATEC)
14
12
IB
10G - TCPoEth
Amazon CC
10G- TCPoEth Vm
Performance
10
8
6
Better
4
2
0
32
64
Virtualization overhead
increases with core count
128
256
Number of cores
44
512
1024
Public clouds compared to
private HPC Centers
Component
Cost
Compute Systems (1.38B hours)
HPSS (17 PB)
$180,900,000
$12,200,000
File Systems (2 PB)
$2,500,000
Total (Annual Cost)
$195,600,000
Over estimate: These are “list” prices, but...
Underestimate:
• Doesn’t include the measured performance slowdown 2x-10x.
• This still only captures about 65% of NERSC’s $55M annual budget.
No consulting staff, no administration, no support.
45
Factors in Price
HPC
Public
Center Cloud
Factor
Utilization (30% private, 90% HPC, 60%? Cloud);
Note: trades off against wait times, elasticity
$$
Cost of people, largest machines lowest people costs/core
$
Cost of power, advantage for placement of center, bulk
$$
Energy efficiency (PUE, 1.1-1.3 is possible; 1.8 typical)
Cost of specialized hardware (interconnect)
$
Cost of commodity hardware
$
Profit
$$$
$ means “cost disadvantage”
46
Where is Moore’s Law (Cores/$)
in Commercial Clouds?
Increase in Cores/$ or per Socket Relative to 2006
1000%
900%
800%
700%
600%
500%
400%
300%
200%
100%
0%
2006
2007
Amazon (small)
2008
2009
Cores - Intel
2010
Cores AMD
• Cost of a small instance at Amazon dropped 18% over 5 years.
• Cores increased 2x-5x per socket; roughly constant cost.
• NERSC cost/core dropped by 10x (20K – 200K cores in 2007-2011)
47
Challenges to Exascale
Performance Growth
1)
2)
3)
4)
5)
6)
7)
8)
9)
System power is the primary constraint
Concurrency (1000x today)
Memory bandwidth and capacity are not keeping pace
Processor architecture is open, but likely heterogeneous
Programming model heroic compilers will not hide this
Algorithms need to minimize data movement, not flops
I/O bandwidth unlikely to keep pace with machine speed
Reliability and resiliency will be critical at this scale
Bisection bandwidth limited by cost and energy
Unlike the last 20 years most of these (1-7) are equally
important across scales, e.g., 1000 1-PF machines
Conclusions
• “Exascale” is about continuing growth in
computing performance for science
– Energy efficiency is key
– Job size is irrelevant
• Success means:
– Influencing market: HPC, technical computing,
clouds, general purpose
– Getting more science from data and computing
• Failure means:
– few big machines for a few big applications
• Not all computing problems are exascale, but
they should all be exascale-technology aware
49
Thank You!
50

Exascale Opportunities and Challenges: Why do we Care? Kathy Yelick Associate Laboratory Director for Computing Sciences and NERSC Center Director Lawrence Berkeley National Laboratory EECS Professor,

Transcript Exascale Opportunities and Challenges: Why do we Care? Kathy Yelick Associate Laboratory Director for Computing Sciences and NERSC Center Director Lawrence Berkeley National Laboratory EECS Professor,

Directory