Transcript y - Enea

Where the World Stands on Supercomputing
Jack Dongarra
University of Tennessee
Oak Ridge National Laboratory
University of Manchester
3/14/2014
1
H. Meuer, H. Simon, E. Strohmaier, & JD
R
Rate
- Listing of the 500 most powerful
Computers in the World
- Yardstick: Rmax from LINPACK MPP
Ax=b dense
Ax=b,
d
problem
bl
TPP performance
f
- Updated twice a year
Size
SC‘xy
SC
xy in the States in November
Meeting in Germany in June
- All data available from www.top500.org
2
Performance Development of HPC
Over the Last 20 Years
1E+09
224 PFlop/s
100 Pflop/s
100000000
33.9 PFlop/s
10 Pflop/s
10000000
1 Pflop/s
1000000
SUM
100 Tflop/s
100000
N=1
N 1
10 Tflop/s
118 TFlop/s
10000
1 Tflop/s
Tfl
1000/
100 Gflop/s
100
6-8 years
1.17 TFlop/s
N=500
My Laptop (70 Gflop/s)
59
59.7 GFlop/s
G op/s
10 Gflop/s
10
My iPad2 & iPhone 4s (1.02 Gflop/s)
1 Gflop/s
1
400 MFlop/s
100 Mflop/s
0,1
1993
1995
1997
1999
2001
2003
2005
2007
2009
2011
2013
State of Supercomputing
p
p
g in 2014
• Pflops computing fully established with 31
systems.
systems
• Three technology architecture possibilities or are
thriving.
• Commodity (e.g. Intel)
• Commodity + accelerator (e.g. GPUs)
• Special purpose lightweight cores (e.g. IBM BG, ARM)
• Interest in supercomputing is now worldwide, and
growing
i iin many new markets
k t (over 50% of Top500
computers are in industry).
• Exascale projects exist in many countries and
regions.
4
November 2013: The TOP10
Rank
1
2
3
Site
Computer
National University
Tianhe 2 NUDT,
Tianhe-2
NUDT
of Defense
Xeon 12C 2.2GHz + IntelXeon
Technology
Phi (57c) + Custom
DOE / OS
Titan, Cray XK7 (16C) + Nvidia
Oak Ridge
Kepler
GPU ((14c)) + Custom
g Nat Lab
p
DOE / NNSA
Sequoia, BlueGene/Q (16c)
L Livermore Nat Lab
+ custom
Country
Cores
Rmax
[Pflops]
% of
Peak
Power MFlops
[MW] /Watt
China
3,120,000
33.9
62
17.8
1905
USA
560,640
17.6
65
8.3
2120
USA
1,572,864
17.2
85
7.9
2063
4
RIKEN Advanced
Inst for Comp Sci
K computer Fujitsu SPARC64
VIIIfx (8c) + Custom
Japan
705,024
10.5
93
12.7
827
5
DOE / OS
Argonne Nat Lab
Mira, BlueGene/Q (16c)
+ Custom
USA
786,432
8.16
85
3.95
2066
6
S i CSCS
Swiss
Piz Daint, Cray XC30, Xeon 8C +
Nvidia Kepler (14c) + Custom
S i
Swiss
115 984
115,984
6 27
6.27
81
2 3
2.3
2726
7
Texas Advanced
Computing Center
Stampede, Dell Intel (8c) + Intel
Xeon Phi (61c) + IB
USA
204,900
2.66
61
3.3
806
8
Forschungszentrum
F
h
t
Juelich (FZJ)
JuQUEEN,
J QUEEN BlueGene/Q,
Bl G
/Q
Power BQC 16C 1.6GHz+Custom
Germany
458,752
5.01
85
2.30
2178
USA
393,216
4.29
85
1.97
2177
Germany
147,456
2.90
91*
3.42
848
9
10
500
Vulcan, BlueGene/Q,
DOE / NNSA
L Livermore Nat Lab Power BQC 16C 1.6GHz+Custom
Leibniz
Rechenzentrum
Banking
SuperMUC, Intel (8c) + IB
HP
USA
22,212
.118
50
Accelerators ((53 systems)
y
)
60
Intel MIC (13)
Intel MIC (13)
Clearspeed CSX600 (0)
Syste
ems
50
40
ATI GPU (2)
( )
IBM PowerXCell 8i (0)
NVIDIA 2070 (4)
30
NVIDIA 2050 (7)
NVIDIA 2090 (11)
20
10
0
2006 2007 2008 2009 2010 2011 2012 2013
NVIDIA K20 (16)
( )
19 US
9 China
6 Japan
4 Russia
2 France
2 Germany
2 India
1 Italy
1 Poland
1 Australia
2 Brazil
1 Saudi Arabia
1 South Korea
1 Spain
2 Switzerland
1 UK
Top500
p
Performance Share of Accelerators
53 of the 500 systems provide 35% of the accumulated performance
35%
30%
25%
20%
15%
10%
5%
2013
2012
2011
2010
2009
2008
2007
0%
2006
Fra
action off Total TO
OP500
Perfo
ormance
e
40%
For the Top 500: Rank at which Half of Total
Performance is Accumulated
90
80
70
60
50
40
35
30
30
20
25
5
10
0
Pflop/s
Numbers of Sysstems
100
Top 500 November 2013
20
Top 16 computers have
half of the computing
p
g
10
the Top 500
1994 1996 1998 2000 2002 2004 2006 2008 power
2010 of2012
15
5
0
0
100
200
300
400
500
Commodity plus Accelerator Today
Commodity
Accelerator (GPU)
Intel Xeon
8 cores
3 GHz
8*4 ops/cycle
96 Gflop/s (DP)
Nvidia K20X “Kepler”
2688 “C
“Cuda
d cores””
.732 GHz
2688*2/3 ops/cycle
1.31 Tflop/s (DP)
Interconnect
PCI-e Gen2/3 16 lane
64 Gb/s (8 GB/s)
1 GW/s
192 Cuda cores/SMX
2688 “Cuda cores”
6 GB
9
Countries Share
Absolute Counts
US:
267
China:
63
Japan:
28
UK:
23
France:
22
Germany: 20
Top500
p
From Italyy
3/14/2014
11
Linpack
p
Efficiency
y
100%
90%
Liinpack E
Efficiency
80%
70%
60%
50%
40%
30%
20%
10%
0%
0
100
200
300
400
500
Linpack
p
Efficiency
y
100%
90%
Liinpack E
Efficiency
80%
70%
60%
50%
40%
30%
20%
10%
0%
0
100
200
300
400
500
Linpack
p
Efficiency
y
100%
90%
Liinpack E
Efficiency
80%
70%
60%
50%
40%
30%
20%
10%
0%
0
100
200
300
400
500
Linpack
p
Efficiency
y
100%
90%
Liinpack E
Efficiency
80%
70%
60%
50%
40%
30%
20%
10%
0%
0
100
200
300
400
500
#1 System on the Top500 Over the Past 21 Years
9
6
2
(16 machines in that club)
Top500 List
Computer
TMC CM-5/1024
Fujitsu Numerical Wind Tunnel
11/93 (1)
Intel XP/S140
6/94 (1)
j
Numerical Wind Tunnel
11/94 - 11/95 ((3)) Fujitsu
Hitachi SR2201/1024
6/96 (1)
Hitachi CP-PACS/2048
11/96 (1)
6/97 - 6/00 (7) Intel ASCI Red
11/00 - 11/01 (3) IBM ASCI White, SP Power3 375 MHz
6/02 - 6/04 (5) NEC Earth-Simulator
11/04 - 11/07 (7) IBM BlueGene/L
6/93 (1)
6/08 - 6/09 (3)
IBM Roadrunner –PowerXCell 8i 3.2 Ghz
Cray Jaguar - XT5-HE 2.6 GHz
NUDT Tianhe-1A,
Tianhe 1A X5670 2.93Ghz
2 93Ghz NVIDIA
11/10 (1)
6/11 - 11/11 (2) Fujitsu K computer, SPARC64 VIIIfx
IBM Sequoia BlueGene/Q
6/12 (1)
C
Cray
XK7 Titan
Tit AMD + NVIDIA Kepler
K l
11/12 (1)
6/13 – 11/13(2) NUDT Tianhe-2 Intel IvyBridge & Xeon Phi
11/09 - 6/10 (2)
16
r_max
(Tflop/s) n_max Hours MW
.060
52224 0.4
.124
31920
0.1
1.
.143
55700 0.2
.170
42000
0.1
1.
.220
138,240 2.2
.368
103,680 0.6
2.38
.
362,880
,
3.7
.
..85
7.23
518,096 3.6
35.9 1,000,000 5.2
6.4
478 1,000,000
478.
1 000 000 0.4
04
14
1.4
1,105. 2,329,599
1,759.
2 566
2,566.
10,510.
16,324.
17 590
17,590.
33,862.
2.1
2.3
5,474,272 17.3
3 600 000 3.4
3,600,000
34
11,870,208 29.5
12,681,215 23.1
4 423 680 0.9
4,423,680
09
9,960,000 5.4
6.9
40
4.0
9.9
7.9
82
8.2
17.8
Performance Development in Top500
1E+11
1E+10
1 Eflop/s
1E+09
100 Pflop/s
000000
10 Pflop/s
000000
1 Pflop/s
000000
N=1
100 Tflop/s
100000
10 Tflop/s
10000
1 Tflop/s
1000
N=500
100 Gflop/s
100
10 Gflop/s
10
1 Gflop/s
1
0,1
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
Today’s
y #1 System
y
Systems
2014
2020-2022
Difference
Today
y & Exa
55 Pflop/s
1 Eflop/s
~20x
18 MW
((3 Gflops/W)
f p
)
~20 MW
((50 Gflops/W)
f p
)
O(1)
~15x
15x
1.4 PB
32 - 64 PB
~50x
Node performance
3.43 TF/s
1.2 or 15TF/s
O(1)
Node concurrency
24 cores CPU +
171 cores CoP
O(1k) or 10k
~5x - ~50x
6.36 GB/s
200-400GB/s
~40x
16,000
O(100,000) or O(1M)
~6x - ~60x
3.12 M
O(billion)
~100x
Few / day
O(<1 day)
O(?)
System peak
Power
System memory
Node Interconnect
nterconnect BW
System size (nodes)
Total concurrency
MTTF
Tianhe-2
(1.024 PB CPU + .384 PB CoP)
( 4 CPU +3 CoP)
(.4
12.48M
. 8M thr
threads
a s ((4/core)
/cor )
Exascale System Architecture
with a cap of $200M and 20MW
Systems
2014
2020-2022
Difference
Today
y & Exa
55 Pflop/s
1 Eflop/s
~20x
18 MW
((3 Gflops/W)
f p
)
~20 MW
((50 Gflops/W)
f p
)
O(1)
~15x
15x
1.4 PB
32 - 64 PB
~50x
Node performance
3.43 TF/s
1.2 or 15TF/s
O(1)
Node concurrency
24 cores CPU +
171 cores CoP
O(1k) or 10k
~5x - ~50x
6.36 GB/s
200-400GB/s
~40x
16,000
O(100,000) or O(1M)
~6x - ~60x
3.12 M
O(billion)
~100x
Few / day
O(<1 day)
O(?)
System peak
Power
System memory
Node Interconnect
nterconnect BW
System size (nodes)
Total concurrency
MTTF
Tianhe-2
(1.024 PB CPU + .384 PB CoP)
( 4 CPU +3 CoP)
(.4
12.48M
. 8M thr
threads
a s ((4/core)
/cor )
Exascale System Architecture
with a cap of $200M and 20MW
Systems
2014
2022
Difference
Today
y & Exa
55 Pflop/s
1 Eflop/s
~20x
18 MW
((3 Gflops/W)
f p
)
~20 MW
((50 Gflops/W)
f p
)
O(1)
~15x
15x
1.4 PB
32 - 64 PB
~50x
Node performance
3.43 TF/s
1.2 or 15TF/s
O(1)
Node concurrency
24 cores CPU +
171 cores CoP
O(1k) or 10k
~5x - ~50x
6.36 GB/s
200-400GB/s
~40x
16,000
O(100,000) or O(1M)
~6x - ~60x
3.12 M
O(billion)
~100x
Few / day
Many / day
O(?)
System peak
Power
System memory
Node Interconnect
nterconnect BW
System size (nodes)
Total concurrency
MTTF
Tianhe-2
(1.024 PB CPU + .384 PB CoP)
( 4 CPU +3 CoP)
(.4
12.48M
. 8M thr
threads
a s ((4/core)
/cor )
DOE Exascale Computing Initiative DOE Exascale Computing Initiative Proposed Timeline
p
Re
esearch & D
Developme
ent
Platform Acquisitions
Application Development
Future Computer Systems: Pathway Towards Exascale
Science, Engineering and Defense Applications
Exascale Co-Design: Driving the design of Exascale HW and SW
Design Forward
Fast Forward
System Design Phase
Prototype Build Phase
Path Forward Phase
Software Technology: Programming Environment, Resiliency, OS & Runtimes
Extreme Scale Research Programs (SC/ASCR & NNSA/ASC): Fundamental Technology
P0
FY
2012
2013
2014
2015
2016
2017
P1
2018
Node
Prototype
P2
2019
Petascale
Prototype
2020
2021
2022
Exascale
Prototype
2023
2024
3/14/2014
22
3/14/2014
23
3/14/2014
24
3/14/2014
25
3/14/2014
26
EU Funded: CRESTA,, DEEP,, & MontMont-Blanc
♦ The CRESTA, DEEP and Mont-
♦
♦ Each will study different
♦ DEEP: Computer +
Blanc projects,
projects with a combined
funding of 25 M Euros
aspects of the exascale
challenge using a coco
design model spanning
hardware, systemware and
software applications.
applications
♦ This funding
g represents
p
the
first in a sustained investment
in exascale research by
Europe.
3/14/2014
CREST
focuses on
ft
nott hardware
h d
software
and Co-design
Booster Nodes
♦ .
Lightweight processor,
Energy EfficientEfficient
ARM based.
27
Major Changes to Software &
Algorithms
• Must rethink the design of our
algorithms
l ith
and
d software
ft
ƒ Another disruptive technology
• Similar to what happened with cluster
computing and message passing
ƒR
Rethink
thi k and
d rewrite
it th
the applications,
li ti
algorithms, and software
ƒ Data movement is expense
ƒ Flop/s are cheap, so are provisioned in
excess
28
Summaryy
• Major Challenges are ahead for extreme
computing
ti
ƒ Parallelism O(109)
• Programming issues
ƒ Hybrid
• Peak and HPL may be very misleading
• No where near close to peak for most apps
ƒ Fault Tolerance
• Today Sequoia BG/Q node failure rate is 1.25 failures/day
ƒ Power
• 50 Gflops/w (today at 2 Gflops/w)
• We will need completely new approaches and
technologies to reach the Exascale level
Evolution Over the Last 30 Years
♦ Initially,
y, commodity
y PCs where
decentralized systems
♦ As chip manufacturing process
shank to less than a micron, they
started
t t d to
t integrate
i t
t f
features
t
ondie:
¾1989:
¾1999:
¾2009:
¾2016
¾2016:
3/14/2014
FPU (Intel 80486DX)
SRAM (Intel Pentium III)
GPU (AMD Fusion)
DRAM on chip
hi (3D stacking)
ki )
30
Future Systems May Be Composed of
Different Kinds of Cores
DRAM chips (cells)
3D DRAM (cells)
Latency
Low er latency
Memory
controller
Address
Data
Address
Data
Higher bandw idth
Memory
controller
3/14/2014
31
Future Chip
p Design
g
3/14/2014
32
Critical Issues at Peta & Exascale for
Algorithm and Software Design
• Synchronization-reducing algorithms
ƒ Break Fork-Join model
• Communication-reducing algorithms
ƒ Use methods which have lower bound on communication
• Mixed precision methods
ƒ 2x speed of ops and 2x speed for data movement
• Autotuning
ƒ Today’s machines are too complicated, build “smarts” into
software
ft
to
t adapt
d t to
t th
the h
hardware
d
• Fault resilient algorithms
ƒ Implement algorithms that can recover from failures/bit flips
• Reproducibility of results
ƒ Today we can’t
can t guarantee this. We understand the issues
issues,
but some of our “colleagues” have a hard time with this.
HPL - Good Things
g
♦
♦
♦
♦
♦
♦
♦
Easy to run
E
Easy to understand
Easy to check results
Stresses certain parts of the system
Historical database of performance information
G d community
Good
it outreach
t
h tool
t l
“Understandable” to the outside world
♦ “If your computer doesn’t perform well on the
LINPACK Benchmark, you will probably be
disappointed with the performance of your
application on the computer.”
34
HPL - Bad Things
g
♦ LINPACK Benchmark is 36 years old
¾ TOP500 (HPL) is 20.5 years old
♦ Floating point
point-intensive
intensive performs O(n3) floating
♦
♦
♦
♦
♦
♦
♦
point operations and moves O(n2) data.
No longer so strongly correlated to real apps.
R
Reports
P
Peak
k Flops
Fl
(although hybrid systems see only 1/2 to 2/3 of Peak)
Encourages poor choices in architectural features
Overall usability of a system is not measured
Used as a marketing tool
Decisions on acquisition made on one number
Benchmarking for days wastes a valuable
resource
35
Goals for New Benchmark
♦
Augment the TOP500 listing with a benchmark that correlates with
important scientific and technical apps not well represented by HPL
♦
Encourage vendors to focus on architecture features needed for high
performance on those important scientific and technical apps.
apps
¾
¾
¾
¾
¾
♦
Stress a balance of floating point and communication bandwidth and latency
Reward investment in high performance collective ops
g performance
p
point-to-point
p
p
messages
g
of various sizes
Reward investment in high
Reward investment in local memory system performance
Reward investment in parallel runtimes that facilitate intra-node parallelism
Provide an outreach/communication tool
¾ Easy to understand
¾ Easy to optimize
¾ Easy to implement, run, and check results
♦
P id a historical
Provide
hi t i l database
d t b
of
f performance
f
information
i f
ti
¾ The new benchmark should have longevity
36
Proposal:
p
HPCG
♦ High
Hi h Performance Conju
Conjugate
ate Gradient (HPCG).
(HPCG)
♦ Solves Ax=b, A large, sparse, b known, x
computed.
mpu
.
♦ An optimized implementation of PCG contains
essential computational and communication patterns
that are prevalent in a variety of methods for
discretization and numerical solution of PDEs
♦ Patterns:
¾ Dense and sparse computations.
¾ Dense
D
and
d sparse collective.
ll ti
¾ Data-driven parallelism (unstructured sparse triangular
solves).
♦ Strong verification and validation properties (via
spectral properties of CG).
37
Collaborators / Software / Support
‹
‹
‹
•
PLASMA
http://icl.cs.utk.edu/plasma/
MAGMA
http://icl.cs.utk.edu/magma/
Quark (RT for Shared Memory)
http://icl.cs.utk.edu/quark/
‹
‹
•
PaRSEC(Parallel Runtime Scheduling
and Execution Control)
C
)
http://icl.cs.utk.edu/parsec/
Collaborating partners
University of Tennessee, Knoxville
University of California, Berkeley
University of Colorado, Denver
MAGMA
PLASMA
38
Big
g Data
Big data is like teenage sex:
everyone talks about it,
nobody
b d really
ll knows
k
how
h
to
t d
do it
it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it...
Dan Arielyy
39
Conclusions
♦ For the last decade or more, the research
iinvestment
t
t strategy
t t
h
has been
b
overwhelmingly biased in favor of
h d
hardware.
♦ This strategy needs to be rebalanced barriers to progress are increasingly on the
software side.
•
High Performance Ecosystem out of balance
¾ Hardware, OS, Compilers, Software, Algorithms, Applications
¾ No Moore’s Law for software, algorithms and applications
Broad Community Support and Development of
the Exascale Initiative
Since 2007
http://science energy gov/ascr/news-and-resources/program-documents/
http://science.energy.gov/ascr/news-and-resources/program-documents/
♦ Town Hall Meetings April-June 2007
♦ Scientific Grand Challenges Workshops
N
Nov,
2008 – Oct,
O t 2009
¾
¾
¾
¾
¾
¾
¾
¾
¾
Climate Science (11/08)
High Energy Physics (12/08)
Nuclear Physics (1/09)
Fusion Energy (3/09)
Nuclear Energy (5/09)
gy (8/09)
(
)
Biology
Material Science and Chemistry (8/09)
National Security (10/09)
Cross-cutting technologies (2/10)
Mission Imperatives
♦ Exascale Steering Committee
¾ “Denver” vendor NDA visits (8/09)
¾ SC09 vendor feedback meetings
¾ Extreme Architecture and Technology
Workshop (12/09)
♦ International Exascale Software Project
¾ Santa Fe, NM (4/09); Paris, France (6/09);
Tsukuba, Japan (10/09); Oxford (4/10); Maui
(10/10); San Francisco (4/11); Cologne (10/11);
Kobe (4/12)
Fundamental Science
41
Future Systems May Be Composed of
Different Kinds of Cores
DRAM chips (cells)
3D DRAM (cells)
Latency
Low er latency
Memory
controller
Address
Data
Address
Data
Higher bandw idth
Memory
controller
3/14/2014
42
3/14/2014
43
Parallelization of QR Factorization
Parallelize the update:
dgemm
• Easy and done
E
dd
i
in any reasonable software.
bl
ft
• This is the 2/3n3 term in the FLOPs count.
• Can be done “efficiently” with LAPACK+multithreaded BLAS
-
R
V
A(1)
Pan
nel
Update
e of the
remain
ning subma
atrix factorization
dgeqf2 + dlarft
qr((
q
)
Fork
F
k - Join
J i parallelism
ll li
Bulk Sync Processing
dlarfb
R
V
A(2)
44
Synchronization (in LAPACK LU)
Step 1
Step 2
Step 3
Step 4
...
¾ fork join
¾ bulk synchronous processing
Allowing for delayed update, out of order, asynchronous, dataflow execution
45
Data Layout
y
is Critical
• Tile data layout where each data tile
is contiguous in memory
• Decomposed into several fine-grained
tasks, which better fit the memory of
the small core caches
46
PLASMA: Parallel Linear Algebra s/w
for Multicore Architectures
•Objectives
ƒ High utilization of each core
ƒ Scaling to large number of cores
ƒ Shared or distributed memory
Ch l k
Cholesky
4x4
•Methodology
ƒ
ƒ
ƒ
ƒ
Dynamic DAG scheduling (QUARK)
Explicit parallelism
Implicit communication
Fine granularity / block data layout
•Arbitrary DAG with dynamic scheduling
Fork-join
Fork
join
parallelism
DAG
G scheduled
parallelism
Time
47
Synchronization
y
Reducing
g Algorithms
g
z
Regular trace
z
Factorization steps pipelined
z
Stalling only due to natural
load imbalance
z
Dynamic
z
Out of order execution
z
Fine grain tasks
z
Independent
p
block operations
p
The colored area over the
rectangle is the efficiency
Tile QR ffactorization;
Til
t i ti
M
Matrix
t i size
i 4000
4000x4000,
4000 Til
Tile size
i 200
8-socket, 6-core (48 cores total) AMD Istanbul 2.8 GHz
PowerPack 2.0
The PowerPack platform consists of software and hardware instrumentation.
49
Kirk Cameron, Virginia Tech; http://scape.cs.vt.edu/software/powerpack-2-0/
Power for Q
QR Factorization
LAPACK’s QR Factorization
Fork join based
Fork-join
MKL’s QR Factorization
Fork-join based
PLASMA’s Conventional
QR Factorization
DAG based
PLASMA’s Communication
Reducing QR Factorization
DAG based
dual-socket quad-core Intel Xeon E5462 (Harpertown) processor
@ 2.80GHz (8 cores total) w / MLK BLAS
matrix size is very tall and skinny (mxn is 1,152,000 by 288)
50
Performance: Least Squares
q
Performance of the LU factorization (flop/s)
Performance: Singular
g
Values
Performance of the LU factorization (flop/s)
Performance: Eigenvalues
g
Experiments
p
on Large
g Core Machines
54
Pipelining: Cholesky Inversion
3 Steps: Factor, Invert L, Multiply L
L’ss
48 cores
POTRF, TRTRI and LAUUM.
The matrix is 4000 x 4000,tile size is 200 x 200,
POTRF+TRTRI+LAUUM: 25 (7t-3)
Cholesky Factorization alone: 3t-2
Pipelined: 18 (3t+6)
55
Toward fast Eigensolver
flops
p formula: n3/3*time
Higher is faster
Keeneland system, using one node
3 NVIDIA GPUs (M2090@ 1
1.1
1 GHz
GHz, 5
5.4
4 GB)
2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)
’ Characteristics
•
•
•
•
Too many Blas-2 op,
Relies on p
panel factorization,,
ÎBulk sync phases,
ÎMemory bound algorithm.
A. Haidar, S. Tomov, J. Dongarra, T. Schulthess, and R. Solca, A novel hybrid CPU-GPU generalized eigensolver for
electronic structure calculations based on fine grained memory aware tasks, ICL Technical report, 03/2012.
Toward fast Eigensolver
flops
p formula: n3/3*time
Higher is faster
Keeneland system, using one node
3 NVIDIA GPUs (M2090@ 1
1.1
1 GHz
GHz, 5
5.4
4 GB)
2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)
’ Characteristics
•
•
•
•
Blas-2 GEMV moved to the GPU,
Accelerate the algorithm
g
by
y doing
g all BLAS-3 on GPU,,
ÎBulk sync phases,
ÎMemory bound algorithm.
A. Haidar, S. Tomov, J. Dongarra, T. Schulthess, and R. Solca, A novel hybrid CPU-GPU generalized eigensolver for
electronic structure calculations based on fine grained memory aware tasks, ICL Technical report, 03/2012.
Two-Stage Approach to Tridiagonal Form
Two((Communication Reducing)
g)
• Reduction to band
ƒ On multicore + GPUs
ƒ Performance as in the one-sided factorizations
[derived from fast Level 3 BLAS]
• Band to tridiagonal
ƒ Leads to “irregular” (bulge chasing) computation
ƒ Done very efficiently on multicore !
ƒ GPUs are used to assemble the orthogonal Q
from the transformations
[needed to find the eigenvectors]
Toward fast Eigensolver
flops
p formula: n3/3*time
Higher is faster
Keeneland system, using one node
3 NVIDIA GPUs (M2090@ 1
1.1
1 GHz
GHz, 5
5.4
4 GB)
2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)
’ Characteristics
first
•
•
•
•
stage
Stage 1: BLAS-3, increasing computational intensity,
Stage
g 2: BLAS-1.5,, new cache friendlyy kernel,,
4X/12X faster than standard approach,
Bottelneck: if all Eigenvectors are required, it has 1 back transformation
extra cost.
second
stage
t
A. Haidar, S. Tomov, J. Dongarra, T. Schulthess, and R. Solca, A novel hybrid CPU-GPU generalized eigensolver for
electronic structure calculations based on fine grained memory aware tasks, ICL Technical report, 03/2012.
Communication Avoiding Algorithms
• Goal: Algorithms that communicate as little as possible
• Jim Demmel and company have been working on algorithms
th t obtain
that
bt i a provable
bl minimum
i i
communication.
i ti
(M
(M. A
Anderson
d
yesterday)
• Direct methods ((BLAS,, LU,, Q
QR,, SVD,, other decompositions)
p
)
• Communication lower bounds for all these problems
• Algorithms that attain them (all dense linear algebra, some
sparse)
• Iterative methods – Krylov subspace methods for Ax=b, Ax=λx
• Communication lower bounds, and algorithms that attain them
(depending
d
d
on sparsity structure)
• For QR Factorization they can show:
60
Communication Reducing QR
Factorization
Quad-socket, quad-core machine Intel Xeon EMT64 E7340 at 2.39 GHz.
Theoretical peak is 153.2 Gflop/s with 16 cores.
Matrix size 51200 by 3200
Mixed Precision Methods
• Mixed precision, use the lowest
precision required to achieve a given
accuracy outcome
ƒ Improves runtime, reduce power
consumption,
ti
llower d
data
t movementt
ƒ Reformulate to find correction to
solution, rather than solution; Δx rather
than x.
62
Idea Goes Something
g Like This…
• Exploit 32 bit floating point as much as
possible.
possible
ƒ Especially for the bulk of the computation
• Correct or update
d
the
h solution
l
with
h selective
l
use of 64 bit floating point to provide a
refined
f d results
l
• Intuitively:
ƒ Compute a 32 bit result,
ƒ Calculate a correction to 32 bit result using
g
selected higher precision and,
ƒ Perform the update of the 32 bit results with the
correction using high precision.
63
Mixed--Precision Iterative Refinement
Mixed
•
Iterative refinement for dense systems, Ax = b, can work this
way.
L U = lu(A)
x = L\(U\b)
r = b – Ax
g
WHILE || r || not small enough
z = L\(U\r)
x = x + z
r = b – Ax
END
3
SINGLE
SINGLE
DOUBLE
O(n )
2
O(n )
2
O(n )
SINGLE
DOUBLE
DOUBLE
O(n )
1
O(n )
2
O(n )
2
ƒ Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt
results when using DP fl pt.
Mixed--Precision Iterative Refinement
Mixed
•
Iterative refinement for dense systems, Ax = b, can work this
way.
L U = lu(A)
x = L\(U\b)
r = b – Ax
g
WHILE || r || not small enough
z = L\(U\r)
x = x + z
r = b – Ax
END
3
SINGLE
SINGLE
DOUBLE
O(n )
2
O(n )
2
O(n )
SINGLE
DOUBLE
DOUBLE
O(n )
1
O(n )
2
O(n )
2
ƒ Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt
results when using DP fl pt.
ƒ It can be shown that using this approach we can compute the solution
to 64
64-bit
bit floating point precision.
precision
•
•
•
•
Requires extra storage, total is 1.5 times normal;
O(n3) work is done in lower precision
O(n2) work is done in high precision
Problems if the matrix is ill-conditioned in sp; O(108)
Ax = b
FERMI
Tesla C2050: 448 CUDA cores @ 1.15GHz
SP/DP peak is 1030 / 515 GFlop/s
500
Single Precision
450
400
Gflop
p/s
350
300
Double Precision
250
200
150
100
50
0
960
3200
5120
7040
Matrix size
8960
11200
13120
Ax = b
FERMI
Tesla C2050: 448 CUDA cores @ 1.15GHz
SP/DP peak is 1030 / 515 GFlop/s
500
Single Precision
450
400
Mixed Precision
Gflop
p/s
350
300
Double Precision
250
200
150
100
Similar results for Cholesky & QR
factorizations
50
0
960
3200
5120
7040
Matrix size
8960
11200
13120
Reproducibility
p
y
∑
xi when done in parallel can’t
• For example
guarantee the order of operations.
• Lack of reproducibility due to floating point
nonassociativity and algorithmic adaptivity
(including autotuning) in efficient production
mode
• Bit-level reproducibility may be unnecessarily
expensive most of the time
• Force routine adoption of uncertainty
quantification
ƒ Given the many unresolvable uncertainties in
program
p
g
inputs,
p , bound the error in the outputs
p
in terms of errors in the inputs
68
Conclusions
• For the last decade or more, the research
i
investment
strategy h
has b
been
overwhelmingly biased in favor of hardware.
• This strategy needs to be rebalanced barriers to progress are increasingly on the
software side.
• High Performance Ecosystem out of balance
ƒ Hardware,
Hardware OS,
OS Compilers,
Compilers Software,
Software Algorithms,
Algorithms Applications
• No Moore’s Law for software, algorithms and applications
70
Clusters with GPUs (Cholesky)
Clusters with GPUs (Cholesky)
Use 12 cores and 3 GPUs per node
Input size = 34560*sqrt(NumberNodes)
DGEMM UB
Distri. GPUs
O vverall Tflo ps
100
80
60
40
1.5
DGEMM UB
Distri. GPUs
mkl_scalapack 10.3
Tflo
ops Per N
Node
120
10
1.0
0.5
20
0
1
2
4 8 16 32 64 100
Number off Nodes
0.0
1
2
4Number
8 16
32 64 100
N b off Nodes
N d
On the Keeneland system:
100 nodes
Each node has two 6-core Intel Westmere CPUs and three Nvidia Fermi GPUs
SW used: Intel MKL 10.3.5, CUDA 4.0, OpenMPI 1.5.1, PLASMA 2.4.1
Sparse Direct Solver and Iterative
Refinement
72
MUMPS package based on multifrontal approach which
generates small dense matrix multiplies
Opteron w/Intel compiler
Speedup Over DP
2
Iterative Refinement
Single Precision
1.8
1.6
1.4
1.2
1
0.8
0.6
04
0.4
0.2
0
12
en
th
wa t0 1
a
nk
ve
o2
rs
to
0
a1
rm
k
8f 6
qa th2
e
m
ne rb 1
s
0
sa p_
na co
d
t_
ul
m 06
0
ap
kiv 04
0
ap
kiv
t1
ar
he 12
5
an
f in
b3
ep n5
o
ws
da 2 6
y
vit
ca
71
c- q p1
k
oc
bl k39
t
ss d
bc l_2
i
rfo
ai 1 6
H
10
Si
64
G
Ite ra vti e R e fn
i e me n t
0
Tim Davis's Collection, n=100K - 3M
Sparse
p
Iterative Methods ((PCG))
73
• Outer/Inner Iteration
Outer iterations using 64 bit floating point
Inner iteration:
In 32 bit floating point
• O
Outer
t iteration
it
ti in
i 64 bit fl
floating
ti point
i t and
d iinner
iteration in 32 bit floating point
Mixed Precision Computations for
Sparse Inner/OuterInner/Outer-type Iterative Solvers
74
Speedups for mixed precision
Inner SP/Outer DP (SP/DP) iter. methods vs DP/DP
(CG2, GMRES2, PCG2, and PGMRES2 with diagonal prec.)
(Higher is better)
2
2
2
2
Iterations for mixed precision
SP/DP iterative methods vs DP/DP
(Lower is better)
Machine:
Intel Woodcrest (3GHz, 1333MHz bus)
Stopping criteria:
Relative to r0 residual reduction (10-12)
M i size
Matrix
i
6,021
18,000
39,000
120,000
240,000
Condition number
Standard Q
QR Block Reduction
• We have a m x n matrix A we want to
reduce to upper triangular form.
Standard Q
QR Block Reduction
• We have a m x n matrix A we want to
reduce to upper triangular form.
Q 1T
Standard Q
QR Block Reduction
• We have a m x n matrix A we want to
reduce to upper triangular form.
R
Q 1T
Q 2T
Q 3T
A = Q1Q2Q3R = QR
Communication Avoiding QR
Example
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620
1610–1620, Pasadena
Pasadena, CA
CA, Jan
Jan. 1988
1988. ACM
ACM. Penn
Penn. State
State.
Communication Avoiding QR
Example
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620
1610–1620, Pasadena
Pasadena, CA
CA, Jan
Jan. 1988
1988. ACM
ACM. Penn
Penn. State
State.
Communication Avoiding QR
Example
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620
1610–1620, Pasadena
Pasadena, CA
CA, Jan
Jan. 1988
1988. ACM
ACM. Penn
Penn. State
State.
Communication Avoiding QR
Example
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620
1610–1620, Pasadena
Pasadena, CA
CA, Jan
Jan. 1988
1988. ACM
ACM. Penn
Penn. State
State.
Communication Avoiding QR
Example
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
Domain_Tile_QR
A. Pothen and P. Raghavan. Distributed orthogonal factorization. In The 3rd
Conference on Hypercube Concurrent Computers and Applications, volume II, Applications,
pages 1610–1620
1610–1620, Pasadena
Pasadena, CA
CA, Jan
Jan. 1988
1988. ACM
ACM. Penn
Penn. State
State.
PLASMA
DPLASMA
O N
D
S
DAGuE
QUARK
N
N
DAG
DAG
O 1
O
C
LU 2 3
QR 4 3
1 3
3
3
3
3
C
4 OTR
P F SYRK GEMM TRSM
LU 4 GETRF GESM TSRF SM
QR 4 GEQRT LARFB TSQRT SRFB
DAG C
P
S
Example: Cholesky 4x4
Example: Cholesky 4x4
RT is using the symbolic information from the compiler to make scheduling, message passing, & RT decisions
Data distribution: regular, g ,
irregular
Task priorities
Task priorities
No left looking or right looking, more adaptive or opportunistic
more adaptive or opportunistic
Software Stack
commerciall or Netlib
PLASMA distrib
bution
PLASMA
PLASMA
QUARK
core BLAS
LAPACK
CBLAS
(C)LAPACK
BLAS
QUARK - QUeuing And Runtime for Kernels
LAPACK
- Linear Algebra PACKage
BLAS
- Basic Linear Algebra Subroutines
hwloc
- hardware locality
POSIX threads
hwloc
Big DAGs: No Global Critical Path
Big DAGs: No Global Critical Path
• DAGs get very big, very fast
– So windows of active tasks are used; this means no global critical path – Matrix of NBxNB tiles; NB3 operation
• NB=100 gives 1 million tasks 86
PLASMA Local Scheduling
Dynamic Scheduling: Sliding Window
Dynamic Scheduling: Sliding Window
‹T
‹ 10
‹ 30
‹ 10
10
LU
PLASMA Local Scheduling
Dynamic Scheduling: Sliding Window
Dynamic Scheduling: Sliding Window
‹T
‹ 10
‹ 30
‹ 10
10
LU
PLASMA Local Scheduling
Dynamic Scheduling: Sliding Window
Dynamic Scheduling: Sliding Window
‹T
‹ 10
‹ 30
‹ 10
10
LU
PLASMA Local Scheduling
Dynamic Scheduling: Sliding Window
Dynamic Scheduling: Sliding Window
‹T
‹ 10
‹ 30
‹ 10
10
LU
Exascale (1018 Flop/
Flop/ss) Systems:
Two Possible Swim Lanes
• Light weight processors (think BG/P)
ƒ ~1 GHz processor (109)
ƒ ~1 Kilo cores/socket (103)
g sockets/system
y
(10
( 6)
ƒ ~1 Mega
Socket Level
Cores scale-out for planar geometry
• Hybrid system (think GPU based)
ƒ ~1 GHz processor (109)
ƒ ~10 Kilo FPUs/socket (104)
100 Kilo sockets/system (105)
ƒ ~100
Node Level
3D packaging
The High
g Cost of Data Movement
•Flop/
Flop/ss or percentage of peak flop/
flop/ss become
much less relevant
Approximate power costs (in picoJoules
picoJoules))
2011
DP FMADD flop
2018
100 pJ
10 pJ
DP DRAM read
4800 pJ
1920 pJ
Local Interconnect
pJ
7500 p
2500 p
pJ
Cross System
9000 pJ
3500 pJ
Source: John Shalf, LBNL
•Algorithms & Software: minimize data
movement;; p
perform more work p
per unit data
movement.
92
Factors that Necessitate Redesign of Our
Software
•
•
•
•
Steepness of the ascent from terascale to
petascale to exascale
Extreme parallelism and hybrid design
• Preparing for million/billion way
parallelism
Tightening memory/bandwidth bottleneck
• Limits on power/clock speed
implication on multicore
• Reducing
d
communication will
ll become
b
much more intense
• Memory per core changes, byte-to-flop
ratio will change
Necessary Fault Tolerance
• MTTF will drop
• Checkpoint/restart has limitations
Software infrastructure does not exist today
Emerging
g g Computer
p
Architectures
• Are needed by
y applications
pp
• Applications are given (as function of time)
• Architectures are given (as function of time)
• Algorithms
g
and software must be adapted
p
or
created to bridge to computer architectures for
the sake of the complex
p
applications
pp
94
Three Design
g Points Todayy
• Gigascale Laptop:
Uninode-Multicore
(Your iPhone and iPad are Gflop/s devices)
• Terascale Deskside:
Multinode-Multicore
• Petacale Center: Multinode-Multicore
Three Design
g Points for Tomorrow
♦ Terascale
T
l Laptop:
L t
¾Manycore
y
♦ Petascale
P t sc l Deskside:
D sksid :
¾Manynode-Manycore
♦ Exacale Center:
¾Manynode-Manycore
97
Challenges of using GPUs
High levels of parallelism
M
Many
GPU cores
[ e.g. Tesla C2050 (Fermi) has 448 CUDA cores ]
Hybrid/heterogeneous architectures
Match algorithmic requirements to architectural
strengths
[ e.g. small, non-parallelizable tasks to run on CPU, large and parallelizable on
GPU ]
Compute
vs communication gap
Exponentially growing gap; persistent challenge
[ Processor speed improves 59%, memory bandwidth 23%, latency 5.5% ]
[ on all levels, e.g. a GPU Tesla C1070 (4 x C1060) has compute power of
O(1 000) Gflop/s
O(1,000)
Gfl / but
b GPUs
GPU communicate
i
through
h
h the
h CPU using
i
O(1) GB/s connection ]
Matrix Algebra on GPU and Multicore Architectures
(MAGMA)
MAGMA: a new ggeneration linear algebra
g
((LA)) libraries to achieve the fastest p
possible
time to an accurate solution on hybrid/heterogeneous architectures
Homepage: http://icl.cs.utk.edu/magma/
MAGMA & LAPACK
−
MAGMA uses LAPACK and extends its functionality to hybrid systems (w/ GPUs);
−
MAGMA is
i d
designed
i
d tto b
be similar
i il tto LAPACK iin
functionality, data storage and interface
−
MAGMA leverages years of experience in developing open source LA software
k
lik
LAPACK ScaLAPACK,
S LAPACK BLAS,
BLAS ATLAS
d PLASMA
packages
like LAPACK,
ATLAS, and
MAGMA developers/collaborators
−
U of Tennessee, Knoxville; U of California, Berkeley; U of Colorado, Denver
−
INRIA Bordeaux - Sud Ouest & INRIA Paris – Saclay, France; KAUST, Saudi Arabia
−
Community effort [similarly to the development of LAPACK / ScaLAPACK]
Hybridization Methodology
MAGMA uses HYBRIDIZATION methodology based on
Representing linear algebra algorithms as collections
of TASKS and DATA DEPENDENCIES among them
P
Properly
l SCHEDULING the
th tasks'
t k ' execution
ti over th
the
multicore and the GPU hardware components
H b id CPU
Hybrid
CPU+GPU
GPU algorithms
l ith
(small tasks for multicores and large
tasks for GPUs)
Successfully applied to fundamental
linear algebra algorithms
One and two-sided factorizations and solvers
Iterative
eigen-solvers
te at ve linear
l ea and
a de
ge solve s
Faster, cheaper, better ?
g
High-level
Leveraging prior developments
Exceeding in performance homogeneous solutions
100
Accelerating Dense Linear Algebra with GPUs
Hessenberg factorization in DP
[ for the general eigenvalue problem ]
LU Factorization in double precision (DP)
[ for solving a dense linear system ]
220 W*
possibile v isualizzare l'immagine. La memoria del computer potrebbe essere insufficiente per aprire l'immagine oppure l'immagine potrebbe essere danneggiata. Riav v iare il computer e aprire di nuov o il file. Se v iene v isualizzata di nuov o la x rossa, potrebbe essere necessario eliminare l'immagine e inserirla di nuov o.
Impossibile v isualizzare l'immagine. La memoria del computer potrebbe essere insufficiente per aprire l'immagine oppure l'immagine potrebbe essere danneggiata. Riav v iare il computer e aprire di nuov o il file. Se v iene v isualizzata di nuov o la x rossa, potrebbe essere necessario eliminare l'immagine e inserirla di nuov o.
1022 W*
GPU Fermi C2050 [448 CUDA Cores @ 1.15 GHz ]
+ Intel Q9300 [ 4 cores @ 2.50 GHz]
DP peak
515 + 40 GFlop/s
System cost ~ $3,000
Power *
~
220 W
*
CPU
AMD ISTANBUL
[ 8 sockets x 6 cores (48 cores) @2.8GHz ]
DP peak
538 GFlop/s
System cost ~ $30,000
Power *
~ 1,022 W
Computation consumed power rate (total system rate minus idle rate), measured with KILL A WATT PS, Model P430
Architecture of Heterogeneous
Multi--core and MultiMulti
Multi-GPU Systems
y
Architecture of a Keeneland compute node.
Two Intel Xeon 2.8 GHz 6-core X5660 processors (Westmere)
Three NVIDIA Fermi M2070
Choleskyy Factorization ((DP))
•
•
Weak scalability on many nodes (Keeneland)
Input size: 34560, 46080, 69120, 92160, 138240, 184320, 276480,
460800
75 Tflops
Choleskyy Factorization ((DP))
•
•
Weak scalability on many nodes (Keeneland)
Input size: 34560, 46080, 69120, 92160, 138240, 184320, 276480,
460800
MAGMA Software Stack
CPU
di t .
distr
HYBRID
GPU
Til & LAPACK Al
Tile
Algorithms
ith
with
ith DAGuE
DAG E
MAGNUM / Rectangular / PLASMA Tile Algorithms
multi
PLASMA / Quark
Scheduler
LAPACK Algorithms and Tile Kernels
MAGMA 1.0
MAGMA SPARSE
single
MAGMA BLAS
LAPACK
BLAS
BLAS
Linux, Windows, Mac OS X | C/C++, Fortran | Matlab, Python
CUDA
MAGMA 1.0
‹
32 algorithms
l ih
are d
developed
l
d (total – 122 routines)
‹
‹
‹
‹
–
‹
‹
LU, LLT, QR, LQ, Sym λ, Non-Sym λ, SVD
Every algorithm is in 4 precisions
(s/c/d/z, denoted by X)
There are 3 mixed precision algorithms
(zc & ds, denoted by XX)
These are hybrid algorithms
Expressed in terms of BLAS
Support is for single CUDA-enabled NVIDIA GPU,
either Tesla or Fermi
MAGMA BLAS
‹
A subset of GPU BLAS, optimized for Tesla and Fermi GPUs
Mixed Precision
• Single Precision is 2X faster than
Double Precision
• With GP-GPUs 8x
• Power saving issues
• Reduced data motion
ƒ 32 bit data instead of 64 bit data
• Higher locality in cache
ƒ More data items in cache
Idea Goes Something
g Like This…
• Exploit 32 bit floating point as much as
possible.
possible
ƒ Especially for the bulk of the computation
• Correct or update
d
the
h solution
l
with
h selective
l
use of 64 bit floating point to provide a
refined
f d results
l
• Intuitively:
ƒ Compute a 32 bit result,
ƒ Calculate a correction to 32 bit result using
g
selected higher precision and,
ƒ Perform the update of the 32 bit results with the
correction using high precision.
108
Mixed--Precision Iterative Refinement
Mixed
•
Iterative refinement for dense systems, Ax = b, can work this
way.
L U = lu(A)
x = L\(U\b)
r = b – Ax
g
WHILE || r || not small enough
z = L\(U\r)
x = x + z
r = b – Ax
END
3
SINGLE
SINGLE
DOUBLE
O(n )
2
O(n )
2
O(n )
SINGLE
DOUBLE
DOUBLE
O(n )
1
O(n )
2
O(n )
2
ƒ Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt
results when using DP fl pt.
Mixed--Precision Iterative Refinement
Mixed
•
Iterative refinement for dense systems, Ax = b, can work this
way.
L U = lu(A)
x = L\(U\b)
r = b – Ax
g
WHILE || r || not small enough
z = L\(U\r)
x = x + z
r = b – Ax
END
3
SINGLE
SINGLE
DOUBLE
O(n )
2
O(n )
2
O(n )
SINGLE
DOUBLE
DOUBLE
O(n )
1
O(n )
2
O(n )
2
ƒ Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt
results when using DP fl pt.
ƒ It can be shown that using this approach we can compute the solution
to 64
64-bit
bit floating point precision.
precision
•
•
•
•
Requires extra storage, total is 1.5 times normal;
O(n3) work is done in lower precision
O(n2) work is done in high precision
Problems if the matrix is ill-conditioned in sp; O(108)
Ax = b
FERMI
Tesla C2050: 448 CUDA cores @ 1.15GHz
SP/DP peak is 1030 / 515 GFlop/s
500
Single Precision
450
400
Gflop
p/s
350
300
Double Precision
250
200
150
100
50
0
960
3200
5120
7040
Matrix size
8960
11200
13120
Ax = b
FERMI
Tesla C2050: 448 CUDA cores @ 1.15GHz
SP/DP peak is 1030 / 515 GFlop/s
500
Single Precision
450
400
Mixed Precision
Gflop
p/s
350
300
Double Precision
250
200
150
100
Similar results for Cholesky & QR
factorizations
50
0
960
3200
5120
7040
Matrix size
8960
11200
13120
Mixed Precision Methods
• Mixed precision, use the lowest
precision required to achieve a given
accuracy outcome
ƒ Improves runtime, reduce power
consumption,
ti
llower d
data
t movementt
ƒ Reformulate to find correction to
solution, rather than solution; Δx rather
than x.
113
Communication Avoiding Algorithms
• Goal: Algorithms that communicate as little as possible
• Jim Demmel and company have been working on algorithms
th t obtain
that
bt i a provable
bl minimum
i i
communication.
i ti
(M
(M. A
Anderson
d
yesterday)
• Direct methods ((BLAS,, LU,, Q
QR,, SVD,, other decompositions)
p
)
• Communication lower bounds for all these problems
• Algorithms that attain them (all dense linear algebra, some
sparse)
• Iterative methods – Krylov subspace methods for Ax=b, Ax=λx
• Communication lower bounds, and algorithms that attain them
(depending
d
d
on sparsity structure)
• For QR Factorization they can show:
114
Software Stack
commerciall or Netlib
PLASMA distrib
bution
PLASMA
PLASMA
QUARK
core BLAS
LAPACK
CBLAS
(C)LAPACK
BLAS
QUARK - QUeuing And Runtime for Kernels
LAPACK
- Linear Algebra PACKage
BLAS
- Basic Linear Algebra Subroutines
hwloc
- hardware locality
POSIX threads
hwloc
Reproducibility
p
y
∑
xi when done in parallel can’t
• For example
guarantee the order of operations.
• Lack of reproducibility due to floating point
nonassociativity and algorithmic adaptivity
(including autotuning) in efficient production
mode
• Bit-level reproducibility may be unnecessarily
expensive most of the time
• Force routine adoption of uncertainty
quantification
ƒ Given the many unresolvable uncertainties in
program
p
g
inputs,
p , bound the error in the outputs
p
in terms of errors in the inputs
116
Three Ideas for Fault Tolerant
Linear Algebra Algorithms
• Lossless diskless check‐pointing for iterative methods
• Checksum maintained in active processors
• On failure, roll back to checkpoint and continue
• No lost data
Three Ideas for Fault Tolerant
Linear Algebra Algorithms
• Lossless diskless check‐pointing for iterative methods
• Checksum maintained in active processors
• On failure, roll back to checkpoint and continue
• No lost data
• Lossyy approach for iterative methods
pp
• No checkpoint for computed data maintained
• On failure, approximate missing data and carry on
• Lost data but use approximation to recover
Three Ideas for Fault Tolerant
Linear Algebra Algorithms
• Lossless diskless check‐pointing for it ti
iterative methods
th d
• Checksum maintained in active processors
• On failure, roll back to checkpoint and continue
• No lost data
• Lossy approach for iterative methods
• No
No checkpoint maintained
checkpoint maintained
• On failure, approximate missing data and carry on
• Lost data but use approximation to recover
• Check‐pointless methods for dense algorithms
• Ch
Checksum maintained as part of k
i i d
f
computation
• No roll back needed; No lost data
PLASMA
People
z
Current Team
z
Past Members
z
Outside Contributors
z
Dulceneia Becker
z
Emmanuel Agullo
z
Fred Gustavson
z
Henricus Bouwmeester
z
Wesley Alvaro
z
Lars Karlsson
z
Jack Dongarra
z
Alfredo Buttari
z
Bo Kågström
z
Mathieu Faverge
z
Bilel Hadri
z
Azzam Haidar
z
Blake Haugen
z
Jakub Kurzak
z
Julien Langou
z
Hatem Ltaief
z
Piotr Łuszczek
A copy of the slides are on my website. Google “dongarra”
121
First …
• Thank a number of people who have
helped with this work
ƒ Emmanuel Agullo,
Agullo George Bosilca,
Bosilca Aurelien
Bouteiller, Anthony Danalis, Jim Demmel,
Tingxing
g g "Tim" Dong,
g, Mathieu Faverge,
g , Azzam
Haidar, Thomas Herault, Mitch Horton, Jakub
Kurzak, Julien Langou, Julie Langou, Pierre
Lemarinier, Piotr Luszczek, Hatem Ltaief,
Stanimire Tomov, Asim YarKhan, …
• Much
h off what
h I will
ll describe
d
b has
h been
b
done before,, at least in theory.
y
122
123
28 Supercomputers
p
p
in the UK
Rank
24
65
69
70
93
154
160
186
190
191
211
212
213
228
233
234
278
279
339
351
365
404
405
415
416
482
484
Site
Computer
University of Edinburgh
Cray XE6 12
12-core
core 2.1 GHz
Atomic Weapons Establishment Bullx B500 Cluster, Xeon X56xx 2.8Ghz, QDR Infiniband
ECMWF
Power 575, p6 4.7 GHz, Infiniband
ECMWF
Power 575, p6 4.7 GHz, Infiniband
University of Edinburgh
Cray XT4, 2.3 GHz
University of Southampton iDataPlex,
iDataPlex Xeon QC 2.26
2 26 GHz,
GHz Ifband,
Ifband Windows HPC2008 R2
IT Service Provider
Cluster Platform 4000 BL685c G7, Opteron 12C 2.2 Ghz, GigE
IT Service Provider
Cluster Platform 3000 BL460c G7, Xeon X5670 2.93 Ghz, GigE
Computacenter (UK) LTD
Cluster Platform 3000 BL460c G1, Xeon L5420 2.5 GHz, GigE
Classified
xSeries x3650 Cluster Xeon QC GT 2.66 GHz, Infiniband
Classified
BladeCenter HS22 Cluster, WM Xeon 6-core 2.66Ghz, Ifband
Classified
BladeCenter HS22 Cluster, WM Xeon 6-core 2.66Ghz, Ifband
Classified
BladeCenter HS22 Cluster, WM Xeon 6-core 2.66Ghz, Ifband
IT Service Provider
Cluster Platform 4000 BL685c G7, Opteron 12C 2.1 Ghz, GigE
g
Financial Institution
iDataPlex,, Xeon X56xx 6C 2.66 GHz,, GigE
Financial Institution
iDataPlex, Xeon X56xx 6C 2.66 GHz, GigE
UK Meteorological Office
Power 575, p6 4.7 GHz, Infiniband
UK Meteorological Office
Power 575, p6 4.7 GHz, Infiniband
Cluster Platform 3000 BL460c, Xeon 54xx 3.0GHz,
Computacenter
p
((UK)) LTD
GigEthernet
g
Asda Stores
BladeCenter HS22 Cluster, WM Xeon 6-core 2.93Ghz, GigE
Financial Services
xSeries x3650M2 Cluster, Xeon QC E55xx 2.53 Ghz, GigE
Financial Institution
BladeCenter HS22 Cluster, Xeon QC GT 2.53 GHz, GigEthernet
Financial Institution
BladeCenter HS22 Cluster, Xeon QC GT 2.53 GHz, GigEthernet
Bank
xSeries x3650M3
x3650M3, Xeon X56xx 2
2.93
93 GHz,
GHz GigE
Bank
xSeries x3650M3, Xeon X56xx 2.93 GHz, GigE
IT Service Provider
Cluster Platform 3000 BL460c
G6, Xeon L5520 2.26 GHz, GigE
07
IT Service Provider
Cluster Platform 3000 BL460c G6, Xeon X5670 2.93 GHz, 10G
Cores
44376
12936
8320
8320
12288
8000
14556
9768
11280
6368
5880
5880
5880
12552
9480
9480
3520
3520
Rmax
Tflop/s
279
124
115
115
95
66
65
59
58
58
55
55
55
54
53
53
51
51
7560
8352
8096
7872
7872
7728
7728
8568
4392
47
47
46
44
44
43
43
40
40
124
Programming
g
g Model Approaches
pp
•
Hierarchical approach (intra-node + internode)
d )
•
Part I: Inter-node model for communicating
between nodes
• MPI scaling to millions of nodes: Importance high;
risk low
• One-sided communication scaling: Importance
medium; risk low
•
Part II: Intra-node model for on-chip
concurrency
• Overriding Risk: No single path for node
architecture
• OpenMP, Pthreads: High risk (may not be feasible
with
h node
d architectures);
h
high
h h payoff
ff (already
l
d in
some applications)
• New API, extended PGAS, or CUDA/OpenCL to
handle hierarchies of memories and cores:
Medium risk (reflects architecture directions);
Medium payoff (reprogramming of node code)
•
Socket Level
C
Cores
scale-out
l
t ffor planar
l
geometry
t
Unified approach: single high level model for
entire system
•
High risk; high payoff for new codes, new
application domains
Node Level
Slide 125
Programming models requires a
dual approach.
pp
•
Hierarchical approach: intra-node +
inter-node
•
something new …
Partt II: IInter-node
P
t
d model
d l for
f
communicating between nodes
• MPI scaling to millions of nodes:
Importance high; risk low; provides path
for incremental progress
• One-sided communication scaling:
Importance medium; risk low
•
Part II: Intra-node model for on-chip
concurrency
• Overriding Risk: No single path for node
architecture
• OpenMP, Pthreads: High risk (may not be
feasible with node architectures); high
payoff (already in some applications)
• New API, extended PGAS, or
CUDA/OpenCL to handle hierarchies of
memories and cores: Medium risk
( fl
(reflects
architecture
hi
di
directions);
i
)
Medium payoff (reprogramming of node
code)
•
Unified approach: single high level
model
d l for
f entire
ti system
t
•
High risk; high payoff for new codes,
new application domains
something old ...
Power Profiles
PLASMA DP
300
250
200
Po
ower (Watts)
Two dual-core 1.8 GHz AMD Opteron processors
Theoretical peak: 14.4 Gflops per node
DGEMM using
i 4 threads:
th d 12.94
12 94 Gflops
Gfl
PLASMA 2.3.1, GotoBLAS2
Experiments:
PLASMA LU solver in double precision
PLASMA LU solver in mixed precision
150
System
CPU
Memory
Disk
Motherboad
100
50
Time to Solution (s)
GFLOPS
Accuracy|| Ax − b ||
PLASMA
DP
PLASMA
Mixed
0
39.5
22.8
300
10.01
17.37
250
2.0E-02
1.3E-01
200
(|| A |||| X || + || b ||)Nε
Iterations
System Energy
(KJ)
7
10852.8
6314.8
0
10
20
30
40
50
Time (seconds)
PLASMA Mixed Precision
Powe
er (Watts)
N = 8400, using 4
cores
150
100
System
CPU
Memory
Disk
Motherboad
50
0
0
10
20
30
Time (seconds)
40
50
128
The High
g Cost of Data Movement
•Flop/
Flop/ss or percentage of peak flop/s
flop/s become
much less relevant
Approximate power costs (in picoJoules
picoJoules))
2011
2018
DP FMADD flop
100 pJ
10 pJ
DP DRAM read
2000 pJ
1000 pJ
DP copper link
traverse (short)
1000 pJ
100 pJ
DP optical link
traverse (long)
3000 pJ
500 pJ
129
130
• “Nothing you can't spell will ever
work.” – Will Rogers
work.
131
Prioritization of critical path and
noncritical tasks
• DAG scheduling of critical
path tasks
• Allows taking advantage of
asynchronicity
h
b
between
j steps
p and adaptive
p
major
load balancing for
noncritical tasks
132
Synchronization
y
Avoiding
g Methods
133
In the States: CoCo-Design Centers
& Exascale Software Center
• Co-Design Centers
ƒ The Co-Design Process is where system architects,
application software designers, applied
mathematicians, and computer scientists work
together to produce a computational science discovery
environment
• Exascale Software Center
•
•
•
•
•
ƒ Deliver high quality system software for exascale platforms
• ~2015, ~2018
Identify required software capabilities
Identify gaps
Design
g and develop
p open-source
p
software components
p
ƒ Both: evolve existing components, develop new ones
ƒ Includes maintainability, support, verification
Ensure functionality,
y, stability,
y, and performance
p
Collaborate with platform vendors to integrate software
134
Increasing the Level of
Asynchronous Behavior
• DAG level description of methods
ƒ expressing
p
g parallelism
p
explicitly
p
y in
DAGs, so that scheduling tasks
dynamically, support massive
parallelism, and apply common
optimization techniques to increase
throughput.
throughput
• Scheduler needed
• Standards
LAPACK
Step 1
T
LU/LL /QR
Q
Step 2
Step 3
Step 4
• Fork-join, bulk synchronous processing
...
136
Tiled Operations
p
& Look Ahead
• Break
ea task
tas
into smaller
operations;
tiles
• Unwind
outer
t lloop
Scaling
g for LU
Matrix Size
138
If We Had A Small Matrix Problem
• We would generate the DAG,
fi d the
find
h critical
i i l path
h and
d
execute it.
• DAG too large to generate ahead
of time
ƒ Not explicitly generate
ƒ Dynamically generate the DAG as
we go
• Machines will have large
number of cores in a distributed
fashion
ƒ Will have to engage in message
p
passing
g
ƒ Distributed management
ƒ Locally have a run time system
The DAGs are Large
g
• Here is the DAG for a factorization on a
20 x 20 matrix
i
• For a large matrix say O(106) the DAG is huge
• Many challenges for the software
140
PLASMA Scheduling
Dynamic Scheduling: Sliding Window
‹
‹
‹
‹
Tile LU factorization
10 x 10 tiles
300 tasks
100 task window
PLASMA Scheduling
Dynamic Scheduling: Sliding Window
‹
‹
‹
‹
Tile LU factorization
10 x 10 tiles
300 tasks
100 task window
PLASMA Scheduling
Dynamic Scheduling: Sliding Window
‹
‹
‹
‹
Tile LU factorization
10 x 10 tiles
300 tasks
100 task window
PLASMA Scheduling
Dynamic Scheduling: Sliding Window
‹
‹
‹
‹
Tile LU factorization
10 x 10 tiles
300 tasks
100 task window
DAG and Scheduling
g
• DAG is dynamically
generated and
implicit
• Everything
designed
g
for
distributed
memory
y systems
y
• Runtime system on
each node or core
• Run time
• Bin
Bi 1
• See if new data has
arrived
• Bin 2
• See if new dependences
p
are satisfied
• If so move task to Bin 3
• Bin 3
• Exec a task that’s ready
• Notify children of
completion
• Send data to children
• If no work do work
stealing
145
Some Questions
Q
• What’s the best way to represent the DAG?
• What’s
Wh ’ the
h b
best approach
h to d
dynamically
i ll generating
i
the DAG?
• What run time system should we use?
ƒ We will probably build something that we would target to the
underlying
y g system’s
y
RTS.
• What about work stealing?
ƒ Can we do better than nearest neighbor work stealing?
• What does the program look like?
ƒ Experimenting with Cilk, Charm++, UPC, Intel Threads
ƒ I would like to reuse as much of the existing software as
possible
146
PLASMA Scheduling
Dynamic Scheduling with QUARK
z
•
Sequential algorithm definition
•
Side-effect-free tasks
•
Directions of arguments (IN, OUT, INOUT)
•
Runtime resolution of data hazards (RaW
(RaW, WaR,
WaR WaW)
•
Implicit construction of the DAG
•
Processing of the tasks by a sliding window
Old concept
z
Jade (Stanford University)
z
SMP Superscalar (Barcelona Supercomputer Center)
z
StarPU (INRIA)
PLASMA Scheduling
Dynamic Scheduling: Tile LU Trace
z
Regular trace
z
Factorization steps pipelined
z
Stalling only due to natural load
imbalance
8
8-socket,
k t 6
6-core (48 cores ttotal)
t l) AMD IIstanbul
t b l2
2.8
8 GH
GHz
Redesign
g
• Asychronicity
• Avoid
A id fork-join
f k j i (B
(Bulk
lk sync d
design)
i )
• Dynamic
y
Scheduling
g
• Out of order execution
• Fine Granularity
• Independent block operations
• Locality of Reference
• Data storage – Block Data Layout
33
149
150
Communication Reducing
Methods
151
Experimental
p
Results
• On two cluster machines and a Cray
XT5 system
ƒ Cluster 1: 2 cores per node (Grig at
UTK)
ƒ Cluster 2: 8 cores per node (Newton at
UTK)
ƒ Cray XT5: 12 cores per node (Jaguar in
ORNL)
• In comparison with vendors
vendors’
ScaLAPACK
library
152
ƒ Take as input tall and skinny matrix
Strong
g Scalabilityy on Jaguar
g
•
•
Fixed-size input for an increasing number of cores
Each node has 2 sockets, 6-core AMD Opteron 2.6GHz per socket
1800
1600
Tile CA
CA-QR
QR
GFLO
OPS
1400
ScaLAPACK
1200
1000
800
600
400
200
0
1
2
4
8
12
24
N b off C
Number
Cores
153
48
96
192
384
Weak Scalabilityy on Jaguar
g
Increase the input size while the number of cores increases
Each node has 2 sockets, 6-core AMD Opteron per socket
•
•
peak
k
dgemm
10
GFLOPS per Core
8
6
Tile CA-QR
4
ScaLAPACK
2
0
1
2
4
8
12
24
48
96
192
Number of Cores
154
384
768 1536 3072
Applying Tile CAQR to GeneralGeneral-Size
Matrices
• Each node has 2 sockets, 6-core
6 core AMD Opteron per socket
• Using 16 nodes on Jaguar (i.e., 192 cores)
1400
1200
GFLOP
PS
1000
800
600
Tile CA-QR
400
ScaLAPACK
200
0
8
16
32
6
64
128
2 6
256
Number of Tile Columns
Figure: Crossover point on matrices with 512 rows
155
512
12
Idle processes: P0 and P1
GMRES speedups on 88-core Clovertown
156
Communication-avoiding iterative
Communicationmethods
•
Iterative Solvers:
•
•
Exascale challenges for iterative solvers:
•
•
•
•
Collectives, synchronization.
Memory latency/BW.
Not viable on exascale systems in present
forms.
Communication-avoiding (s-step) iterative
solvers:
•
•
•
Dominant cost of many apps (up to 80+% of
runtime).
runtime)
Idea: Perform s steps in bulk ( s=5 or more ):
• s times fewer synchronizations.
• s times fewer data transfers: Better
latency/BW.
Problem: Numerical accuracy of
orthogonalization.
TSQR Implementation:
•
•
•
•
2-level parallelism (Inter and intra node).
Memory hierarchy optimizations.
Fl ibl node-level
Flexible
d l
l scheduling
h d li via
i IIntel
t l
Threading Building Blocks.
Generic scalar data type: supports mixed and
extended precision.
LAPACK – Serial,
S i l MGS –Threaded
Th d d modified
difi d G
Gram-Schmidt
S h idt
TSQR capability:
• Critical for exascale solvers.
• Part
P t off the
th Trilinos
T ili
scalable
l bl multicore
lti
capabilities.
• Helps all iterative solvers in Trilinos
(available to external libraries, too).
•
Staffing: Mark Hoemmen (lead,
(lead postpost
doc, UC-Berkeley), M. Heroux
• Part of Trilinos 10.6 release, Sep
2010.
Mixed Precision Methods
158
Exploiting Mixed Precision Computations
• Single precision is faster than DP because:
ƒ Higher parallelism within floating point units
• 4 ops/cycle (usually) instead of 2
ops/cycle
ƒ Reduced data motion • 32 bit data instead of 64 bit data
ƒ Higher locality in cache
• More data items in cache
Fault Overcoming
g Methods
160
Autotuning
g
161
Automatic Performance Tuning
• Writing high performance software is hard
• Ideal:
Id l get high
hi h fraction
f
i off peak
k performance
f
ffrom
one algorithm
• Reality: Best algorithm (and its implementation) can
depend strongly on the problem, computer
architecture, compiler,…
ƒ Best choice can depend on knowing a lot of
applied mathematics and computer science
ƒ Changes
Ch g with
ith each
h new h
hardware,
d
compiler
il
release
• Automatic p
performance tuning
g
ƒ Use machine time in place of human time for tuning
ƒ Search over possible implementations
ƒ Use performance models to restrict search space
ƒ Past successes: ATLAS, FFTW, Spiral, Open-MPI
How to Deal with Complexity?
p
y
• Many parameters in the code needs to be
optimized
optimized.
• Software adaptivity is the key for applications to
effectively use available resources whose
complexity is exponentially increasing
MFLOPS
O S
Detect
Hardware
Parameters
L1Si
L1Size
NR
MulAdd
L*
ATLAS Search
Engine
(MMSearch)
NB
MU,NU,KU
xFetch
MulAdd
Latency
ATLAS MM
Code Generator
(MMCase)
Compile,
Execute,
Measure
MiniMMM
Source
163
Auto--Tuning
Auto
Best algorithm implementation can depend strongly
on the problem
problem, computer architecture
architecture, compiler
compiler,…
There are 2 main approaches
ƒ Model-driven optimization
p
[Analytical models for various parameters;
Heavily used in the compilers community;
May
y not give
g
optimal
p
results ]
ƒ Empirical optimization
[ Generate large number of code versions and runs them on a given
platform to determine the best performing
p
p
g one;;
Effectiveness depends on the chosen parameters to optimize and
the search heuristics used ]
Natural approach is to combine them in a hybrid
approach
[1st model-driven to limit the search space for a 2nd empirical part ]
[ Another
A
h aspect iis adaptivity
d
i i – to treat cases where
h
tuning
i can not b
be
restricted to optimizations at design, installation, or compile time ]
165
International Community
Effort
166
ƒ We believe this needs to be an international
collaboration for various reasons including:
• The scale of investment
• The need for international input on requirements
p
Asians, and others are working
g on
• US, Europeans,
their own software that should be part of a larger
vision for HPC.
• No global evaluation of key missing components
• Hardware features are uncoordinated with
software development
www.exascale.org
Outline
•
•
•
•
•
Push towards Exascale
Science drivers
IESP and EESI work
Importance of doing it now
Worldwide HPC and challenges
167
Moore’s Law Reinterpreted
p
• Number of cores per chip doubles
every 2 year,
year while clock speed
decreases (not increases).
ƒ Need to deal with systems with millions
of concurrent threads
• Future generation will have billions of
threads!
ƒ Need to be able to easily replace interchip
p parallelism
p
with intro-chip
p
parallelism
• Number of threads of execution
doubles every 2 year
10+ Pflop/s Systems Planned in the
States
♦ DOE Funded, Titan at ORNL,
Based on Cray design with
accelerators, 20 Pflop/s, 2012
♦ DOE Funded, Sequoia at
L
Lawrence
Livermore
Li
Nat.
N
Lab,
L b
Based on IBM’s BG/Q,
20
Pflop/s 2012
Pflop/s,
♦ DOE Funded, BG/Q at Argonne
National Lab,
Lab
Based on IBM
IBM’ss
BG/Q,
10
Pflop/s, 2012
♦ NSF Funded, Blue Waters at
University of Illinois UC, Based
on IBM’s
BM’ P
Power 7 Proc,
P
10
Pflop/s, 2012
Roadmap
p Components
p
www.exascale.org
Exascale Software Center
(in 1
slide))
•
S
Scope
ƒ Deliver high quality system software for exascale platforms
• ~2015, ~2018
ƒ Id
Identify
tif software
ft
gaps, research
h&d
develop
l solutions,
l ti
ttestt and
d
support deployment
ƒ Increase the productivity and capability and reduce the risk of
exascale deployments
•
Cost:
ƒ Applied R&D: ~10-20 distributed teams of 3 to 7 people each
ƒ Large, primarily
l centralized
l d QA, integration, and
d verification
center
•
Schedule Overview
ƒ 2010 – Q1 2011: Planning and technical reviews
ƒ April 2011: Launch Exascale Software Center!
ƒ 2014, 2017: SW ready for integration for 2015, 2018 systems
respectively
Scaling
g
ƒ Strong scaling: fixed problem size.
• Data on each node decreases as the number of nodes
increases
ƒ Weak scaling: fixed the data size on each
node.
node
• Problem size increases as the number of node
increases.
172
Potential System Architecture Targets
System
attributes
2010
“2015”
“2018”
System peak
2 Peta
200 Petaflop/sec
1 Exaflop/sec
Power
6 MW
15 MW
20 MW
y
memory
y
System
0.3 PB
5 PB
32-64 PB
Node
performance
125 GF
0.5 TF
7 TF
1 TF
10 TF
Node memory
BW
25 GB/s
0 1 TB/sec
0.1
1 TB/sec
0 4 TB/sec
0.4
4 TB/sec
Node
concurrency
12
O(100)
O(1,000)
O(1,000)
O(10,000)
System size
(nodes)
18,700
50,000
5,000
1,000,000
100,000
Total
T
t l Node
N d
Interconnect BW
MTTI
1 5 GB/
1.5
GB/s
20 GB/sec
GB/
200 GB/sec
GB/
days
O(1day)
O(1 day)
Moore’s Law reinterpreted
p
• N
Number
b off cores per chip
hi will
ill d
double
bl every
two years
• Clock speed will not increase (possibly
decrease) because of Power
Power ∝ Voltage2 * Frequency
Voltage ∝ Frequency
Power ∝ Frequency 3
• Need to deal with systems
y
with millions of
concurrent threads
• Need to deal with inter-chip
inter chip parallelism as
well as intra-chip parallelism
Future Computer
p
Systems
y
• Most likely be a hybrid design
• Think
hi k standard
d d multicore
li
chips
hi and
d
accelerator (GPUs)
• Today accelerators are attached
generation more integrated
g
• Next g
• Intel’s Larrabee? Now called “Knights
Corner” and “Knights
Corner
Knights Ferry
Ferry” to come.
come
ƒ 48 x86 cores
• AMD’s Fusion in 2011 - 2013
ƒ Multicore with embedded graphics ATI
• Nvidia’s
d ’ plans?
l
175
What’ss Next?
What
All Large Core
Mixed Large
and
Small Core
Many Small Cores
All Small Core
Many FloatingPoint Cores
Different Classes of
Chips
Home
Games / Graphics
Business
Scientific
MAGMA Software
Available through MAGMA
MAGMA'ss homepage
http://icl.cs.utk.edu/magma/
Included are the 3 one
one-sided
sided matrix factorizations
Iterative Refinement Algorithm (Mixed Precision)
Standard (LAPACK) data layout and accuracy
Two LAPACK-style interfaces
ƒ CPU interface: both input and output are on
the CPU
ƒ GPU interface: both input and output are on
the GPU
This release is intended for single GPU
Today’s Fastest Computer
System
attributes
2011
Fujitsu K
“2015”
“2018”
Difference
2011 &
2018
8.7 Pflop/s
200 Pflop/s
1 Eflop/sec
O(100)
115
Power
10 MW
15 MW
~20 MW
System memory
1.6 PB
5 PB
32-64 PB
Node
performance
128 GF
0.5 TF
7 TF
1 TF
10 TF
O(10) –
O(100)
Node memory
BW
64 GB/s
0.1
TB/sec
1 TB/sec
0.4 TB/sec
4 TB/sec
O(100)
62
Node
concurrency
8
O(100)
O(1,000)
O(1,000)
O(10,000)
O(100) –
O(1000)
Total
Concurrency
548,352
O(108)
O(109)
O(1000)
1823
Total Node
Interconnect
BW
20 GB/s
20 GB/sec
200 GB/sec
O(10)
days
O(1day)
O(1 day)
- O(10)
Computer
System peak
MTTI
O(10)
20
Potential System Architecture Targets with $200M and 20 MW caps
Potential System Architecture Targets with $200M and 20 MW caps
System
attributes
2011
Fujitsu K
“2015”
“2018”
Difference
2011 &
2018
8.7 Pflop/s
200 Pflop/s
1 Eflop/sec
O(100)
115
Power
10 MW
15 MW
~20 MW
System memory
1.6 PB
5 PB
32-64 PB
Node
performance
128 GF
0.5 TF
7 TF
1 TF
10 TF
O(10) –
O(100)
Node memory
BW
64 GB/s
0.1
TB/sec
1 TB/sec
0.4 TB/sec
4 TB/sec
O(100)
62
Node
concurrency
8
O(100)
O(1,000)
O(1,000)
O(10,000)
O(100) –
O(1000)
Total
Concurrency
548,352
O(108)
O(109)
O(1000)
1823
Total Node
Interconnect
BW
20 GB/s
20 GB/sec
200 GB/sec
O(10)
days
O(1day)
O(1 day)
- O(10)
Computer
System peak
MTTI
O(10)
20