Oskar@Linz Keynote Speach

Transcript Oskar@Linz Keynote Speach

Computing in Space
PRACE Keynote, Linz
Oskar Mencer, April 2014
Thinking Fast and Slow
Daniel Kahneman
Nobel Prize in Economics, 2002
14 × 27 = ?
Kahneman splits thinking into:
System 1: fast, hard to control
System 2: slow, easier to control
….. 300
….. 378
Assembly-line computing in action
SYSTEM 1
x86 cores
SYSTEM 2
flexible memory
plus logic
Optimal
Encoding
Low Latency
Memory
System
High Throughput
Memory
minimize data movement
Temporal Computing (1D)
• A program is a sequence
of instructions
• Performance is dominated
by:
CPU
Memory
– Memory latency
– ALU availability
Actual computation time
Get
Inst.
1
5
Read
data
1
C
O
M
P
Write Get
Result Inst.
1
2
Read
data
2
C
O
M
P
Write Get
Result Inst.
2
3
Read
data
3
C
O
M
P
Write
Result
3
Time
Spatial Computing (2D)
Synchronous data movement
Control
data
in
ALU
Control
ALU
Buffer
ALU
data
out
ALU
ALU
Read data [1..N]
Computation
Write results [1..N]
Throughput dominated
6
Time
Computing in Time vs Computing in Space
Computing in Time
512 Controlflow Cores
2GHz
10KB on-chip SRAM
Computing in Space
10,000* Dataflow cores
200MHz
5MB on-chip SRAM
>10TB/s
8GB on board DRAM
96GB of DRAM per DFE
1 result every 100* clock cycles
1 result every clock cycle
*depending on application!
=> *200x faster per manycore card
=> *10x less power
=> *10x bigger problems per node
=> *10x less nodes needed
OpenSPL in Practice
New CME Electronic Trading Gateway will be going live in
March 2014!
Webinar Page: http://www.cmegroup.com/education/new-ilink-architecture-webinar.html
CME Group Inc. (Chicago Mercantile Exchange) is one of the largest options and futures
exchanges. It owns and operates large derivatives and futures exchanges in Chicago, and New
York City, as well as online trading platforms. It also owns the Dow Jones stock and financial
indexes, and CME Clearing Services, which provides settlement and clearing of exchange
trades. …. [from Wikipedia]
8
9
Maxeler Seismic Imaging Platform
• Maxeler provides Hardware plus application software for seismic modeling
• MaxSkins allow access to Ultrafast Modelling and RTM for research and
development of RTM and Full Waveform Inversion (FWI) from
MatLab, Python, R, C/C++ and Fortran.
• Bonus: MaxGenFD is a MaxCompiler plugin that allows the user to
specify any 3D Finite Difference problem, including the PDE, coefficients,
boundary conditions, etc, and automatically generate a fully parallelized
implementation for a whole rack of Maxeler MPC nodes.
Application areas:
• O&G
• Weather
• 3D PDE Solvers
• High Energy Physics
• Medical Imaging
10
Example:
data flow graph
generated by
MaxCompiler
4866
static dataflow cores
in 1 chip
Mission Impossible?
Computing in Space - Why Now?
• Semiconductor technology is ready
– Within ten years (2003 to 2013) the number of transistors on a chip went up from
400M (Itanium 2) to 5Bln (Xeon Phi)
• Memory performance isn’t keeping up
–
–
–
–
Memory density has followed the trend set by Moore’s law
But Memory latency has increased from 10s to 100s of CPU clock cycles
As a result, On-die cache % of die area increased from 15% (1um) to 40% (32nm)
Memory latency gap could eliminate most of the benefits of CPU improvements
• Petascale challenges (10^15 FLOPS)
– Clock frequencies stagnated in the few GHz range
– Energy usage and Power wastage of modern HPC systems are becoming a huge
economic burden that can not be ignored any longer
– Requirements for annual performance improvements grow steadily
– Programmers continue to rely on sequential execution (1D approach)
• For affordable petascale systems  Novel approach is needed
13
OpenSPL Example: X2 + 30
x
x
SCSVar x = io.input("x", scsInt(32));
30
SCSVar result = x * x + 30;
io.output("y", result, scsInt(32));
+
y
14
OpenSPL Example: Moving Average
Y = (Xn-1 + X + Xn+1) / 3
SCSVar x = io.input(“x”, scsFloat(7,17));
SCSVar prev = stream.offset(x, -1);
SCSVar next = stream.offset(x, 1);
SCSVar sum = prev + x + next;
SCSVar result = sum / 3;
io.output(“y”, result, scsFloat(7,17));
15
OpenSPL Example: Choices
x
SCSVar x = io.input(“x”, scsUInt(24));
SCSVar result = (x>10) ? x+1 : x-1;
io.output(“y”, result, scsUInt(24));
>
-
+
y
16
1
1
10
OpenSPL and MaxAcademy
17 lectures/exercises, Theory and Practice of Computing in Space
LECTURE 1: Concepts for Computing in Space
LECTURE 2: Converting Temporal Code to Graphs
LECTURE 3: Computing, Storage and Networking
LECTURE 4: OpenSPL
LECTURE 5: Dataflow Engines (DFEs)
LECTURE 6: Programming DFEs (Basics)
LECTURE 7: Programming DFEs (Advanced)
LECTURE 8: Programming DFEs (Dynamic and multiple kernels)
LECTURE 9: Application Case Studies I
LECTURE 10: Making things go fast
LECTURE 11: Numerics
LECTURE 12: Application Case Studies II
LECTURE 13: System Perspective
LECTURE 14: Verifying Results
LECTURE 15: Performance Modelling
LECTURE 16: Economics of Computing in Space
LECTURE 17: Summary and Conclusions
17
Maxeler Dataflow Engine Platforms
High Density DFEs
The Dataflow Appliance
The Low Latency Appliance
Intel Xeon CPU cores and up to 6
DFEs with 288GB of RAM
Dense compute with 8 DFEs,
384GB of RAM and dynamic
allocation of DFEs to CPU servers
with zero-copy RDMA access
Intel Xeon CPUs and 1-2 DFEs with
direct links to up to six 10Gbit
Ethernet connections
18
Bringing Scalability and
Efficiency to the
Datacenter
19
3000³ Modeling
Compared to 32 3GHz x86 cores parallelized using MPI
2,000
15Hz peak frequency
1,600
30Hz peak frequency
1,400
45Hz peak frequency
1,200
70Hz peak frequency
Equivalent CPU cores
1,800
1,000
800
600
400
200
0
1
*presented at SEG 2010.
4
Number of MAX2 cards
8
8 Full Intel Racks ~100kWatts => 2 MaxNodes (2U) Maxeler System <1kWatt
Typical Scalability of Sparse Matrix
Visage –
Geomechanics
Eclipse Benchmark
4
(2 node Nehalem 2.93 GHz)
E300 2 Mcell Benchmark
FEM Benchmark
Relative Speed
Relative Speed
(2 node Westmere 3.06 GHz)
3
2
1
0
0
2
4
6 8
# cores
10 12
5
4
3
2
1
0
0
2
4
# cores
6
8
Sparse Matrix Solving
O. Lindtjorn et al, 2010
• Given matrix A, vector b, find vector x in:
Ax = b
• Typically memory bound, not parallelisable.
• 1 MaxNode achieved 20-40x the performance of an
x86 node.
60
GREE0A
1new01
Speedup per 1U Node
50
624
40
30
20
10
624
0
0
1
2
3
4
5
6
7
8
Ratio
Domain SpecificCompression
Address and
Data Encoding
22
9
10
Global Weather Simulation
 Atmospheric equations
 Equations: Shallow Water Equations (SWEs)
𝜕𝑄 1 𝜕(Λ𝐹1 ) 1 𝜕(Λ𝐹1 )
+ C. Yang, W.1Xue, +
+G.𝑆Yang,
= 0Accelerating solvers for
[L. Gan, H. Fu, W. Luk,
X. Huang, Y. Zhang,
and
2
𝜕𝑡 equations
Λ through
𝜕𝑥 mixed-precision
Λ 𝜕𝑥
global atmospheric
data flow engine, FPL2013]
Always double-precision needed?
 Range analysis to track the absolute values of all variables
fixed-point
fixed-point
fixed-point
reduced-precision
reduced-precision
What about error vs area tradeoffs
 Bit accurate simulations for different bit-width configurations.
Accuracy validation
[Chao Yang, Wei Xue, Haohuan Fu, Lin Gan, et al. ‘A Peta-scalable CPU-GPU
Algorithm for Global Atmospheric Simulations’, PPoPP’2013]
And there is also performance gain
Platform
Performance
Speedup
()
6-core CPU
4.66K
1
Tianhe-1A node
110.38K
23x
MaxWorkstation
468.1K
100x
MaxNode
1.54M
330x
Meshsize: 1024 × 1024 × 6
14x
MaxNode speedup over Tianhe node: 14 times
And power efficiency too
Platform
Efficiency
Speedup
()
6-core CPU
Tianhe-1A node
MaxWorkstation
MaxNode
20.71
306.6
2.52K
3K
Meshsize: 1024 × 1024 × 6
MaxNode is 9 times more power efficient
1
14.8x
121.6x
144.9x
9x
Weather and climate models on DFEs
Which one is better?
Finer grid and higher precision are obviously preferred but the computational
requirements will increase  Power usage  $$
What about using reduced precision? (15 bits instead of 64 double precision FP)
29
Weather models precision comparison
30
What about 15 days of simulation?
Surface pressure after 15 days of simulation for the double precision and the
reduced precision simulations (quality of the simulation hardly reduced)
31
CPU
DFE
MAX-UP: Astro Chemistry
Does it work?
Test problem
2D Linear advection
4th order Runge-Kutta
Regular torus mesh
Gaussian bump
Bump is advected
across the torus mesh
After 20 timesteps it
should be back where
it started
Bump
at t=20
33
CFD Performance
Max3A workstation with Xilinx Virtex 6
475t + 4-core i7
For this 2D
linear advection
test problem we
achieve
ca.450M
degree-offreedom
updates per
second
For comparison
a GPU
implementation
(of a NavierStokes solver)
34
CFD Conclusions
You really can do unstructured meshes on a
dataflow accelerator
You really can max out the DRAM bandwidth
You really can get exciting performance
You have to work pretty hard
Or build on the work of others
This was not an acceleration project
We designed a generic architecture for a
family of problems
35
7
We’re Hiring
Candidate Profiles
Acceleration Architect (UK)
Application Engineer (USA)
System Administrator (UK)
Senior PCB Designer (UK)
Hardware Engineer (UK)
Networking Engineer (UK)
Electronics Technician (UK)

Oskar@Linz Keynote Speach

Transcript Oskar@Linz Keynote Speach

Directory