Transcript Slide 1

V. Milutinović, G. Rakocevic, S. Stojanović, and Z. Sustran
University of Belgrade
Oskar Mencer
Imperial College, London
Oliver Pell
Maxeler Technologies
Goran Dimic
Institute Michael Pupin
1
Modify the Big Data eGov algorithms
to achieve,
for the same hardware price,
a) speed-up of 20-200
b) monthly electricity bills reduced 20 times
c) size 20 times smaller
Absolutely all results achieved with:
a) all hardware produced in Europe, UK
b) all software generated by programmers
of EU and WB
c) primary project beneficiaries in EU;
secondary project beneficiaries world wide
(this is a potential export product for EU)
ControlFlow:
 Top500 ranks using Linpack (Japanese K,…)
DataFlow:
 Coarse Grain (HEP) vs. Fine Grain (Maxeler)
Compiling below the machine code level brings speedups;
also a smaller power, size, and cost.
The price to pay:
The machine is more difficult to program.
Consequently:
Ideal for WORM applications :)
Examples using Maxeler:
GeoPhysics (20-40), Banking (200-1000, with JP Morgan 20%),
M&C (New York City), Datamining (Google), …, eGov
6
7
tCPU =
N * NOPS * CCPU*TclkCPU
/NcoresCPU
tGPU =
N * NOPS * CGPU*TclkGPU /
NcoresGPU
tDF = NOPS * CDF * TclkDF +
(N – 1) * TclkDF / NDF
Assumptions:
1. Software includes enough parallelism to keep all cores busy
2. The only limiting factor is the number of cores.
8
DualCore?
Which way are the horses
going?
9
 Is it possible
to use 2000 chicken instead of two horses?
?
==
 What is better, real and anecdotic?
10
2 x 1000 chickens (CUDA and rCUDA)
11
How about 2 000 000 ants?
12
Big Data Input
Results
Marmalade
13
 Factor: 20 to 200
MultiCore/ManyCore
Dataflow
Machine Level Code
Gate Transfer Level
14
 Factor: 20
MultiCore/ManyCore
Dataflow
15
 Factor: 20
MultiCore/ManyCore
DataFlow
Data Processing
Data Processing
Process Control
Process Control
16
 MultiCore:
 Explain what to do, to the driver
 Caches, instruction buffers, and predictors needed
 ManyCore:
 Explain what to do, to many sub-drivers
 Reduced caches and instruction buffers needed
 DataFlow:
 Make a field of processing gates: 1C+5Java
 No caches, etc. (300 students/year: BGD, BCN, LjU,…)
17
 MultiCore:
 Business as usual
 ManyCore:
 More difficult
 DataFlow:
 Much more difficult
 Debugging both, application and configuration code
18
 MultiCore/ManyCore:
 Several minutes
 DataFlow:
 Several hours for the real hardware
 Fortunately, only several minutes for the simulator
 The simulator supports
both the large JPMorgan machine
as well as the smallest “University Support” machine
 Good news:
 Tabula@2GHz
19
20
 MultiCore:
 Horse stable
 ManyCore:
 Chicken house
 DataFlow:
 Ant hole
21
 MultiCore:
 Haystack
 ManyCore:
 Cornbits
 DataFlow:
 Crumbs
22
Small Data: Toy Benchmarks (e.g., Linpack)
23
Medium Data
(benchmarks
favorising Nvidia,
compared to Intel,…)
24
Big Data
25
 Revisting all major Big Data DM algorithms
 Massive static parallelism at low clock frequencies
 Concurrency and communication
 Concurrency between millions of tiny cores difficult,
“jitter” between cores will harm performance
at synchronization points.
 Reliability and fault tolerance
 10-100x fewer nodes, failures much less often
 Memory bandwidth and FLOP/byte ratio
 Optimize data choreography, data movement,
and the algorithmic computation.
26
Maxeler Hardware
CPUs plus DFEs
Intel Xeon CPU cores and up to
4 DFEs with 192GB of RAM
DFEs shared over Infiniband
Up to 8 DFEs with 384GB of
RAM and dynamic allocation
of DFEs to CPU servers
MaxWorkstation
Desktop development system
27
Low latency connectivity
Intel Xeon CPUs and 1-2 DFEs
with up to six 10Gbit Ethernet
connections
MaxCloud
On-demand scalable accelerated
compute resource, hosted in London
Major Classes of DM Algorithms,
from the Computational Perspective
1. Coarse grained, stateful
– CPU requires DFE for minutes or hours
2. Fine grained, stateless transactional
– CPU requires DFE for ms to s
– Many short computations
3. Fine grained, transactional with shared database
– CPU utilizes DFE for ms to s
– Many short computations, accessing common database data
28
Coarse Grained: FD Wave Modeling
29
Timesteps (thousand)
70
60
Domain points (billion)
50
Total computed points (trillion)
40
30
20
10
0
0
10
20
30
40
50
Peak Frequency (Hz)
60
70
2,000
1,800
15Hz peak frequency
1,600
30Hz peak frequency
1,400
45Hz peak frequency
1,200
70Hz peak frequency
Equivalent CPU cores
• Long runtime, but:
• Memory requirements
change dramatically based
on modelled frequency
• Number of DFEs allocated
to a CPU process can be
easily varied to increase
available memory
• Streaming compression
• Boundary data exchanged
over chassis MaxRing
80
1,000
800
600
400
200
0
1
4
Number of MAX2 cards
8
80
Fine Grained, Stateless: BSOP Control
• Analyse > 1,000,000 scenarios
• Many CPU processes run on many DFEs
– Each transaction executes on any DFE in the assigned group atomically
• ~50x MPC-X vs. multi-core x86 node
CPU
CPU
CPU
CPU
CPU
Market and
instruments
data
Tail
Tail
Tail
Tail
analysis
Tail
analysis
analysis
analysis
onCPU
CPU
analysis
on
on
CPU
onCPU
CPU
on
Instrument
values
30/13
DFE
DFE
DFE
DFE
DFE
Loopover
overinstruments
instruments
Loop
Loop
over
instruments
Loopover
overinstruments
instruments
Loop
Random
number
Randomnumber
number
Random
generator
and
Random
number
generator
and
Random
number
generator
and
sampling
of
underliers
generator
and
sampling
ofand
underliers
generator
sampling
of
underliers
sampling
of
underliers
sampling of underliers
Priceinstruments
instruments
Price
Price
instruments
usingBlack
Black
Priceinstruments
instruments
using
Price
using
Black
Scholes
usingScholes
Black
using
Black
Scholes
Scholes
Scholes
Fine Grained, Shared Data: Monitoring
• DFE DRAM contains the database to be searched
• CPUs issue transactions find(x, db)
• Complex search function
– Text search against documents
– Shortest distance to coordinate (multi-dimensional)
– Smith Waterman sequence alignment for genomes
• Any CPU runs on any DFE
that has been loaded with the database
– MaxelerOS may add or remove DFEs
from the processing group to balance system demands
– New DFEs must be loaded with the search DB before use
31
P. Marchetti et al, 2010
Trace Stacking: Speed-up 217
• DM for Monitoring and Control in Seismic processing
• Velocity independent / data driven method
to obtain a stack of traces, based on 8 parameters
– Search for every sample of each output trace
2
t
2
hyp

2t0
2 T 
T
T
T
T
  t 0 
w m  
m H zy K N H zy m  h H zy K NIP H zy h
v0
v0


(
2 parameters ( emergence angle & azimuth )
3 Normal Wave front parameters ( KN,11; KN,12 ; KN22 )
3 NIP Wave front parameters ( KNip,11; KNip,12 ; KNip22 )
32
)
Conclusion: Nota Bene
This is about algorithmic changes,
to maximize
the algorithm to architecture match!
The winning paradigm
of Big Data eGov?
33
The TriPeak
BSC
+ Imperial College
+ Maxeler
+ Belgrade
34/8
The TriPeak
MontBlanc = A ManyCore (NVidia) + a MultiCore (ARM)
Maxeler = A FineGrain DataFlow (FPGA)
How about a happy marriage
of MontBlanc and Maxeler?
In each happy marriage,
it is known who does what :)
The Big Data eGov algorithms:
What part goes to MontBlanc and what to Maxeler?
35/8
Core of the Symbiotic Success:
An intelligent DM algorithmic scheduler,
partially implemented for compile time,
and partially for run time.
At compile time:
Checking what part of code fits where
(MontBllanc or Maxeler).
At run time:
Rechecking the compile time decision,
based on the current data values.
36/8
37
37/8
© H. Maurer
38
38/8
© H. Maurer
39
39/8
Q&A
[email protected]© H. Maurer
40
40/8
[email protected]