Trends and Perspectives for HPC infrastructures

Download Report

Transcript Trends and Perspectives for HPC infrastructures

Trends and Perspectives for HPC
infrastructures
Carlo Cavazzoni, CINECA
outline
-
HPC resource in EUROPA (PRACE)
Today HPC architectures
Technology trends
Cineca roadmaps (toward 50PFlops)
EuroExa project
The PRACE RI provides access to distributed persistent pan-European world class HPC computing and data
management resources and services. Expertise in efficient use of the resources is available through
participating centers throughout Europe. Available resources are announced for each Call for Proposals..
European
Tier 0
National
Tier 1
Local
Tier 2
Peer reviewed open access
PRACE Projects (Tier-0)
PRACE Preparatory (Tier-0)
DECI Projects (Tier-1)
TIER-0 System, PRACE regular calls
CURIE (GENCI, Fr), BULL Cluster, Intel Xeon, Nvidia cards,
Infiniband network
FERMI (CINECA, It) &
JUQUEEN (Juelich, D), IBM BGQ,
Power processors, custom 5D torus net.
MARENOSTRUM (BSC, S),
IBM DataPlex, Intel Xeon node,
Infiniband net.
HERMIT (HLRS, D), Cray XE6,
AMD procs, custom 3D torus net.
1PFLops
SuperMUC (LRZ, D), IBM DataPlex, Intel Xeon
Node, Infiniband net..
TIER-1 Systems, DECI calls
DECI site
Machine name
System type
chip
Bulgaria (NCSA)
Czech Repulic (VSB-TUO)
Finland (CSC)
France (CINES)
France (IDRIS)
Germany (Jülich)
Germany (RZG)
Germany (RZG)
Ireland (ICHEC)
Italy (CINECA)
Norway (SIGMA)
Poland (WCSS)
Poland (PSNC)
Poland (PSNC)
EA"ECNIS"
Anselm
Sisu
Jade
Babel
JuRoPA
Genius
Stokes
PLX
Abel
Supernova
chimera
cane
IBM BG/P
Bull Bullx
Cray XC30
SGI ICE EX8200
IBM BG/P
Intel cluster
IMB BG/P
iDataPlex
Sgi ICE 8200EX
iDataPlex
MegWare cluster
Cluster
SGI UV1000
cluster AMD&GPU
PowerPC 450
Intel Sandy Bridge-EP
Intel Sandy Bridge
Intel Quad-Core E5472/X5560
PowerPC 450
Intel Xeon X5570
PowerPC 450
Intel Sandy Bridge
Intel Xeon E5650
Intel Westmere
Intel Sandy Bridge
Intel Westmere-EP
Intel Xeon E7-8837
AMD Opteron™ 6234
Poland (ICM)
boreasz
IBM Power 775 (Power7)
IBM Power7
Poland (Cyfronet)
Zeus-gpgpu
Linux Cluster
Intel Xeon X5670/E5645
Spain (BSC)
Sweden (PDC)
Switzerland (CSCS)
The Netherlands (SARA)
Turkey (UYBHM)
UK (EPCC)
UK (ICE-CSE)
MinoTauro
Lindgren
Monte Rosa
Huygens
Karadeniz
HECToR
ICE Advance
Bull Cuda Cluster
Cray XE6
Cray XE6
IBM pSeries 575
HP Cluster
Cray XE6
IBM BG/Q
Intel Xeon E5649
AMD Opteron
AMD Opteron
Power 6
Intel Xeon 5550
AMD Opteron
PowerPC A2
peak perforGPU cards
mance (Tflops)
27
66 23 nVIDIA Tesla 4 Intel Xeon Phi P5110
244.9
267.88
139
207
54
200
293 548 nVIDIA Tesla M2070/ M2070Q
260
51.58
21.8
224.3 334 NVIDIA TeslaM2050
74.5
136.8 48 M2050/160 M2090
182 256 nVIDIA Tesla M2090
305
402
65
2.5
829.03
1250
HPC Architectures
Hybrid:
Server class processors:
Server class nodes
Special purpose nodes
Accelerator devices:
two
model
Nvidia
Intel
AMD
FPGA
Homogeneus:
Server class node:
Standar processors
Special porpouse nodes
Special purpose processors
Networks
Standard/switched:
Infiniband
Special purpose/Topology:
BGQ
CRAY
TOFU (Fujitsu)
TH Express-2 (Thiane-2)
Programming Models
fundamental paradigm:
Message passing
Multi-threads
Consolidated standard: MPI & OpenMP
New task based programming model
Special purpose for accelerators:
CUDA
Intel offload directives
OpenACC, OpenCL, Ecc…
NO consolidated standard
Scripting:
python
Roadmap to Exascale
(architectural trends)
Dennard scaling law
(downscaling)
new VLSI gen.
old VLSI gen.
L’ = L / 2
V’ = V / 2 do not hold anymore!
F’ = F * 2
D’ = 1 / L2 = 4D
P’ = P
The core frequency
and performance do not
grow following the
Moore’s law any longer
L’ = L / 2
V’ = ~V
F’ = ~F * 2
D’ = 1 / L2 = 4 * D
P’ = 4 * P
The power crisis!
Increase the number of cores
to maintain the
architectures evolution
on the Moore’s law
Programming crisis!
Moore’s Law
Economic and market law
Stacy Smith, Intel’s chief financial officer, later
gave some more detail on the economic benefits of
staying on the Moore’s Law race.
The cost per chip “is going down more than the capital intensity is going up,” Smith said,
suggesting Intel’s profit margins should not suffer because of heavy capital spending. “This is the
economic beauty of Moore’s Law.”
And Intel has a good handle on the next production shift, shrinking circuitry to 10 nanometers.
Holt said the company has test chips running on that technology. “We are projecting similar kinds
of improvements in cost out to 10 nanometers,” he said.
So, despite the challenges, Holt could not be induced to say there’s any looming end to Moore’s
Law, the invention race that has been a key driver of electronics innovation since first defined by
Intel’s co-founder in the mid-1960s.
From
It is all about the number of chips per Si wafer!
WSJ
But!
14nm VLSI
0.54 nm
Si lattice
300 atoms!
There will be still 4~6 cycles (or technology generations) left until
we reach 11 ~ 5.5 nm technologies, at which we will reach downscaling limit, in some
year between 2020-30 (H. Iwai, IWJT2008).
What about Applications?
In a massively parallel context, an upper limit for the scalability of parallel
applications is determined by the fraction of the overall execution time
spent in non-scalable operations (Amdahl's law).
maximum speedup tends to
1/(1−P)
P= parallel fraction
1000000 core
P = 0.999999
serial fraction= 0.000001
Architectural trends
Peak Performance
Moore law
FPU Performance
Dennard law
Number of FPUs
Moore + Dennard
App. Parallelism
Amdahl's law
HPC Architectures
two model
Hybrid, but…
Homogeneus, but…
What 100PFlops system we will see … my guess
IBM (hybrid) Power8+Nvidia GPU
Cray (homo/hybrid) with Intel only!
Intel (hybrid) Xeon + MIC
Arm (homo) only arm chip, but…
Nvidia/Arm (hybrid) arm+Nvidia
Fujitsu (homo) sparc high density low power
China (homo/hybrid) with Intel only
Room for AMD console chips
Chip Architecture
Strongly market driven
Mobile, Tv set, Screens
Video/Image processing
Intel
New arch to compete with ARM
Less Xeon, but PHI
ARM
Main focus on low power mobile chip
Qualcomm, Texas inst. , Nvidia, ST, ecc
new HPC market, server maket
NVIDIA
GPU alone will not last long
ARM+GPU, Power+GPU
Power
Embedded market
Power+GPU, only chance for HPC
AMD
Console market
Still some chance for HPC
CINECA Roadmaps
Roadmap 50PFlops
Power
consumption
EURORA 50KW, PLX
350 KW, BGQ
1000KW + ENI
EURORA or PLX
upgrade 400KW;
BGQ 1000KW, Data
repository 200KW; ENI
R&D
Eurora
EuroExa STM / ARM
board
Deployment
Eurora industrial
prototype 150 TF
Eurora or PLX
upgrade 1PF peak,
350TF scalar
Time line
2013
2014
EuroExa STM / ARM
prototype
PCP Proto 1PF in a
rack
EuroExa STM / ARM
PF platform
multi petaflop
system
2015
2016
ETP proto
towards exascale
board
Tier-1 towards
exascale
Tier-0 50PF
2017
2018
2019
2020
Tier 1 CINECA
Procurement Q2014
Requisiti di alto livello del sistema
Potenza elettrica assorbita: 400KW
Dimensione fisica del sistema: 5 racks
Potenza di picco del sistema (CPU+GPU): nell'ordine di 1PFlops
Potenza di picco del sistema (solo CPU): nell'ordine di 300TFlops
Tier 1 CINECA
Requisiti di alto livello del sistema
Architettura CPU: Intel Xeon Ivy Bridge
Numero di core per CPU: 8 @ >3GHz, oppure 12 @ 2.4GHz
La scelta della frequenza ed il numero di core dipende dal TDP del socket, dalla densità del
sistema e dalla capacità di raffreddamento
Numero di server: 500 - 600,
( Peak perf = 600 * 2socket * 12core * 3GHz * 8Flop/clk = 345TFlops )
Il numero di server del sistema potrà dipendere dal costo o dalla geometria della configurazione
in termini di numero di nodi solo CPU e numero di nodi CPU+GPU
Architettura GPU: Nvidia K40
Numero di GPU: >500
( Peak perf = 700 * 1.43TFlops = 1PFlops )
Il numero di schede GPU del sistema potrà dipendere dal costo o dalla
geometria della configurazione in termini di
numero di nodi solo CPU e numero di nodi CPU+GPU
Tier 1 CINECA
Requisiti di alto livello del sistema
Vendor identificati: IBM, Eurotech
DRAM Memory: 1GByte/core
Verrà richiesta la possibilità di avere un sottoinsieme di nodi
con una quantità di memoria più elevata
Memoria non volatile locale: >500GByte
SSD/HD a seconda del costo e dalla configurazione del sistema
Cooling: sistema di raffreddamento a liquido con opzione di free cooling
Spazio disco scratch: >300TByte (provided by CINECA)
Thank you