G.Erbacci GE-PRACE Autumns School Sofia 2012

Transcript G.Erbacci GE-PRACE Autumns School Sofia 2012

Many Integrated Core Prototype
G. Erbacci – CINECA
PRACE Autumn School 2012 on Massively Parallel Architectures and Molecular Simulations
Sofia, 24-28 September 2012
Outline
•
•
•
•
HPC evolution
The Eurora Prototype
MIC architecture
Programming MIC
2
Many Integrated Core Prototype
•
•
•
•
HPC evolution
The Eurora Prototype
MIC architecture
Programming MIC
3
HPC at CINECA
CINECA: National Supercomputing Centre in Italy
• manage the HPC infrastructure
• provide support to Italian and European researchers
• promote technology transfer initiatives for industry
• CINECA is a Hosting Member in PRACE
– PLX: Cluster Linux with GPUs (Tier-1 in PRACE)
– FERMI: IBM BG/Q (Tier-0 in PRACE)
4
PLX@CINECA
IBM Cluster linux
Processor type: 2 six-cores Intel Xeon (Esa-Core Westmere)
X 5645 @ 2.4 GHz, 12MB Cache
N. of nodes / cores: 274 / 3288
RAM: 48 GB/Compute node (14 TB in total)
Internal Network: Infiniband with 4x QDR switches (40 Gbps)
Acccelerators: 2 GPUs nVIDIA M2070 per node
548 GPUs in total
Peak performance: 32 Tflops
565 TFlops SP GPUs
283 TFlops DP GPUs
5
FERMI@CINECA
Architecture: 10 BGQ Frames
Model: IBM-BG/Q
Processor type: IBM PowerA2 @1.6 GHz
Computing Cores: 163840
Computing Nodes: 10240
RAM: 1GByte / core (163 PByte total)
Internal Network: 5D Torus
Disk Space: 2PByte of scratch space
Peak Performance: 2PFlop/s
N. 7 in Top 500 rank (June 2012)
National and PRACE Tier-0 calls
6
CINECA HPC Infrastructure
7
Computational Sciences
Computational science (with theory and
experimentation), is the “third pillar” of scientific inquiry,
enabling researchers to build and test models
of complex phenomena
Quick evolution of innovation:
- Instantaneous communication
- Geographically distributed work
- Increased productivity
- More data everywhere
- Increasing problem complexity
- Innovation happens worldwide
8
Technology Evolution
More data everywhere:
Radar, satellites, CAT scans, sensors, micro-arrays
weather models, the human genome.
The size and resolution of the problems scientists address today are
limited only by the size of the data they can reasonably work with.
There is a constantly increasing demand for faster
processing on bigger data.
Increasing problem complexity
Partly driven by the ability to handle bigger data, but also by the
requirements and opportunities brought by new technologies. For
example, new kinds of medical scans create new computational
challenges.
HPC Evolution
As technology allows scientists to handle bigger datasets and faster
computations, they push to solve harder problems.
In turn, the new class of problems drives the next cycle of technology
innovation.
9
Top 500: some facts
1976
Cray 1 installed at Los Alamos: peak performance 160 MegaFlop/s (106 flop/s)
1993
(1° Edition Top 500) N. 1
1997
Teraflop/s barrier (1012 flop/s)
2008
Petaflop/s (1015 flop/s): Roadrunner (LANL) Rmax 1026 Gflop/s, Rpeak 1375 Gflop/s
hybrid system: 6562 processors dual-core AMD Opteron accelerated with 12240 IBM
Cell processors (98 TByte di RAM)
59.7 GFlop/s (1012 flop/s)
2012 (J) 16.3 Petaflop/s : Lawrence Livermore’s Sequoia Supercomputer BlueGene/Q,
(1.572.864 cores)
- 4 European systems in the Top 10
- Total combined performance of all 500 systems has grown to
123.02 Pflop/s, compared to 74.2 Pflop/s six months ago
- 57 systems use accelerators
- - - - Toward Exascale
10
Dennard Scaling law (MOSFET)
•
•
•
•
•
L’ = L / 2 do not hold anymore!
V’ = V / 2
F’ = F * 2
D’ = 1 / L2 = 4D
P’ = P
The core frequency
and performance do not
grow following the
Moore’s law any longer
L’ = L / 2
V’ = ~V
F’ = ~F * 2
D’ = 1 / L2 = 4 * D
P’ = 4 * P
The power crisis!
CPU + Accelerator
to maintain the
architectures evolution
In the Moore’s law
Programming crisis!
11
Roadmap to Exascale(architectural trends)
12
Heterogeneous Multi-core Architecture
• Combines different types of processors
–
–
Each optimized for a different operational modality
• Performance
Synthesis favors superior performance
• For complex computation exhibiting distinct modalities
• Purpose-designed accelerators
–
–
Integrated to significantly speedup some critical aspect of one or more
important classes of computation
IBM Cell architecture, ClearSpeed SIMD attached array processor,
• Conventional co-processors
–
–
–
–
Graphical processing units (GPU)
Network controllers (NIC)
Many Integrated Cores (MIC )
Efforts underway to apply existing special purpose components
to general applications
13
Accelerators
A set (one or more) of very simple execution units that can perform few operations
(with respect to standard CPU) with very high efficiency. When combined with full
featured CPU (CISC or RISC) can accelerate the “nominal” speed of a system.
CPU
Single thread perf.
ACC.
throughput
CPU.
ACC
Physical integration
CPU & ACC
Architectural integration
14
nVIDIA GPU
Fermi implementation
packs 512 processor
cores
15
ATI FireStream, AMD GPU
2012
New Graphics Core Next “GCN”
With new instruction set and new SIMD
design
16
Intel MIC (Knight Ferry)
17
Real HPC Crisis is with Software
A supercomputer application and software are usually much more long-lived than a hardware
- Hardware life typically four-five years at most.
- Fortran and C are still the main programming models
Programming is stuck
- Arguably hasn’t changed so much since the 70’s
Software is a major cost component of modern technologies
- The tradition in HPC system procurement is to assume that the software is free.
It’s time for a change
- Complexity is rising dramatically
- Challenges for the applications on Petaflop systems
- Improvement of existing codes will become complex and partly impossible
- The use of O(100K) cores implies dramatic optimization effort
- New paradigm as the support of a hundred threads in one node implies new
parallelization strategies
- Implementation of new parallel programming methods in existing large applications has
not always a promising perspective
There is the need for new community codes
18
What about parallel App?
• In a massively parallel context, an upper limit for the
scalability of parallel applications is determined by the
fraction of the overall execution time spent in non-scalable
operations (Amdahl's law).
maximum speedup tends to
1/(1−P)
P= parallel fraction
1000000 core
P = 0.999999
serial fraction= 0.000001
19
Trends
Scalar Application
MPP System, Message Passing: MPI
Vector
Multi core nodes: OpenMP
Distributed
memory
Accelerator (GPGPU,
FPGA): Cuda, OpenCL
Shared
Memory
Hybrid codes
20
Many Integrated Core Prototype
•
•
•
•
HPC evolution
The Eurora Prototype
MIC architecture
Programming MIC
21
EURORA Prototype
•
•
Evolution of AURORA architecture by Eurotech (http://www.eurotech.com/)
– Aurora Rack: 256 Nodes: 512 CPUs
– 101 Tflops @ 100 KW
– liquid cooled
CPU: Xeon Sandy Bridge (SB)
– Up to One full cabinet (128 nodes + 256 accelerators)
•
•
Accelerator: Intel Many Integrated Cores (MIC)
Network architecture: IB and Torus interconnect
–
•
Low Latency/High Bandwidth Interconnect
Cooling: Hot Water
22
EURORA chassis
16 nodes card or
8 nodes card + 16 accelerators
1 rack, 16 chassis
Eurora Rack
Physical dimensions: 2133mm(48U) h, 1095mm w, 1500 mm d;
Weight (full rack with cooling fully loaded with water): 2000Kg
Power/Cooling typical requirements: 120-130 kW @ 48 Vdc
23
EURORA node
•
2 Intel Xeon E5
2 Intel MIC or
2 nVidia Kepler
16GByte DDR3 1.6GHz
SSD disk
24
Node card mockup
• Presented at ISC12
• Can host MIC and K20 cards
• Thermal analysis and validation performed
25
EURORA Network
3D Torus custom network
FPGA (Altera Stratix V)
EXTOLL, APENET
Ad-hoc MPI subset
InfiniBand FDR
Mellanox ConnectX3
MPI + Filesystem
Synch
26
Cooling
•
•
•
•
•
•
•
Hot water 50-80C
Temperature gap 3-5C
No rotating fans
Cold plates –direct on component liquid cooling
Dry chillers
Free cooling
Temperature sensors – downgrade performance is
required
• System isolation
Quick disconnect
27
EURORA prototype (Node Accelerator)
EURopean many integrated cORe Architecture
Goal: evaluate a new architecture for next generation Tier-0 system
Partners:
- CINECA, Italy
- GRNET, Greece
- IPB, Serbia
- NCSA, Bulgaria
Vendor:
Eurotech, Italy
28
EURORA Installation Plan
29
HW Procurement
• Contract with EUROTECH signed in July
–
–
–
–
–
–
–
64 compute card
128 Xeon SandyBridge 3.1GHz
16GByte DDR3 1600MHz per node
160GByte SSD per node
1 FPGA (Altera Stratix V) per node
IB FDR
128 Accelerator cards
• INTEL KNC (or NVIDA K20)
– Thermal sensors network
30
HW Procurement and Facility
• Contract with EUROTECH signed in July
• Integration in the Facility
– First assessment of the location with EUROTECH in May
– First project of integration completed
• Estimated cost higher than budgeted
– Second assessment with EUROTECH in September (before the
end)
– Procurement of the technology:
• Dry coolers, pipes and pumps, exchanger, tanks, filters
31
Some Applications
• www.quantum-espresso.org
www.gromacs.org
32
EURORA Programming Models
•
•
•
•
Message Passing (MPI)
Shared Memory (OpenMP, TBB)
MIC offload (pragmas) / native
Hybrid: MPI + OpenMP + MIC extensions/OpenCL
33
ACCELERATORS
• First K20 and KNC (dense form factor) samples in
September
• KNC standard expansion module, already available to
start the work on software.
34
Software
•
•
•
•
Installation of the KNC software kit
Test of the compiler, and node card HW
First simple (MPI+OpenMP) application test
First Mic-to-Mic MPI communication test
– Intel MPI
– within the same node
• Test of the affinity
35
ACCESS
• Access will be granted upon request to the partners of
the prototype project.
• Other requests will be evaluated case by case.
• We are working to grant early access to the KNC
board already installed.
36
Expected results
• Validate node card design;
• Density in the order of 500TFlops/rack (BG/Q is 200TFlops/rack);
• 3D Torus network scalability and performance vs InfiniBand;
• Power Usage Effectiveness (PUE) close to or less than 1.1 (free
cooling most of the year);
• Programming model for MIC accelerator;
• improved applications efficiency with respect of multi-core clusters;
• Bridge the gap with exascale machines
37
Many Integrated Core Prototype
•
•
•
•
HPC evolution
The Eurora Prototype
MIC architecture
Programming MIC
38
MIC Architecture
• MIC Many Integrated Core
• Knight Corner co-processor
• Intel Xeon Phi co-processor
–
–
–
–
–
–
–
–
22 nm technology
> 50 Intel Architecture cores
connected by a high performance on-die bi-directional interconnect.
I/O Bus: PCIe
Memory Type: GDDR5 and >2x bandwidth of KNF
Memory size: 8 GB GDDR5 memory technology
Peak performance: >1 TFLOP (DP)
Single Linux image per chip
39
MIC Intel Xeon Phi Ring
Each microprocessor core is a
fully functional, in-order core
capable of running IA
instructions independently of
the other cores.
Hardware multi-threaded cores
Each core can concurrently run
instructions from four
processes or threads.
The Ring Interconnect
connecting all the components
together on the chip
40
The Processor Core
- Fetches and decodes instructions
from four hardware thread
execution contexts
- Executes the x86 ISA, and
Knights Corner vector instructions
- The core can execute 2
instructions per clock cycle, one
per pipe
- 32KB, 8-Way set associative L1
Icache and Dcache
- Core Ring Interface (CRI)
- L2 Cache
- Memory controllers (which access
external memory devices to read
and write data)
- PCI Express client: is the system
interface to the host CPU or PCI
Express switch,
41
The vector processing unit
Vector processing unit
(VPU) associated with
each core.
This is primarily a
sixteen-element wide
SIMD engine, operating
on 512-bit vector
registers.
Gather / Scatter Unit
Vector Mask
42
Vector processing
do i = 1, N
A(i) =
B(i)+C(i)
end do
V0  V1 + V2
Functional Unit Add Floating Point
CP 0
.... B(3) B(2) B(1)
.... C(3) C(2) C(1)
CP 1
.... B(4) B(3) B(2)
.... C(4) C(3) C(2)
B(1)
C(1)
CP 2
.... B(5) B(4) B(3)
.... C(5) C(4) C(3)
B(2) B(1)
C(2) C(1)
CP 3
.... B(6) B(5) B(4)
.... C(6) C(5) C(4)
B(3) B(2) B(1)
C(3) C(2) C(1)
CP 4
.... B(7) B(6) B(5)
.... C(7) C(6) C(5)
B(4) B(3) B(2) B(1)
C(4) C(3) C(2) C(1)
CP 5
.... B(8) B(7) B(6)
.... C(8) C(7) C(6)
B(5) B(4) B(3) B(2) B(1)
C(5) C(4) C(3) C(2) C(1)
CP 6
.... B(9) B(8) B(7)
.... C(9) C(8) C(7)
B(6) B(5) B(4) B(3) B(2) B(1)
C(6) C(5) C(4) C(3) C(2) C(1)
CP 7
.... B(10) B(9) B(8)
.... C(10) C(9) C(8)
B(7) B(6) B(5) B(4) B(3) B(2)
C(7) C(6) C(5) C(4) C(3) C(2)
B (1) + C(1)
43
The L2 Cache
• Each core has a 512 KB L2 cache
• The L2 cache is part of the Core-Ring Interface block
• The L2 cache is private to the core: each core acts as a
stand-alone core with 512 KB of total L2 cache space
• Other cores can not directly use them as a cache
512 KB x > 50 cores  > 25 MB L2 on Knight Corner
• Tag Directory on each core, not private to the core
• A simplified way to view the many cores in Knights
Corner is as a chip-level symmetric multiprocessor
(SMP) and > 50 such cores share a high-speed
interconnect on-die.
44
The Ring Interconnect
•
Knights Corner has 10 rings (5 in each direction):
– BL ring carries the data
– 2 AR rings carry address
– 2 AK rings carry coherence information
•
Knights Corner can send data across the ring once per clock per
controller
45
Many Integrated Core Prototype
•
•
•
•
HPC evolution
The Eurora Prototype
MIC architecture
Programming MIC
46
Compute modes vision
Xeon Centric
Xeon
Hosted
Scalar
Co-processing
General purpose
serial and parallel
computing
MIC Centric
Symmetric
Parallel
Co-processing
Codes with
balanced needs
Codes with highlyparallel phases
MIC
Hosted
Highly-parallel
codes
Highly parallel
codes with scalar
phases
47
Xeon Native
Xeon
Main( )
Foo( )
MPI_*( )
Xeon-hosted
MIC Coprocessed
Autonomous
Mode
MIC-hosted
Xeon coprocessed
Main( )
Foo( )
MPI_*( )
Main( )
Foo( )
MPI_*( )
Foo( )
Foo( )
Main( )
Foo( )
MPI_*( )
Main( )
Foo( )
MPI_*( )
MIC Native
PCIe
MIC
Main( )
Foo( )
MPI_*( )
Offload:
Intel
C, C++, Fortran
Compiler
Native: -mmic
48
MPI programmming models for MIC
49
Offload Model
This model is characterized by the
MPI communications taking place
only between the host processors.
The co-processors are used
exclusively thru the offload
capabilities of the products like
Intel® C, C++, and Fortran
Compiler for Intel MIC Architecture,
Intel Math Kernel Library (MKL),
etc.
MPI on Host Devices with Offload
to Co-processors
This mode of operation is already
supported by the Intel MPI Library
for Linux OS
50
Symmetric Model
The MPI processes reside on both
the host and the MIC devices
This model involves both the host
CPUs and the co-processors into the
execution of the MPI processes and
the related MPI communications.
Message passing is supported inside
the co-processor, inside the host
node, and between the co-processor
and the host. environment variable
Most general MPI view of an
essentially heterogeneous cluster.
51
Co-processor-only Model (or MPI Native)
the MPI processes reside on the MIC coprocessor only .
MPI libraries, the application, and other
needed libraries are uploaded to the coprocessors.
An application can be launched from the
host or the co-processor.
This can be seen as a specific case of the
symmetric model
52
53
FARM: Air quality Model
3D Eulerian chemical-transport model (CTM); Fortran 77/90;
• Used to study the transport, chemical conversion and deposition of atmospheric
pollutants;
• Manages multiple nested grids with different resolution.
• Can be compiled in four different ways:
- Serial;
- OpenMP;
- MPI (Master-Worker strategy);
- Hybrid (OpenMP + MPI).
• Good candidate to be tested on MIC
Native compilation: All FARM processes run on the MIC.
Symmetric compilation: Master process runs on the host; Worker Processes run
on the MIC.
54
Native compilation
55
Native compilation
56
Porting on MIC
Pros:
• Compilation with an additional Intel compiler flag (-mmic);
• Scalability tests: fast and smooth;
• Quick analysis with Intel tools (VtuneT, Itac Intel Trace Analyzer and Collector;
• Porting time: one day with validation of the numerical result;
• expert developer of FARM, with good knowledge of the Intel compiler,
BUT with only a basic knowledge of MIC.
• Best scalability with OpenMP and Hybrid.
Cons:
• MPI Init routine problem: increasing CPU time for increasing number of processes;
• Same problem when using two MICs together;
• Detailed analysis of OpenMP threads is currently not available;
• Execution time depends strongly from code vectorization, so compiler vectorization
skills and code structure are a key point to have a vectorizable satisfactory overall
performances.
57
Conclusions
• Hybrid clusters can bridge the gap with exascale machines
• Test and monitor technologies that could influence the design of Exascale
systems
• Introduce fault tolerance at system and application level
• New hybrid programming models
– Focus on the applications
– Big Data Challenge
– Uncertainties Challenge
– coupling Architecture-Algorithm-Application
58