Transcript Document

Trends and Challenges
in
High Performance Computing
Hai Xiang Lin
Delft Institute of Applied Mathematics
Delft University of Technology
What is HPC?
 Supercomputers
are computers which
are typically 100 times or more faster
than a PC or a workstation.
 High Performance Computing are
typically computer applications running
on parallel and distributed
supercomputers in order to solve large
scale or complex models within an
acceptable time. (huge computing speed
and huge memory storage)
Computational Science & Engineering

Computational Science & Engineering
as a third paradigm for scientific
research is becoming increasingly more
important (traditional paradigms:
analytical (theory) and experimental);

HPC is the driving force for the rise of
the third paradigm (although CSE is not
necessary connected to HPC);
Hunger for more computing power

Tremendous increase in speed:
 Clock speed: 106
 Parallel processing: 105
 Efficient algorithms: 106

Computational scientists and
engineers demand for ever more
computing power
Outline

Architectures

Software and Tools

Algorithms and Applications
Evolution of supercomputers

Vector supercomputers (70’s, 80’s)


Massively Parallel Computers (90’s)



Very expensive! (Cray 2 (1986): 1 Gflops)
Still expensive ( Intel ASCI Red (1997): 1 Tflops)
Distributed Cluster Computers (late 90’s - )

Cheap (using off-the-shelf components), and

Easy to produce

(IBM Roadrunner (2008): 1 Pflops)
What’s next?

1 Exaflops in 2019? (cycle of 11 years)
Hardware trends: no more doubling of clock
speed every 2~3 years
CMOS Devices hitting a scaling wall
Power components:
 Active power
 Passive power
 Gate
leakage
 Sub-threshold
leakage (sourcedrain leakage)

Net:

Further
improvements
require
structure/material
s changes
1000
Air Cooling limit
Power Density (W/cm2)

100
Active
Power
10
Passive Power
1
0.1
0.01
1994
2004
0.001
1
0.1
0.01
Gate Length (microns)
Source: P. Hofstee, IBM, Europar 09 keynote
Microprocessor Trends
Single Thread
performance power
limited

Multi-core
throughput
performance
extended

Hybrid extends
performance and
efficiency
Hybrid
Performance

Multi-Core
Single Thread
Power
Hardware trends: moving towards multi-core
and accelerators (e.g., Cell, GPU, …)
 Multi-Core: e.g., IBM Cell BE: 1 PPE +8 SPE
 GPU: e.g., Nvidia G200 (240 cores) and GF100 (512 cores).
 “supercomputer” affordable for everyone now!
E.g., a PC + 1 Tflops GPU.
 The size is kept increasing – The largest
supercomputer will soon have more than 1
million processors/cores (e.g., IBM Sequoia 1.6
Power-processors, 1.6 Pbytes and 20 Pflops, (2012)).
 Power consumption is becoming an important
metric (watts/Gflops) for (HPC) computers.
Geographical distribution of
Supercomputing power (Tflops)
Figure: John West, InsideHPC
HPC Cluster Directions
(according to Peter Hofstee, IBM)
Software and Tools Challenges

In the mid 70s to mid 90s, data parallel
languages and SIMD execution mode were
popular together with the vector computers.


Automatic vectorization for array type
operations is quite well developed
For MPPs and clusters, an efficient
automatic parallelizing compiler has not been
developed till today

Optimizing data distribution, and automatic
detection of task-parallelism turns out to be
very hard problems to tackle.
Software and Tools Challenges
(con.)


Current situation: OpenMP works for SMP
systems with a small number of
processors/cores. For large systems and
distributed memory systems, data
distibution/communication must be done
manually by the programmer, mostly with
MPI.
Programming GPU type of accelerators
using CUDA, OpenCL etc. has sort
resemblance of programming vector
processors in the old days. Very high
performance for certain type of
operations, but the programmability and
applicability are some what limited.
The programming difficulty is
getting severer


In contrast to the fast development in
hardware, the development of parallel
compiler and programming lack
behind.
Moving towards larger and larger
systems enlarges this problem even
further
Heterogeneity
 Debugging
…

DOE report on ExaScale
superomputing [4]

“The shift from faster processors to
multicore processors is as disruptive as
the shift from vector to distributed
memory supercomputers 15 years ago.
That change required complete
restructuring of scientific application
codes, which took years of effort. The
shift to multicore exascale systems will
require applications to exploit million-way
parallelism. This ‘scalability challenge’
affects all aspects of the use of HPC. It is
critical that work begin today if the
software ecosystem is to be ready for the
arrival of exascale systems in the coming
decade”
The big challenge requires
consorted Int’l effort

IESP - International Exascale
Sofwtare Project [5].
Applications

Applications which require Exaflops
computing power, examples ([4],[6]):






Climate and atmospheric modelling
Astrophysics
Energy research (e.g., combustion and
fusion)
Biology (genetics, molecular dynamics, …)
…
Are their applications which can use 1
million processors?



Parallelism is inherent in nature
Serialization is a way we deal with complexity
Some mathematical and computational
models may have to be reconsidered
Algorithms


Algorithms with a large degree of
parallelism is essential
Data locality is important to
efficiency
Data movement at the cache
level(s)
 Data movement (communication)
between processors/nodes

Growing gap betw. processor and memory
7/20/2015
HPC & A
20
Memory hierarchy: NUMA
In order reduce the big delay of
directly accessing the main or
remote memory, it requires:
 Optimizing the data
movement, maximize reuse of
data already in fastest
memory;
 Minimizing data movement
between ‘remote’ memories
(communication)
7/20/2015
HPC & A
21
Scale change requires change in
algorithms


It is well known that sometimes an
algorithm with higher degree of
parallelism is preferred above an
‘optimal’ algorithm (in the sense of
number of operations);
In order to reduce the data
movement, we need to consider to
restructure existing algorithms
An example: Krylov iterative
method
James Demmel et al, “Avoiding
communication in Sparse Matrix
Computations”, Proc. IPDPS, April 2008.
In an iteration of the Krylov method, e.g. CG or
GMRES, typically an SpMV (Sparse Matrix-Vector
Multiplication) is calculated:
y  y + A x, A is a sparse matrix
For ai,j ≠ 0, yi = yi + ai,j * xj
SpMV has low computational intensity. ai,j is used
only once, no reuse at all.
An example: Krylov iterative method
(con.)



Consider the operation across a
number of iterations, where the
“matrix power kernel” [x, Ax, A2x,
…, Akx] is computed.
Computing all these terms at the
same time  minimize the data
movement of A (some redundant
work)
Speedup upto 7x, and 22x across
the Grid.
Example: Generating Parallel operation by graph
transformations
[Lin2001] A Unifying Graph Model for Designing Parallel
Algorithms For Tridiagonal Systems, Parallel Computing,
Vol. 27, 2001
[Lin2004] Graph Transformation and Designing Parallel
Sparse Matrix Algorithms beyond Data Dependence Analysis.
Scientific Programming, Vol.12, 2004.
May be yet a step too far, first thing should be automatic
parallelizing compiler (by detecting parallelism and optimizing
data locality).