Document 7549129

Download Report

Transcript Document 7549129

Scientific Computations
on Modern Parallel
Vector Systems
Leonid Oliker, Jonathan Carter, Andrew Canning, John Shalf
Lawrence Berkeley National Laboratories
Stephane Ethier
Princeton Plasma Physics Laboratory
http://crd.lbl.gov/~oliker
Overview
 Superscalar cache-based architectures dominate HPC market
 Leading architectures are commodity-based SMPs due to generality and perception of
cost effectiveness
 Growing gap between peak & sustained performance is well known in scientific
computing
 Modern parallel vectors may bridge gap this for many important applications
 In April 2002, the Earth Simulator (ES) became operational:
Peak ES performance > all DOE and DOD systems combined
Demonstrated high sustained performance on demanding scientific apps
 Conducting evaluation study of scientific applications on modern vector systems
 09/2003 MOU between ES and NERSC was completed
First visit to ES center: December 8th-17th, 2003 (ES remote access not available)
First international team to conduct performance evaluation study at ES
 Examining best mapping between demanding applications and leading HPC systems one size does not fit all
Vector Paradigm
 High memory bandwidth
• Allows systems to effectively feed ALUs (high byte to flop ratio)
 Flexible memory addressing modes
• Supports fine grained strided and irregular data access
 Vector Registers
• Hide memory latency via deep pipelining of memory load/stores
 Vector ISA
• Single instruction specifies large number of identical operations
 Vector architectures allow for:
• Reduced control complexity
• Efficiently utilize large number of computational resources
• Potential for automatic discovery of parallelism
However: most effective if sufficient regularity discoverable in program structure
• Suffers even if small % of code non-vectorizable (Amdahl’s Law)
Architectural Comparison
Node Type Where
CPU/ Clock
Node MHz
Peak Mem BW Peak
GFlop
GB/s byte/flop
Netwk
Bisect
MPI
BW
BW
Latency
GB/s/P byte/flop
usec
Network
Topology
Power3 NERSC
16
375
1.5
1.0
0. 47
0.13
0.087
16.3
Fat-tree
Power4
ORNL
32
1300
5.2
2.3
0.44
0.13
0.025
7.0
Fat-tree
Altix
ORNL
2
1500
6.0
6.4
1.1
0.40
0.067
2.8
Fat-tree
ES
X1
ESC
ORNL
8
4
500
800
8.0
12.8
32.0
34.1
4.0
2.7
1.5
6.3
0.19
0.088
5.6
7.3
Crossbar
2D-torus
Custom vector architectures have
•High memory bandwidth relative to peak
•Superior interconnect: latency, point to point, and bisection bandwidth
Overall ES appears as the most balanced architecture,
while Altix shows best architectural balance among superscalar architectures
A key ‘balance point’ for vector systems is the scalar:vector ratio
Applications studied
LBMHD
Plasma Physics 1,500 lines grid based
Lattice Boltzmann approach for magneto-hydrodynamics
CACTUS
Astrophysics
100,000 lines grid based
Solves Einstein’s equations of general relativity
PARATEC
Material Science 50,000 lines Fourier space/grid
Density Functional Theory electronic structures codes
GTC
Magnetic Fusion 5,000 lines particle based
Particle in cell method for gyrokinetic Vlasov-Poisson equation


Applications chosen with potential to run at ultrascale
Computations contain abundant data parallelism
•


ES runs require minimum parallelization and vectorization hurdles
Codes originally designed for superscalar systems
Ported onto single node of SX6, first multi-node experiments performed at ESC
Plasma Physics: LBMHD
 LBMHD uses a Lattice Boltzmann method to model
magneto-hydrodynamics (MHD)
 Performs 2D simulation of high temperature plasma
 Evolves from initial conditions and decaying to form
current sheets
 2D spatial grid is coupled to octagonal streaming
lattice
 Block distributed over 2D processor grid
Current density decays of two crossshaped structures
 Main computational components:



Collision requires coefficients for local gridpoint only, no communication
Stream values at gridpoints are streamed to neighbors,
at cell boundaries information is exchanged via MPI
Interpolation step required between spatial and stream lattices
 Developed George Vahala’s group College of William and Mary, ported Jonathan Carter
LBMHD: Porting Details
(left) octagonal streaming
lattice coupled with square
spatial grid
(right) example of diagonal
streaming vector updating
three spatial cells
 Collision routine rewritten:



For ES loop ordering switched so gridpoint loop (~1000 iterations) is inner rather than velocity or
magnetic field loops (~10 iterations)
X1 compiler made this transformation automatically: multistreaming outer loop and vectorizing (via
strip mining) inner loop
Temporary arrays padded reduce bank conflicts
 Stream routine performs well:

Array shift operations, block copies, 3rd-degree polynomial eval
 Boundary value exchange


MPI_Isend, MPI_Irecv pairs
Further work: plan to use ES "global memory" to remove message copies
LBMHD: Performance
Power 3
Data
Size
P
16
64
256
64
256
1024
4096
x
4096
8192
x
8192
Power4
Altix
ES
X1
Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak
0.11
0.14
0.14
0.11
0.12
0.11
7%
9%
9%
7%
8%
7%
0.28
0.30
0.28
0.27
0.28
---
5%
6%
5%
5%
5%
---
0.60
0.62
--0.65
-----
10%
10%
--11%
-----
4.6
4.3
3.2
4.6
4.3
3.3
58%
54%
40%
58%
53%
41%
 ES achieves highest performance to date: over 3.3 Tflops for P=1024


X1 comparable absolute speed up to P=64 (lower % peak)
But performs 1.5X slower at P=256 (decreased scalability)
 CAF improved X1 to slightly exceed ES at P=64 (up to 4.70 Gflop/P)
 ES is 44X, 16X, and 7X faster than Power3, Power4, and Altix
• Low CI (1.5) and high memory requirement (30GB) hurt scalar performance
 Altix best scalar due to: high memory bandwidth, fast interconnect
4.3
4.4
--4.5
2.7
---
34%
34%
--35%
21%
---
LBMHD on X1
MPI vs CAF
Data
Size
40962
81922
P
16
64
64
256
X1-MPI
Gflops/P
%peak
4.32
4.35
4.48
2.70
34%
34%
35%
21%
X1-CAF
Gflops/P %peak
4.55
4.26
4.70
2.91
36%
33%
37%
23%
 X1 well-suited for one-sided parallel languages (globally addressable mem)
• MPI hinders this feature and requires scalar tag matching

CAF allows much simpler coding of boundary exchange (array subscripting):
• feq(ista-1,jsta:jend,1) = feq(iend,jsta:jend,1)[iprev,myrankj]
 MPI requires non-contiguous data copies into buffer, unpacked at destination
 Since communication about 10% of LBMHD, only slight improvements
 However, for P=64 on 40962 performance degrades. Tradeoffs:
• CAF reduced total message volume 3X (eliminates user and system buffer copy)
• But CAF used more numerous and smaller sized message
Astrophysics: CACTUS
 Numerical solution of Einstein’s equations from
theory of general relativity
 Among most complex in physics: set of coupled
nonlinear hyperbolic & elliptic systems with
thousands of terms
Visualization of grazing collision of
two black holes
 CACTUS evolves these equations to simulate high
gravitational fluxes, such as collision of two black
holes
 Evolves PDE’s on regular grid using finite differences
 Uses ADM formulation: domain decomposed into 3D
hypersurfaces for different slices of space along time
dimension
 Exciting new field about to be born: Gravitational Wave
Astronomy - fundamentally new information about Universe
 Gravitational waves: Ripples in spacetime curvature,
caused by matter motion, causing distances to change.
 Developed at Max Planck Institute, vectorized by John Shalf
Communication at boundaries
Expect high parallel efficiency
CACTUS: Performance
Problem
Size
P
16
64
256
250x80x80 16
per
64
processor 256
80X80x80
per
processor



Altix
ES
X1
Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak
0.31
0.22
0.22
0.10
0.08
0.07
21%
14%
14%
6%
6%
5%
0.58
0.50
0.48
0.56
-----
11%
10%
9%
11%
-----
0.89
0.70
--0.51
0.42
---
15%
12%
--9%
7%
---
1.5
1.4
1.4
2.8
2.7
2.7
18%
17%
17%
35%
34%
34%
Vector performance related to x-dim (vector length)
Excellent scaling on ES using fixed data size per proc (weak scaling)
Scalar performance better on smaller problem size (cache effects)
X1 surprisingly poor (4X slower than ES) - low ratio scalar:vector



Power 4
ES achieves fastest performance to date: 45X faster than Power3!


Power 3
Unvectorized boundary, required 15% of runtime on ES and 30+% on X1
< 5% for the scalar version: unvectorized code can quickly dominate cost
Poor superscalar performance despite high computational intensity


Register spilling due to large number of loop variables
Prefetch engines inhibited due to multi-layer ghost zones calculations
0.54
0.43
0.41
0.81
0.72
0.68
4%
3%
3%
6%
6%
5%
Material Science: PARATEC
 PARATEC performs first-principles quantum
mechanical total energy calculation using
pseudopotentials & plane wave basis set
 Density Functional Theory to calc structure &
electronic properties of new materials
 DFT calc are one of the largest consumers of
supercomputer cycles in the world
Induced current and charge
density in crystallized glycine
 Uses all-band CG approach to obtain wavefunction of electrons
 33% 3D FFT, 33% BLAS3, 33% Hand coded F90
 Part of calculation in real space other in Fourier space
• Uses specialized 3D FFT to transform wavefunction
 Computationally intensive - generally obtains high percentage of peak
 Developed Andrew Canning with Louie and Cohen’s groups (UCB, LBNL)
PARATEC:
Wavefunction Transpose
(a)
(c)
(e)
(b)
(d)
(f)

Transpose from Fourier to real space

3D FFT done via 3 sets of
1D FFTs and 2 transposes

Most communication in
global transpose (b) to (c)
little communication (d) to (e)

Many FFTs done at the same time
to avoid latency issues

Only non-zero elements
communicated/calculated

Much faster than vendor 3D-FFT
PARATEC: Performance
Data
Size
P
32
64
128
256
512
128
256
432
Atom
686
Atom


X1
Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak
0.95
63%
2.0
39%
3.7
62%
4.7
60%
3.0
24%
0.85
57%
1.7
33%
3.2
54%
4.7
59%
2.6
20%
0.74
49%
1.5
29%
----4.7
59%
1.9
15%
0.57
38%
1.1
21%
----4.2
52%
----0.41
28%
--------3.4
42%
----4.9
62%
3.0
24%
4.6
57%
1.3
10%
Main advantage for this type of code is fast interconnect system
Non-vectorizable code can be much more expensive on X1 (32:1 vs 8:1)
Lower bisection bandwidth to computation ratio
Plan to run larger problem size next ES visit
Scalar architectures generally perform well due to high computational intensity


ES
Limited scalability due to increasing cost of global transpose and reduced vector length


Altix
X1 3.5X slower than ES (although peak is 50% higher)


Power4
ES achieves fastest performance to date! Over 2Tflop/s on 1024 procs


Power 3
Power3, Power4, Alitx are 8X, 4X, 1.5X slower than ES
Vector arch allow opportunity to simulate systems not possible on scalar platforms
Magnetic Fusion: GTC





3D visualization of electrostatic
potential in magnetic fusion device

Gyrokinetic Toroidal Code: transport of thermal
energy (plasma microturbulence)
Goal magnetic fusion is burning plasma power plant
producing cleaner energy
GTC solves 3D gyroaveraged gyrokinetic system w/
particle-in-cell approach (PIC)
PIC scales N instead of N2 – particles interact w/
electromagnetic field on grid
Allows solving equation of particle motion with ODEs
(instead of nonlinear PDEs)
Main computational tasks:
 Scatter deposit particle charge to nearest point
 Solve
Poisson eqn to get potential for each point
 Gather calc force based on neighbors potential
 Move
particles by solving eqn of motion
 Shift
particles moved outside local domain
Developed at Princeton Plasma Physics Laboratory, vectorized by Stephane Ethier
GTC: Scatter operation
 Particle charge deposited amongst nearest grid points.
 Calculate force based on neighbors potential, then move particle accordingly
 Several particles can contribute to same grid points, resulting in memory conflicts
(dependencies) that prevent vectorization
 Solution: VLEN copies of charge deposition array with reduction after main loop
• However, greatly increases memory footprint (8X)
 Since particles are randomly localized - scatter also hinders cache reuse
GTC: Performance
Power 3
Power4
Altix
ES
X1
Number
Particles
P
10/cell
32
0.13
9%
0.29
5%
0.29
5%
1.15
14%
1.00
8%
20M
64
0.13
9%
0.32
5%
0.26
4%
1.00
13%
0.80
6%
32
0.13
9%
0.29
5%
0.33
6%
1.62
20%
1.50
12%
64
0.13
9%
0.29
5%
0.31
5%
1.56
20%
1.36
11%
1024
0.06
4%
100/cell
200M
ES achieves fastest performance of any tested architecture!

•
•
•
•
First time code achieved 20% of peak - compared with less 10% on superscalar systems
Vector hybrid (OpenMP) parallelism not possible due to increased memory requirements
P=64 on ES is 1.6X faster than P=1024 on Power3!
Reduced scalability due to decreasing vector length, not MPI performance
Non-vectorizable code portions expensive on X1

•
Before vectorization shift routine accounted for 11% of ES and 54% of X1 overhead
Larger tests could not be performed at ES due to parallelization/vectorization hurdles

•

Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak
Currently developing new version with increased particle decomposition
Advantage of ES for PIC codes may reside in higher statistical resolution simulations
•
Greater speed allow more particles per cell
Overview
Code
(P=64) % peak
(P=Max avail) Speedup ES vs.
Pwr3
Pwr4
Altix
ES
X1
Pwr3
Pwr4
Altix
X1
LBMHD
7%
5%
11%
58%
37%
30.6
15.3
7.2
1.5
CACTUS
6%
11%
7%
34%
6%
45.0
5.1
6.4
4.0
GTC
9%
6%
5%
20%
11%
9.4
4.3
4.1
1.1
PARATEC
57%
33%
54%
58%
20%
8.2
3.9
1.4
3.9
23.3
7.2
4.8
2.6
Average
Tremendous potential of vector architectures: 4 codes running faster than ever before

Vector systems allows resolution not possible with scalar arch (regardless of # procs)
 Opportunity to perform scientific runs at unprecedented scale

•
ES shows high raw and much higher sustained performance compared with X1
•
•
Limited X1 specific optimization - optimal programming approach still unclear (CAF, etc)
Non-vectorizable code segments become very expensive (8:1 or even 32:1 ratio)
•
•
GTC example code at odds with data-parallelism
Much more difficult to evaluate codes poorly suited for vectorization
Evaluation codes contain sufficient regularity in computation for high vector performance

Vectors potentially at odds w/ emerging techniques (irregular, multi-physics, multi-scale)

Plan to expand scope of application domains/methods, and examine latest HPC architectures
Second ES visit

Evaluate high-concurrency PARATEC performance using large-scale Quantum
Dot simulation

Evaluate CACTUS performance using updated vectorization of radiation boundary
condition

Evaluate MADCAP performance using a newly optimized version, without global
file systems requirements and improved I/O behavior

Examine 3D version of LBMHD, and explore optimization strategies

Evaluate GTC performance using updated vectorization of shift routine as well as
new particle decomposition approach designed to increase concurrency

Evaluate performance of FVCAM3 (Finite Volume atmospheric model), at high
concurrencies and resolution (1x1.25 , 0.5 x 0.625, 0.25 x 0.375)
Papers available at http://crd.lbl.gov/~oliker