Scientific Computations on Modern Parallel Vector Systems
Download
Report
Transcript Scientific Computations on Modern Parallel Vector Systems
Scientific Computations
on Modern Parallel
Vector Systems
Leonid Oliker
Julian Borrill, Jonathan Carter,
Andrew Canning, John Shalf, David Skinner
Lawrence Berkeley National Laboratories
Stephane Ethier
Princeton Plasma Physics Laboratory
http://crd.lbl.gov/~oliker
Architectural Comparison
Node Type Where
CPU/ Clock
Node MHz
Peak Mem BW Peak
GFlop
GB/s byte/flop
Netwk
Bisect
MPI
BW
BW
Latency
GB/s/P byte/flop
usec
Network
Topology
Power3 NERSC
16
375
1.5
1.0
0. 47
0.13
0.087
16.3
Fat-tree
Power4
ORNL
32
1300
5.2
2.3
0.44
0.13
0.025
7.0
Fat-tree
Altix
ORNL
2
1500
6.0
6.4
1.1
0.40
0.067
2.8
Fat-tree
ES
X1
ESC
ORNL
8
4
500
800
8.0
12.8
32.0
34.1
4.0
2.7
1.5
6.3
0.19
0.088
5.6
7.3
Crossbar
2D-torus
Custom vector architectures have
•High memory bandwidth relative to peak
•Superior interconnect: latency, point to point, and bisection bandwidth
Overall ES appears as the most balanced architecture,
while Altix shows best architectural balance among superscalar architectures
A key ‘balance point’ for vector systems is the scalar:vector ratio
Applications studied
LBMHD
Plasma Physics 1,500 lines grid based
Lattice Boltzmann approach for magneto-hydrodynamics
CACTUS
Astrophysics
100,000 lines grid based
Solves Einstein’s equations of general relativity
PARATEC
Material Science 50,000 lines Fourier space/grid
Density Functional Theory electronic structures codes
GTC
Magnetic Fusion 5,000 lines particle based
Particle in cell method for gyrokinetic Vlasov-Poisson equation
MADbench
Cosmology
2,000 lines dense linear algebra
Maximum likelihood two-point angular correlation, I/O intensive
Applications chosen with potential to run at ultrascale
Computations contain abundant data parallelism
•
ES runs require minimum parallelization and vectorization hurdles
Codes originally designed for superscalar systems
Ported onto single node of SX6, first multi-node experiments performed at ESC
Plasma Physics: LBMHD
LBMHD uses a Lattice Boltzmann method to model
magneto-hydrodynamics (MHD)
Performs 2D simulation of high temperature plasma
Evolves from initial conditions and decaying to form
current sheets
2D spatial grid is coupled to octagonal streaming
lattice
Block distributed over 2D processor grid
Current density decays of two crossshaped structures
Main computational components:
Collision requires coefficients for local gridpoint only, no communication
Stream values at gridpoints are streamed to neighbors,
at cell boundaries information is exchanged via MPI
Interpolation step required between spatial and stream lattices
Developed George Vahala’s group College of William and Mary, ported Jonathan Carter
LBMHD: Porting Details
(left) octagonal streaming
lattice coupled with square
spatial grid
(right) example of diagonal
streaming vector updating
three spatial cells
Collision routine rewritten:
For ES loop ordering switched so gridpoint loop (~1000 iterations) is inner rather than velocity or
magnetic field loops (~10 iterations)
X1 compiler made this transformation automatically: multistreaming outer loop and vectorizing (via
strip mining) inner loop
Temporary arrays padded reduce bank conflicts
Stream routine performs well:
Array shift operations, block copies, 3rd-degree polynomial eval
Boundary value exchange
MPI_Isend, MPI_Irecv pairs
Further work: plan to use ES "global memory" to remove message copies
Material Science: PARATEC
PARATEC performs first-principles quantum
mechanical total energy calculation using
pseudopotentials & plane wave basis set
Density Functional Theory to calc structure &
electronic properties of new materials
DFT calc are one of the largest consumers of
supercomputer cycles in the world
Induced current and charge
density in crystallized glycine
Uses all-band CG approach to obtain wavefunction of electrons
33% 3D FFT, 33% BLAS3, 33% Hand coded F90
Part of calculation in real space other in Fourier space
• Uses specialized 3D FFT to transform wavefunction
Computationally intensive - generally obtains high percentage of peak
Developed Andrew Canning with Louie and Cohen’s groups (UCB, LBNL)
PARATEC:
Wavefunction Transpose
(a)
(c)
(e)
(b)
(d)
(f)
Transpose from Fourier to real space
3D FFT done via 3 sets of
1D FFTs and 2 transposes
Most communication in
global transpose (b) to (c)
little communication (d) to (e)
Many FFTs done at the same time
to avoid latency issues
Only non-zero elements
communicated/calculated
Much faster than vendor 3D-FFT
Astrophysics: CACTUS
Numerical solution of Einstein’s equations from
theory of general relativity
Among most complex in physics: set of coupled
nonlinear hyperbolic & elliptic systems with
thousands of terms
Visualization of grazing collision of
two black holes
CACTUS evolves these equations to simulate high
gravitational fluxes, such as collision of two black
holes
Evolves PDE’s on regular grid using finite differences
Uses ADM formulation: domain decomposed into 3D
hypersurfaces for different slices of space along time
dimension
Exciting new field about to be born: Gravitational Wave
Astronomy - fundamentally new information about Universe
Gravitational waves: Ripples in spacetime curvature,
caused by matter motion, causing distances to change.
Developed at Max Planck Institute, vectorized by John Shalf
Communication at boundaries
Expect high parallel efficiency
Magnetic Fusion: GTC
3D visualization of electrostatic
potential in magnetic fusion device
Gyrokinetic Toroidal Code: transport of thermal
energy (plasma microturbulence)
Goal magnetic fusion is burning plasma power plant
producing cleaner energy
GTC solves 3D gyroaveraged gyrokinetic system w/
particle-in-cell approach (PIC)
PIC scales N instead of N2 – particles interact w/
electromagnetic field on grid
Allows solving equation of particle motion with ODEs
(instead of nonlinear PDEs)
Main computational tasks:
Scatter deposit particle charge to nearest point
Solve
Poisson eqn to get potential for each point
Gather calc force based on neighbors potential
Move
particles by solving eqn of motion
Shift
particles moved outside local domain
Developed at Princeton Plasma Physics Laboratory, vectorized by Stephane Ethier
GTC: Scatter operation
Particle charge deposited amongst nearest grid points.
Calculate force based on neighbors potential, then move particle accordingly
Several particles can contribute to same grid points, resulting in memory conflicts
(dependencies) that prevent vectorization
Solution: VLEN copies of charge deposition array with reduction after main loop
• However, greatly increases memory footprint (8X)
Since particles are randomly localized - scatter also hinders cache reuse
Cosmology: MADbench
Microwave Anisotropy Dataset
Computational Analysis Package
Optimal general algorithm for extracting
key cosmological data from Cosmic
Microwave Background Radiation (CMB)
CMB encodes fundamental parameters of
cosmology: Universe geometry, expansion
rate, number of neutrino species
Preserves full complexity of underlying scientific problem
Calculates maximum likelihood two-point angular correlation function
Recasts problem in dense linear algebra: ScaLAPACK
Steps include: mat-mat, mat-vec, chol decomp, redistribution
High I/O requirement - due to out-of-core nature of calculation
Developed at NERSC/CRD by Julian Borrill
CMB Data Analysis
CMB analysis moves
from the time domain - observations - O(1012)
to the pixel domain - maps - O(108)
to the multipole domain - power spectra - O(104)
calculating the compressed data and their reduced
error bars at each step.
100%
80%
60%
40%
20%
0%
Sbg ES
P hx C m b
Sbg ES
P hx C m b
Sbg ES
P hx C m b
P=1024
P=1024
P=256
P=256
P=256
P=256
P=64
P=64
P=64
P=64
P=16
P=16
P=16
C a lc
C a lc+MP I
C a lc+MP I+I/O
C a lc+MP I+I/O +R m p
P=16
% of theoretical peak
MADBench:
Performance Characterization
Sbg ES
In depth analysis shows performance contribution of each component for evaluated
architectures
Identifies system specific balance and opportunities for optimization
Results show that I/O has more effect on ES than Seaborg - due to ratio between I/O
performance and peak ALU speed
Demonstrated IPM capabilities to measure MPI overhead on variety of architectures
without the need to recompile, at a trivial runtime overhead (<1%)
Overview
Code
(P=64) % peak
(P=Max avail) Speedup ES vs.
Pwr3
Pwr4
Altix
ES
X1
Pwr3
Pwr4
Altix
X1
LBMHD
7%
5%
11%
58%
37%
30.6
15.3
7.2
1.5
CACTUS
6%
11%
7%
34%
6%
45.0
5.1
6.4
4.0
GTC
9%
6%
5%
20%
11%
9.4
4.3
4.1
1.1
PARATEC
57%
33%
54%
58%
20%
8.2
3.9
1.4
3.9
MADbench
49%
---
19%
37%
17%
6.3
---
3.5
1.4
19.9
7.2
4.5
2.4
Average
Tremendous potential of vector architectures: 5 codes running faster than ever before
Vector systems allows resolution not possible with scalar arch (regardless of # procs)
Opportunity to perform scientific runs at unprecedented scale
ES shows high raw and much higher sustained performance compared with X1
•
•
Evaluation codes contain sufficient regularity in computation for high vector performance
•
•
•
Non-vectorizable code segments become very expensive (8:1 or even 32:1 ratio)
GTC example code at odds with data-parallelism
Important to characterize full application including I/O effects
Much more difficult to evaluate codes poorly suited for vectorization
Vectors potentially at odds w/ emerging techniques (irregular, multi-physics, multi-scale)
Plan to expand scope of application domains/methods, and examine latest HPC architectures