Performance Characteristics of a Cosmology Package on Leading HPC Architectures Leonid Oliker

Download Report

Transcript Performance Characteristics of a Cosmology Package on Leading HPC Architectures Leonid Oliker

Performance Characteristics
of a Cosmology Package
on Leading HPC Architectures
Leonid Oliker
http://crd.lbl.gov/~oliker
Julian Borrill, Jonathan Carter
Lawrence Berkeley National Laboratories
Overview
 Superscalar cache-based architectures dominate HPC market
 Leading architectures are commodity-based SMPs due to generality and perception of
cost effectiveness
 Growing gap between peak & sustained performance is well known in scientific
computing
 Modern parallel vectors may bridge gap this for many important applications
 In April 2002, the Earth Simulator (ES) became operational:
Peak ES performance > all DOE and DOD systems combined
Demonstrated high sustained performance on demanding scientific apps
 Conducting evaluation study of scientific applications on modern vector systems
 09/2003 MOU between ES and NERSC was completed
First visit to ES center: Dec 2003, second visit Oct 2004 (no remote access)
First international team to conduct performance evaluation study at ES
 Examining best mapping between demanding applications and leading HPC systems one size does not fit all
Vector Paradigm
 High memory bandwidth
• Allows systems to effectively feed ALUs (high byte to flop ratio)
 Flexible memory addressing modes
• Supports fine grained strided and irregular data access
 Vector Registers
• Hide memory latency via deep pipelining of memory load/stores
 Vector ISA
• Single instruction specifies large number of identical operations
 Vector architectures allow for:
• Reduced control complexity
• Efficiently utilize large number of computational resources
• Potential for automatic discovery of parallelism
However: most effective if sufficient regularity discoverable in program structure
• Suffers even if small % of code non-vectorizable (Amdahl’s Law)
Architectural Comparison
CPU/ Clock
Node Type Where
Node MHz
Peak Mem BW Peak
GFlop
GB/s byte/flop
Netwk
Bisect
MPI
BW
BW
Latency
GB/s/P byte/flop
usec
Network
Topology
Power3 NERSC
16
375
1.5
1.0
0. 47
0.13
0.087
16.3
Fat-tree
Power4
ORNL
32
1300
5.2
2.3
0.44
0.13
0.025
7.0
Fat-tree
Altix
ORNL
2
1500
6.0
6.4
1.1
0.40
0.067
2.8
Fat-tree
ES
X1
ESC
ORNL
8
4
500
800
8.0
12.8
32.0
34.1
4.0
2.7
1.5
6.3
0.19
0.088
5.6
7.3
Crossbar
2D-torus
Custom vector architectures have
•High memory bandwidth relative to peak
•Superior interconnect: latency, point to point, and bisection bandwidth
Another key balance point is I/O performance:
Seaborg I/O: 16 GFPS servers, each w/ 32 GB main memory (for caching & metadata)
I/O uses switch fabric, sharing bandwidth with message-passing traffic
ES I/O: Each group 16 nodes has a pool of RAID disks attached with fiber channel
switch (each node has a separate filesystem)
Previous ES visit
Code
(P=64) % peak
(P=Max avail) Speedup ES vs.
Pwr3
Pwr4
Altix
ES
X1
Pwr3
Pwr4
Altix
X1
LBMHD
7%
5%
11%
58%
37%
30.6
15.3
7.2
1.5
CACTUS
6%
11%
7%
34%
6%
45.0
5.1
6.4
4.0
GTC
9%
6%
5%
20%
11%
9.4
4.3
4.1
1.1
PARATEC
57%
33%
54%
58%
20%
8.2
3.9
1.4
3.9
23.3
7.2
4.8
2.6
Average
Tremendous potential of vector architectures: 4 codes running faster than ever before
 Vector systems allows resolution not possible with scalar arch (regardless of # procs)
 Opportunity to perform scientific runs at unprecedented scale
•
•
Evaluation codes contain sufficient regularity in computation for high vector performance
However, none of the tested codes contained significant I/O requirements
The Cosmic Microwave Background


The CMB is a snapshot of the Universe when it first became neutral 400,000 years
after the Big Bang.
After Big Bang the expansion of space cooled Universe sufficiently for charged
electrons and neutrons to combine
Cosmic - primordial photons filling all
of space.
Microwave - redshifted by the
expansion of the Universe from 3000K
to 3K.
Background - coming from “behind” all
astrophysical sources.
CMB Science
The CMB is a unique probe of the very early Universe.
Tiny fluctuations in its temperature & polarization encode
- the fundamental parameters of cosmology
•
Universe geometry, expansion rate, number of neutrino species, ionization
history, dark matter, cosmological constant
- ultra-high energy physics beyond the Standard Model
CMB Data Analysis
CMB analysis moves
from the time domain - observations - O(1012)
to the pixel domain - maps - O(108)
to the multipole domain - power spectra - O(104)
calculating the compressed data and their reduced
error bars at each step.
MADCAP: Performance
Power 3
P
Power4
ES
X1
Gflops/P
%peak
Gflops/P
%peak Gflops/P %peak Gflops/P %peak
16
0.62
41%
1.5
29%
4.1
32%
2.2
27%
64
0.54
36%
0.81
16%
1.9
23%
2.0
16%

Porting: ScaLAPACK plus rewrite of Legendre polynomial recursion, such that large batches are
computed in inner loop

Original ES visit: only partially ported due to code’s requirements of global file system

Could not meet minimum parallelization and vectorization thresholds for ES

All systems sustain relatively low % peak considering MADCAP’s BLAS3 ops

Further work performed for MADbench to: reduce I/O, remove system calls, and remove global
file system requirements

New results collected from recent ES visit October 2004
IPM Overview
Integrated
Performance
Monitoring

portable, lightweight, scalable
profiling

fast hash method

profiles MPI topology

profiles code regions

open source
MPI_Pcontrol(1,”W”);
…code…
MPI_Pcontrol(-1,”W”);
###########################################
# IPMv0.7 :: csnode041 256 tasks ES/ESOS
# madbench.x (completed) 10/27/04/14:45:56
#
#
<mpi>
<user>
<wall> (sec)
#
171.67
352.16
393.80
#…
###############################################
# W
#
<mpi>
<user>
<wall> (sec)
#
36.40
198.00
198.36
#
# call
[time]
%mpi
%wall
# MPI_Reduce
2.395e+01
65.8
6.1
# MPI_Recv
9.625e+00
26.4
2.4
# MPI_Send
2.708e+00
7.4
0.7
# MPI_Testall
7.310e-02
0.2
0.0
# MPI_Isend
2.597e-02
0.1
0.0
###############################################
…
MADbench
Is a lightweight version of the MADCAP maximum likelihood CMB
power spectrum estimation code.
Retains the operational complexity & integrated system requirements
of the full science code.
Has three basic steps - dSdC, invD & W.
Out of core calculation: holds approx 3 of the 50 matrices in memory
Is used for
- computer & file-system procurements.
- realistic scientific code benchmarking and optimization.
- architectural comparisons.
dSdC
This step generates a set of Nb dense, symmetric NpxNp signal correlation
derivative matrices dSdCb by Lengendre polynomial recursion.
Each matrix is block-cyclic distributed over the 2D processor array with
blocksize B.
As each matrix is calculated, each processor writes its subset of the matrix
elements to a unique file.
No inter-processor communication is required.
Flops: O(N2P) Disk: 8NbN2p (primarily writing)
invD
This step generates the data correlation matrix D and inverts it.
The dSdCb matrices are read from disk one at a time and progressively
accumulated to build the signal correlation matrix S.
A diagonal white noise correlation matrix N is added to S to give the data
correlation matrix D, which is inverted using ScaLAPACK to give D-1.
Each processor writes its subset of the D-1 matrix elements to a unique file.
Flops: O(N3P) Disk: 8NbN2p (primarily reading)
W
This step multiplies each dSdCb matrix by D-1 to form Wb and derives a
Newton-Raphson iterative step from this.
Since they are independent, these matrix multiplications can be carried out
gang-parallel across Ng gangs of processors.
Each dSdCb matrix is read in by all processors and then redistributed to the
target gang.
When all gangs have been given a matrix, they all perform their multiplication
simultaneously.
Flops: O(N3P) Disk: 8NbN2p (primarily reading)
Parameters
Np - number of pixels (matrix size).
Nb - number of bins (matrix count).
Ng - number of gangs of processors.
B - ScaLAPACK blocksize.
MODIO - IO concurrency control (only 1 in MODIO processors do IO
simultaneously).
Running on P processors requires:
- 3 x 8 x Np2 bytes of memory per gang
- Nb x 8 x Np2 bytes & Nb x P inodes of disk
- Nb a multiple of Ng to load-balance the gangs.
B & MODIO are architecture-specific optimizations.
dSdC performance
Seconds
dSdC
80
CALC
70
MPI
60
I/O
50
40
30
20
10
P=16
P=16
P=64
P=64
P=256
P=256
P=1024
P=1024
0
Pwr3
ES
Pwr3
ES
Pwr3
ES
Pwr3
ES

ES shows constant I/O performance (independent disks)

Significantly fast computation (30X) due to high memory bandwidth

Overall only 2.6X faster than Power3 due to I/O overhead
Power3 has faster write I/O until GPFS contention at P=1024

invD performance
invD
CALC
200
MPI
Seconds
I/O
150
100
50
P=16
P=16
P=64
P=64
P=256
P=256
P=1024
P=1024
0
Pwr3
ES
Pwr3
ES
Pwr3
ES
Pwr3
ES

I/O remains relatively constant, while MPI overhead and computation grows

Seaborg I/O reads faster than ES

Overall ES only 2.3X faster
W performance
W
2500
Seconds
2000
CALC
MPI
I/O
1500
1000
500




P=16
P=16
P=64
P=64
P=256
P=256
P=1024
P=1024
P=1024
P=1024
0
G16
G16
G16
G16
G16
G16
G1
G1
G16
G16
Pwr3
ES
Pwr3
ES
Pwr3
ES
Pwr3
ES
Pwr3
ES
Multi-gang runs significantly reduce MPI overhead (4.8X on ES, 3.3X on Seaborg)
MPI and CALC grow with numbers of processors
I/O trivial part of W calculation
Overall ES is 7X faster
Performance overview
Overview
1.0
%Pk (no I/O)
%Pk (total)
CALC
MPI
I/O
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
P=256
P=256
P=256
P=256
Pwr3
ES
Pwr3
ES
P=1024 P=1024 P=1024 P=1024
Pwr3
ES
Pwr3
ES
ES runtimes normalized to Power3

Overall ES 5.6X faster & slightly higher % of peak compared w/ Seaborg for P=1024

For P=256 Seaborg shows higher % peak, due to relative I/O vs. peak flop performance

Although I/O cost remains relatively high, both systems achieve over 50% peak
Overview


New version of Madbench successfully reduced I/O overhead and removed global file
system requirements
Allowed ES runs up to 1024 processors, achieving over 50% of peak



Compared with only 23% of peak on 64 processors from first visit
Results show that I/O has more effect on ES than Seaborg - due to ratio between I/O
performance and peak ALU speed
Demonstrated IPM capabilities to measure MPI overhead on variety of architectures
without the need to recompile, at a trivial runtime overhead (1-2%)

Continue study of complex interplay between architecture, interconnect, and I/O

Currently performing experiments on Columbia and Phoenix

MADbench and IPM being prepared for public distribution

Future CMB analysis will require sparse methods due to size of data sets - potentially at
odds with vector architectures