UGC 2006 - Standard Performance Evaluation Corporation

Download Report

Transcript UGC 2006 - Standard Performance Evaluation Corporation

Presentation Outline
 A word or two about our program
 Our HPC system acquisition process
 Program benchmark suite
 Evolution of benchmark-based performance metrics
 Where do we go from here?
HPC Modernization Program
HPC Modernization Program Goals
DoD HPC Modernization Program
HPCMP Serves a Large, Diverse
DoD User Community
 519 projects and 4,086 users at
approximately 130 sites
Computational Fluid Dynamics –
1,572 Users
Computational Electromagnetics &
Acoustics – 337 Users
 Requirements categorized in 10
Computational Technology Areas
(CTA)
 FY08 non-real-time requirements
of 1,108 Habu-equivalents
Electronics, Networking, and
Systems/C4I – 114 Users
Computational Structural
Mechanics – 437 Users
Environmental Quality Modeling &
Simulation – 147 Users
Forces Modeling
& Simulation –
182 Users
Computational Chemistry, Biology &
Materials Science – 408 Users
Climate/Weather/Ocean Modeling &
Simulation – 241 Users
Integrated Modeling & Test
Environments – 139 Users
Signal/Image Processing –
353 Users
156 users are self characterized as “Other”
High Performance Computing Centers
Strategic Consolidation
of Resources
4 Major Shared
Resource Centers
4 Allocated Distributed
Centers
HPCMP Center Resources
1993
2007
Legend
MSRCs
ADCs (DCs)
350
MSRCs
300
ADCs
69.6
Habus
250
200
150
13.9
247.0
100
50
0
1.1
4.2
1.0
6.3
3.0
14.8
7.9
23.1
10.3
39.2
109.5
FY 01 FY 02 FY 03 FY 04 FY 05 FY 06 FY 07
Fiscal Year (TI-XX)
Note: Computational capability reflects available GFLOPS during fiscal year
Total HPCMP End-of-Year
Computational Capabilities
HPC Modernization Program (MSRCs)
FY03
FY04
FY05
FY06
FY07
As of: August 2007
HPC Center
System
Processors
Army Research
Laboratory (ARL)
Linux Networx Cluster
Linux Networx Cluster
IBM Opteron Cluster (C)
SGI Altix Cluster (C)
Linux Networx Cluster
Linux Networx Cluster (C)
256 PEs
2,100 PEs
2,372 PEs
256 PEs
4,528 PEs
3,464 PEs
Aeronautical
Systems Center
(ASC)
SGI Origin 3900
SGI Origin 3900 (C)
IBM P4 (C)
SGI Altix Cluster
HP Opteron
SGI Altix
2,048 PEs
128 PEs
32 PEs
2,048 PEs
2,048 PEs
9,216 PEs
Engineer
Research and
Development
Center (ERDC)
SGI Origin 3900
Cray XT3 (FY 07 upgrade)
Cray XT4
1,024 PEs
8,192 PEs
8,848 PEs
Naval
Oceanographic
Office (NAVO)
IBM P4+
IBM 1600 P5 Cluster
IBM 1600 P5 Cluster (C)
3,456 PEs
3,072 PEs
1,920 PEs
HPC Modernization Program (ADCs)
FY03
FY04
FY05
FY06
As of: August 2007
HPC Center
System
Processors
Army High
Performance
Computing Research
Center (AHPCRC)
Cray X1E
Cray XT3
1,024 PEs
1,128 PEs
Arctic Region
Supercomputing
Center (ARSC)
IBM Regatta P4
Sun x4600
800 PEs
2,312 PEs
Maui High Performance Dell PowerEdge 1955
Computing Center
(MHPCC)
Space & Missile
Defense Command
(SMDC)
SGI Origin 3000
SGI Altix
West Scientific Cluster
IBM e1300 Cluster
IBM Regatta P4
Cray X1E
Atipa Linux Cluster
IBM Xeon Cluster
Cray XD1
5,120 PEs
736 PEs
128 PEs
64 PEs
256 PEs
32 PEs
128 PEs
256 PEs
128 PEs
288 PEs
Overview of TI-XX Acquisition Process
Determination of
Requirements,
Usage, and
Allocations
Choose application
benchmarks, test
cases, and weights
Vendors provide
measured and
projected times on
offered systems
Measure benchmark
times on DoD
standard system
Determine
performance for each
offered system on
each application test
case
Measure benchmark
times on existing DoD
systems
Determine
performance for each
existing system on
each application test
case
Center facility
requirements
Vendor pricing
Usability/past
performance
information on offered
systems
Determine
performance for
each offered
system
Collective
Acquisition
Decision
Use optimizer to determine
price/performance for each
offered system and
combination of systems
Life-cycle costs
for offered
systems
TI-08 Synthetic Test Suite
 CPUBench – Floating point execution rate
 ICBench – Interconnect bandwidth and latency
 LANBench – External network interface and connection bandwidth
 MEMBench – Memory bandwidth (MultiMAPS)
 OSBench – Operating system noise (PSNAP from LANL)
 SPIOBench – Streaming parallel I/O bandwidth
TI-08 Application Benchmark Codes

ICEPIC – Particle-in-cell
magnetohydrodynamics code
– (C, MPI, 60,000 SLOC)

LAMMPS – Molecular dynamics code
– (C++, MPI, 45,400 SLOC)

AMR – Gas dynamics code
– (C++/FORTRAN, MPI, 40,000 SLOC)

AVUS (Cobalt-60) – Turbulent flow CFD
code
– (Fortran, MPI, 19,000 SLOC)


CTH – Shock physics code
– (~43% Fortran/~57% C, MPI, 436,000
SLOC)

GAMESS – Quantum chemistry code
– (Fortran, MPI, 330,000 SLOC)

HYCOM – Ocean circulation modeling
code
– (Fortran, MPI, 31,000 SLOC)
OOCore – Out-of-core solver mimicking
electromagnetics code
– (Fortran, MPI, 39,000 SLOC)

Overflow2 – CFD code originally
developed by NASA
– (Fortran, MPI, 83,600 SLOC)

WRF – Multi-Agency mesoscale
atmospheric modeling code
– (Fortran and C, MPI, 100,000
SLOC)
Application Benchmark History
Computational
Technology Area
FY 2003
FY 2004
FY 2005
FY 2006
FY 2007
FY 2008
Computational Structural
Mechanics
CTH
CTH
RFCTH
RFCTH
CTH
CTH
Computational Fluid
Dynamics
Cobalt60
LESLIE3D
Aero
Cobalt60
Aero
AVUS
Overflow2
Aero
AVUS
Overflow2
Aero
AVUS
Overflow2
AVUS
Overflow2
AMR
Computational
Chemistry, Biology, and
Materials Science
GAMESS
NAMD
GAMESS
NAMD
GAMESS
GAMESS
LAMMPS
GAMESS
LAMMPS
GAMESS
LAMMPS
OOCore
OOCore
OOCore
OOCore
ICEPIC
OOCore
ICEPIC
HYCOM
HYCOM
WRF
HYCOM
WRF
HYCOM
WRF
HYCOM
WRF
Computational
Electromagnetics and
Acoustics
Climate/Weather/ Ocean
Modeling and Simulation
NLOM
Determination of Performance
 Establish a DoD standard benchmark time for each
application benchmark case
– ERDC Cray dual-core XT3 (Sapphire) chosen as standard DoD
system
– Standard benchmark times on DoD standard system measured
at 128 processors for standard test cases and 512 processor for
large test cases
– Split in weight between standard and large application test
cases will be made at 256 processors
 Benchmark timings (at least four on each test case) are
requested for systems that meet or beat the DoD standard
benchmark times by at least a factor of two (preferably four)
 Benchmark timings may be extrapolated provided they are
guaranteed, but at least two actual timings must be provided
for each test case
Determination of Performance (cont.)
 Curve fit: Time = A/N + B + C*N
– N = number of processing cores
– A/N = time for parallel portion of code (|| base)
– B = time for serial portion of code
– C*N = parallel penalty (|| overhead)
 Constraints
– A/N ≥ 0 Parallel base time is non-negative.
– Tmin≥ B ≥ 0 Serial time is non-negative and is not
greater than the minimum observed time.
Determination of Performance (cont.)
 Curve fit approach
– For each value of B (Tmin≥ B ≥ 0)
 Determine A: Time – B = A/N
 Determine C: Time – (A/N + B) = C*N
 Calculate fit quality
 (Ni, Ti) = time Ti observed at Ni cores
 M = number of observed core counts
Fit Quality 
1.0
M
2
(
T

A
/
N

B

C
*
N
)
 i
i
i
i 1
– Select the value of B with largest fit quality
Determination of Performance (cont.)
 Calculate score (in DoD standard
system equivalents)
– C = number of compute cores in target system
– Cbase = number of compute cores in standard
system
– Sbase = number of compute cores in standard
execution
– STM = size-to-match = number of compute cores
of target system required to match performance
of Sbase cores of the standard system
Sbase
C
Score 

Cbase STM
AMR Large Test Case on HP Opteron Cluster
Relative Performance (Sapphire Eq.)
1,2
1,1
1,0
0,9
0,8
0,7
Benchmark Data
0,6
Benchmark Curve
STM Range
0,5
256
320
384
448
512
576
Cores
640
704
768
832
AMR Large Test Case on SGI Altix
1,7
Relative Performance (Sapphire Eq.)
1,6
1,5
1,4
1,3
1,2
1,1
1,0
Benchmark Data
0,9
Benchmark Curve
STM Range
0,8
192
256
320
384
448
512
Cores
576
640
704
768
832
AMR Large Test Case on Dell Xeon Cluster
Relative Performance (Sapphire Eq.)
1,6
1,4
1,2
1,0
0,8
Benchmark Data
Benchmark Curve
STM Range
0,6
192
256
320
384
448
512
Cores
576
640
704
768
832
Overflow-2 Standard Test Case on Dell Xeon Cluster
Relative Performance (Sapphire Eq.)
2,5
2,0
1,5
1,0
0,5
Benchmark Data
Benchmark Curve
STM Range
0,0
0
64
128
192
Cores
256
320
Overflow-2 Large Test Case on IBM P5+
Relative Performance (Sapphire Eq.)
1,1
1,0
0,9
0,8
0,7
0,6
Benchmark Data
0,5
Benchmark Curve
STM Range
0,4
192
256
320
384
448
512
Cores
576
640
704
768
832
ICEPIC Standard Test Case on SGI Altix
1,1
Relative Performance (Sapphire Eq.)
1,0
0,9
0,8
0,7
0,6
0,5
0,4
0,3
Benchmark Data
Benchmark Curve
0,2
STM Range
0,1
0
64
128
192
256
Cores
320
384
448
ICEPIC Large Test Case on SGI Altix
1,1
Relative Performance (Sapphire Eq.)
1,0
0,9
0,8
0,7
0,6
0,5
0,4
Benchmark Data
Benchmark Curve
0,3
STM Range
Pseudo Score
0,2
192
320
448
576
704
832
960
Cores
1088
1216
1344
1472
1600
Comparison of HPCMP System
Capabilities: FY 2003 - FY 2008
16
Habu-equivalents per Processor
14
FY 2003
FY 2004
12
10
FY 2005
FY 2006
FY 2007
8
FY 2008
6
4
2
0
IBM P3 IBM P4 IBM P4+ IBM P5+ HP SC40 HP SC45
HP
SGI
Opteron O3800
Cluster
SGI SGI Altix
O3900
LNXI
Xeon
Cluster
(3.6)
LNXI
Xeon
Cluster
(3.0)
Dell Cray XT3
Xeon
Cluster
What’s Next?
 Continue to evolve application benchmarks to
represent accurately the HPCMP computational
workload
 Increase profiling and performance modeling to
understand application performance better
 Use performance predictions to supplement
application benchmark measurements and guide
vendors in designing more efficient systems