High Performance Computing: Concepts, Methods, & Means An

Download Report

Transcript High Performance Computing: Concepts, Methods, & Means An

Prof. Thomas Sterling
Department of Computer Science
Louisiana State University
January 27, 2011
HIGH PERFORMANCE COMPUTING: MODELS, METHODS, &
MEANS
BENCHMARKING
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
Topics
•
•
•
•
•
•
•
Definitions, properties and applications
Early benchmarks
Linpack
Other parallel benchmarks
Organized benchmarking
Presentation and interpretation of results
Summary
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
2
Topics
•
•
•
•
•
•
•
Definitions, properties and applications
Early benchmarks
Linpack
Other parallel benchmarks
Organized benchmarking
Presentation and interpretation of results
Summary
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
3
Basic Performance Metrics
• Time related:
– Execution time [seconds]
• wall clock time
• system and user time
– Latency
– Response time
• Rate related:
– Rate of computation
• floating point operations
per second [flops]
• integer operations per
second [ops]
– Data transfer (I/O) rate
[bytes/second]
• Effectiveness:
– Efficiency [%]
• Sustained perf/peak perf
– Memory consumption
[bytes]
– Productivity
[utility/($*second)]
• Performance measures:
– Sustained Performance
– Peak Performance
– Benchmark sustained perf
• HPL Rmax
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
4
What Is a Benchmark?
Benchmark: a standardized problem or test that serves as a basis for
evaluation or comparison (as of computer system performance) [Merriam-
Webster]
• The term “benchmark” also commonly applies to
specially-designed programs used in benchmarking
• A benchmark should:
– be domain specific (the more general the benchmark, the
less useful it is for anything in particular)
– be a distillation of the essential attributes of a workload
– avoid using single metric to express the overall performance
• Computational benchmark kinds
– synthetic: specially-created programs that impose the load
on the specific component in the system
– application: derived from a real-world application program
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
5
Purpose of Benchmarking
• Provide a tool, enabling quantitative comparisons
– Comparison of variations within the same system
– Comparison of distinct systems
• Driving progress
– enable better engineering by defining measurable and
repeatable objectives
• Establishing of performance agenda
– measure release-to-release or version-to-version progress
– set goals to meet
– be understandable and useful also to the people not having the
expertise in the field (managers, etc.)
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
6
Properties of a Good Benchmark
•
•
•
•
Relevance: meaningful within the target domain
Understandability
Good metric(s): linear, orthogonal, monotonic
Scalability: applicable to a broad spectrum of
hardware/architecture
• Coverage: does not over-constrain the typical
environment (does not require any special conditions)
• Acceptance: embraced by users and vendors
• Has to enable comparative evaluation
CSC
7600 SIGMOD
Lecture ‘97
4: Benchmarking,
Adapted from: Standard Benchmarks for Database Systems by Charles
Levine,
Spring 2011
7
Topics
•
•
•
•
•
•
•
Definitions, properties and applications
Early benchmarks
Linpack
Other parallel benchmarks
Organized benchmarking
Presentation and interpretation of results
Summary
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
8
Early Benchmarks
• Whetstone
– Floating point intensive
• Dhrystone
– Integer and character string oriented
• Livermore Fortran Kernels
– “Livermore Loops”
– Collection of short kernels
• NAS kernel
– 7 Fortran test kernels for aerospace computation
The sources of the benchmarks listed above are available
from: http://www.netlib.org/benchmark
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
9
Whetstone
• Originally written in Algol 60 in 1972 at the National
Physics Laboratory (UK)
• Named after Whetstone Algol translator-interpreter on
the KDF9 computer
• Measures primarily floating point performance in
WIPS: Whetstone Instructions Per Second
• Raised also the issue of efficiency of different
programming languages
• The original Algol code was translated to C and
Fortran (single and double precision support), PL/I,
APL, Pascal, Basic, Simula and others
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
10
Dhrystone
• Synthetic benchmark developed in 1984 by Reinhold
Weicker
• The name is a pun on “Whetstone”
• Measures integer and string operations performance,
expressed in number of iterations, or Dhrystones, per
second
• Alternative unit: D-MIPS, normalized to VAX 11/780
performance
• Latest version released: 2.1, includes
implementations in C, Ada and Pascal
• Superseded by SPECint suite
Gordon Bell and VAX 11/780
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
11
Livermore Fortran Kernels (LFK)
• Developed at Lawrence Livermore National Laboratory
in 1970
– also known as Livermore Loops
• Consists of 24 separate kernels:
– hydrodynamic codes, Cholesky conjugate gradient, linear
algebra, equation of state, integration, predictors, first sum and
difference, particle in cell, Monte Carlo, linear recurrence,
discrete ordinate transport, Planckian distribution and others
– include careful and careless coding practices
• Produces 72 timing results using 3 different DO-loop
lengths for each kernel
• Produces Megaflops values for each kernel and range
statistics of the results
• Can be used as performance, compiler accuracy
(checksums stored in code) or hardware endurance test
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
12
NAS Kernel
• Developed at the Numerical Aerodynamic Simulation
Projects Office at NASA Ames
• Focuses on vector floating point performance
• Consists of 7 test kernels in Fortran (approx. 1000 lines
of code):
–
–
–
–
–
–
–
matrix multiply
complex 2-D FFT
Cholesky decomposition
block tri-diagonal matrix solver
vortex method setup with Gaussian elimination
vortex creation with boundary conditions
parallel inverse of three matrix pentadiagonals
• Reports performance in Mflops (64-bit precision)
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
13
Topics
•
•
•
•
•
•
•
Definitions, properties and applications
Early benchmarks
Linpack
Other parallel benchmarks
Organized benchmarking
Presentation and interpretation of results
Summary
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
14
Linpack Overview
• Introduced by Jack Dongarra in 1979
• Based on LINPACK linear algebra package
developed by J. Dongarra, J. Bunch, C. Moler
and P. Stewart (now superseded by the
LAPACK library)
• Solves a dense, regular system of linear
equations, using matrices initialized with
pseudo-random numbers
• Provides an estimate of system’s effective
floating-point performance
• Does not reflect the overall performance of
the machine!
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
15
Linpack Benchmark Variants
• Linpack Fortran (single processor)
– N=100
– N=1000, TPP, best effort
• Linpack’s Highly Parallel Computing benchmark (HPL)
• Java Linpack
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
16
Fortran Linpack (I)
N=100 case
• Provides results listed in Table 1 of “Linpack Benchmark
Report”
• Absolutely no changes to the code can be made (not
even in comments!)
• Matrix generated by the program must be used to run
this case
• An external timing function (SECOND) has to be
supplied
• Only compiler-induced optimizations allowed
• Measures performance of two routines
– DGEFA: LU decomposition with partial pivoting
– DGESL: solves system of linear equations using result from
DGEFA
• Complexity: O(n2) for DGESL, O(n3) for DGEFA
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
17
Fortran Linpack (II)
N=1000 case, Toward Peak Performance (TPP), Best Effort
• Provides results listed in Table 1 of “Linpack Benchmark
Report”
• The user can choose any linear equation to be solved
• Allows a complete replacement of the factorization/solver
code by the user
• No restriction on the implementation language for the
solver
• The solution must conform to prescribed accuracy and
the matrix used must be the same as the matrix used by
the netlib driver
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
18
Linpack Fortran Performance on
Different Platforms
Computer
Intel Pentium Woodcrest (1core, 3 GHz)
N=100
[MFlops]
N=1000, TPP
[MFlops]
Theoretical
Peak [MFlops]
3018
6542
12000
NEC SX-8/8 (8 proc., 2 GHz)
-
75140
128000
NEC SX-8/8 (1 proc., 2 GHz)
2177
14960
16000
-
8185
14800
1852
4851
7400
IBM eServer p5-575 (8 POWER5 proc., 1.9 GHz)
-
34570
60800
IBM eServer p5-575 (1 POWER5 proc., 1.9 GHz)
1776
5872
7600
SGI Altix 3700 Bx2 (1 Itanium2 proc., 1.6 GHz)
1765
5953
6400
-
12860
22400
HP ProLiant BL45p (1 core AMD Opteron 854, 2.8 GHz)
1717
4191
5600
Fujitsu VPP5000/1 (1 proc., 3.33ns)
1156
8784
9600
1129 (1 proc.)
29360
57600
HP AlphaServer GS1280 7/1300 (8 Alpha proc., 1.3GHz)
-
14260
20800
HP AlphaServer GS1280 7/1300 (1 Alpha proc., 1.3GHz)
1122
2132
2600
HP 9000 rp8420-32 (8 PA-8800 proc., 1000MHz)
-
14150
32000
HP 9000 rp8420-32 (1 PA-8800 proc., 1000MHz)
843
2905
4000
HP ProLiant BL20p G3 (4 cores, 3.8 GHz Intel Xeon)
HP ProLiant BL20p G3 (1 core 3.8 GHz Intel Xeon)
HP ProLiant BL45p (4 cores AMD Opteron 854, 2.8 GHz)
Cray T932 (32 proc., 2.2ns)
Data excerpted from the 11-30-2006 LINPACK Benchmark Report at http://www.netlib.org/benchmark/performance.ps
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
19
Fortran Linpack Demo
> ./linpack
Please send the results of this run to:
Time spent in
solver (dgesl)
Jack J. Dongarra
Computer Science Department
University of Tennessee
Knoxville, Tennessee 37996-1300
“Timing” unit
(obsolete)
First element
of right hand
side vector
Fax: 865-974-8296
Internet: [email protected]
This is version 29.5.04.
Time spent in matrix
factorization routine
(dgefa)
Total time
(dgefa+dgesl)
norm. resid
1.25501937E+00
resid
1.39332990E-14
Sustained
floating point
rate
machep
2.22044605E-16
Fraction of Cray1S execution
time (obsolete)
x(1)
1.00000000E+00
times are reported for matrices of order
100
dgefa
dgesl
total
mflops
unit
times for array with leading dimension of 201
4.890E-04 2.003E-05 5.090E-04 1.349E+03 1.483E-03
4.860E-04 1.895E-05 5.050E-04 1.360E+03 1.471E-03
4.850E-04 2.003E-05 5.050E-04 1.360E+03 1.471E-03
4.856E-04 1.730E-05 5.029E-04 1.365E+03 1.465E-03
times for array with leading dimension of 200
4.210E-04 1.800E-05 4.390E-04 1.564E+03 1.279E-03
4.200E-04 1.901E-05 4.390E-04 1.564E+03 1.279E-03
4.200E-04 1.699E-05 4.370E-04 1.571E+03 1.273E-03
4.288E-04 1.640E-05 4.452E-04 1.542E+03 1.297E-03
end of tests -- this version dated 05/29/04
x(n)
1.00000000E+00
ratio
b(1)
9.090E-03 -9.159E-15
9.017E-03 1.000E+00
9.018E-03 1.000E+00
8.981E-03 5.298E+02
7.840E-03
7.840E-03
7.804E-03
7.950E-03
Two different
dimensions used to
test the effect of
array placement in
memory
1.000E+00
1.000E+00
1.000E+00
5.298E+02
Reference: http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
20
Linpack’s Highly Parallel
Computing Benchmark (HPL)
• Measures the performance of distributed memory
machines
• Used in the “Linpack Benchmark Report” (Table 3) and
to determine the order of machines on the Top500 list
• The portable version (written in C)
• External dependencies for Linpack installation:
– MPI-1.1 functionality for inter-node communication
– BLAS or VSIPL library for simple vector operations such as
scaled vector addition (DAXPY: y = αx+y) and inner dot
product (DDOT: a = Σxiyi)
• Ground rules:
– allows a complete user replacement of the LU factorization
and solver steps (the accuracy must satisfy given bound)
– same matrix as in the driver program
– no restrictions on problem size
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
21
HPL Algorithm
• Data distribution: 2-D block-cyclic
• Algorithm elements:
– right-looking variant of LU factorization with row
partial pivoting featuring multiple look-ahead depths
– recursive panel factorization with pivot search and
column broadcast combined
– various virtual panel broadcast topologies
– bandwidth reducing swap-broadcast algorithm
– backward substitution with look-ahead depth of one
• Floating point operation count: 2/3·n3+n2
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
22
HPL Algorithm Elements
Execution flow for single parameter set:
Right looking variant of LU factorization is used.
In each iteration of the loop a panel of NB columns is
factorized and the trailing submatrix is updated:
Matrix Generation
Panel
Factorization
Panel Broadcast
Look-ahead
Matrix distribution scheme over P×Q grid of processors:
Update
N
All columns
of A
processed?
Y
Backward
Substitution
http://www.netlib.org/benchmark/hpl/algorithm.html
Solution Check
...
Six broadcast algorithms available
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
23
HPL Linpack Metrics
• The HPL implementation of the benchmark is run for
different problem sizes N on the entire machine
• For certain problem size Nmax, the cumulative
performance in Mflops (reflecting 64-bit addition and
multiplication operations) reaches its maximum value
denoted as Rmax
• Another metric possible to obtain from the benchmark is
N1/2, the problem size for which the half of the maximum
performance (Rmax/2) is achieved
• The Rmax value is used to rank supercomputers in
Top500 list; listed along with this number are the
theoretical peak double precision floating point
performance Rpeak of the machine and N1/2
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
24
Machine Parameters Influencing
Linpack Performance
Parameter
Linpack Fortran,
N=100
Linpack Fortran,
N=1000, TPP
HPL
Processor speed
Yes
Yes
Yes
Memory capacity
No
No (modern
system)
Yes (for Rmax)
Network
latency/bandwidth
No
No
Yes
Compiler flags
Yes
Yes
Yes
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
25
Ten Fastest Supercomputers On Current Top500 List
Source: http://www.top500.org/sublist
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
26
Java Linpack
• Intended mostly to measure the efficiency of Java
implementation rather than hardware floating point
performance
• Solves a dense 500x500 system of linear equations with
one right-hand side, Ax=b
• Matrix A is generated randomly
• Vector b is constructed, so that all component of solution
x are one
• Uses Gaussian elimination with partial pivoting
• Reports: Mflops, time to solution, Norm Res (solution
accuracy), relative machine precision
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
27
HPL Output Example
> mpirun -np 4 xhpl
============================================================================
HPLinpack 1.0a -- High-Performance Linpack benchmark -- January 20, 2004
Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK
============================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N
: The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P
: The number of process rows.
Q
: The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N
: 5000
NB :
32
PMAP : Row-major process mapping
P
:
2
1
4
Q
:
2
4
1
PFACT : Left
NBMIN :
2
NDIV :
2
RFACT : Left
BCAST : 1ringM
DEPTH :
0
SWAP : Mix (threshold = 64)
L1 : transposed form
U
: transposed form
EQUIL : yes
ALIGN : 8 double precision words
============================================================================
T/V
N NB P Q
Time
Gflops
---------------------------------------------------------------------------WR01L2L2
5000 32 2 2
7.14
1.168e+01
---------------------------------------------------------------------------||Ax-b||_oo / ( eps * ||A||_1 * N
)=
0.0400275 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) =
0.0264242 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =
0.0051580 ...... PASSED
============================================================================
T/V
N NB P Q
Time
Gflops
---------------------------------------------------------------------------WR01L2L2
5000 32 1 4
7.00
1.192e+01
---------------------------------------------------------------------------||Ax-b||_oo / ( eps * ||A||_1 * N
)=
0.0335428 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) =
0.0221433 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =
0.0043224 ...... PASSED
============================================================================
T/V
N NB P Q
Time
Gflops
---------------------------------------------------------------------------WR01L2L2
5000 32 4 1
7.00
1.191e+01
---------------------------------------------------------------------------||Ax-b||_oo / ( eps * ||A||_1 * N
)=
0.0426255 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) =
0.0281393 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =
0.0054928 ...... PASSED
============================================================================
Finished
3 tests with the following results:
3 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
----------------------------------------------------------------------------
---------------------------------------------------------------------------- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
1) ||Ax-b||_oo / ( eps * ||A||_1 * N
)
2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )
3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be
1.110223e-16
- Computational tests pass if scaled residuals are less than
16.0
End of Tests.
============================================================================
For configuration issues, consult:
http://www.netlib.org/benchmark/hpl/faqs.html
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
28
Topics
•
•
•
•
•
•
•
Definitions, properties and applications
Early benchmarks
Linpack
Other parallel benchmarks
Organized benchmarking
Presentation and interpretation of results
Summary
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
29
Other Parallel Benchmarks
• High Performance Computing Challenge (HPCC)
benchmarks
– Devised and sponsored to enrich the benchmarking parameter
set
• NAS Parallel Benchmarks (NPB)
– Powerful set of metrics
– Reflects computational fluid dynamics
• NPBIO-MPI
– Stresses external I/O system
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
30
HPC Challenge Benchmark
Consists of 7 individual tests:
• HPL (Linpack TPP): floating point rate of execution of a solver of
linear system of equations
• DGEMM: floating point rate of execution of double precision matrixmatrix multiplication
• STREAM: sustainable memory bandwidth (GB/s) and the
corresponding computation rate for simple vector kernel
• PTRANS (parallel matrix transpose): total capacity of the network
using pairwise communicating processes
• RandomAccess: the rate of integer random updates of memory (in
GUPS: Giga-Updates Per Second)
• FFT: floating point rate of execution of double precision complex 1-D
Discrete Fourier Transform
• b_eff (effective bandwidth benchmark): latency and bandwidth of a
number of simultaneous communication patterns
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
31
Comparison of HPCC Results on
Selected Supercomputers
"Red Storm" Cray XT3, Sandia (Opteron/Cray custom 3D mesh)
IBM Blue Gene/L, NNSA (PowerPC 440/IBM custom 3D torus & tree)
HP XC, Government (Itanium2/Quadrics Elan4)
NEC SX-8, HLRS (SX-8/IXS crossbar)
IBM p5-575, LLNL (Power5/IBM HPS)
Cray X1E, ORNL (X1E/Cray modified 2D torus)
"Columbia" SGI, NASA (Itanium2/SGI NUMALINK)
"Emerald" Rackable Systems, AMD (Opteron/Silverstorm Infiniband)
Percentage of the maximum value
100
80
60
40
20
0
G-HPL (max=91
G-PTRANS
Tflops)
(max=4666GB/s)
G-Random
Access
(max=7.69
GUP/s)
G-FFTE
(max=1763
Gflops)
EP-STREAM
system
(max=62890
GB/s)
EP-DGEMM
system
(max=161885
Gflops)
Random Ring
Bandwidth
(max=0.829
GB/s)
Random Ring
Latency
(max=118.6 μs)
Notes:
• all metrics shown are “higher-better”, except for the Random Ring Latency
• machine labels include: machine name (optional), manufacturer and system name, affiliation and (in parentheses)
processor/network fabric type
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
32
NAS Parallel Benchmarks
• Derived from computational fluid dynamics (CFD) applications
• Consist of five kernels and three pseudo-applications
• Exist in several flavors:
– NPB 1: original “paper-and-pencil” specification
• generally proprietary implementations by hardware vendors
– NPB 2: MPI-based sources distributed by NAS
• supplements NPB 1
• can be run with little or no tuning
– NPB 3: implementations in OpenMP, HPF and Java
• derived from NPB-serial version with improved serial code
• a set of multi-zone benchmarks was added
• test implementation efficiency of multi-level and hybrid parallelization
methods and tools (e.g. OpenMP with MPI)
– GridNPB 3: new suite of benchmarks, designed to rate the
performance of computational grids
• includes only four benchmarks, derived from the original NPB
• written in Fortran and Java
• Globus as grid middleware
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
33
NPB 2 Overview
• Multiple problem classes (S, W, A, B, C, D)
• Tests written mainly in Fortran (IS in C):
– BT (block tri-diagonal solver with 5x5 block size)
– CG (conjugate gradient approximation to compute the smallest
eigenvalue of a sparse, symmetric positive definite matrix)
– EP (“embarrassingly parallel”; evaluates an integral by means of
pseudorandom trials)
– FT (3-D PDE solver using Fast Fourier Transforms)
– IS (large integer sort; tests both integer computation speed and network
performance)
– LU (a regular-sparse, 5x5 block lower and upper triangular system
solver)
– MG (simplified multigrid kernel; tests both short and long distance data
communication)
– SP (solves multiple independent system of non-diagonally dominant,
scalar, pentadiagonal equations)
• Sources and reports available from:
http://ww.nas.nasa.gov/Resources/Software/npb.html
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
34
NPBIO-MPI
•
•
Attempts to address lack of I/O tests in NPB, focusing primarily on file
output
Based on BTIO (Block Tridiagonal Input Output) effort, which extended BT
(Block-tridiagonal) benchmark with routines writing to storage five double
precision numbers for every mesh point
– runs for 200 iterations, writing every five iterations
– after all time steps are finished, all data belonging to a single time step must be
stored in the same file, sorted by vector components
– timing must include all required data rearrangements to achieve the specified
data layout
•
Supported access scenarios:
–
–
–
–
•
•
•
simple: MPI-IO without collective buffering
full: MPI-IO collective buffering
fortran: Fortran 77 file operations
epio: where each process writes continuously its part of the computational
domain to a separate file
Number of processes must be a square
Problem sizes: class A (643), class B (1023), class C (1623)
Several possible results, depending on the benchmarking goal: effective
flops, effective output bandwidth or output overhead
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
35
Sample NPB 2 Results
Reference: The NAS Parallel Benchmarks 2.1 Results by W. Saphir, A. Woo, and M. Yarrow
http://www.nas.nasa.gov/News/Techreports/1996/PDF/nas-96-010.pdf
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
36
Topics
•
•
•
•
•
•
•
Definitions, properties and applications
Early benchmarks
Linpack
Other parallel benchmarks
Organized benchmarking
Presentation and interpretation of results
Summary
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
37
Benchmarking Organizations
• SPEC (Standard Performance Evaluation Corporation)
– Created to satisfy the need for realistic, fair and standardized
performance tests
– Motto: “An ounce of honest data is worth more than a pound of
marketing hype”
• TPC (Transaction Processing Performance Council)
– Formed primarily due to lack of reliable database benchmarks
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
38
SPEC Benchmark Suite Overview
• Standard Performance Evaluation Corporation (SPEC) is
a non-profit organization (financed by its members: over
60 leading computer and software manufacturers)
founded in 1988
• SPEC benchmarks are written in platform-neutral
language (typically C or Fortran)
• The code may be compiled using arbitrary compilers, but
the sources may not be modified
– many manufacturers are known to optimize their compilers
and/or systems to improve the SPEC results
• Benchmarks may be obtained by purchasing the license
from SPEC; the results are published on the SPEC
website
• Website: http://www.spec.org
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
39
SPEC Suite Components
•
SPEC CPU2006: combined performance of CPU, memory and compiler
– CINT2006 (aka. SPECint): integer arithmetic test using compilers, interpreters,
word processors, chess programs, etc.
– CFP2006 (aka. SPECfp): floating point test using physical simulations, 3D
graphics, image processing, computational chemistry, etc.
•
•
•
•
•
•
•
•
•
•
•
SPECweb2005: PHP/JSP performance
SPECviewperf: OpenGL 3D graphic system performance
SPECapc: several popular 3D-intensive applications
SPEC HPC2002: high-end parallel computing tests using quantum
chemistry application, weather modeling, industrial oil deposits locator
SPEC OMP2001: OpenMP application performance
SPECjvm98: performance of java client on a Java VM
SPECjAppServer2004: multi-tier benchmark measuring the performance of
J2EE application servers
SPECjbb2005: server-side Java performance
SPEC MAIL2001: mail server performance (SMTP and POP)
SPEC SFS97_R1: NFS server throughput and response time
Planned: SPEC MPI2006, SPECimap, SPECpower, Virtualization
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
40
Sample Results: SPEC CPU2006
System
CINT2006 CFP2006
Speed
Speed
CINT2006 CFP2006
Rate
Rate
base
base
peak
base
peak
peak
base
Dell Precision 380 (Pentium EE965 3.73GHz,
2cores)
11.6
12.4
23.1
21.7
HP ProLiant DL380 G4 (Xeon 3.8GHz, 2
cores)
11.4
11.7
20.9
18.8
HP ProLiant DL585 (Opteron 854 2.8GHz, 2
cores)
11.2
Sun Blade 2500 (1 UltraSPARC IIIi, 1280MHz)
4.04
12.7
12.1
13.0
22.3
25.2
759
904
HP Integrity rx6600 (Itanium2 1.6GHz/24MB, 8
cores)
94.7
HP Integrity Superdome (Itanium2
1.6GHz/24MB, 128 cores)
1534
14.5
24.1
25.9
102
69.1
71.4
1648
1422
1479
4.04
Sun Fire E25K (UltraSPARC IV+ 1500MHz,
144 cores)
HP Integrity rx6600 (Itanium2 1.6GHz/24MB, 2
cores)
peak
15.7
17.3
18.1
Notes:
• base metric requires that the same flags are used when compiling all instances of the benchmark (peak is less strict)
• speed metric measures how fast a computer executes single task, while rate determines throughput with multiple tasks
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
41
TPC
•
Governed by the Transaction Processing Performance Council
(http://www.tpc.org) founded in 1985
– members include leading system and microprocessor manufacturers, and
commercial database developers
– the council appoints professional affiliates and auditors outside the member
group to help fulfill the TPC’s mission and validate benchmark results
•
Current benchmark flavors:
– TPC-C for transaction processing (de-facto standard for On-Line Transaction
Processing)
– TPC-H for decision support systems
– TPC-App for web services
•
Obsolete benchmarks:
– TPC-A (performance of update-intensive databases)
– TPC-B (throughput of a system in transactions per second)
– TPC-D (decision support applications with long running queries against complex
data structures)
– TPC-R (business reporting, decision support)
– TPC-W (transactional web e-Commerce benchmark)
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
42
Top Ten TPC-C Results
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
43
Topics
•
•
•
•
•
•
•
Definitions, properties and applications
Early benchmarks
Linpack
Other parallel benchmarks
Organized benchmarking
Presentation and interpretation of results
Summary
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
44
Presentation of the Results
• Tables
• Graphs
–
–
–
–
–
–
Bar graphs (a)
Scatter plots (b)
Line plots (c)
Pie charts (d)
Gantt charts (e)
Kiviat graphs (f)
• Enhancements
(a)
(b)
12000
(c)
10000
(d)
G-HPL
G-PTRANS
G-FFTE
G-RanAcc
G-Stream
EPStream
8000
6000
4000
– Error bars, boxes or
confidence intervals
– Broken or offset scales (be
careful!)
– Multiple curves per graph
(but avoid overloading)
– Data labels, colors, etc.
2000
0
0
2000
(e)
4000
6000
8000
10000
12000
(f)
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
45
Kiviat Graph Example
Source: http://www.cse.clrc.ac.uk/disco/DLAB_BENCH_WEB/hpcc/hpcc_kiviat.shtml
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
46
Mixed Graph Example
WRF
OOCORE
MILC
PARATEC
HOMME
BSSN_PUGH Whisky_Carpet ADCIRC PETSc_FUN3D
Computation fraction
Floating point operations
Communication fraction
Load/store operations
Other operations
Characterization of NSF/CCT parallel applications on POWER5 architecture
(using data collected by IPM)
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
47
Graph Do’s and Don’ts
• Good graphs:
–
–
–
–
–
Require minimum effort from the reader
Maximize information
Maximize information-to-ink ratio
Use commonly accepted practices
Avoid ambiguity
• Poor graphs:
–
–
–
–
–
–
Have too many alternatives on a single chart
Display too many y-variables on a single chart
Use vague symbols in place of text
Show extraneous information
Select scale ranges improperly
Use line chart instead of a bar graph
Reference: Raj Jain, The Art of Computer Systems Performance Analysis, Chapter 10
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
48
Common Mistakes in
Benchmarking
From Chapter 9 of The Art of Computer Systems Performance
Analysis by Raj Jain:
•
•
•
•
•
•
•
•
•
•
•
•
Only average behavior represented in test workload
Skewness of device demands ignored
Loading level controlled inappropriately
Caching effects ignored
Buffering sizes not appropriate
Inaccuracies due to sampling ignored
Ignoring monitoring overhead
Not validating measurements
Not ensuring same initial conditions
Not measuring transient performance
Using device utilizations for performance comparisons
Collecting too much data but doing very little analysis
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
49
Misrepresentation of Performance
Results on Parallel Computers
1. Quote only 32-bit performance results, not 64-bit results
2. Present performance for an inner kernel, representing it as the performance of the entire
application
3. Quietly employ assembly code and other low-level constructs
4. Scale problem size with the number of processors, but omit any mention of this fact
5. Quote performance results projected to the full system
6. Compare your results with scalar, unoptimized code run on another platform
7. When direct run time comparisons are required, compare with an old code on an obsolete
system
8. If MFLOPS rates must be quoted, base the operation count on the parallel implementation,
not on the best sequential implementation
9. Quote performance in terms of processor utilization, parallel speedups or MFLOPS per
dollar
10. Mutilate the algorithm used in the parallel implementation to match the architecture
11. Measure parallel run times on a dedicated system, but measure conventional run times in
a busy environment
12. If all else fails, show pretty pictures and animated videos, and don't talk about performance
Reference:
David Bailey “Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers”,
Supercomputing Review, Aug 1991, pp.54-55, http://crd.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
50
Topics
•
•
•
•
•
•
•
Definitions, properties and applications
Early benchmarks
Linpack
Other parallel benchmarks
Organized benchmarking
Presentation and interpretation of results
Summary
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
51
Material For Test
• Basic performance metrics (slide 4)
• Definition of benchmark in own words; purpose of
benchmarking; properties of good benchmark (slides 5,
6, 7)
• Linpack: what it is, what does it measure, concepts and
complexities (slides 15, 17, 18)
• HPL: (slides 21 and 24)
• Linpack compare and contrast (slide 25)
• General knowledge about HPCC,SPEC and NPB suites
(slides 30, 31, 34, 39)
• Kiviat Graph (slide 46)
• Benchmark result interpretation (slides 49, 50)
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
52
CSC 7600 Lecture 4: Benchmarking,
Spring 2011
53