Benchmarking - LSU Center for Computation & Technology

Download Report

Transcript Benchmarking - LSU Center for Computation & Technology

High Performance Computing: Concepts, Methods & Means
Performance I: Benchmarking
Prof. Thomas Sterling
Department of Computer Science
Louisiana State University
January 23rd, 2007
Topics
• Definitions, properties and applications
• Early benchmarks
• Everything you ever wanted to know
about Linpack (but were afraid to ask)
• Other parallel benchmarks
• Organized benchmarking
• Presentation and interpretation of
results
• Summary
2
•
•
•
•
•
•
Definitions, properties and applications
Early benchmarks
Linpack
Other parallel benchmarks
Organized benchmarking
Presentation and interpretation of
results
• Summary
3
Basic Performance Metrics
•
Time related:
– Execution time [seconds]
• wall clock time
• system and user time
– Latency
– Response time
•
Rate related:
– Rate of computation
• floating point operations per second [flops]
• integer operations per second [ops]
– Data transfer (I/O) rate [bytes/second]
•
Effectiveness:
– Efficiency [%]
– Memory consumption [bytes]
– Productivity [utility/($*second)]
•
Modifiers:
– Sustained
– Peak
– Theoretical peak
4
What Is a Benchmark?
Benchmark: a standardized problem or test that serves as
a basis for evaluation or comparison (as of computer
system performance) [Merriam-Webster]
• The term “benchmark” also commonly applies to
specially-designed programs used in benchmarking
• A benchmark should:
– be domain specific (the more general the benchmark, the
less useful it is for anything in particular)
– be a distillation of the essential attributes of a workload
– avoid using single metric to express the overall performance
• Computational benchmark kinds
– synthetic: specially-created programs that impose the load
on the specific component in the system
– application: derived from a real-world application program
5
Purpose of Benchmarking
• To define the playing field
• To provide a tool enabling quantitative
comparisons
• Acceleration of progress
– enable better engineering by defining measurable
and repeatable objectives
• Establishing of performance agenda
– measure release-to-release or version-to-version
progress
– set goals to meet
– be understandable and useful also to the people
not having the expertise in the field (managers,
etc.)
6
Properties of a Good
Benchmark
•
•
•
•
•
•
•
•
Relevance: meaningful within the target domain
Understandability
Good metric(s): linear, orthogonal, monotonic
Scalability: applicable to a broad spectrum of
hardware/architecture
Coverage: does not over-constrain the typical
environment
Acceptance: embraced by users and vendors
Has to enable comparative evaluation
Limited lifetime: there is a point when additional code
modifications or optimizations become
counterproductive
Adapted from: Standard Benchmarks for Database Systems by Charles Levine, SIGMOD ‘97
7
•
•
•
•
•
•
Definitions, properties and applications
Early benchmarks
Linpack
Other parallel benchmarks
Organized benchmarking
Presentation and interpretation of
results
• Summary
8
Early Benchmarks
• Whetstone
– Floating point intensive
• Dhrystone
– Integer and character string oriented
• Livermore Fortran Kernels
– “Livermore Loops”
– Collection of short kernels
• NAS kernel
– 7 Fortran test kernels for aerospace computation
The sources of the benchmarks listed above are
available from: http://www.netlib.org/benchmark
9
•
•
•
•
•
•
Definitions, properties and applications
Early benchmarks
Linpack
Other parallel benchmarks
Organized benchmarking
Presentation and interpretation of
results
• Summary
14
Linpack Overview
• Introduced by Jack Dongarra in 1979
• Based on LINPACK linear algebra package
developed by J. Dongarra, J. Bunch, C. Moler
and P. Stewart (now superseded by the
LAPACK library)
• Solves a dense, regular system of linear
equations, using matrices initialized with
pseudo-random numbers
• Provides an estimate of system’s effective
floating-point performance
• Does not reflect the overall performance of
the machine!
15
Linpack Benchmark
Variants
• Linpack Fortran (single processor)
– N=100
– N=1000, TPP, best effort
• Linpack’s Highly Parallel Computing
benchmark (HPL)
• Java Linpack
16
Linpack Fortran Performance
on Different Platforms
Computer
Intel Pentium Woodcrest (1core, 3 GHz)
N=100
[MFlops]
N=1000, TPP
[MFlops]
Theoretical
Peak [MFlops]
3018
6542
12000
NEC SX-8/8 (8 proc., 2 GHz)
-
75140
128000
NEC SX-8/8 (1 proc., 2 GHz)
2177
14960
16000
-
8185
14800
1852
4851
7400
IBM eServer p5-575 (8 POWER5 proc., 1.9 GHz)
-
34570
60800
IBM eServer p5-575 (1 POWER5 proc., 1.9 GHz)
1776
5872
7600
SGI Altix 3700 Bx2 (1 Itanium2 proc., 1.6 GHz)
1765
5953
6400
-
12860
22400
HP ProLiant BL45p (1 core AMD Opteron 854, 2.8 GHz)
1717
4191
5600
Fujitsu VPP5000/1 (1 proc., 3.33ns)
1156
8784
9600
1129 (1 proc.)
29360
57600
HP AlphaServer GS1280 7/1300 (8 Alpha proc., 1.3GHz)
-
14260
20800
HP AlphaServer GS1280 7/1300 (1 Alpha proc., 1.3GHz)
1122
2132
2600
HP 9000 rp8420-32 (8 PA-8800 proc., 1000MHz)
-
14150
32000
HP 9000 rp8420-32 (1 PA-8800 proc., 1000MHz)
843
2905
4000
HP ProLiant BL20p G3 (4 cores, 3.8 GHz Intel Xeon)
HP ProLiant BL20p G3 (1 core 3.8 GHz Intel Xeon)
HP ProLiant BL45p (4 cores AMD Opteron 854, 2.8 GHz)
Cray T932 (32 proc., 2.2ns)
Data excerpted from the 11-30-2006 LINPACK Benchmark Report at http://www.netlib.org/benchmark/performance.ps
19
Fortran Linpack Demo
> ./linpack
Please send the results of this run to:
Time spent in
solver (dgesl)
Jack J. Dongarra
Computer Science Department
University of Tennessee
Knoxville, Tennessee 37996-1300
“Timing” unit
(obsolete)
First element
of right hand
side vector
Fax: 865-974-8296
Internet: [email protected]
This is version 29.5.04.
Time spent in
matrix factorization
routine (dgefa)
Total time
(dgefa+dgesl)
norm. resid
1.25501937E+00
resid
1.39332990E-14
Sustained
floating point
rate
machep
2.22044605E-16
Fraction of
Cray-1S
execution time
(obsolete)
x(1)
1.00000000E+00
times are reported for matrices of order
100
dgefa
dgesl
total
mflops
unit
times for array with leading dimension of 201
4.890E-04 2.003E-05 5.090E-04 1.349E+03 1.483E-03
4.860E-04 1.895E-05 5.050E-04 1.360E+03 1.471E-03
4.850E-04 2.003E-05 5.050E-04 1.360E+03 1.471E-03
4.856E-04 1.730E-05 5.029E-04 1.365E+03 1.465E-03
times for array with leading dimension of 200
4.210E-04 1.800E-05 4.390E-04 1.564E+03 1.279E-03
4.200E-04 1.901E-05 4.390E-04 1.564E+03 1.279E-03
4.200E-04 1.699E-05 4.370E-04 1.571E+03 1.273E-03
4.288E-04 1.640E-05 4.452E-04 1.542E+03 1.297E-03
end of tests -- this version dated 05/29/04
x(n)
1.00000000E+00
ratio
b(1)
9.090E-03 -9.159E-15
9.017E-03 1.000E+00
9.018E-03 1.000E+00
8.981E-03 5.298E+02
7.840E-03
7.840E-03
7.804E-03
7.950E-03
Two different
dimensions used to
test the effect of
array placement in
memory
1.000E+00
1.000E+00
1.000E+00
5.298E+02
Reference: http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html
20
Linpack’s Highly Parallel
Computing Benchmark (HPL)
• Measures the performance of distributed memory
machines
• Used in the “Linpack Benchmark Report” (Table 3)
and to determine the order of machines on the Top500
list
• The portable version (written in C)
• External dependencies:
– MPI-1.1 functionality for inter-node communication
– BLAS or VSIPL library for simple vector operations such as
scaled vector addition (DAXPY: y = αx+y) and inner dot
product (DDOT: a = Σxiyi)
• Ground rules:
– allows a complete user replacement of the LU factorization
and solver steps (the accuracy must satisfy given bound)
– same matrix as in the driver program
– no restrictions on problem size
21
HPL Linpack Metrics
• The HPL implementation of the benchmark is run for
different problem sizes N on the entire machine
• For certain problem size Nmax, the cumulative
performance in Mflops (reflecting 64-bit addition and
multiplication operations) reaches its maximum value
denoted as Rmax
• Another metric possible to obtain from the benchmark
is N1/2, the problem size for which the half of the
maximum performance (Rmax/2) is achieved
• The Rmax value is used to rank supercomputers in
Top500 list; listed along with this number are the
theoretical peak double precision floating point
performance Rpeak of the machine and N1/2
24
Machine Parameters Influencing
Linpack Performance
Parameter
Linpack Fortran,
N=100
Linpack Fortran,
N=1000, TPP
HPL
Processor speed
Yes
Yes
Yes
Memory capacity
No
No (modern
system)
Yes (for Rmax)
Network
latency/bandwidth
No
No
Yes
Compiler flags
Yes
Yes
Yes
25
Ten Fastest Supercomputers
On Current Top500 List
#
Computer
Site
1
IBM Blue Gene/L
DoE/NNSA/LLNL (USA)
2
Cray Red Storm
3
Processors
Rmax
Rpeak
131,072
280,600
367,000
Sandia (USA)
26,544
101,400
127,411
IBM BGW
IBM T. Watson Research Center (USA)
40,960
91,290
114,688
4
IBM ASC Purple
DoE/NNSA/LLNL (USA)
12,208
75,760
92,781
5
IBM Mare Nostrum
Barcelona Supercomputing Center (Spain)
10,240
62,630
94,208
6
Dell Thunderbird
NNSA/Sandia (USA)
9,024
53,000
64,973
7
Bull Tera-10
Commissariat a l’Energie Atomique (France)
9,968
52,840
63,795
8
SGI Columbia
NASA/Ames Research Center (USA)
10,160
51,870
60,960
9
NEC/Sun Tsubame
GSIC Center, Tokyo Institute of Technology
(Japan)
11,088
47,380
82,125
Cray Jaguar
Oak Ridge National Laboratory (USA)
10,424
43,480
54,205
10
Source: http://www.top500.org/list/2006/11/100
26
HPL Demo
> mpirun -np 4 xhpl
============================================================================
HPLinpack 1.0a -- High-Performance Linpack benchmark -January 20, 2004
Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK
============================================================================
An explanation of the input/output parameters follows:
T/V
: Wall time / encoded variant.
N
: The order of the coefficient matrix A.
NB
: The partitioning blocking factor.
P
: The number of process rows.
Q
: The number of process columns.
Time
: Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N
NB
PMAP
P
Q
PFACT
NBMIN
NDIV
RFACT
BCAST
DEPTH
SWAP
L1
U
EQUIL
ALIGN
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
5000
32
Row-major process mapping
2
1
4
2
4
1
Left
2
2
Left
1ringM
0
Mix (threshold = 64)
transposed form
transposed form
yes
8 double precision words
============================================================================
T/V
N
NB
P
Q
Time
Gflops
---------------------------------------------------------------------------WR01L2L2
5000
32
2
2
7.14
1.168e+01
---------------------------------------------------------------------------||Ax-b||_oo / ( eps * ||A||_1 * N
) =
0.0400275 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) =
0.0264242 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =
0.0051580 ...... PASSED
============================================================================
T/V
N
NB
P
Q
Time
Gflops
---------------------------------------------------------------------------WR01L2L2
5000
32
1
4
7.00
1.192e+01
---------------------------------------------------------------------------||Ax-b||_oo / ( eps * ||A||_1 * N
) =
0.0335428 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) =
0.0221433 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =
0.0043224 ...... PASSED
============================================================================
T/V
N
NB
P
Q
Time
Gflops
---------------------------------------------------------------------------WR01L2L2
5000
32
4
1
7.00
1.191e+01
---------------------------------------------------------------------------||Ax-b||_oo / ( eps * ||A||_1 * N
) =
0.0426255 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) =
0.0281393 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =
0.0054928 ...... PASSED
============================================================================
Finished
3 tests with the following results:
3 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
---------------------------------------------------------------------------End of Tests.
============================================================================
---------------------------------------------------------------------------- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
1) ||Ax-b||_oo / ( eps * ||A||_1 * N
)
2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )
3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be
- Computational tests pass if scaled residuals are less than
1.110223e-16
16.0
For configuration issues, consult:
http://www.netlib.org/benchmark/hpl/faqs.html
28
•
•
•
•
•
•
Definitions, properties and applications
Early benchmarks
Linpack
Other parallel benchmarks
Organized benchmarking
Presentation and interpretation of
results
• Summary
29
Other Parallel Benchmarks
• High Performance Computing
Challenge (HPCC) benchmarks
– Devised and sponsored to enrich the
benchmarking parameter set
• NAS Parallel Benchmarks (NPB)
– Powerful set of metrics
– Reflects computational fluid dynamics
• NPBIO-MPI
– Stresses external I/O system
30
HPC Challenge
Benchmark
Consists of 7 individual tests:
• HPL (Linpack TPP): floating point rate of execution of a solver
of linear system of equations
• DGEMM: floating point rate of execution of double precision
matrix-matrix multiplication
• STREAM: sustainable memory bandwidth (GB/s) and the
corresponding computation rate for simple vector kernel
• PTRANS (parallel matrix transpose): total capacity of the
network using pairwise communicating processes
• RandomAccess: the rate of integer random updates of memory
(in GUPS: Giga-Updates Per Second)
• FFT: floating point rate of execution of double precision complex
1-D Discrete Fourier Transform
• b_eff (effective bandwidth benchmark): latency and bandwidth
of a number of simultaneous communication patterns
31
Comparison of HPCC Results on
Selected Supercomputers
"Red Storm" Cray XT3, Sandia (Opteron/Cray custom 3D mesh)
IBM Blue Gene/L, NNSA (PowerPC 440/IBM custom 3D torus & tree)
HP XC, Government (Itanium2/Quadrics Elan4)
NEC SX-8, HLRS (SX-8/IXS crossbar)
IBM p5-575, LLNL (Power5/IBM HPS)
Cray X1E, ORNL (X1E/Cray modified 2D torus)
"Columbia" SGI, NASA (Itanium2/SGI NUMALINK)
"Emerald" Rackable Systems, AMD (Opteron/Silverstorm Infiniband)
Percentage of the maximum value
100
80
60
40
20
0
G-HPL (max=91
G-PTRANS
Tflops)
(max=4666GB/s)
G-Random
Access
(max=7.69
GUP/s)
G-FFTE
(max=1763
Gflops)
EP-STREAM
system
(max=62890
GB/s)
EP-DGEMM
system
(max=161885
Gflops)
Random Ring
Bandwidth
(max=0.829
GB/s)
Random Ring
Latency
(max=118.6 μs)
Notes:
• all metrics shown are “higher-better”, except for the Random Ring Latency
• machine labels include: machine name (optional), manufacturer and system name, affiliation and (in parentheses)
processor/network fabric type
32
NAS Parallel Benchmarks
• Derived from computational fluid dynamics (CFD) applications
• Consist of five kernels and three pseudo-applications
• Exist in several flavors:
– NPB 1: original paper-and-pencil specification
• generally proprietary implementations by hardware vendors
– NPB 2: MPI-based sources distributed by NAS
• supplements NPB 1
• can be run with little or no tuning
– NPB 3: implementations in OpenMP, HPF and Java
• derived from NPB-serial version with improved serial code
• a set of multi-zone benchmarks was added
• test implementation efficiency of multi-level and hybrid parallelization
methods and tools (e.g. OpenMP with MPI)
– GridNPB 3: new suite of benchmarks, designed to rate the
performance of computational grids
• includes only four benchmarks, derived from the original NPB
• written in Fortran and Java
• Globus as grid middleware
33
NPB 2 Overview
• Multiple problem classes (S, W, A, B, C, D)
• Tests written mainly in Fortran (IS in C):
– BT (block tri-diagonal solver with 5x5 block size)
– CG (conjugate gradient approximation to compute the smallest
eigenvalue of a sparse, symmetric positive definite matrix)
– EP (“embarrassingly parallel”; evaluates an integral by means of
pseudorandom trials)
– FT (3-D PDE solver using Fast Fourier Transforms)
– IS (large integer sort; tests both integer computation speed and
network performance)
– LU (a regular-sparse, 5x5 block lower and upper triangular system
solver)
– MG (simplified multigrid kernel; tests both short and long distance
data communication)
– SP (solves multiple independent system of non-diagonally
dominant, scalar, pentadiagonal equations)
• Sources and reports available from:
http://ww.nas.nasa.gov/Resources/Software/npb.html
34
•
•
•
•
•
•
Definitions, properties and applications
Early benchmarks
Linpack
Other parallel benchmarks
Organized benchmarking
Presentation and interpretation of
results
• Summary
37
Benchmarking Organizations
• SPEC
– Created to satisfy the need for realistic, fair
and standardized performance tests
– Motto: “An ounce of honest data is worth
more than a pound of marketing hype”
• TPC
– Formed primarily due to lack of reliable
database benchmarks
38
•
•
•
•
•
•
Definitions, properties and applications
Early benchmarks
Linpack
Other parallel benchmarks
Organized benchmarking
Presentation and interpretation of
results
• Summary
44
Presentation of the Results
• Tables
• Graphs
–
–
–
–
–
–
Bar graphs (a)
Scatter plots (b)
Line plots (c)
Pie charts (d)
Gantt charts (e)
Kiviat graphs (f)
• Enhancements
(a)
(b)
12000
(c)
10000
G-HPL
G-PTRANS
G-FFTE
G-RanAcc
G-Stream
EPStream
8000
(d)
6000
4000
– Error bars, boxes or
confidence intervals
– Broken or offset scales
(be careful!)
(e)
– Multiple curves per graph
(but avoid overloading)
– Data labels, colors, etc.
2000
0
0
2000
4000
6000
8000
10000
12000
(f)
Kiviat Graph Example
Source: http://www.cse.clrc.ac.uk/disco/DLAB_BENCH_WEB/hpcc/hpcc_kiviat.shtml
46
Mixed Graph Example
WRF
OOCORE
MILC
PARATEC
HOMME
BSSN_PUGH Whisky_Carpet ADCIRC PETSc_FUN3D
Computation fraction
Floating point operations
Communication fraction
Load/store operations
Other operations
Characterization of NSF/CCT parallel applications on POWER5 architecture
(using data collected by IPM)
47
Graph Do’s and Don’ts
• Good graphs:
–
–
–
–
–
Require minimum effort from the reader
Maximize information
Maximize information-to-ink ratio
Use commonly accepted practices
Avoid ambiguity
• Poor graphs:
–
–
–
–
–
–
Have too many alternatives on a single chart
Display too many y-variables on a single chart
Use vague symbols in place of text
Show extraneous information
Select scale ranges improperly
Use line chart instead of a bar graph
Reference: Raj Jain, The Art of Computer Systems Performance Analysis, Chapter 10
48
Common Mistakes in
Benchmarking
From Chapter 9 of The Art of Computer Systems
Performance Analysis by Raj Jain:
•
•
•
•
•
•
•
•
•
•
•
•
Only average behavior represented in test workload
Skewness of device demands ignored
Loading level controlled inappropriately
Caching effects ignored
Buffering sizes not appropriate
Inaccuracies due to sampling ignored
Ignoring monitoring overhead
Not validating measurements
Not ensuring same initial conditions
Not measuring transient performance
Using device utilizations for performance comparisons
Collecting too much data but doing very little analysis
49
Misrepresentation of Performance
Results on Parallel Computers
•
•
•
•
•
•
•
•
•
•
•
•
Quote only 32-bit performance results, not 64-bit results
Present performance for an inner kernel, representing it as the performance of the
entire application
Quietly employ assembly code and other low-level constructs
Scale problem size with the number of processors, but omit any mention of this fact
Quote performance results projected to the full system
Compare your results with scalar, unoptimized code run on another platform
When direct run time comparisons are required, compare with an old code on an
obsolete system
If MFLOPS rates must be quoted, base the operation count on the parallel
implementation, not on the best sequential implementation
Quote performance in terms of processor utilization, parallel speedups or MFLOPS
per dollar
Mutilate the algorithm used in the parallel implementation to match the architecture
Measure parallel run times on a dedicated system, but measure conventional run
times in a busy environment
If all else fails, show pretty pictures and animated videos, and don't talk about
performance
Reference:
David Bailey “Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers”,
Supercomputing Review, Aug 1991, pp.54-55, http://crd.lbl.gov/~dhbailey/dhbpapers/twelve-ways.pdf
50
•
•
•
•
•
•
Definitions, properties and applications
Early benchmarks
Linpack
Other parallel benchmarks
Organized benchmarking
Presentation and interpretation of
results
• Summary
51
Knowledge Factors &
Skills
• Knowledge factors:
– benchmarking and metrics
– performance factors
– Top500 list
• Skill set:
– determine state of system resources and
manipulate them
– acquire, run and measure benchmark
performance
– launch user application codes
52
Material For Test
• Basic performance metrics (slide 4)
• Definition of benchmark in own words; purpose of
benchmarking; properties of good benchmark (slides
5, 6, 7)
• Linpack: what it is, what does it measure, concepts
and complexities (slides 15, 17, 18)
• HPL: (slides 21 and 24)
• Linpack compare and contrast (slide 25)
• General knowledge about HPCC and NPB suites
(slides 31 and 34)
• Benchmark result interpretation (slides 49, 50)
53