Survey of Computer Architecture

Transcript Survey of Computer Architecture

2007 SPEC Benchmark Workshop
January 21, 2007
Radisson Hotel Austin North
The HPC Challenge Benchmark:
A Candidate for Replacing
LINPACK in the TOP500?
Jack Dongarra
University of Tennessee
and
Oak Ridge National Laboratory
1
Outline - The HPC Challenge Benchmark:
A Candidate for Replacing Linpack in the TOP500?
 Look at LINPACK
 Brief discussion of DARPA HPCS Program
 HPC Challenge Benchmark
 Answer the Question
2
What Is LINPACK?
 Most people think LINPACK is a benchmark.
 LINPACK is a package of mathematical
software for solving problems in linear algebra,
mainly dense linear systems of linear equations.
 The project had its origins in 1974
 LINPACK: “LINear algebra PACKage”
 Written in Fortran 66
3
Computing in 1974
 High Performance Computers:
 IBM 370/195, CDC 7600, Univac 1110, DEC
PDP-10, Honeywell 6030
 Fortran 66
 Run efficiently
 BLAS (Level 1)
 Vector operations
 Trying to achieve software portability
 LINPACK package was released in 1979
 About the time of the Cray 1
4
The Accidental Benchmarker
 Appendix B of the Linpack Users’ Guide
 Designed to help users extrapolate execution
time for Linpack software package
 First benchmark report from 1977;
 Cray 1 to DEC PDP-10
Dense matrices
Linear systems
Least squares problems
Singular values
5
LINPACK Benchmark?
 The LINPACK Benchmark is a measure of a computer’s
floating-point rate of execution for solving Ax=b.
 It is determined by running a computer program that solves a
dense system of linear equations.
 Information is collected and available in the LINPACK
Benchmark Report.
 Over the years the characteristics of the benchmark has
changed a bit.
 In fact, there are three benchmarks included in the Linpack
Benchmark report.
 LINPACK Benchmark since 1977
 Dense linear system solve with LU factorization using partial
pivoting
 Operation count is: 2/3 n3 + O(n2)
 Benchmark Measure: MFlop/s
 Original benchmark measures the execution rate for a Fortran
program on a matrix of size 100x100.
6
For Linpack with n = 100
 Not allowed to touch the code.
 Only set the optimization in the compiler and run.
 Provide historical look at computing
 Table 1 of the report (52 pages of 95 page report)
 http://www.netlib.org/benchmark/performance.pdf
7
Linpack Benchmark Over Time
 In the beginning there was only the Linpack 100 Benchmark (1977)
 n=100 (80KB); size that would fit in all the machines
 Fortran; 64 bit floating point arithmetic
 No hand optimization (only compiler options); source code available
 Linpack 1000 (1986)
 n=1000 (8MB); wanted to see higher performance levels
 Any language; 64 bit floating point arithmetic
 Hand optimization OK
 Linpack Table 3 (Highly Parallel Computing - 1991) (Top500; 1993)
 Any size (n as large as you can; n=106; 8TB; ~6 hours);
 Any language; 64 bit floating point arithmetic
 Hand optimization OK
 Strassen’s method not allowed (confuses the operation count and rate)
 Reference implementation available
|| Ax  b ||
 O(1)
 In all cases results are verified by looking at: || A || ||x||n
 Operations count for factorization 2 n3  1 n 2 ; solve 2n 2
3
2
8
Motivation for Additional Benchmarks
Linpack Benchmark
 Good
 One number
 Simple to define & easy to rank
 Allows problem size to change
with machine and over time
 Stresses the system with a run
of a few hours
 Bad
 Emphasizes only “peak” CPU
speed and number of CPUs
 Does not stress local bandwidth
 Does not stress the network
 Does not test gather/scatter
 Ignores Amdahl’s Law (Only
does weak scaling)
 Ugly
 MachoFlops
 Benchmarketeering hype
 From Linpack Benchmark and
Top500: “no single number can
reflect overall performance”
 Clearly need something more
than Linpack
 HPC Challenge Benchmark
 Test suite stresses not only
the processors, but the
memory system and the
interconnect.
 The real utility of the HPCC
benchmarks are that
architectures can be described
with a wider range of metrics
than just Flop/s from Linpack.
9
At The Time The Linpack Benchmark Was
Created …
 If we think about computing in late 70’s
 Perhaps the LINPACK benchmark was a
reasonable thing to use.
 Memory wall, not so much a wall but a step.
 In the 70’s, things were more in balance
 The memory kept pace with the CPU
 n cycles to execute an instruction, n
cycles to bring in a word from memory
 Showed compiler optimization
 Today provides a historical base of data
10
Many Changes
 Many changes in our hardware over the
Const.
400
Cluster
300
MPP
200
SMP
100
SIMD
Single Proc.
2006
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
0
some changes to the
Linpack Benchmark not
all of them reflect the advances made in
the hardware.
2005
 While there has been
500
2004
 Superscalar, Vector,
Distributed Memory,
Shared Memory,
Multicore, …
Top500 Systems/Architectures
2003
past 30 years
 Today’s memory hierarchy is much more
complicated.
11
High Productivity Computing Systems
Goal:
Provide a generation of economically viable high productivity computing systems for the
national security and industrial user community (2010; started in 2002)
Focus on:
Real (not peak) performance of critical national
security applications






Intelligence/surveillance
Reconnaissance
Cryptanalysis
Weapons analysis
Airborne contaminant modeling
Biotechnology
Programmability: reduce cost and time of
developing applications
Software portability and system robustness
is & Asse
Analys ry R ssment
&D
Indust
Performance
Characterization
& Prediction
System
Architecture
Programming
Models
Hardware
Technology
Software
Technology
Indust
ry R&D
nt
Analys
is & Assessme
HPCS Program Focus Areas
Applications:
Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne contaminant modeling and biotechnology
Fill the Critical Technology and Capability Gap
Today (late 80's HPC Technology) ... to ... Future (Quantum/Bio Computing)
HPCS Roadmap
 5 vendors in phase 1; 3 vendors in phase 2; 1+ vendors in phase 3
 MIT Lincoln Laboratory leading measurement and evaluation team
Petascale Systems
Full Scale
Development
TBD
team
Advanced
Design &
Prototypes
Concept
Study
Test Evaluation
Framework
team
team
Phase 1
$20M (2002)
Validated Procurement
Evaluation Methodology
Phase 2
$170M (2003-2005)
Today
Phase 3
(2006-2010)
~$250M each
New Evaluation
Framework
13
Predicted Performance Levels for Top500
.
100,000
10,000
6,267
4,648
3,447
#1
1,000
557
405
294
#10
#500
Total
100
44
33
Pred. #1
59
Pred. #10
Pred. #500
10
5.46
3.95
2.86
9
n0
Ju
08
8
De
c-
n0
Ju
07
7
De
c-
n0
Ju
06
6
De
c-
n0
Ju
05
5
De
c-
n0
Ju
04
4
De
c-
n0
Ju
03
De
c-
n0
3
1
Ju
TFlop/s Linpack
Total
14
A PetaFlop Computer by the End of the
Decade
 At least 10 Companies developing a
Petaflop system in the next decade.
 Cray
 IBM
 Sun
 Dawning
 Galactic
 Lenovo
 Hitachi
 NEC
 Fujitsu
 Bull
}
}
}
2+ Pflop/s Linpack
6.5 PB/s data streaming BW
3.2 PB/s Bisection BW
64,000 GUPS
Chinese
Companies
Japanese
“Life Simulator” (10 Pflop/s)
Keisoku project $1B 7 years
15
PetaFlop Computers in 2 Years!
 Oak Ridge National Lab
 Planned for 4th Quarter 2008 (1 Pflop/s peak)
 From Cray’s XT family
 Use quad core from AMD
 23,936 Chips
 Each chip is a quad core-processor (95,744 processors)
 Each processor does 4 flops/cycle
 Cycle time of 2.8 GHz
 Hypercube connectivity
 Interconnect based on Cray XT technology
 6MW, 136 cabinets
 Los Alamos National Lab
 Roadrunner (2.4 Pflop/s peak)
 Use IBM Cell and AMD processors
 75,000 cores
16
HPC Challenge Goals
 To examine the performance of HPC architectures
using kernels with more challenging memory access
patterns than the Linpack Benchmark
 The Linpack benchmark works well on all architectures ―
even cache-based, distributed memory multiprocessors due to
1. Extensive memory reuse
2. Scalable with respect to the amount of computation
3. Scalable with respect to the communication volume
4. Extensive optimization of the software
 To complement the Top500 list
 Stress CPU, memory system, interconnect
 Allow for optimizations
 Record effort needed for tuning
 Base run requires MPI and BLAS
 Provide verification & archiving of results
17
Tests on Single Processor and System
●
●
●
Local - only a single processor (core) is
performing computations.
Embarrassingly Parallel - each processor (core)
in the entire system is performing computations
but they do no communicate with each other
explicitly.
Global - all processors in the system are
performing computations and they explicitly
communicate with each other.
HPC Challenge Benchmark
Consists of basically 7 benchmarks;

1.
2.
3.
4.
Think of it as a framework or harness for adding benchmarks of interest.
LINPACK (HPL) ― MPI Global (Ax = b)
STREAM ― Local; single CPU
*STREAM ― Embarrassingly parallel
PTRANS (A
A + BT) ― MPI Global
RandomAccess ― Local; single CPU
*RandomAccess ― Embarrassingly parallel
RandomAccess ― MPI Global
5.
BW and Latency – MPI
6.
FFT - Global, single CPU, and EP
7.
Matrix Multiply – single CPU and EP
Random integer
read; update; & write
proci
prock
19
HPCS Performance Targets
Memory Hierarchy
Registers
Operands Instructions
Cache(s)
Lines Blocks
Local Memory
Messages
Remote Memory
Pages
Disk
Tape
HPCC was developed by HPCS to assist in testing new HEC systems
● Each benchmark focuses on a different part of the memory hierarchy
● HPCS performance targets attempt to
Flatten the memory hierarchy
Improve
real application performance
April
18, 2006
Oak Ridge National Lab, CSM/FT
Make programming easier
●
20
HPCS Performance Targets
●
LINPACK: linear system solve
Ax = b
Memory Hierarchy
Registers
Operands Instructions
Cache(s)
Lines Blocks
Local Memory
Messages
Remote Memory
Pages
Disk
Tape
HPCC was developed by HPCS to assist in testing new HEC systems
● Each benchmark focuses on a different part of the memory hierarchy
● HPCS performance targets attempt to
Flatten the memory hierarchy
Improve
real application performance
April
18, 2006
Oak Ridge National Lab, CSM/FT
Make programming easier
●
21
HPCS Performance Targets
●
Memory Hierarchy
LINPACK: linear system solve
Ax = b
Registers
●
●
●
STREAM: vector operations
A=B+s*C
Operands Instructions
FFT: 1D Fast Fourier Transform
Z = fft(X)
RandomAccess: integer update
T[i] = XOR( T[i], rand)
HPC Challenge
Cache(s)
Lines Blocks
Local Memory
Messages
Remote Memory
Pages
Disk
Tape
HPCC was developed by HPCS to assist in testing new HEC systems
● Each benchmark focuses on a different part of the memory hierarchy
● HPCS performance targets attempt to
Flatten the memory hierarchy
Improve
real application performance
April
18, 2006
Oak Ridge National Lab, CSM/FT
Make programming easier
●
22
HPCS Performance Targets
●
Memory Hierarchy
LINPACK: linear system solve
Ax = b
Max
2 Pflop/s
Relative
8x
Operands Instructions
6.5 Pbyte/s
40x
Cache(s)
0.5 Pflop/s
64000 GUPS
200x
2000x
Registers
●
●
●
STREAM: vector operations
A=B+s*C
FFT: 1D Fast Fourier Transform
Z = fft(X)
RandomAccess: integer update
T[i] = XOR( T[i], rand)
HPC Challenge
Lines Blocks
Local Memory
Messages
Remote Memory
Pages
Performance Targets
Disk
Tape
HPCC was developed by HPCS to assist in testing new HEC systems
● Each benchmark focuses on a different part of the memory hierarchy
● HPCS performance targets attempt to
Flatten the memory hierarchy
Improve
real application performance
April
18, 2006
Oak Ridge National Lab, CSM/FT
Make programming easier
●
23
Computational Resources and
HPC Challenge Benchmarks
CPU
computational
speed
Computational
resources
Memory
bandwidth
Node
Interconnect
bandwidth
Computational Resources and
HPC Challenge Benchmarks
HPL
Matrix Multiply
CPU
computational
speed
Computational
resources
Memory
bandwidth
STREAM
Node
Interconnect
bandwidth
PTrans, FFT, Random Access
Random & Natural Ring
Bandwidth & Latency
How Does The Benchmarking Work?
 Single program to download and run
 Simple input file similar to HPL input
 Base Run and Optimization Run
 Base run must be made
 User supplies MPI and the BLAS
 Optimized run allowed to replace certain routines
 User specifies what was done
 Results upload via website (monitored)
 html table and Excel spreadsheet generated with
performance results
 Intentionally we are not providing a single figure of merit
(no over all ranking)
 Each run generates a record which contains 188
pieces of information from the benchmark run.
 Goal: no more than 2 X the time to execute HPL.
26
http://icl.cs.utk.edu/hpcc/ web
27
28
29
HPCC Kiviat Chart
http://icl.cs.utk.edu/hpcc/
30
31
32
Different Computers are Better at Different
Things, No “Fastest” Computer for All Aps
33
HPCC Awards Info and Rules
Class 1 (Objective)
 Performance
1.G-HPL $500
2.G-RandomAccess $500
3.EP-STREAM system $500
4.G-FFT $500
 Must be full submissions
through the HPCC
database
Winners (in both classes) will be
announced at SC07 HPCC BOF
Sponsored by:
Class 2 (Subjective)
 Productivity (Elegant
Implementation)
 Implement at least two
tests from Class 1
 $1500 (may be split)
 Deadline:
 October 15, 2007
 Select 3 as finalists
 This award is weighted
 50% on performance and
 50% on code elegance,
clarity, and size.
 Submissions format
flexible
34
Class 2 Awards
 Subjective
 Productivity (Elegant Implementation)
 Implement at least two tests from Class 1
 $1500 (may be split)
 Deadline:
 October 15, 2007
 Select 5 as finalists
 Most "elegant" implementation with special
emphasis being placed on:
 Global HPL, Global RandomAccess, EP STREAM
(Triad) per system and Global FFT.
 This award is weighted
 50% on performance and
 50% on code elegance, clarity, and size.
35
5 Finalists for Class 2 – November 2005
 Cleve Moler, Mathworks
 Environment: Parallel
Matlab Prototype
 System: 4 Processor
Opteron
 Calin Caseval, C. Bartin, G.
Almasi, Y. Zheng, M.
Farreras, P. Luk, and R.
Mak, IBM
 Environment: UPC
 System: Blue Gene L
Winners!
 Bradley Kuszmaul, MIT
 Environment: Cilk
 System: 4-processor
1.4Ghz AMD Opteron 840
with 16GiB of memory
 Nathan Wichman, Cray
 Environment: UPC
 System: Cray X1E (ORNL)
 Petr Konency, Simon
Kahan, and John Feo, Cray
 Environment: C + MTA
pragmas
 System: Cray MTA2
36
2006 Competitors
 Some Notable Class 1 Competitors
Cray (DOD ERDC)
XT3 4096 CPUs
Sapphire
SGI (NASA)
Columbia
10,000 CPUs
Cray (DOE ORNL) DELL (MIT LL)
Cray
NEC (HLRS)
IBM (DOE LLNL)
300 CPUs
X1 1008 CPUs
SX-8 512 CPUs (Sandia)
BG/L 131,072 CPU
LLGrid
XT3 11,648 CPU Purple 10,240 CPU Jaguar XT3 5200 CPUs
Red Storm
 Class 2: 6 Finalists
 Calin Cascaval (IBM) UPC on Blue Gene/L [Current Language]
 Bradley Kuszmaul (MIT CSAIL) Cilk on SGI Altix [Current Language]
 Cleve Moler (Mathworks) Parallel Matlab on a cluster [Current Language]
 Brad Chamberlain (Cray) Chapel [Research Language]
 Vivek Sarkar (IBM) X10 [Research Language]
 Vadim Gurev (St. Petersburg, Russia) MCSharp [Student Submission]
37
The Following are the Winners of the 2006
HPC Challenge Class 1 Awards
38
The Following are the Winners of the 2006
HPC Challenge Class 2 Awards
39
2006 Programmability
Speedup vs Relative Code Size
3
10
Ref
Class 2 Award
50% Performance
50% Elegance
6 teams
Speedup relative to
serial C on workstation
Code size relative to
serial C
Traditional
HPC
1
10
Speedup
21 Codes submitted by
2
10
0
10
-1
10
Java, Matlab,
Python, etc.
“All too often”
-2
10
-3
10
-1
10
0
10
10
Relative Code Size
40
2006 Programming Results Summary
 21 of 21 smaller than C+MPI Ref; 20 smaller than serial
 15 of 21 faster than serial; 19 in HPCS quadrant
41
Top500 and HPC Challenge Rankings
 It should be clear that the HPL (Linpack
Benchmark - Top500) is a relatively poor
predictor of overall machine performance.
 For a given set of applications such as:






Calculations on unstructured grids
Effects of strong shock waves
Ab-initio quantum chemistry
Ocean general circulation model
CFD calculations w/multi-resolution grids
Weather forecasting
 There should be a different mix of components
used to help predict the system performance.
42
Will the Top500 List Go Away?
 The Top500 continues to serve a valuable role
in high performance computing.






Historical basis
Presents statistics on deployment
Projection on where things are going
Impartial view
Its simple to understand
Its fun
 The Top500 will continue to play a role
43
No Single Number for HPCC?
 Of course everyone wants a single number.
 With HPCC Benchmark you get 188 numbers per system run!
 Many have suggested weighting the seven tests in HPCC to come up
with a single number.
 LINPACK, MatMul, FFT, Stream, RandomAccess,
Ptranspose, bandwidth & latency
 But your application is different than mine, so weights are
dependent on the application.
 Score = W1*LINPACK + W2*MM + W3*FFT+ W4*Stream +
W5*RA + W6*Ptrans + W7*BW/Lat
 Problem is that the weights depend on your job mix.
 So it make sense to have a set of weights for each user or site.
44
Tools Needed to Help With Performance
 A tools that analyzed an application perhaps
statically and/or dynamically.
 Output a set of weights for various sections of
the application
 [ W 1, W2, W 3, W4, W5, W 6, W7, W8 ]
 The tool would also point to places where we were
missing a benchmarking component for the mapping.
 Think of the benchmark components as a basis
set for scientific applications
 A specific application has a set of "coefficients"
of the basis set.
 Score = W1*HPL + W2*MM + W3*FFT+ W4*Stream +
W5*RA + W6*Ptrans + W7*BW/Lat + …
45
Future Directions




Looking at reducing execution time
Constructing a framework for benchmarks
Developing machine signatures
Plans are to expand the benchmark
collection
 Sparse matrix operations
 I/O
 Smith-Waterman (sequence alignment)
 Port to new systems
 Provide more implementations
 Languages (Fortran, UPC, Co-Array)
 Environments
 Paradigms
46
• HPC Challenge
Collaborators
• Top500
–
–
–
–
–
–
–
–
Piotr Łuszczek, U of Tennessee
– Hans Meuer, Prometeus
David Bailey, NERSC/LBL
– Erich Strohmaier, LBNL/NERSC
Jeremy Kepner, MIT Lincoln Lab – Horst Simon, LBNL/NERSC
David Koester, MITRE
Bob Lucas, ISI/USC
Rusty Lusk, ANL
John McCalpin, IBM, Austin
Rolf Rabenseifner, HLRS
Stuttgart
– Daisuke Takahashi, Tsukuba,
Japan
http://icl.cs.utk.edu/hpcc/
HPCC vs. 27th TOP500::top10
Computer
Top500
HPL
PTRANS
STREAM
FFT
GUPS
Latency
B/W
Rmax
1
BlueGene/L
280.6
259.2
4665.9
160
2311
35.47
5.92
0.159
2
BGW
91
83.9
171.55
50
1235
21.61
4.70
0.159
3
ASC Purple
75.8
57.9
553
55
842
1.03
5.1
3.184
4
Columbia
51.87
46.78
91.31
20
229
0.25
4.23
0.896
5
Tera-10
42.9
6
Thunderbird
38.27
7
Fire x4600
38.18
8
BlueGene eServer
37.33
9
Red Storm
36.19
32.99
1813.06
43.58
1118
1.02
7.97
1.149
10
Earth Simulator
35.86
Most of the numbers are from multiple submissions
(base or optimized) to arrive at the largest number.
Memory Access Patterns
Memory Access Patterns
November 2005 TOP500: 10 Fastest
Rank
1
2
3
4
5
6
7
8
9
10
Name
Rmax HPL
PTRANS STREAM FFT
GUPS Lat. b/w
BG/L
281
259
374
160
2311 35.5
6
0.2
BGW (*)
91
84
172
50
84 21.6
5
0.2
ASC Purple
63
58
576
44
967
0.2
5
3.2
Columbia (*)
52
47
91
21
230
0.2
4
1.4
Thunderbird
38
Red Storm
36
33
1813
44
1118
1.0
8
1.2
Earth Simulator
36
MareNostrum
28
Stella
27
Jaguar
21
20
29
855
0.7
7
1.2
944
(*) scaled results
HPCC vs. 27th TOP500::top10
Computer
Rmax
HPL
PTRANS
STREAM
FFT
GUPS
Latency
B/W
1
BlueGene/L
280.6
259.2
4665.9
160
2311
35.47
5.92
0.159
2
BGW (*)
91
83.9
171.55
50
1235
21.61
4.70
0.159
3
ASC Purple
75.8
57.9
553
55
842
1.03
5.1
3.184
4
Columbia (*)
51.87
46.78
91.31
20
229
0.25
4.23
0.896
5
Tera-10
42.9
6
Thunderbird
38.27
7
Fire x4600
38.18
8
BlueGene eServer
37.33
9
Red Storm
36.19
32.99
1813.06
43.58
1118
1.02
7.97
1.149
10
Earth Simulator
35.86
(*) scaled results
Most of the numbers are from multiple submissions
(base or optimized) to arrive at the largest number.53
HPC Challenge Languages
Base
Language
HPL
Random Access
STREAM
FFT
Specification
UTK
UTK
UTK
UTK
C
UTK
UTK
UTK
UTK
C & MPI
UTK
UTK
UTK
UTK
C & OpenMP
UTK
UTK
UTK
UTK
UPC
ISI
C & pthreads
C++
Fortran
Matlab
MIT-LL
MIT-LL
MIT-LL
MIT-LL
StarP
UCSB
UCSB
UCSB
UCSB
pMatlab
MIT-LL
MIT-LL
MIT-LL
MIT-LL
Octave
OSC
OSC
OSC
OSC
Python
UTK
UTK
UTK
UTK
UTK
UTK
Cray
Cray
Matlab & MPI
Python & MPI
Java
Chapel
X10
Fortress
Cray
IBM
Cray
HPCC Data Analysis: Normalize
●
Example divide by peak flop/s
System
G-HPL
G-RandomAccess EP-STREAM-sys
G-FFT
Cray XT3
81.4%
0.031
1168.8
38.3
Cray X1E
67.3%
0.422
696.1
13.4
IBM Power5
53.5%
0.003
703.5
15.5
IBM BG/L
70.6%
0.089
435.7
6.1
SGI Altix
71.9%
0.003
308.7
3.5
NEC SX-8
86.9%
0.002
2555.9
17.5
HPCC Data Analysis: Correlate Peak vs. HPL
HPL versus Theoretical Peak
Theoretical Peak (Tflop/s)
30
25
Cray XT3
20
15
10
NEC SX-8
5
SGI Altix
0
0
5
10
15
HPL (Tflop/s)
●
Is HPL an effective peak or just a peak?
●
Close, but not exactly
20
25
HPCC Data Analysis: Correlate DGEMM vs. HPL
HPL versus DGEMM
30000
Cray XT3
DGEMM (Gflop/s)
25000
20000
15000
10000
NEC SX-8
5000
SGI Altix
0
0
5
10
15
20
25
HPL (Tflop/s)
●
Can I just run DGEMM (local matrix-matrix multiply) instead of HPL?
●
DGEMM alone overestimates HPL performance
●
Note the 1000x difference in scales: Tera vs. Giga
HPCC Data Analysis: Correlate RA vs. HPL
HPL versus G-RandomAccess
2
Cray X1E/opt
G-RandomAccess (GUPS)
1.8
Cray XT3
1.6
1.4
1.2
Rackable
1
0.8
IBM BG/L
0.6
SGI Altix
NEC SX-8
0.4
0.2
0
0
5
10
15
HPL (Tflop/s)
●
As expected: no correlation
20
25
Committee Members
●
David Bailey
–
●
●
–
●
●
ISI
IBM Austin
Rolf Rabenseifner
–
●
University of Tennessee
John McCalpin
–
●
Argonne National Lab
Piotr Luszczek
–
MITRE
Bob Lucas
–
●
MIT Lincoln Lab
David Koester
Rusty Lusk
–
University of Tenn/ORNL
Jeremy Kepner (Co-Chair)
–
●
LBNL NERSC
Jack Dongarra (Co-Chair)
–
●
HLRS, Stuttgart
Daisuke Takahashi
–
University of Tsukuba
RandomAccess (words/s)
1
Top500 (words/s)
Mega
Giga
1
 Hierarchy
steepens
Clusters ~106
 HPC systems
 Hierarchy
constant
1
DARPA HPCS Goals
Dell GigE P4 (MITLL)
Dell GigE P8 (MITLL)
Dell GigE P16 (MITLL)
Dell GigE P32 (MITLL)
Dell GigE P64 (MITLL)
Opteron (AMD)
Cray X1E (AHPCRC)
SGI Altix (NASA)
NEC SX-8 (HLRS)
Cray X1 (ORNL)
Cray X1 (ORNL) Opt
Cray XT3 (ERDC)
Cray XT3 (ORNL)
IBM Power5 (LLNL)
IBM BG/L (LLNL)
IBM BG/L (LLNL) Opt
60
Dell GigE P2 (MITLL)
Kilo
Tera
1
HPC ~104
Peta
words/second
 Highlights
memory
hierarchy
 Clusters
 All results in
Dell GigE P1 (MITLL)
Systems
(in Top500 order)
1
HPCS ~102
STREAM (words/s)
FFT (words/s)
 HPCS Goals
 Hierarchy
flattens
 Easier to
program
Effective Bandwidth (words/second)
HPC Challenge Performance Results
Official HPCC Submission Process
1. Download
Prequesites:
● C compiler
● BLAS
● MPI
2. Install
3. Run
4. Upload results
Provide detailed
installation and
execution environment
5. Confirm via @email@
6. Tune
7. Run
8. Upload results
9. Confirm via @email@
Optional
Only some routines can be replaced
● Data layout needs to be preserved
● Multiple languages can be used
●
Results are immediately available on the web site:
● Interactive HTML
● XML
● MS Excel
● Kiviat charts (radar plots)
HPL Benchmark
 These settings used for the other tests
Rate
 TPP Linpack Benchmark
 Used for the Top500 ratings
 Solve Ax=b, dense problem, matrix is random
 Uses LU decomposition with partial pivoting
 Based on the ScaLAPACK routines but optimized
 The algorithm is scalable in the sense that the parallel
efficiency is maintained constant with respect to the per
processor memory usage
 In double precision (64-bit) arithmetic
 Run on all processors
TPP performance
 Problem size set by user
 Requires
Size
 An implementation of the MPI
 An implementation of the Basic Linear Algebra Subprograms
(BLAS)
 Reports total TFlop/s achieved for set of processors
 Takes the most time
 Considering stopping the process after say 25%
 Still check to see if correct
62
STREAM Benchmark
 The STREAM Benchmark is a standard
benchmark for the measurement of computer
memory bandwidth
 Measures bandwidth sustainable from standard
operations -- not the theoretical "peak
bandwidth" provided by most vendors
-----------------------------------------------------------------name
kernel
bytes/iter
FLOPS/iter
 Four operations
----------------------------------------------------------------- COPY, SCALE
 ADD, TRIAD
 Measures:
COPY:
a(i) = b(i)
16
0
SCALE:
a(i) = q*b(i)
16
1
SUM:
a(i) = b(i) + c(i)
24
1
TRIAD:
a(i) = b(i) + q*c(i)
24
2
------------------------------------------------------------------
 Machine Balance - relative cost of memory accesses
vs arithmetic
 Vector lengths chosen to fill local memory
 Tested on a single processor
 Tested on all processors in the set in an
“embarrassingly parallel” fashion
 Reports total GB/s achieved per processor
63
PTRANS
 Implements parallel matrix transpose
 A = A + BT
 The matrices A and B are distributed across the processors
 Two-dimensional block-cyclic storage
 Same storage as for HPL
 Exercises the communications pattern where pairs
of processors communicate with each other
simultaneously.
 Large (out-of-cache) data transfers across the network
 Stresses the global bisection bandwidth
 Reports total GB/s achieved for set of processors
64
Random Access
 Integer Read-modify-write to random address
 No spatial or temporal locality
 Measures memory latency or the ability to hide memory
latency
 Architecture stresses
 Latency to cache and main memory
 Architectures which can generate enough outstanding
memory operations to tolerate the latency, change this into
a main memory bandwidth constrained benchmark
 Three forms
 Tested on a single processor
 Tested on all processors in the set in an “embarrassingly
parallel” fashion
 Tested with an MPI version across the set of processors
 Each processor caches updates then all processors perform MPI
all-to-all communication to perform updates across processors
 Reports Gup/s (Giga updates per second) per processor
65
Bandwidth and Latency Tests
 Ping-Pong test between
pairs of processors
 Send a message from proci to prock
then return message from prock to proci
proci
prock
 proci MPI_Send() - prock MPI_Recv()
 proci MPI_Recv() - prock MPI_Send()
 Other processors doing MPI_Waitall()
 time += MPI_Wtime()
 time /= 2
 The test is performed between as many possible
distinct pairs of processors.
 There is an upper bound on the time for the test
 Tries to find the weakest link amongst all pairs
 Minimum bandwidth
 Maximum latency
 Not necessarily the same link will be the worst for
bandwidth and latency
 Message 8B used for latency test; take max time
 Message 2MB used for bandwidth test; take min GB/s
66
Bandwidth/Latency
Ring Tests (All Procs)
 Two types of rings:
 Naturally ordered
 (use MPI_COMM_WORLD): 0,1,2, ... P-1.
 Randomly ordered
(30 rings tested)
 eg.: 7, 2, 5, 0, 3, 1, 4, 6
 Each node posts two sends (to its left and right
neighbor) and two receives (from its left and right
neighbor).
 Two types of communication routines are used: combined
send/receive and non-blocking send/receive.
 MPI_Sendrecv( TO: right_neighbor,FROM: left_neighbor)
 MPI_Irecv( left_neighbor )MPI_Irecv( right_neighbor ) and
MPI_Isend( right_neighbor )MPI_Isend( left_neighbor )
 The smaller (better) time for each is taken (which one is
smaller depends on the MPI implementation).
 Message 8B used for latency test;
 Message 2MB used for bandwidth test;
67
FFT
 Using FFTE software
 Daisuke Takahashi code from University of Tsukuba
 64 bit complex 1-D FFT
 Uses 64 bit addressing
 Global transpose with MPI_Alltoall()
 Three transposes (data is never scrambled)
68
Performance of #1 Machine over Time
10000
1000
FFT
HPL
1000
Tflop/s
Gflop/s
100
100
10
10
1
1
May 9, 03
Jan 14, 04
Sep 20, 04
May 28, 05
Feb 2, 06
0.1
May 9, 03
May 28, 05
Feb 2, 06
STREAM
RandomAccess
10
100000
1
10000
GB/s
GUPS
Sep 20, 04
1000000
100
0.1
1000
0.01
0
May 9, 03
Jan 14, 04
100
Jan 14, 04
Sep 20, 04
May 28, 05
Feb 2, 06
10
May 9, 03
Jan 14, 04
Sep 20, 04
May 28, 05
Feb 2, 06
History
Total number of entries
140
120
Number of entries
100
80
60
40
20
0
10/6/2003 1/14/2004 4/23/2004
8/1/2004
11/9/2004 2/17/2005 5/28/2005
Time
9/5/2005 12/14/2005 3/24/2006
7/2/2006
Class 2: Implementation Languages
(Subjective)
●
●
●
English (Paper and pencil)
C/C++
–
MPI-1, MPI-2, OpenMP, pthreads
●
Fortran 90/95/03
●
Java
●
Matlab
–
●
MatlabMPI, StarP, pMatlab
Python
–
MPI
●
UPC, CAF
●
Chapel, X10, Fortress
●
HPCC tests
–
FFT
–
HPL
–
RandomAccess
–
STREAM
Good if 2 of the 4 tests actually run
HPCC Awards Class 2 Detailed Results
Language
HPL
Python MPI
pMatlab
√
RandomAccess
STREAM
√
√
√
√
√
Cray MTA C
√
UPC x 3
√
√
√
Cilk
√
√
√
√
√
StarP
√
Parallel Matlab
√
HPF
√
√
√
MPT C
OpenMP C++
FFT
√
√
√
√
√
√
√
HPC Challenge Benchmarks
Normalized to System Peak Performance
System
G-HPL
% of Peak
G-Random
GUPS/Peak
EP-Stream
GW/s/Peak
G-FFT
GF/Peak
Cray XT3
81.4%
0.031
146
38.3
Cray X1E
67.3%
0.422
87
13.4
IBM Power5
53.5%
0.003
88
15.5
IBM BG/L
70.6%
0.089
54
6.1
SGI Altix
71.9%
0.003
39
3.5
NEC SX-8
86.9%
0.002
319
17.5
73
Scope and Naming Conventions
M
M
P
P
P
M
P
P
M
P
P
P
G
EP
Network
S
Single
M
M
P
P
P
M
P
P
HPL
STREAM
FFT
...
RandomAccess(1m)
HPL(25%)
P
P
P
P
Global
P
Vectorize
CPU
core(s)
M
P
P
Network
thread
P
Embarissingly Parallel
M
CPU
M
Network
M
system
Software
modules
M
P
P
P
OpenMP
Computational
resources
MPI
Memory
Interconnect
Kiviat Charts: Multi-network Example
●
●
●
●
AMD Opteron clusters
–
2.2 GHz
–
64-processor cluster
Interconnects
1.
GigE
2.
Commodity
3.
Vendor
Cannot be differentiated based on:
–
G-HPL
–
Matrix-matrix multiply
Available on HPCC website
–
http://icl.cs.utk.edu/hpcc/
Kiviat chart (radar plot)
Details on Component Kernels
●
FFT
–
●
Only power of 2 size/CPUs
●
–
●
HPL
–
Will add 3, 5 (JPP-SGI)
●
Data always in-order
●
Three transposes in parallel
●
Each stage timed for base
–
–
Limited time run
●
–
Currently 60 seconds
Planned new HyperCube algorithm
●
Generalization of BG/L's SC|05
submission
●
Identical to Antoine's version
Always the same data layout
●
RandomAccess
–
Always the same code for base
Compatible with ScaLAPACK
Planned: reduced time (25%)
●
Factor portion of the matrix
●
Make the rest of matrix trivial
STREAM
–
No static array size
●
–
–
Runtime variable
No static storage of data
●
Heap storage
●
No alignment control
Limited aliasing information
●
Unless specified as compiler flag
Olympic Style?
●
●
Its been suggested that the top system is awarded a Gold medal,
the 2nd Silver and 3rd Bronze for each benchmark group.
The system with the largest number of Gold medals can be
declared the winner.
HPCC vs. 27th TOP500::top10
Computer
Top500
HPL
PTRANS
STREAM
FFT
GUPS
Latency
B/W
Rmax
1
BlueGene/L
280.6
259.2
4665.9
160
2311
35.47
5.92
0.159
2
BGW (*)
91
83.9
171.55
50
1235
21.61
4.70
0.159
3
ASC Purple
75.8
57.9
553
55
842
1.03
5.1
3.184
4
Columbia (*)
51.87
46.78
91.31
20
229
0.25
4.23
0.896
5
Tera-10
42.9
6
Thunderbird
38.27
7
Fire x4600
38.18
8
BlueGene eServer
37.33
9
Red Storm
36.19
32.99
1813.06
43.58
1118
1.02
7.97
1.149
10
Earth Simulator
35.86
Most of the numbers are from multiple submissions
(base or optimized) to arrive at the largest number.

Survey of Computer Architecture

Transcript Survey of Computer Architecture

Directory