Transcript www.pbm.com

Summarizing Performance is No Mean Feat
No matter how much people want performance to be a
single number, it is usually a distribution, not a mean alone.
John R. Mashey
For
EE282, Stanford, Oct 12, 2004, Alpha Version
NoMeanFeat – Copyright 2004, John Mashey
0
Speaker – John Mashey
•
Ancient UNIX software guy, Bell Labs 1973-1983, MTS…Manager
– Programmer’s Workbench, shell programming, text processing,
workload measurement/tuning in first UNIX computer center,
UNIX per-process accounting
UNIX+mainframe data mining apps, capacity planning/tuning
•
Convergent Technologies 1983-1984, MTS…Director Software
– Compiler & OS tuning, uniprocessor/multiprocessor servers
•
MIPS Computer Systems 1985-1992, Mgr. OS…VP Systems Technology
– {system coprocessor, TLB, interrupt-handling, etc}, {byte addressing(!), halfword
instructions}, ISA evolution, SMP features, multi-page-sizes, 64-bit
– MIPS Performance Brief editor; ex-long-time frequent poster on comp.arch
– One of the SPEC founders, 1988; long-time Hot Chips committee
•
Silicon Graphics 1992-2000, Dir. Systems Tech … VP & Chief Scientist
– Fingers in many areas, R10000 & later architecture, including performance
counters & software, Origin3000/Altix ccNUMA architecture,
performance issues in HPC, DBMS
•
Current: consult for VCs & high-tech co’s, sit on technical advisory boards;
Trustee at Computer History Museum
* Not a statistician, so statisticians in audience, please be nice.
NoMeanFeat – Copyright 2004, John Mashey
1
Overview
•
Background
– Who benchmarks and why
– “Standard model” advice about use of means ((W)AM, (W)HM, WGM)
» Good advice, but incomplete; contradictions remain; industry mismatch
– “Alternate model”
» Various means useful when applied appropriately
» Requirements, assumptions, results differ
•
Review of basic statistics
– Populations, samples; parameters versus statistics
– Distributions, especially normal, inverse normal, lognormal
•
Alternate model
– WCA, WAW (mean = WAM or WHM), SERPOP (mean = GM)
•
Sample analyses using WAW and/or SERPOP
– SPEC CINT2000, CFP2000
– Livermore Fortran Kernels (LFK)
– (Digital Review CPU2; not in this version)
•
Conclusion
NoMeanFeat – Copyright 2004, John Mashey
2
Who benchmarks and why
•
Computer designers
–
–
–
–
•
Sell magazines
Industry consortia
–
–
•
Workload understanding, capacity planning; evaluate potential purchases
Computer magazines
–
•
Understand where to focus efforts on improvements
Owners/buyers of computers, sometimes in groups
–
•
System for dedicated application; must understand application, especially real-time
Example: embedded systems, System-on-Chip, Tensilca/ARC/etc
Software engineers
–
•
New H/W + S/W to attack new problem domain; little existing data
Example: Cray-1 vector systems … but many such have failed
New H/W + S/W to compete in wide existing market; can gather related data
Examples: RISC systems of 1980s … but most new ISAs have failed
New implementation of established ISA; much data on workloads and programs
Examples: IBM S/360, Digital VAX [reputed to have 500+ benchmarks]
Try to get meaningful benchmarks to avoid coercion/waste of silly ones
SPEC, TPC, EEMBC, etc
Researchers
NoMeanFeat – Copyright 2004, John Mashey
3
Standard Model for Summarizing Benchmarks
•
Summarize times
– Arithmetic Mean (AM)
– Weighted AM (WAM)
•
Summarize rates or ratios
– Harmonic Mean (HM)
– Weighted HM (WHM)
•
Do not ever use for anything!
–
–
–
–
Geometric Mean (GM) [1]
Weighted GM (WGM)
Do not predict workload run-time
I.e., SPEC and LFK wrong
• Do not use Means (But do)
References [2, 3, 4, 5, 6]
AM, HM, GM are Power Means M1, M-1, M0 or
the Pythagorean Means … these are old!
NoMeanFeat – Copyright 2004, John Mashey
4
Alternate Model of Summarizing Benchmarks
•
Summarize times
•
– Arithmetic Mean (AM)
– Weighted AM (WAM)
•
Summarize rates or ratios
– Harmonic Mean (HM)
– Weighted HM (WHM)
•
Do not ever use for anything!
– AM or HM rare assumptions
•
Do not use means (but do)
Workload-dependent (WAW)
– WHM, WAM same, if right
– Performance : workload
– Population, algebra, definite
•
– Geometric Mean (GM)
– Weighted GM (WGM)
– Do not predict workload run-time
•
Measure workload (WCA)
•
Workload-neutral (SERPOP)
–
–
–
–
Geometric Mean (GM)
Weighted GM (WGM) only to fix sample
Performance : program, not workload
Sample, statistical inference, probabilistic
Really do not use means (alone)
– WAW: really sum/n, not distribution
– SERPOP: means + other metrics
NoMeanFeat – Copyright 2004, John Mashey
5
Some Really Basic Statistics
•
•
•
•
•
•
•
Populations and samples
General distribution descriptions
Normal distribution – x
Handling non-normal distributions
Inverse normal – 1/x
Lognormal – ln x
What do the Means mean
NoMeanFeat – Copyright 2004, John Mashey
6
Populations and Samples
•
Population: set of observations measured across members of group
– Forms a distribution
– Summarized by descriptive statistics, or better, parameters
– Uncertainty: individual measurement errors
•
Sample: subset of population
– Compute statistics
– Know population distribution apriori, or check sample versus assumption
– Extra uncertainty: small samples or selection bias
Population, N
Parameters
Sample size,
representativeness
Sample, n
Statistics
NoMeanFeat – Copyright 2004, John Mashey
7
General Distribution Descriptions
•
•
•
•
•
Mean: measure of central tendency, 1st moment
Variance: measure of dispersion, 2nd moment
Standard deviation: measure of dispersion, same scale as Mean
Coefficient of Variation: CoV, dimensionless measure of dispersion
Excel functions at left, when exist (OpenOffice.org Calc is mostly the same)
1
AVERAGE : AM   
N
1
VARP :  
N
2
N
x
i 1
i
N
2


x


 i
i 1
SDEVP :    2
CoV   / 
NoMeanFeat – Copyright 2004, John Mashey
8
Samples
•
•
•
Sample Mean: used to estimate population mean; AM OK for 3 cases below
Sample Variance, standard deviation, CoV, note slight difference
Skewness or skew : degree of asymmetry, 3rd moment, Excel: SKEW
•
– Zero for symmetric, negative for long left tail, positive for long right
Kurtosis: concentration comparison with normal, 4th moment, Excel: KURT
– Positive: more concentrated than normal, negative: heavier tails
•
NOTE: further discussion: assume all xi positive
1 n
AVERAGE : AM  x   xi
n i 1

n
1
VAR : s 2 
xi  x

(n  1) i 1
Normal
Kurtosis negative
Kurtosis positive
Right-skewed

2
STDEV : s  s 2
CoV  s / x
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
NoMeanFeat – Copyright 2004, John Mashey
9
Normal (Gaussian) Distribution
•
•
•
•
•
•
Arises from large number of additive effects
Familiar, useful properties … but cannot automatically assume normal
68% mean -/+ s
95% mean -/+ 2s
97.7% mean -/+ 3s
3 of 1000 outside 3 s, i.e., rare!
xs
x  2s
x  3s
0.0
0.5
1.0
1.5
x
2.0
xs
x  2s
x  3s
2.5
3.0
3.5
4.0
m
---68%-------------95%----------------------------97.7%-----------------NoMeanFeat – Copyright 2004, John Mashey
10
Normal z-score Transformation
•
•
•
Normals one shape, linear-scale symmetric, with percentages as given
Can be converted to standard normal, mean=0, s =1
Excel: STANDARDIZE
xi  x
zi 
s
xs
x  2s
x  3s
-4.0
-3.0
-2.0
-1.0
x
0.0
xs
x  2s
x  3s
1.0
2.0
3.0
4.0
m
---68%-------------95%----------------------------97.7%-----------------NoMeanFeat – Copyright 2004, John Mashey
11
Confidence Intervals
•
•
•
Confidence Intervals can be computed if population is normal
Commonly described as chance that population mean within interval
Alpha = significance level, such as .05
100(1 – Alpha) = confidence level, such 95%
Small samples (less than 30) need to use Student’s t-distribution, like
normal, but with wider tails. Excel: TINV
Interval improves (gets smaller) with larger sample
Conf.interval x  TINV (0.05, n)
s
n
NoMeanFeat – Copyright 2004, John Mashey
12
Is x (or x* ) Normal? Quick Tests
•
•
Normal’s Skew and Kurtosis 0 and CoV < 0.3
If sample much different, population normal assumption needs checking
– Population may be distinctly non-normal, as in heavily-skewed example
– Population may include several distinct sub-populations [see LFK]
– Sample may be small or biased
– Sample may include unusual outliers
AM especially sensitive to large outliers, right skew.
Error? Odd, but legal? Illusionary, due to small sample?
– As CoV rises above 0.3, normal increasingly predicts negative xi (Bad).
NoMeanFeat – Copyright 2004, John Mashey
13
Is it Normal? Quick Tests
•
•
Coefficient of Determination, CoD =1.0, others decrease
Excel: INDEX(LINEST(sorted.zdata,,,TRUE), 3)
HM=93, GM=100, AM=108; STDEV=45; SKEW=1.48; CONF [80,137]
HM=100, GM=101, AM=103; STDEV=22; SKEW=.77; CONF [88,119] trimmed
Normal probability plot, perfect normal = straight line
r – 1 big, 1 medium outlier; CoD = 0.77
r
41
79
86
87
88
89
103
106
124
127
144
225
z( r)
-1.49
-0.65
-0.50
-0.47
-0.44
-0.43
-0.11
-0.04
0.35
0.42
0.78
2.57
z-scores: r
3.0
2.0
1.0
0.0
-1.0
-2.0
-3.0
Trimmed – no outliers; CoD = 0.91
z( r)
-2.88
-1.13
-0.81
-0.75
-0.70
-0.67
0.00
0.14
0.96
1.09
1.86
5.58
NoMeanFeat – Copyright 2004, John Mashey
z-scores: r
3.0
2.0
1.0
0.0
-1.0
-2.0
-3.0
14
Sample distribution metrics, summary
•
Mean: measure of central tendency
•
s, CoV: dispersion, relative dispersion
Skew, Kurtosis: more on shape
•
CoD: similarity to normal
•
Confidence limits: goodness of sample
(Should be more error analysis; later)
NoMeanFeat – Copyright 2004, John Mashey
15
Handling Non-Normal Distributions
•
Example distributions: Jain[4], DeCoursey[7], NIST/SEMATECH [8]
– Different processes  different distributions
– Bernoulli, Beta, Binomial, Cauchy, Chi-Square, Double exponential, Erlang,
Exponential, Extreme value, F, Gamma, Geometric, Lognormal, Inverse
Normal, Negative Binomial, Normal, Pareto, Pascal, Poisson, Power
lognormal, Student’s t, Tukey-Lambda, Uniform, Weibull.
–
•
•
http://www.itl.nist.gov/div898/handbook/eda/section3/eda366.htm
Normal is so useful … but nothing guarantees it, so must check
If isn’t normal, try to transform to one that could be
– Xi* = f(xi) transform; use whatever works, f(x) = 1/x, f(x) = ln(x), etc
– Compute mean, standard deviation, etc of Xi*
Check normality!
– Back-transform mean (and other metrics that can be) via f -1
NoMeanFeat – Copyright 2004, John Mashey
16
Inverse Normal – 1/x
•
If 1/x is normal, x has an inverse normal distribution
– In computing, minimal direct use; typically used for converting rates  times
•
•
Transform: xi* = 1/xi.
Mean:
1 n *
AVERAGE : AM x*  x *   xi
n i1
•
Compute higher moments. Back-transformed Mean is HM:
HM 1 / AM x*
HARMEAN : HM 
back - transform
n
n
1

i 1 xi
direct
NoMeanFeat – Copyright 2004, John Mashey
17
Linear and Logarithmic scales
•
On usual linear scale
1(a) Normal, symmetric
1(b) Normal, symmetric, wider
1(c) Lognormal, slight right-skew
1(d) Lognormal, noticeable right-skew
1(a): Normal, s=0.2
1(b): Normal, s=0.3
1(c): Lognormal, s=0.2
1(d) Lognormal, s=0.5
1(a) and 1(c) very similar
0.0
•
On logarithmic scale
2(a) Normal, slight left-skew
2(b) Normal, noticeable left-skew
2(c) Lognormal, symmetric
2(d) Lognormal, symmetric, wider
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
2(a): Normal, s=0.2
2(b): Normal, s=0.3
2(c): Lognormal, s=0.2
2(d) Lognormal, s=0.5
0.1
NoMeanFeat – Copyright 2004, John Mashey
1.0
10.0
18
Lognormal (or log-normal) – ln(x)
•
If ln(x) is normal, x has a lognormal distribution
•
Transform: xi* = ln(xi ), Mean is:
– Nothing magic about ln; any base works
1 n *
AVERAGE : AM x*  x *   xi
n i1
•
Compute higher moments on xi* . Back-transformed Mean is GM:
GM  exp(AM x* )
back - t ransform
1 n

GEOMEAN : GM  exp  ln xi  #1 - Direct ,back - t ransformed AM
 n i 1

1
 
n
n

GEOMEAN : GM    xi 

 i 1 
Sigm a  exp(sx* )
•
#2 - Direct ,easier comput e,non - obvious
Multiplicat ivest andarddeviat ion
Sigma can be used like s, 68% of x data in [m/Sigma, m*Sigma]
NoMeanFeat – Copyright 2004, John Mashey
19
Lognormal in real world, more or less
•
“Examples of variates which have approximately log normal distributions
include the size of silver particles in a photographic emulsion,
the survival time of bacteria in disinfectants,
the weight* and blood pressure of humans,
and the number of words written in sentences by George Bernard Shaw.”
http://mathworld.wolfram.com/LogNormalDistribution.html
*Human heights are normal/lognormal, but weights need lognormal 
•
Useful article by Stahel and Abbt:
“Log-Normal Distributions Across the Sciences: Keys and Clues.”
http://www.inf.ethz.ch/personal/gutc/lognormal/bioscience.pdf
•
Net-based graphical simulation:
Gut, Limpert, Hinterberger, “Modeling the Genesis of Normal and Log-Normal
Distributions – A Computer simulation on the web to visualize the genesis of
normal and log-normal distributions”
http://www.inf.ethz.ch/~gut/lognormal
NoMeanFeat – Copyright 2004, John Mashey
20
Inverse normal
•
•
•
Coefficient of Determination, CoD =0.68 or trimmed 0.95 below
CONF [72, 130]
CONF [88,115] trimmed
Normal probability plot, perfect normal = straight line [note inverted scale]
C/r – 1 big outlier; CoD = 0.68
z(C/r)
C/r
2.79
2.46
0.38
1.27
0.18
1.17
0.14
1.15
0.11
1.14
0.10
1.13
-0.23
0.97
-0.28
0.94
-0.55
0.81
-0.59
0.79
-0.77
0.70
-1.28
0.44
z-scores: C/r
-3.0
-2.0
-1.0
0.0
1.0
2.0
3.0
Trimmed; no outliers; CoD = 0.95
z(C/r)
7.52
1.37
0.84
0.74
0.68
0.63
-0.19
-0.33
-1.03
-1.12
-1.59
-2.89
NoMeanFeat – Copyright 2004, John Mashey
z-scores: C/r
-3.0
-2.0
-1.0
0.0
1.0
2.0
3.0
21
Lognormal
•
•
•
Coefficient of Determination, CoD =0.83 or trimmed 0.93 below
CONF [72, 130]
CONF [88,117] trimmed
Normal probability plot, perfect normal = straight line
Ln(r) – 2 moderate outliers; CoD = 0.83
ln( r) z(ln( r))
3.70 -2.21
4.37 -0.59
4.45 -0.39
4.47 -0.35
4.48 -0.32
4.48 -0.30
0.07
4.64
0.14
4.67
0.52
4.82
0.58
4.84
0.88
4.97
1.97
5.42
z-scores: ln(r)
3.0
2.0
1.0
0.0
-1.0
-2.0
-3.0
Trimmed – no outliers; CoD = 0.93
z(ln( r))
-4.51
-1.24
-0.83
-0.75
-0.69
-0.66
0.10
0.24
1.00
1.12
1.72
3.93
NoMeanFeat – Copyright 2004, John Mashey
z-scores: ln(r)
3.0
2.0
1.0
0.0
-1.0
-2.0
-3.0
22
What Do the Means Mean?
•
Relationships:
HM ( x )  GM ( x )  AM ( x )
HM ( x )  1 / AM (1 / x )
AM ( x )  1 / HM (1 / x )
GM ( x )  1 / GM (1 / x )
•
•
Normal  AM, inverse normal  HM, lognormal  GM
AM  normal, HM  inverse normal, GM  lognormal
Must
–
–
–
–
•
Not implied
Understand physical meaning to avoid irrelevant math
Discover appropriate distribution type for population, check sample
Know whether population or sample!
Quantify uncertainty from measurement; sample size, bias
So far, main example might be modeled by lognormal, normal, inverse
normal, although untrimmed lognormal better for original example
Trimming helps … but are the outliers good data or not?
NoMeanFeat – Copyright 2004, John Mashey
23
Entering Benchmark Territory
•
•
“Danger, Will Robinson. Warning! Warning! Enemy benchmarketeers!”
•
In benchmarking
•
•
Lies, Damn Lies, and Benchmarks [This talk is not that one.]
This talk is about trying to get the math right
In science
– Data usually is what it is, within measurement error
– People make mistakes, but try to explain data, make good charts, liars get hurt
– Example: human metrics
Year 1: Measure people’s heights (normal / lognormal), or weights (lognormal)
Year 2: Do it again, probably get mostly similar numbers
–
–
–
–
Data changes, inherently, business changes quickly
People are selective in presentation, can be tricky; Jain [4] “Ratio Games”
People do make mistakes, sometimes on purpose; rule-bending….cheating
If human metrics were benchmarks, assuming bigger is better:
Year 2:
Some will have had crash eating binges; a few have lead weights in pockets
Some will have elevator shoes
Somebody will say they’ll be taller next month
Somebody will say why heights don’t really matter in their application
One will have discovered stilts
NoMeanFeat – Copyright 2004, John Mashey
24
Benchmarking: What Does “R% as fast” Mean?
•
•
Computers X and Y
Assume n programs Pi=1,n supposedly members of some related class
– Many examples here: large user-level, CPU-intense, user-level integer codes
•
Run programs on X & Y with same inputs
measure run-times xi and yi
compute ratios ri = 100 * xi / yi
•
Compute R from these numbers, somehow, then claim:
“Y is R% as fast as X”
•
What could this possibly mean that is true and useful?
– “Always” – simple, useful, but essentially never true
– “For a given workload” – true and useful, sometimes
– “For systems and programs” – true and useful, sometimes, different times
NoMeanFeat – Copyright 2004, John Mashey
25
“X R% as fast as Y on all such programs”
•
Two very similar systems, same cache, memory system, software
– SPEC CINT2000, 12 benchmarks
– X: Dell PW350, 2266MHz
– Y: Dell PW350, 3066Mhz (135% of X clock-rate)
– But ri vary noticeably; HM=124; GM=125; AM=125; STDEV=9
– Note: did not get 135% performance, for usual memory reasons
– Bad enough; earlier example is worse (41%-225%); typical
r
106
118
119
120
122
123
125
129
133
133
133
138
z( r)
-2.12
-0.83
-0.68
-0.56
-0.31
-0.18
-0.01
0.44
0.91
0.93
0.94
1.46
z-scores: r
3.0
2.0
1.0
0.0
-1.0
-2.0
-3.0
Typical, others worse.
NoMeanFeat – Copyright 2004, John Mashey
26
“Two Useful Answers – WAW or SERPOP”
•
Overall population: hopeless!
•
Workload Characterization Analysis
Gather data, generate Weights
Codify “local knowledge”
External: published rates/ratios
Can sometimes be used to fill in
missing data, increase sample size
•
•
Workload Analysis with Weights
Needs goodness of Weights
Can be “what if” analyses
Algebra on workload parameters
R% for a workload
•
Sample Estimate of Relative
Performance Of Programs
Representativeness, sample size
Statistical analysis on sample
R% for programs on systems,
plus distribution and confidence
Population of Programs Pi=1,NN, Run-times for X & Y
Workload-dependent
Workload-neutral
WCA
External
System log, “Experience”
Published metrics
Pi=1, N : Txi =total run-time
Pk: Mxk, Myk
Compute weights Wi
Compute rk(Y:X)
Pi : Wi
Pk : rk (Y:X)
WAW
SERPOP
Pi, Wi
Pi
Pi: select input,
Pi: select input, run
run on X,Y xi, yi, and
on X, Y ri = xi /yi.
ri = xi / yi
Assume: IA1, IA2
Assume: IA1, IA2
P : Wi,, xi,,yi.
Pi : Wi, ri
WAM
WHM
Ra : IA3
Rh : IA4
Rwa
Rwh
Rw
WCA or IA3, IA4, IA5
NoMeanFeat – Copyright 2004, John Mashey
Pi : Wi, ri
GM
Rg : IA7, IA8
s, CoV,
Skew, Kurtosis
CoD
Confidence limits
27
WCA
•
•
Assume X running real workload, days, weeks, etc.
Goal: estimate Y’s R% without running the entire workload
•
•
 Identify Pi=1,n that account for most of the total run-time
 identify Wi=1,n Weights, or fractions of total run-time
Txi = sum of all run-times for Pi
n
Wi  Txi /  Txi
i 1
•
Can be easy, for example:
On most UNIX systems, turn on accounting, use acctcom(1) & accumulate
•
Wi must be well-measured, or at very least, well-estimated by experience
Later WAW analyses no better than the Weights
NoMeanFeat – Copyright 2004, John Mashey
28
WAW
•
•
For each Pi , select representative input
Run programs on X & Y with same input to get:
run-times xi and yi
ratios ri = 100 * xi / yi
•
•
Implicit Assumptions need to be made explicit, always checked
IA1: Repeatability
Multiple runs of Pi, same input, same system  low-variability run-times
Run-times long enough to avoid clock measurement problems
If true, tiny samples plausibly representative
Recent SPEC rule: run 3 times, take Median time
Often more-or-less true … but not always
IA2: Consistent Relative Performance
Each ri has small variability for different inputs.
If true, only one input needed to estimate ri
Often more-or-less true … but not always
•
NoMeanFeat – Copyright 2004, John Mashey
29
Violating IA1: Repeatability
•
IA1: awful case in Summer 1986, on first MIPSco systems (~M/400)
Run-time distribution of ECAD benchmark supposed to run overnight:
Run-times (hours) Before
Hours After
OK
0
•
•
2
4
6
8
10
We wish!
NOT OK!
12
14
16
18
20
Why: 4KB pages, 8KB I-cache, 8KB D-cache, direct-mapped.
OS random-mapped pages (virtualreal)
 about 20% caused bad collisions, 2X time, unpredictable, unacceptable
Fix: OS changed to map (virtualreal) in consistent way per process.
 the worst outliers disappeared [real-time: max more important than mean]
 M/500 got 16KB I-cache instead; big modern set-associative caches OK
BUT: issue still exists, has moved to embedded designs, SoCs
NoMeanFeat – Copyright 2004, John Mashey
30
Run-times: one program, one system, different inputs
•
•
•
Some programs run ~fixed time, not much dependent on input
Some simulations run 1 hour, 2 hour, 3 hour… depending on iterations
Some can run arbitrary time, dependent on input size, array sizes
What is overall distribution shape of run-times in general?
A: unknown, perhaps unknowable
What is overall distribution of Pi’s run-times in a specific environment?
A: WCA, measure it
0
•
2
4
Almost const ant
Discret e Simulat ion St eps
Proport ional t o input or sizing
M in, t ypical, long t ail
6
8
10
12
14
16
18
20
xi , yi : ?? but at least hope that ri have small variability, maybe
NoMeanFeat – Copyright 2004, John Mashey
31
Violating IA2: Consistent Relative Performance
•
•
•
IA2: Some programs execute radically different code for different inputs
Example: Spice in SPEC89, different circuit types
Oops! We knew Spice was floating-point, but happened to pick one input that spent
most time doing integer storage allocation
Some programs stress data cache differently according to size parameter
Example: array size for matrix operations, easily changeable
Following chart known to benchmarketeers…. 
One of several profiles, mostly dealing with cache size, memory design
X: lower clock rate or equivalent, larger cache, same memory
Y: higher clock rate or equivalent, smaller cache, same memory
Rel. Perf, y
Relat ive Perf ormance of Y over X , higher = bet t er f or Y
Increasing problem size 
Y high hit Y decreasing
Y high hit rate
Y decreasing
X high hit X high
X high hit rate
X high
Y low
X decreasing
Y (first choice)
X happy
NoMeanFeat – Copyright 2004, John Mashey
Y very low
X very low
Y (second choice)
32
WAW using AM
•
Simplest analysis adds run-times and divides to get Ra :
n
tx   xi
i 1
n
and ty   yi
i 1
Ra  100 * AM ( x ) / AM ( y )  100 * tx / ty
•
•
•
•
•
Ra = 100 * tx / ty = 1881 / 1789 = 105%
Ra = 100 * AM(x )/ AM(ty ) = 157 / 149 = 105%
Sums give same answer
What do AMs really mean? = sum / n
Are AMs good central measures here?
Unclear.
We have not characterized the distribution types.
No reason to believe any particular distribution
In any case, run-times might be arbitary, adjusted
for convenient size, availability of inputs, etc
NoMeanFeat – Copyright 2004, John Mashey
x (secs) y (secs)
100
246
133
169
97
113
114
131
213
242
133
150
96
93
99
93
175
141
106
83
217
151
398
177
Sums: tx=1881 ty=1789
AMs:
157
149
33
WAW using AM, Run-time Distributions?
•
•
•
X distribution, far from normal, CoD = 0.66 (low)
Right-skewed by 197: skew>2
AM(157) further from GM(141) than HM(131)
Large STDEV, high CoV (>.3
{HM, GM, HM} >> Median
Y distribution, closer to normal, CoD = 0.92
Somewhat right-skewed by 181 & 300
SPEC run-times somewhat arbitrary, picked for
convenience, keep fastest systems >60 seconds
z(x)
z(y)
3.0
3.0
2.0
2.0
1.0
1.0
0.0
0.0
-1.0
-1.0
-2.0
-2.0
-3.0
-3.0
NoMeanFeat – Copyright 2004, John Mashey
Benchmark
181.mcf
256.bzip2
252.eon
255.vortex
300.twolf
175.vpr
176.gcc
254.gap
164.gzip
186.crafty
253.perlbmk
197.parser
MEDIAN
HM
GM
AM
STDEV
SKEW
KURTOSIS
CoV
x (secs) y (secs)
100
246
133
169
97
113
114
131
213
242
133
150
96
93
99
93
175
141
106
83
217
151
398
177
124
146
131
133
141
141
157
149
88
54
2.16
0.75
5.19
-0.12
0.56
0.36
34
WAW with AM, usually wrong
•
•
•
•
Ra calculations never used actual Wi
AM makes a very strong, and in practice, usually wrong assumption:
IA3: Benchmark Equals Workload
Each Pi uses same fraction of time in both workload and benchmark:
Wi = xi /tx weights are assumed; this makes the choice explicit
Useful only when benchmark = workload, perhaps with multiplicative factor
This does happen, as when someone’s workload is:
–
–
–
–
•
Small number of programs, often run in dependent order
With small run-time variance
Run in equal numbers
Examples: some CAD workloads, where nightly run = P1P2P3P4,
and sequence must finish overnight
Otherwise simplicity of the AM can fool people into thinking that:
– Real distribution being measured, and AM is good central measure
– The original workload is being modeled
NoMeanFeat – Copyright 2004, John Mashey
35
WAW with Weighted Arithmetic Mean (WAM)
•
•
First, calculate xi* proportional to original Wi, to reflect their fractions of original
total run-time
Then yi* computed to maintain same relative performance ri,
xi*  Wi tx and y*i  xi* /( ri / 100)
Rwa  100* AM ( x* ) / AM ( y * )  100* tx* / ty *
n
n
100Wi tx
ri
i 1
 100 Wi tx /
i 1
n
n
Wi tx
ri
i 1
 Wi tx / 
i 1
n
n
Wi
i 1 ri
 Wi / 
i 1
•
•
Results depend on Wi,, need good WCA
With good WCA, people make various assumptions, implicit or explicit:
NoMeanFeat – Copyright 2004, John Mashey
36
WAW with WAM, Assumptions without WCA
•
•
•
•
IA4: Equal times on X
Set Wi, = 1/n, assume each Pi
consumes equal time on X.
All xi * = 157.
tx* = tx = 1881, but ty* = 2031
Rwa = 93% (Y slower than X)
IA5: Equal times on Y
Is X somehow special? No, no WCA.
If X and Y swap roles:
All yi * = 149.
ty* = ty = 1789, but tx* = 1936
Rwa = 108% (Y faster than X)
Original benchmark
x (secs) y (secs)
100
246
133
169
97
113
114
131
213
242
133
150
96
93
99
93
175
141
106
83
217
151
398
177
Sums: tx=1881 ty=1789
AMs:
157
149
r
41
79
86
87
88
89
103
106
124
127
144
225
IA4
y*
386
199
183
180
178
177
152
147
126
123
109
70
IA5
x*
60
117
128
130
131
132
154
159
185
190
214
335
ty*=2031 tx*=1936
169
161
IA6: Extreme cases
By assuming workload consisting only of worst or best case, can get:
Rmin = 41% to Rmax = 225%
Is it better to be base (equal weight) system in IA4 or IA5?
Do benchmarketeers know the answer? Jain[4], “Ratio Games”
NoMeanFeat – Copyright 2004, John Mashey
37
WAW with WAM Summary
•
Y is R% as fast as X:
Rmin = 41% worst case from this data*
Rwa = 93% IA4: equal times on X
Ra = 105% IA3: benchmark = workload
Rwa = 108% IA5: equal times on Y
Rmax = 225% best case on this data*
•
•
In the absence of weights from WCA any of these are assumptions
For any assumption, must ask:
Why this assumption made?
What evidence is there for it?
•
Assumptions are no substitute for knowledge
* Actually, later SERPOP analysis would predict about 5% of programs would fall outside this range, so
could be even worse or better for the wrong/right single-program workload
NoMeanFeat – Copyright 2004, John Mashey
38
WAW with Weighted Harmonic Mean (WAM)
•
•
HM or WHM used when xi and yi are really rates or performance ratios
Usual general WHM formula is:
n
Rwh
•
•
•
•
n
Wi
 WHM ( r )  Wi / 
i 1
i 1 ri
Same formula as in Rwa, so Rwh = Rwa
If Wi = 1, reduces to usual HM calculation under IA4
HM assumes equal time per Pi on X
Each Pi ’s run-time proportional to 1/ri, and Rh = 93%
If X and Y swap, and assume equal time on Y under IA5, Rh = 108%.
Defined this particular way:
IA4: WAM and WHM give same answer (93%)
IA5: WAM and WHM give same answer (108%)
WCA: WAM and WHM give same answer (according to WCA)
But of course, the different cases give different answers
NoMeanFeat – Copyright 2004, John Mashey
39
WAW Summary
•
•
•
•
•
Appropriate when WCA Wi, are known
Appropriate for “what-ifs”
Appropriate for design of dedicated systems
Strong influence of goodness and completeness of Wi,
Makes no useful predictions about program n+1
•
Sometimes difficult to use for general computer architectural design studies
Difficult to get the data in any reliable way
New design aimed to target workload 3 years away
•
Very difficult for industry-consortia benchmarking!
Agreement on weights is “interesting” experience
Population, not sample
Algebra, not statistics
NoMeanFeat – Copyright 2004, John Mashey
40
External Data
•
•
•
•
•
•
Published metrics can be useful complement to analysis
Times useful only for replication testing, credibility
Usually rates (like MFLOPS, Dhrystones) or ratios (SPECratios)
Common practice: convert former to latter
1980s: vendor performance documents often included some awful published
metrics, for lack of anything better, i.e., Dhrystones, Whetstones
Recognizability and understandability by user very important
Mysterious numbers are useless … especially from a vendor
Can feed into WAW analysis example
80% of workload is in 4 programs on X, 20% is “other”
Able to get x1, x2, x3, x4, but only y1, y2, y3. (P4 3rd-party code, no port yet),
then can compute r1, r2, r3, but not r4
If experience has shown that P4 is “like” some published benchmark,
then can estimate r4 and continue
Likewise, may be able to estimate an overall r for the missing 20%
NoMeanFeat – Copyright 2004, John Mashey
41
SERPOP
* Actually not. The rest is just a slight formalization of methods people have used for decades, but usually justified for
(true) reasons that (also) led to seeming contradictions. Some of us (author included) failed to dig deeper into the
math and explain it, or whole argument over Means would have been over years ago. There are no new statistical
methods here, and this just is the tip of the applicable statistics iceberg. Resampling techniques, jack-knife,
bootstrap, etc, or general Box-Cox transformations, and more complex distributions than lognormal are beyond the
scope here, but are worth studying, especially for small sample sizes commonly found in benchmarking.
NoMeanFeat – Copyright 2004, John Mashey
42
Distributions of Ratios
•
•
•
•
•
xi and yi fit no consistent, recognizable, useful distributions
No real-world reason for them to do so
They are also often arbitrarily adjusted for convenience.
Wi are fractions that must sum to 1.0
No real-world reason for them to follow any consistent distribution, although
given site might find that its workload fit something
That leaves the ratios ri
If xi and yi independent, from standard normal (negative…0…positive):
Cauchy distribution, among the most awkward known.
No mean or variance. Increasing sample size useless
But fortunately, for benchmarks:
xi and yi positive ri positive  not Cauchy, thank goodness!
xi and yi not completely independent, sometimes extremely well-correlated
Real experience says Y is actually faster/slower than X, usually
Example: Correl = 0.45; Dell example = 0.99.
One more important characteristic of benchmark ratio distributions….
NoMeanFeat – Copyright 2004, John Mashey
43
Benchmark Ratio Distributions need Log-scale Symmetry
•
•
•
•
X and Y are arbitrary labels, results cannot depend on numerator choice
i.e., R(a/b) = 1002/R(b/a) (given percentage notation used here)
 log-scale symmetry
X 50% of Y on P1, and 200% on P2; Y just the opposite
Given only the r values, the two systems must be equal, as GM shows
If this is WAW with AM, X, Y, and Z are equal (100)
If this is a sample from a log-symmetric distribution, Z is slower…. (94)
Note effect of consistently doubling P1 run-time
Tiny examples mostly silly, but the math has to work for them also
Fleming and Wallace [1]: reflexive, symmetric, multiplicative properties
x
P1
P2
HM
GM
AM
2
4
2.67
2.83
3.00
y
4
2
2.67
2.83
3.00
AM
r
50
200
80
100
125
ln( r)
3.912
5.298
4.605
z
3
3
3.00
3.00
3.00
rz
67
133
89
94
100
ln( rz)
4.200
4.893
4.546
x
P1
P2
HM
GM
AM
NoMeanFeat – Copyright 2004, John Mashey
4
4
4.00
4.00
4.00
y
8
2
3.20
4.00
5.00
AM
r
50
200
80
100
125
ln( r)
3.912
5.298
4.605
z
6
3
4.00
4.24
4.50
rz
67
133
89
94
100
ln( rz)
4.200
4.893
4.546
44
Log-scale Symmetry and Lognormal Distribution
•
•
•
•
AM (ln(x))  GM(x) : GM appropriate for log-symmetric distributions
Lognormal one of many log-symmetric distributions, obvious first choice to try
Lognormals normally caused by (mostly) multiplicative effects
Clock rate: Consider add 100MHz to 100Mhz system, 100Mhz to 1000MHz
First: <200%, Second: <110%; multiplicative, not additive
Micro-architecture
Memory system
Compiler code generation
Early: lognormal s fit, and if small-s, fits normal as well, as should
Possible: related symetric distribution with extra parameter to vary Kurtosis
No rmal
0.1
Kurt o sis neg at ive
1.0
Kurt o sis p o sit ive
10.0
NoMeanFeat – Copyright 2004, John Mashey
100.0
45
SERPOP
•
•
•
•
•
For each Pi, select representative input
Run programs on X & Y with same input, get:
run-times xi and yi
ratios ri = 100 * xi / yi and ignore xi and yi thereafter
Assume IA1: Repeatability and IA2: Consistent Relative Performance, plus
IA7: Sufficient sample size
More better, especially if CoV large
 Compute Confidence Intervals to understand goodness of sample
So far: even small samples (CINT2000: 12, CFP2000: 14) look OK
IA8: Representative sample
Experience and analysis needed to know that selected Pi “representative”
People often know representative programs in local environment, even if
weights not really known  LFK
Vendors often have detailed simulations, micro-architectural statistics
“This just does not look like anything real we’ve ever seen.” experience with
synthetic benchmarks like Dhrystone
Wide use of benchmark does not imply goodness of benchmark
NoMeanFeat – Copyright 2004, John Mashey
46
SERPOP
•
•
•
•
Just use GM(ri ) and other usual statistics for lognormal
WGM(ri ) is possible, but usually reserved for samples known to be badly
biased, while awaiting better, larger sample.
Detailed analysis on next few pages
Shows normal, lognormal, inverse normal for illustration
In real usage, would likely use lognormal
Others are not log-symmetric
When s small, similar anyway
NoMeanFeat – Copyright 2004, John Mashey
47
SERPOP – CINT2000 Einux (Opteron) vs IBM (POWER4+)
•
•
•
•
•
•
B1:C1 are unique
identifiers from filenames
5.5X range of
z-scores: r
relative perf
3.0
Unrelated,
2.0
1.0
Correl=0.45
0.0
-1.0
Outliers
-2.0
-3.0
pull down
CoD
z-scores: ln(r)
Lognormal
3.0
copes better 2.0
1.0
with outliers
0.0
-1.0
X: negative
-2.0
-3.0
times
predicted by
z-scores: C/r
normal
assumption, -3.0
-2.0
as usual with -1.0
0.0
strong skew
1.0
2.0
3.0
A
1 CINT2000
2
3 int_base2000:
4 Benchmark
5 181.mcf
6 256.bzip2
7 252.eon
8 255.vortex
9 300.twolf
10 175.vpr
11 176.gcc
12 254.gap
13 164.gzip
14 186.crafty
15 253.perlbmk
16 197.parser
17 MEDIAN
18 HM
19 GM
20 AM
21 STDEV
22 SKEW
23 KURTOSIS
24 CoV
25 Histogram
26 Bins
27 <m-2s
28 <m-s
29 <m
30 <m+s
31 <m+2s
32 >m+2s
B
C
#02136 #02097
x:IBM y:Einux
1081
1077
x (secs) y (secs)
100
246
133
169
97
113
114
131
213
242
133
150
96
93
99
93
175
141
106
83
217
151
398
177
124
146
131
133
141
141
157
149
88
54
2.16
0.75
5.19
-0.12
0.56
0.36
Correl:
0.45
Back Means:
-19
42
69
96
157
149
244
203
332
256
D
E
F
G
H
I
y: Einux A4800, 1800Mhz AMD Opteron, versus
x:IBM eServer Turbo 690, 1700Mhz POWER4+
=C*x/y
C: 100
Z-scores
r
ln( r)
C/r
z( r) z(ln( r)) z(C/r)
41
3.70
2.46
-1.49
-2.21
2.79
79
4.37
1.27
-0.65
-0.59
0.38
86
4.45
1.17
-0.50
-0.39
0.18
87
4.47
1.15
-0.47
-0.35
0.14
88
4.48
1.14
-0.44
-0.32
0.11
89
4.48
1.13
-0.43
-0.30
0.10
103
4.64
0.97
-0.11
0.07
-0.23
106
4.67
0.94
-0.04
0.14
-0.28
124
4.82
0.81
0.35
0.52
-0.55
127
4.84
0.79
0.42
0.58
-0.59
144
4.97
0.70
0.78
0.88
-0.77
225
5.42
0.44
2.57
1.97
-1.28
96
4.56
1.05
n:
12
93
4.57
0.92
r
ln( r)
C/r
100
4.59
1.00 Coef of Determination
108
4.61
1.08
0.77
0.83
0.68
45
0.41
0.50 95% Confidence Limits
1.48
-0.27
2.05
80
78
72
3.72
2.27
5.99
137
130
130
0.42
0.09
0.46 Range of Conf Limits
Sigma:
1.51
57
52
59
108
100
93
Histogram counts
17
44
48
0
1
1
63
67
63
1
0
0
108
100
93
7
5
5
154
151
172
3
5
5
199
227
1164
0
1
1
1
0
0
NoMeanFeat – Copyright 2004, John Mashey
1.2
1
0.8
0.6
0.4
0.2
0
1
48
SERPOP – CFP2000 Einux (Opteron) vs IBM (POWER4+)
•
•
•
•
Well-behaved lognormal
4.7X range of
z-scores: C/r
relative
performance
-3.0
-2.0
Unrelated,
-1.0
Correl=0.49
0.0
1.0
Remember,
2.0
3.0
a high r
z-scores: ln(r)
does not
necessarily
3.0
mean Y
2.0
1.0
is great,
0.0
-1.0
it means it is
-2.0
unusually
-3.0
better than X;
z-scores: C/r
X might be
-3.0
bad on that
-2.0
code. Need
-1.0
0.0
to look at
1.0
other systems. 2.0
3.0
A
B
C
D
E
F
G
H
I
1 CFP2000
#02137 #02109 y: Einux A4800, 1800Mhz AMD Opteron, versus
2
x:IBM y:Einux x:IBM eServer Turbo 690, 1700Mhz POWER4+
3 fp_base2000: 1598
1122 =C*tr/t
C: 100
Z-scores
4 Benchmark
tr
t
r
ln( r)
C/r
z( r) z(ln( r)) z(C/r)
5 183.quake
44
142
31
3.44
3.22
-1.51
-2.07
2.58
6 178.galgel
76
187
41
3.71
2.45
-1.18
-1.37
1.40
7 168.wupwise
72
138
52
3.96
1.91
-0.79
-0.75
0.58
8 179.art
112
187
60
4.09
1.67
-0.53
-0.40
0.21
9 200.sixtrack
152
245
62
4.13
1.61
-0.45
-0.31
0.12
10 187.facerec
99
149
66
4.19
1.51
-0.31
-0.15
-0.03
11 173.applu
151
228
66
4.19
1.51
-0.31
-0.15
-0.04
12 189.lucas
109
141
77
4.35
1.29
0.07
0.24
-0.37
13 301.apsi
193
246
78
4.36
1.27
0.11
0.28
-0.40
14 171.swim
145
183
79
4.37
1.26
0.13
0.31
-0.42
15 191.fma3d
163
190
86
4.45
1.17
0.36
0.51
-0.56
16 172.mgrid
173
170
102
4.62
0.98
0.90
0.94
-0.84
17 188.ammp
210
197
107
4.67
0.94
1.07
1.06
-0.91
18 177.mesa
164
112
146
4.99
0.68
2.43
1.86
-1.30
19 MEDIAN
148
185
72
4.27
1.40
n:
14
20 HM
111
171
65
4.22
1.33
ln( r)
C/r
21 GM
123
175
70
4.23
1.42 Coef of Determination
22 AM
133
180
75
4.25
1.53
0.88
0.92
0.81
23 STDEV
49
41
29
0.39
0.65 95% Confidence Limits
24 SKEW
-0.29
0.22
0.94
-0.29
1.44
59
56
52
25 KURTOSIS
-0.77
-0.69
1.61
0.61
2.54
92
88
86
26 CoV
0.37
0.23
0.39
0.09
0.43 Range of Conf Limits
27 Histogram
Correl:
0.49 Sigma:
1.48
34
32
34
28 Bins
Back Means:
75
70
65
Histogram counts
29 <m-2s
36
98
17
32
35
0
1
1
30 <m-s
84
139
46
47
46
2
1
1
31 <m
133
180
75
70
65
5
5
3
32 <m+s
182
221
105
104
114
5
5
8
33 <m+2s
230
262
134
155
442
1
2
1
34 >m+2s
1
0
0
1.2
1
0.8
0.6
0.4
0.2
0
1
NoMeanFeat – Copyright 2004, John Mashey
49
SERPOP – CINT2000, Dell PW350 3066 vs 2266 Mhz
•
•
•
•
•
•
•
135% clock rate difference
R = 125%
z-scores: r
512KB cache
Correl=0.99
3.0
2.0
181: likely
1.0
0.0
low cache
-1.0
-2.0
hit rate,
-3.0
drags faster
z-scores: ln(r)
system
almost down
3.0
2.0
to slower
1.0
0.0
254, 253,
-1.0
186, 164
-2.0
-3.0
likely high
z-scores: C/r
cache hit
rate 
-3.0
clock-rate
-2.0
-1.0
ratio
0.0
1.0
252: ??
2.0
3.0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
A
CINT2000
B
C
D
E
F
G
H
I
#01791 #01827 Y: Dell Precision Workstation 350, 3066MHz P4
3066MHz2266MHz X: Dell Precision Workstation 350, 2266MHz P4
int_base2000: 1081
1077 =C*x/y
C: 100
Z-scores
Benchmark
x(secs) x (secs)
r
ln( r)
C/r
z( r) z(ln( r)) z(C/r)
181.mcf
248
233
106
4.67
0.94
-2.12
-2.21
2.31
175.vpr
272
231
118
4.77
0.85
-0.83
-0.81
0.78
176.gcc
106
89
119
4.78
0.84
-0.68
-0.65
0.61
300.twolf
400
333
120
4.79
0.83
-0.56
-0.53
0.49
256.bzip2
214
175
122
4.81
0.82
-0.31
-0.28
0.24
255.vortex
142
115
123
4.82
0.81
-0.18
-0.14
0.11
197.parser
220
176
125
4.83
0.80
-0.01
0.03
-0.06
254.gap
100
78
129
4.86
0.78
0.44
0.46
-0.47
253.perlbmk
173
130
133
4.89
0.75
0.91
0.90
-0.88
186.crafty
113
85
133
4.89
0.75
0.93
0.92
-0.90
164.gzip
164
123
133
4.89
0.75
0.94
0.93
-0.91
252.eon
134
97
138
4.93
0.73
1.46
1.39
-1.32
MEDIAN
169
127
124
4.82
0.80
n:
12
HM
162
128
124
4.83
0.80
r
ln( r)
C/r
GM
175
140
125
4.83
0.80 Coef of Determination
AM
190
155
125
4.83
0.80
0.93
0.91
0.89
STDEV
87
78
9
0.07
0.06 95% Confidence Limits
SKEW
1.30
1.19
-0.55
-0.73
0.93
120
119
119
KURTOSIS
1.90
1.01
0.34
0.79
1.32
131
131
130
CoV
0.46
0.50
0.07
0.01
0.07 Range of Conf Limits
Histogram
Correl:
0.99 Sigma:
1.07
11
11
12
Bins
Back Means:
125
125
124
Histogram counts
<m-2s
17
0
107
108
109
1
1
1
<m-s
104
78
116
116
116
0
0
0
<m
190
155
125
125
124
6
5
5
<m+s
277
233
134
134
134
4
5
5
<m+2s
364
311
143
144
146
1
1
1
>m+2s
0
0
0
NoMeanFeat – Copyright 2004, John Mashey
1.2
1
0.8
0.6
0.4
0.2
0
1
50
SERPOP – CFP2000, Dell PW350 3066 vs 2266 Mhz
•
•
•
•
•
•
135% clock rate difference
R=117%, noticeable
z-scores: r
cache-miss,
CFP2000 has
3.0
larger data
2.0
1.0
than CINT
0.0
-1.0
Correl=0.97
-2.0
512KB cache -3.0
z-scores: ln(r)
each
Low STDEV, 3.0
HM~GM~AM, 2.0
1.0
~Median
0.0
-1.0
177, 200
-2.0
-3.0
get good
z-scores: C/r
cache hit
-3.0
-2.0
-1.0
0.0
1.0
2.0
3.0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
A
CFP2000
B
C
D
E
F
G
H
I
#02137 #02109 Y: Dell Precision Workstation 350, 3066MHz P4
3066MHz2266MHz X: Dell Precision Workstation 350, 2266MHz P4
fp_base2000: 1598
1122 =C*x/y
C: 100
Z-scores
Benchmark x (secs) y (secs)
r
ln( r)
C/r
z( r) z(ln( r)) z(C/r)
179.art
354
363
98
4.58
1.03
-1.94
-2.06
2.18
171.swim
175
169
104
4.64
0.97
-1.34
-1.37
1.40
183.quake
112
101
111
4.71
0.90
-0.62
-0.59
0.56
188.ammp
382
342
112
4.72
0.90
-0.54
-0.51
0.47
178.galgel
213
189
113
4.72
0.89
-0.44
-0.40
0.37
189.lucas
151
131
115
4.75
0.87
-0.19
-0.15
0.11
173.applu
208
180
116
4.75
0.87
-0.16
-0.12
0.08
301.apsi
368
312
118
4.77
0.85
0.08
0.12
-0.15
187.facerec
172
144
119
4.78
0.84
0.22
0.26
-0.29
172.mgrid
208
172
121
4.80
0.83
0.37
0.40
-0.43
168.wupwise
140
114
123
4.81
0.81
0.56
0.58
-0.60
191.fma3d
239
193
124
4.82
0.81
0.66
0.67
-0.68
177.mesa
160
120
133
4.89
0.75
1.59
1.52
-1.44
200.sixtrack
263
195
135
4.90
0.74
1.75
1.65
-1.55
MEDIAN
208
176
117
4.76
0.86
n:
14
HM
197
168
116
4.76
0.85
ln( r)
C/r
GM
210
180
117
4.76
0.86 Coef of Determination
AM
225
195
117
4.76
0.86
0.93
0.92
0.91
STDEV
87
84
10
0.09
0.08 95% Confidence Limits
SKEW
0.82
1.10
-0.03
-0.28
0.54
111
111
111
KURTOSIS
-0.49
0.10
0.27
0.43
0.71
123
123
123
CoV
0.39
0.43
0.09
0.02
0.09 Range of Conf Limits
Histogram
Correl:
0.97 Sigma:
1.09
12
12
12
Bins
Back Means:
117
117
116
Histogram counts
<m-2s
50
26
97
98
99
0
1
1
<m-s
137
110
107
107
107
2
1
1
<m
225
195
117
117
116
5
5
5
<m+s
312
279
127
127
128
5
5
5
<m+2s
399
364
137
139
141
2
2
2
>m+2s
0
0
0
NoMeanFeat – Copyright 2004, John Mashey
1.2
1
0.8
0.6
0.4
0.2
0
1
51
SERPOP – CINT2000 – Two Sun Blades, Caches
•
•
•
•
•
•
•
2500: 1MB on-chip, 1280MHz
2000: 8MB off-chip, 1200MHz
z-scores: r
107% higher
clock rate
3.0
94% GM
2.0
1.0
Correl=0.74
0.0
-1.0
186, 254, 113 -2.0
>107%; good -3.0
z-scores: ln(r)
cache hit in
on-chip cache
3.0
181 likely fits 2.0
1.0
8MB cache,
0.0
-1.0
misses much -2.0
-3.0
in 1MB
z-scores: C/r
256 also
shows
-3.0
cache
-2.0
-1.0
effect
0.0
1.0
High KURT
2.0
3.0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
A
CINT2000
B
C
#01999 #02435
IBM
Einux
int_base2000: 1081
1077
Benchmark x (secs) y(secs)
181.mcf
263
546.0
256.bzip2
215
269.0
197.parser
315
337.0
175.vpr
254
269.0
255.vortex
191
196.0
176.gcc
162
165.0
300.twolf
456
459.0
252.eon
169
162.0
253.perlbmk
282
269.0
186.crafty
150
138.0
254.gap
236
210.0
164.gzip
293
259.0
MEDIAN
245
264
HM
227
234
GM
237
252
AM
249
273
STDEV
85
123
SKEW
1.25
1.25
KURTOSIS
2.31
1.13
CoV
0.34
0.45
Histogram
Correl:
0.74
Bins
Back Means:
<m-2s
79
28
<m-s
164
151
<m
249
273
<m+s
334
396
<m+2s
418
519
>m+2s
D
E
F
G
H
I
Y: Sun Blade2500, US IIIi, 1280Mhz, 1MB on-chip
X: Sun Blade2000, 1200Mhz, 8MB off-chip
=C*tr/t
C: 100
Z-scores
r
ln( r)
C/r
z( r) z(ln( r)) z(C/r)
48
3.87
2.08
-2.71
-2.90
3.02
80
4.38
1.25
-0.92
-0.71
0.49
93
4.54
1.07
-0.15
-0.03
-0.07
94
4.55
1.06
-0.10
0.01
-0.10
97
4.58
1.03
0.07
0.15
-0.20
98
4.59
1.02
0.11
0.18
-0.23
99
4.60
1.01
0.18
0.23
-0.27
104
4.65
0.96
0.46
0.44
-0.41
105
4.65
0.95
0.49
0.46
-0.43
109
4.69
0.92
0.71
0.62
-0.53
112
4.72
0.89
0.91
0.76
-0.62
113
4.73
0.88
0.96
0.79
-0.64
99
4.59
1.01
n:
12
92
4.53
1.04
r
ln( r)
C/r
94
4.54
1.06 Coef of Determination
96
4.55
1.09
0.72
0.60
0.50
18
0.23
0.33 95% Confidence Limits
-2.03
-2.54
2.94
85
81
77
4.93
7.26
9.23
107
109
113
0.18
0.05
0.30 Range of Conf Limits
Sigma:
1.26
22
28
35
96
94
92
Histogram counts
61
59
57
1
1
1
79
75
71
0
0
0
96
94
92
3
2
1
114
119
130
8
9
10
132
150
226
0
0
0
0
0
0
NoMeanFeat – Copyright 2004, John Mashey
1.2
1
0.8
0.6
0.4
0.2
0
1
52
SERPOP – CFP2000 – Two Sun Blades, Caches
•
•
•
•
•
•
2500: 1MB on-chip, 1280MHz
2000: 8MB off-chip, 1200MHz
z-scores: r
107% higher
clock rate
3.0
92% GM
2.0
1.0
Correl=0.92
0.0
-1.0
178, 189,
-2.0
-3.0
188, 179
z-scores: ln(r)
form clump,
cache size
3.0
makes
2.0
1.0
difference
0.0
-1.0
200,301,191 -2.0
>107%; good -3.0
cache hit in
z-scores: C/r
on-chip cache
-3.0
Negative
-2.0
-1.0
KURT
0.0
1.0
2.0
3.0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
A
CFP2000
B
C
#02000 #02436
IBM
Einux
fp_base2000: 1598
1122
Benchmark x (secs) y(secs)
178.galgel
146
206
189.lucas
378
523
188.ammp
367
482
179.art
26
34
171.swim
299
337
183.quake
84
93
172.mgrid
255
271
187.facerec
161
169
177.mesa
201
199
173.applu
282
279
168.wupwise
179
169
200.sixtrack
270
250
301.apsi
402
365
191.fma3d
431
377
MEDIAN
263
261
HM
142
164
GM
206
224
AM
249
268
STDEV
122
139
SKEW
-0.20
0.27
KURTOSIS
-0.79
-0.31
CoV
0.49
0.52
Histogram
Correl:
0.92
Bins
Back Means:
<m-2s
4
-10
<m-s
126
129
<m
249
268
<m+s
371
407
<m+2s
493
546
>m+2s
D
E
F
G
H
I
Y: Sun Blade2500, US IIIi, 1280Mhz, 1MB on-chip
X: Sun Blade2000, 1200Mhz, 8MB off-chip
=C*x/y
C: 100
Z-scores
r
ln( r)
C/r
z( r) z(ln( r)) z(C/r)
71
4.26
1.41
-1.53
-1.62
1.70
72
4.28
1.38
-1.43
-1.49
1.55
76
4.33
1.31
-1.17
-1.17
1.17
77
4.34
1.30
-1.11
-1.11
1.10
89
4.49
1.13
-0.30
-0.23
0.15
89
4.49
1.12
-0.26
-0.18
0.10
94
4.54
1.06
0.06
0.14
-0.20
95
4.56
1.05
0.14
0.21
-0.27
101
4.62
0.99
0.54
0.57
-0.60
101
4.62
0.99
0.54
0.58
-0.60
106
4.66
0.94
0.87
0.87
-0.85
108
4.68
0.93
1.02
0.99
-0.95
110
4.70
0.91
1.16
1.11
-1.05
114
4.74
0.87
1.45
1.34
-1.23
95
4.55
1.06
n:
14
91
4.52
1.07
ln( r)
C/r
92
4.52
1.09 Coef of Determination
93
4.52
1.10
0.98
0.96
0.94
15
0.16
0.18 95% Confidence Limits
-0.25
-0.43
0.59
85
84
83
-1.26
-1.19
-1.07
102
101
100
0.16
0.04
0.17 Range of Conf Limits
Sigma:
1.18
17
17
17
93
92
91
Histogram counts
64
67
68
0
0
0
79
78
78
4
4
4
93
92
91
2
2
2
108
108
109
5
6
6
122
127
136
3
2
2
0
0
0
NoMeanFeat – Copyright 2004, John Mashey
1.2
1
0.8
0.6
0.4
0.2
0
1
53
Livermore Fortran Kernels
•
McMahon[9], Good WCA of codes in local scientific environment, to identify Pi, not Wi
–
Large environment, workloads varied strongly
•
http://www.llnl.gov/asci_benchmarks/asci/limited/lfk/README.html
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Kernel 1: an excerpt from a hydrodynamic code.
Kernel 2: an excerpt from an Incomplete Cholesky-Conjugate Gradient code.
Kernel 3: the standard Inner Product function of linear algebra.
Kernel 4: an excerpt from a Banded Linear Equations routine.
Kernel 5: an excerpt from a Tridiagonal Elimination routine.
Kernel 6: an example of a general linear recurrence equation.
Kernel 7: an Equation of State fragment.
Kernel 8: an excerpt of an Alternating Direction, Implicit Integration code.
Kernel 9: an Integrate Predictor code.
Kernel 10: a Difference Predictor code.
Kernel 11: a First Sum.
Kernel 12: a First Difference.
Kernel 13: an excerpt from a 2-D Particle-in-Cell code.
Kernel 14: an excerpt of a 1-D Particle-in-Cell code.
Kernel 15: a sample of how casually FORTRAN can be written.
Kernel 16: a search loop from a Monte Carlo code.
Kernel 17: an example of an implicit conditional computation.
Kernel 18: an excerpt from a 2-D Explicit Hydrodynamic code.
Kernel 19: a general Linear Recurrence Equation.
Kernel 20: an excerpt from a Discrete Ordinate Transport program.
Kernel 21: a matrix X matrix product calculation.
Kernel 22: a Planckian Distribution procedure.
Kernel 23: an excerpt from 2-D Implicit Hydrodynamics.
Kernel 24: finds the location of the first minimum in X.
NoMeanFeat – Copyright 2004, John Mashey
54
Livermore Fortran Kernels
•
•
•
•
•
•
http://www.netlib.org/benchmark/livermore says:
“The best central measure is the Geometric Mean(GM) of 72 rates because the GM is
less biased by outliers than the Harmonic(HM) or Arithemetic(AM). CRAY hardware
monitors have demonstrated that net Mflop rates for the LLNL and UCSD tuned
workloads are closest to the 72 LFK test GM rate. [ However, CRAY memories are "all
cache". LLNL codes ported to smaller cache microprocessors typically perform at only
LFK Harmonic mean MFlop rates.]”
It also associates:
2*AM
Best applications
AM
Optimized applications
GM
Tuned workload
HM
Untuned workload
HM(scalar)
All-scalar applications
Such advice seems to be set of heuristics from people with long experience in specific
environment, as there is no obvious mathematical reason for all these to be true, other
than usual HM < GM < AM
Comment “less biased” is true, but more important, this collection of codes is a
reasonable SERPOP sample.
MFLOPS rates really ratios versus a mythical system that does 1MFLOP on each loop
In following, MFLOPS rates are given as “r”
NoMeanFeat – Copyright 2004, John Mashey
55
3.0
2.0
LFK – MIPS M/1000, 15Mhz R2000 uniprocessor, Oct 1987
1.0
0.0
0
•
•
•
•
•
•
•
Scalar uniprocessor
No cache-blocking
24 data points, fairly
well behaved
Good fit for normal,
lognormal, as expected
with small s
No really weird outliers
MIPS [14]
#NUMS: no worry,do not
care about HM and GM
of the logs, and of
course. If these are
needed, would scale the
values to avoid negative
logs.
z( r)
3. 0
2. 0
1. 0
0. 0
1
13
-1. 0
-2. 0
-3. 0
z ( l n( r ) ) )
3. 0
2. 0
1. 0
0. 0
1
13
-1. 0
-2. 0
-3. 0
z ( 1/ r )
-3. 0
-2. 0
-1. 0
1
0. 0
1. 0
13
MIPS M/1000, 15MHz R2000,
64-bit Livermore Fortran Kernels
1/r
ln( r)
r
Loop
0.64 -0.450 1.568
13
0.67 -0.401 1.494
14
0.96 -0.042 1.043
15
1.04 0.044 0.957
23
1.23 0.208 0.812
12
1.23 0.209 0.812
11
1.23 0.209 0.811
10
1.42 0.349 0.705
6
1.45 0.371 0.690
21
1.48 0.390 0.677
5
1.62 0.482 0.617
20
1.63 0.487 0.615
16
1.84 0.609 0.544
2
1.85 0.617 0.539
24
1.92 0.650 0.522
19
1.96 0.673 0.510
4
2.10 0.743 0.476
3
2.27 0.818 0.441
17
2.31 0.839 0.432
1
2.33 0.845 0.430
18
2.37 0.864 0.421
22
2.57 0.943 0.389
8
3.13 1.140 0.320
7
3.23 1.172 0.310
9
1.733 0.548 0.579
MEDIAN
1.487 #NUM! 0.565
HM
1.633 #NUM! 0.612
GM
1.770 0.490 0.672
AM
0.689 0.427 0.327
STDEV
0.374 -0.589 1.586
SKEW
KURTOSIS -0.212 0.076 2.433
0.389 0.870 0.486
COV
1.532
MSTDEV
1.2
2. 0
-1.0
Size = 471
z( r) z(ln( r)) z(1/r)
-1.558 -2.196 2.924
-1.513 -2.081 2.680
-1.105 -1.227 1.198
-0.984 -1.022 0.915
-0.721 -0.632 0.439
-0.719 -0.630 0.436
-0.718 -0.628 0.435
-0.457 -0.295 0.085
-0.413 -0.244 0.036
-0.373 -0.197 -0.008
-0.172 0.022 -0.204
-0.162 0.032 -0.213
0.136 0.322 -0.445
0.158 0.342 -0.460
0.245 0.421 -0.517
0.308 0.475 -0.556
0.508 0.642 -0.669
0.741 0.821 -0.783
0.807 0.870 -0.813
0.827 0.884 -0.821
0.890 0.930 -0.848
1.167 1.118 -0.954
1.956 1.587 -1.183
2.098 1.663 -1.216
24
Count
1/r
ln( r)
r
Coef of Determination
0.957 0.937 0.788
95% Confidence Limits
1.2
1.3
1.4
1.8
1.9
2.0
-2.0
-3.0
3.0
2.0
1.0
0.0
0
-1.0
-2.0
-3.0
-3.0
-2.0
-1.0
0
0.0
1.0
2.0
3.0
1
3. 0
0.8
NoMeanFeat – Copyright 2004, John Mashey
0.6
56
12
10
8
0.4
6
2.0
LFK – SGI 4D240GTX, 4 R3000-25, July 1989
1.0
0.0
0
-1.0
•
•
•
•
•
Vectorizing, parallelizing
FORTRAN compiler
Says:
11 of 24 parallelizable
15 of 24 vectorizable
Much bigger range,
example of rationale that
led to use of GM, much
better fit for lognormal
when substantial variation
exists
Step-like groupings of
programs with related
performance
Humphries [15]
3 off chart
z( r)
3. 0
2. 0
1. 0
0. 0
1
13
-1. 0
-2. 0
-3. 0
z ( l n( r ) ) )
3. 0
2. 0
1. 0
0. 0
1
13
-1. 0
-2. 0
-3. 0
z ( 1/ r )
-3. 0
-2. 0
-1. 0
1
0. 0
1. 0
13
64-bit Livermore Fortran Kernels
1/r
ln( r)
r
Loop
0.11 0.893
1.12
13
0.79 0.452
2.21
24
0.83 0.435
2.30
11
0.88 0.413
2.42
16
1.11 0.329
3.04
6
1.31 0.270
3.71
5
1.33 0.264
3.79
2
1.33 0.264
3.79
14
1.33 0.264
3.79
17
1.48 0.227
4.40
4
1.49 0.225
4.44
20
1.55 0.213
4.70
15
1.62 0.198
5.05
23
1.62 0.197
5.07
19
1.91 0.148
6.74
3
2.12 0.120
8.32
12
2.14 0.117
8.53
22
2.18 0.113
8.83
10
2.47 0.085
11.82
18
2.78 0.062
16.13
1
2.86 0.057
17.52
9
2.98 0.051
19.68
21
3.01 0.050
20.20
8
3.16 0.042
23.58
7
5.07 0.205
4.875
MEDIAN
4.372 1.018 0.126
HM
5.855 1.515 0.171
GM
7.966 1.767 0.229
AM
6.582 0.805 0.187
STDEV
1.209 0.147 2.062
SKEW
KURTOSIS 0.216 -0.589 6.040
0.826 0.456 0.818
COV
2.237
MSTDEV
1.2
2. 0
Size = 471
z( r) z(ln( r)) z(1/r)
-0.950 -1.707 2.406
-0.705 -0.805 0.560
-0.685 -0.752 0.486
-0.658 -0.685 0.396
-0.518 -0.382 0.042
-0.368 -0.118 -0.207
-0.350 -0.090 -0.230
-0.350 -0.090 -0.230
-0.350 -0.090 -0.230
-0.213 0.108 -0.384
-0.204 0.120 -0.392
-0.145 0.196 -0.445
-0.066 0.291 -0.506
-0.062 0.296 -0.510
0.314 0.674 -0.714
0.669 0.954 -0.833
0.716 0.987 -0.845
0.783 1.033 -0.862
1.456 1.420 -0.982
2.424 1.832 -1.076
2.737 1.942 -1.097
3.223 2.096 -1.123
3.339 2.131 -1.129
4.099 2.336 -1.159
24
Count
1/r
ln( r)
r
Coef of Determination
0.810 0.954 0.737
95% Confidence Limits
2.4
3.0
3.5
4.6
5.6
7.2
-2.0
-3.0
3.0
2.0
1.0
0.0
0
-1.0
-2.0
-3.0
-3.0
-2.0
-1.0
0
0.0
1.0
2.0
3.0
1
3. 0
NoMeanFeat – Copyright 2004, John Mashey
0.8
0.6
57
16
14
12
10
2.0
LFK-CRAY X-MP, uniprocessor, 1988
1.0
0.0
1
-1.0
•
•
•
•
•
Vectorized Kernels
asterisked, all but one
below bar separating loop
14 and loop 2
r: large SKEW, KURT,
CoV. AM=STDEV,
awkward for normal
Tang & Davidson [16]
r, ln( r), 1/r s-sized
intervals below. Grayed
boxes were made-up, as
the normal calculations
yield impossible results.
The -62 and 0 values are
also impossible.
z( r)
7 off chart
3. 0
2. 0
1. 0
0. 0
1
13
-1. 0
-2. 0
-3. 0
z ( l n( r ) ) )
3. 0
2. 0
1. 0
0. 0
1
13
-1. 0
-2. 0
-3. 0
1.2
z ( 1/ r )
1
-3. 0
0.8
-2. 0
<m-2s
<m-s
<m
<m+s
<m+2s
>m+2s
-62
0
61
123
184
2
9
31
114
416
0.6
5
7
16
114
416
-1. 0
0.4
0.2
0
1
0. 0
1. 0
1
2. 0
3. 0
13
64-bit LFK-CRAY X-MP, COS, CFT77.12
Size = 471
CRAY
r
ln( r)
1/r
z( r) z(ln( r)) z(1/r)
24
3.65
1.29 0.274 -0.85 -1.18
1.05
15
5.18
1.64 0.193 -0.80 -0.88
0.48
13
5.83
1.76 0.172 -0.77 -0.78
0.33
16
6.15
1.82 0.163 -0.76 -0.74
0.27
17
10.15
2.32 0.099 -0.61 -0.32 -0.18
6*
11.28
2.42 0.089 -0.57 -0.23 -0.25
11
12.68
2.54 0.079 -0.52 -0.13 -0.32
20
13.22
2.58 0.076 -0.50 -0.09 -0.34
19
13.36
2.59 0.075 -0.50 -0.09 -0.35
23
13.88
2.63 0.072 -0.48 -0.05 -0.37
5
14.36
2.66 0.070 -0.46 -0.02 -0.38
14
22.22
3.10 0.045 -0.17
0.34 -0.56
2*
45.51
3.82 0.022
0.69
0.95 -0.72
10*
61.21
4.11 0.016
1.26
1.20 -0.76
22*
65.78
4.19 0.015
1.43
1.26 -0.77
4*
65.94
4.19 0.015
1.44
1.26 -0.77
12*
74.34
4.31 0.013
1.75
1.36 -0.78
21*
108.94
4.69 0.009
3.02
1.68 -0.81
18*
110.57
4.71 0.009
3.08
1.69 -0.81
8*
145.79
4.98 0.007
4.38
1.93 -0.83
3*
151.70
5.02 0.007
4.59
1.96 -0.83
9*
157.52
5.06 0.006
4.81
1.99 -0.83
1*
164.58
5.10 0.006
5.07
2.03 -0.83
7*
187.75
5.24 0.005
5.92
2.14 -0.84
MEDIAN
33.865
5.07 0.033 Count
24
HM
15.626 2.923 0.016
r
ln( r)
1/r
GM
31.479 3.192 0.032 Coef of Determination
AM
61.316 3.449 0.064
0.87
0.97
0.77
STDEV
61.519 1.291 0.072 95% Confidence Limits
SKEW
0.816 -0.068 1.511
15.4
9.0
5.4
KURTOSIS -0.813 -1.504 1.963
38.3
24.4
15.5
COV
1.003 0.374 1.123
MSTDEV
3.636
NoMeanFeat – Copyright 2004, John Mashey
-2.0
-3.0
3.0
2.0
1.0
0.0
1
-1.0
-2.0
-3.0
-3.0
-2.0
-1.0
1
0.0
1.0
2.0
3.0
1.2
1
0.8
0.6
58
16
14
12
10
Optimization / Tuning
•
•
“Social engineering” issue of creating good industry benchmarks and
controlling their “gaming” is an entire different talk, except one bit of math:
In competitive benchmarking, tuning focus depends on metric
– HM: work to raise smallest ri , especially low outliers
– AM: work to reduce largest run-times xi or yi , especially high outliers
– GM: work to tune every program, since improving any ri by factor F is as good as
improving any other ri by that factor.
– WAM/WHM: work on programs with largest weights from WCA
NoMeanFeat – Copyright 2004, John Mashey
59
“Two Useful Answers – WAW or SERPOP”
•
Overall population: hopeless!
•
Workload Characterization Analysis
Gather data, generate Weights
Codify “local knowledge”
External: published rates/ratios
Can sometimes be used to fill in
missing data, increase sample size
•
•
•
Workload Analysis with Weights
Needs goodness of Weights
Can be “what if” analyses
Algebra on workload parameters
R% for a workload
Sample Estimate of Relative
Performance Of Programs
Representativeness, sample size
Statistical analysis on sample
R% for programs on systems,
plus distribution and confidence
Population of Programs Pi=1,NN, Run-times for X & Y
Workload-dependent
Workload-neutral
WCA
External
System log, “Experience”
Published metrics
Pi=1, N : Txi =total run-time
Pk: Mxk, Myk
Compute weights Wi
Compute rk(Y:X)
Pi : Wi
Pk : rk (Y:X)
WAW
SERPOP
Pi, Wi
Pi
Pi: select input,
Pi: select input, run
run on X,Y xi, yi, and
on X, Y ri = xi /yi.
ri = xi / yi
Assume: IA1, IA2
Assume: IA1, IA2
P : Wi,, xi,,yi.
Pi : Wi, ri
WAM
WHM
Ra : IA3
Rh : IA4
Rwa
Rwh
Rw
WCA or IA3, IA4, IA5
NoMeanFeat – Copyright 2004, John Mashey
Pi : Wi, ri
GM
Rg : IA7, IA8
s, CoV,
Skew, Kurtosis
CoD
Confidence limits
60
Assumptions Again
•
•
IA1: Repeatability
IA2: Consistent Relative Performance
•
•
•
•
IA3: Benchmark Equals Workload
IA4: Equal times on X
IA5: Equal times on Y
IA6: Extreme cases
•
•
IA7: Sufficient sample size
IA8: Representative sample
NoMeanFeat – Copyright 2004, John Mashey
61
Conclusion
•
Do as much WCA as affordable
– Competitively, better WCA = better products
– Very difficult for general-purpose CPUs
– Much more plausible for dedicated, embedded systems, SystemOnChip
Really important for real-time or equivalent
•
When weights are known, use WAW to estimate workload run-time
– Algebra
•
When they are not, use SERPOP to analyze distribution
– Statistics
– Large CoV or equivalent, wide confidence levels  get more/better data
•
Performance is distribution, not just mean
NoMeanFeat – Copyright 2004, John Mashey
62
References
1.
P. J. Fleming, J. J. Wallace, “How Not to Lie With
Statistics: The Correct Way to Summarize
Benchmarks,” Comm ACM, Vol 29, No. 3, pp. 218-221,
March 1986.
2. J. E. Smith, “Characterizing Computer Performance with
a Single Number,” Comm ACM, Vol 31, No. 10, pp.
1202-1206, October 1988.
3. L. K. John, “More on finding a Single Number to indicate
Overall Performance of a Benchmark Suite,” Computer
Architecture News, Vol. 32, No 1, pp. 3-8, March 2004.
4. R. Jain, The Art of Computer Systems Performance
Analysis, Wiley, New York, 1991. See especially
Chapter on “Ratio Games.”
5. D. J. Lilja, Measuring Computer Performance – A
Practioner’s Guide, Cambridge University Press, 2000.
6. J. L. Hennessy, D. A. Patterson, Computer Architecture
– A Quantitative Approach, Third Edition, Morgan
Kaufmann Publishers, 2003.
7. W. J. DeCoursey, Statistics and Probability for
Engineering Applications, Newnes, Amsterdam, 2003.
8. NIST/SEMATECH e-Handbook of Statistical Methods,
http://www.itl.nist.giv/div898/handbook, 2004.
9. McMahon, F., “The Livermore Fortran kernels: A
Computer test of numerical performance range,” Tech.
Rep. UCRL-55745, Lawrence Livermore national
Laboratory, Univ. of California, Livermore, 1986.
10. SPEC, www.spec.org.
11. J. R. Mashey, “War of the Benchmark Means: Time for
a Truce,” Computer Architecture News, Vol 32, No 3,
Sept 2004. [TO BE PUBLISHED]
12. J. Tang, E. S. Davidson, “An Evaluation of Cray-1 and
Cray X-MP Performance on Vectorizable Livermore
Fortran Kernels,” International Conference on
Supercomputing, pp. 510-518, ACM, July 1988.
13. http://mathworld.wolfram.com, a good website on
mathematics. Good place to look up distributions.
14. MIPS Computer Systems, “Performance Brief Part 1:
CPU Benchmarks, Issue 3.0,” October 1987.
15. Ralph Humphries, “Performance Report, Revision 1.4,
July 1, 1989,” Silicon Graphics Computer Systems,
Mountain View, CA.
16. J.H. Tang, E. S. Davidson, “An evaluation of Cray-1 and
Cray X-MP Performance on Vectorizable Livermore
Fortran Kernels,” International Conf. on
Supercomputing, ACM, pp. 510-518, July 1988.
NoMeanFeat – Copyright 2004, John Mashey
63
Feedback Please
•
A. Working on:
1. BS 2. MS 3. PhD 4. Already have PhD or other
•
B. Statistics background
1. None 2. Some embedded in other course 3. High school course 4. Undergraduate course 6. Grad school
And if class taken, what department label? ______________________________
•
C. Any terminology that was insufficiently defined early enough?
•
D. Anything else you would have expected to be in this talk?
•
E. Anything too elementary?
•
F. Any other comments or suggestions?
NoMeanFeat – Copyright 2004, John Mashey
64