CS210_305_05

Transcript CS210_305_05

CS.210
Computer Systems and Architecture
<http://spider.science.strath.ac.uk/spider/spider/s
howClass.php?class=cs210>
and
CS.305
Computer Architecture
<local.cis.strath.ac.uk/teaching/ug/classes/CS.305>
Assessing and Understanding
Performance
Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005,
and from slides kindly made available by Dr Mary Jane Irwin, Penn State University.
Performance Metrics
 Purchasing perspective
 given a collection of machines, which has the
• best performance ?
• least cost ?
• best cost/performance?
 Design perspective
 faced with design options, which has the
• best performance improvement ?
• least cost ?
• best cost/performance?
 Both require
 basis for comparison
 metric for evaluation
 Our goal is to understand what factors in the architecture
contribute to overall system performance and the
relative importance (and cost) of these factors
Assessing and Understanding Performance
CS210_305_05/2
Defining (Speed) Performance
 Normally interested in reducing
 Response time (aka execution time) – the time between the start
and the completion of a task
• Important to individual users
 Thus, to maximize performance, need to minimize execution time
1
Perform anceX 
Executiontim eX
If X is n times faster than Y, then

Perform anceX Executiontim eY

n
Perform anceY Executiontim eX
 Throughput – the total amount of work done in a given time
• Important to data center managers

 Decreasing response time almost always improves throughput
Assessing and Understanding Performance
CS210_305_05/3
Performance Factors
 Want to distinguish elapsed time and the time spent on
our task
 CPU execution time (CPU time) – time the CPU spends
working on a task
 Does not include time waiting for I/O or running other programs
CPU Timeprogram  CPU clock cyclesprogram  Clock cycle time
or
CPU Time program 
CPU clock cycles
program
Clock rate
 Can improve performance by reducing either the length
of the clock cycle or the number of clock cycles required
 for a program
Assessing and Understanding Performance
CS210_305_05/4
Review: Machine Clock Rate
 Clock rate (MHz, GHz) is inverse
of clock cycle time (clock period): Clock rate 
1
Clock cycle
one clock period

1
Clock cycle 
Clock rate
10 nsec clock cycle => 100 MHz clock rate
5 nsec clock cycle => 200 MHz clock rate

2 nsec clock cycle => 500 MHz clock rate
1 nsec clock cycle =>
1 GHz clock rate
500 psec clock cycle =>
2 GHz clock rate
250 psec clock cycle =>
4 GHz clock rate
Assessing and Understanding Performance
CS210_305_05/5
Clock Cycles per Instruction (CPI)
 Not all instructions take the same amount of time to
execute
 One way to think about execution time is that it equals the number
of instructions executed multiplied by the average time per
instruction
clock cycles instructions average clock cycles


program
program
instruction
 Clock cycles per instruction (CPI) – the average number
of clock cycles each instruction takes to execute
 A way to compare two different implementations of the same ISA
CPI for this instruction class
CPI
Assessing and Understanding Performance
A
B
C
1
2
3
CS210_305_05/6
Effective CPI
 Computing the overall effective CPI is done by looking at
the different types of instructions and their individual cycle
counts and averaging
n
Overall effective CPI =
(CPIi  ICi )
i1
 Where ICi is the count (percentage) of the number of instructions
of class i executed
 CPIi is the (average) number of clock cycles per instruction for
that instruction class

 n is the number of instruction classes
 The overall effective CPI varies by instruction mix – a
measure of the dynamic frequency of instructions across
one or many programs
Assessing and Understanding Performance
CS210_305_05/7
THE Performance Equation
 Our basic performance equation is then
CPU Time  Instruction _count  CPI  Clock _ cycle
or
Instruction _count  CPI
CPU Time 
Clock_rate
 These equations separate the three key factors that affect
performance

 Can measure the CPU execution time by running the program
 The clock rate is usually given
 Can measure overall instruction count by using profilers/
simulators without knowing all of the implementation details
 CPI varies by instruction type and ISA implementation for which
we must know the implementation details
Assessing and Understanding Performance
CS210_305_05/8
Determinates of CPU Performance
CPU Time  Instruction _count  CPI  Clock _ cycle
Instruction
count
CPI
Clock cycle
Algorithm
Programming
language
Compiler
ISA
Processor
organization
Technology
Assessing and Understanding Performance
CS210_305_05/9
Determinates of CPU Performance
CPU Time  Instruction _count  CPI  Clock _ cycle
Instruction
count
CPI
Algorithm
X
X
Programming
language
X
X
Compiler
X
X
ISA
X
X
X
X
X
Processor
organization
Technology
Assessing and Understanding Performance
Clock cycle
X
CS210_305_05/10
A Simple Example
Op
Freq
CPIi
ALU
50%
1
Load
20%
5
Store
10%
3
Branch
20%
2
Freq x CPIi
.
=
 Q1: How much faster would the machine be if a better data cache
reduced the average load time to 2 cycles?
 Q2: How does this compare with using branch prediction to shave a
cycle off the branch time?
 Q3: What if two ALU instructions could be executed at once?
Assessing and Understanding Performance
CS210_305_05/11
A Simple Example
Op
Freq
CPIi
Freq x CPIi
Q1:
Q2:
Q3:
ALU
50%
1
.5
.5
.5
.25
Load
20%
5
1.0
.4
1.0
1.0
Store
10%
3
.3
.3
.3
.3
Branch
20%
2
.4
.4
.2
.4
2.2
1.6
2.0
1.95
=
 Q1: How much faster would the machine be if a better data cache
reduced the average load time to 2 cycles?
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster
 Q2: How does this compare with using branch prediction to shave a
cycle off the branch time?
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster
 Q3: What if two ALU instructions could be executed at once?
CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster
Assessing and Understanding Performance
CS210_305_05/12
Comparing and Summarizing Performance
 How do we summarize the performance for benchmark
set with a single number?
 The average of execution times that is directly proportional to total
execution time is the arithmetic mean (AM)
1 n
AM  Timei
n i1
 Where Timei is the execution time for the ith program of a total of
n programs in the workload
 A smaller mean indicates a smaller average execution time and
thus improved
performance

 Guiding principle in reporting performance measurements is
reproducibility – list everything another experimenter would need to
duplicate the experiment (version of the operating system, compiler
settings, input set used, specific computer configuration (clock rate,
cache sizes and speed, memory size and speed, etc.))
Assessing and Understanding Performance
CS210_305_05/13
SPEC Benchmarks www.spec.org
Integer benchmarks
gzip
compression
vpr
FPGA place & route
gcc
GNU C compiler
mcf
Combinatorial optimization
crafty
Chess program
parser
Word processing program
eon
Computer visualization
perlbmk perl application
gap
vortex
bzip2
twolf
Group theory interpreter
Object oriented database
compression
Circuit place & route
Assessing and Understanding Performance
FP benchmarks
wupwise Quantum chromodynamics
swim
Shallow water model
mgrid
Multigrid solver in 3D fields
applu
Parabolic/elliptic pde
mesa
3D graphics library
galgel
Computational fluid dynamics
art
Image recognition (NN)
equake Seismic wave propagation
simulation
facerec Facial image recognition
ammp
Computational chemistry
lucas
Primality testing
fma3d
Crash simulation fem
sixtrack Nuclear physics accel
apsi
Pollutant distribution
CS210_305_05/14
Example SPEC Ratings
Assessing and Understanding Performance
CS210_305_05/15
Other Performance Metrics
 Power consumption – especially in the embedded market
where battery life (and cooling) is important
 For power-limited applications, the most important metric is
energy efficiency
Assessing and Understanding Performance
CS210_305_05/16
Other Performance Metrics - (Native) MIPS
 (Native) MIPS - and What Is Wrong with Them
 The dangers of using metrics other than time in performance
measurement can be shown by looking at several popular
alternatives.
 One such alternative is MIPS - Millions of Instructions Per Second
Instructioncount
Clock rate
MIPS

6
Executiontime10
CPI 106
 The following form is sometimes convenient since clock rate is
fixed for a machine and CPI is usually a small number, unlike
instruction count or execution time. Relating MIPS to time,

Instructioncount
Execution time
MIPS106
Assessing and Understanding Performance
CS210_305_05/17
The Problem With MIPS
 The problem with using MIPS as a measure of
comparison is threefold:
 MIPS is dependent on the instruction set, making it
difficult to compare MIPS of computers with different
instruction sets;
 MIPS varies between programs on the same
computer; and most importantly,
 MIPS can vary inversely to performance!
 The classic example of the last case is the MIPS rating of
a machine with optional floating-point hardware.
Machines with the option yield faster executing programs
yet have a lower MIPS rating.
Assessing and Understanding Performance
CS210_305_05/18
Peak MIPS
 Beware of so-called peak MIPS. This type of rating is
obtained by choosing an instruction mix that minimises
the CPI even if that instruction mix is totally impractical.
For instance:
 A program composed entirely of arithmetic and logic operations
but no jumps, branches, or load/stores!
 Or, as in a famous case, a program comprising only of NOPs - No
OPerations)
 In other words: Peak MIPS - a level of performance that
will never be attained ;-)
Assessing and Understanding Performance
CS210_305_05/19
Relative MIPS
 MIPS can fail to give a true picture of performance in that
it does not track execution time. An alternative type MIPS
rating is relative MIPS - as opposed to native MIPS derived by using a particular machine as a reference
point:
Timereference
Relative MIPS=
 MIPSreference
Timeunrated
Timereference
Timeunrated
MIPSreference

= Execution time of a program on a reference machine
= Execution time of the same program on a machine to be rated
= Agreed-upon MIPS rating of the reference machine
 The advantage of this form of MIPS is small since
execution time, program, and program input still must be
known to have meaningful information.
Assessing and Understanding Performance
CS210_305_05/20
Summary: Evaluating ISAs
 Design-time metrics:
 Can it be implemented, in how long, at what cost?
 Can it be programmed? Ease of compilation?
 Static Metrics:
 How many bytes does the program occupy in memory?
 Dynamic Metrics:
 How many instructions are executed? How many bytes does the
processor fetch to execute the program?
 How many clocks are required per instruction?
CPI
 How "lean" a clock is practical?
Best Metric: Time to execute the program!
depends on the instructions set, the
processor organization, and compilation
techniques.
Assessing and Understanding Performance
Inst. Count
Cycle Time
CS210_305_05/21
Fallacies and Pitfalls
 Pitfall: Expecting the improvement of one aspect of a machine
to increase performance by an amount proportional to the size
of the improvement.
 Example: Suppose a program takes 100 seconds to run and
multiply operations account for 80 seconds of this time. How
much do you need to improve the speed of multiplication to make
the program run five times faster?
 A: Using Amdahl’s Law:
Execution time after improvement =
Execution time affected by improvement

 Execution time unaffected


Amount
of
improvement


Assessing and Understanding Performance
CS210_305_05/22
Fallacies and Pitfalls
 …A: Using Amdahl’s Law (and the problem set):
80 seconds

Execution time after improvement = 
 (100 80 seconds)


n
 To get 5 times faster the new execution time must be 20 seconds


20 seconds = 80 seconds (20 seconds)

n 

80 seconds

0 = 


n

 I.e there
is no amount by which we can enhance multiply to
achieve a fivefold improvement in execution time!
 Making the common case fast…
 …will tend to enhance performance better than optimising
the rare case.
Assessing and Understanding Performance
CS210_305_05/23
Fallacies and Pitfalls
 Pitfall: Comparing computers using only one or two of three
performance metrics: clock rate, CPI, and instruction count.
 Pitfall: Using peak performance to compare machines.
 Fallacy: Synthetic benchmarks predict performance.
 Synthetic benchmarks are small artificial programs that attempt to
represent the execution frequency of statements found in a larger
set of benchmarks or in real-world programs. Whetstone and
Dhrystone are examples. Since these are not natural programs
they can (be used to) distort performance statistics by, e.g.:
• compilers discarding large sections of the code!
• compilers targeting other optimisation 'opportunities' specifically for a
benchmark and, hence, artificially inflate the performance stats. E.g.
20% to 30% improvement by using a string copy 'optimisation' for
Dhrystone that could not be applied in over 99% of normal programs!!
Assessing and Understanding Performance
CS210_305_05/24
Concluding Remarks
 The task a computer designer faces is a complex one:
Determine what attributes are important for a new machine, then
design a machine to maximise performance while staying within cost
constraints.
Performance can be measured as throughput or
response time - which depends on the environment/application and
should be borne in mind.
 Amdahl's Law is a valuable tool to determine what performance
improvement an architectural enhancement may give.
 Knowing what cases are the most frequent is critical to improving
performance. Based on empirical studies of instruction sets, tradeoffs
can be made by deciding which instructions are the most important
and what cases to try to make fast.
 Computer designs will always be measured by cost and performance
and finding the best balance will always be the art of computer
design, just as in any engineering task.
Assessing and Understanding Performance
CS210_305_05/25