Let’s look at the single-cycle model analytically
Download
Report
Transcript Let’s look at the single-cycle model analytically
Performance
Performance
– what is it: measures of performance
The CPU Performance Equation:
– Execution time as the measure
– what affects execution time
– examples
Choosing good benchmarks?
– choosing bad benchmarks?
Amdahl's Law
Datorteknik PerformanceAnalyse bild 1
Performance is Time
Time to do the task (Execution Time)
– execution time, response time, latency
Tasks per unit time (sec, minute, ...)
– throughput, bandwidth
Datorteknik PerformanceAnalyse bild 2
Performance as Response Time
Performance is most often measured as
response time or execution time for some
task.
“X is n times faster than Y” means
Performance(X)
–––––––––––––– =
Performance(Y)
Execution Time(Y)
–––––––––––––––– = n
Execution Time(X)
Example
Execution time of program P
X is 5 sec; Y is 10 sec.
X is 2 times faster than Y.
Datorteknik PerformanceAnalyse bild 3
What time to measure?
Elapsed time, wall-clock time:
–
–
–
–
CPU Time:
–
–
–
–
actual time from start to completion
depends on CPU, system, I/O, etc.
often used in real benchmarks
only suitable choice when I/O is included
measure/analyze CPU performance only
may be suitable when machine is timeshared
possibly both user and system component
User CPU time is our focus for first part of course
Elapsed time = CPU time + Idle time
– usually and assuming time is accurately accounted for
Datorteknik PerformanceAnalyse bild 4
Metrics of performance
Different performance metrics are appropriate
at different levels:
Application
Programming
Language
Compiler
ISA
Datapath
Control
Function Units
Transistors
Frames per second
Operations per second
(millions) of Instructions per second – MIPS
(millions) of (F.P.) operations per second –
MFLOP/s
Cycles per second (clock rate)
Cycles per Instruction
Datorteknik PerformanceAnalyse bild 5
Relating Processor Metrics
CPU execution time per program
= CPU clock cycles/program X Clock cycle time
= CPU clock cycles/program ÷ Clock rate (frequency)
CPU clock cycles/program
= Instructions/program X Clock cycles Per Instruction
Clock cycles Per Instruction (CPI) is an average
measurement, it depends on :
– ISA, the implementation, and the program measured
– CPI = CPU clock cycles/program ÷ Instructions/program
– Also, Instructions per clock cycle or IPC = 1 / CPI
CPU execution time = Instructions X CPI X Clock cycle
Datorteknik PerformanceAnalyse bild 6
Let’s look at the single-cycle model
analytically
Datorteknik PerformanceAnalyse bild 7
Static timing analysis
Memories
Register
Adders
ALU
Use topological sort!
10 ns
5 ns
10 ns
10 ns
Datorteknik PerformanceAnalyse bild 8
Zero ext.
35 ns delay
10 ns
Branch
logic
5 ns
10 ns
0
A
ALU
4
B
+
Sgn/Ze
extend
31
+
10 ns
lw $2 const($3)
10 ns
10 ns
Datorteknik PerformanceAnalyse bild 9
But that path goes through
the data memory!
What if this is not a load/store?
How about an instruction that does nothing?
“NOP”
Datorteknik PerformanceAnalyse bild 10
Zero ext.
10 ns delay
10 ns
Branch
logic
5 ns
10 ns
0
A
ALU
4
B
+
Sgn/Ze
extend
31
+
10 ns
Nop
10 ns
10 ns
Datorteknik PerformanceAnalyse bild 11
Zero ext.
25 ns delay
10 ns
Branch
logic
5 ns
10 ns
0
A
ALU
4
B
+
Sgn/Ze
extend
31
+
10 ns
Add $ra $rb $rc
10 ns
10 ns
Datorteknik PerformanceAnalyse bild 12
Zero ext.
20 ns delay
10 ns
Branch
logic
5 ns
10 ns
0
A
ALU
4
B
+
Sgn/Ze
extend
31
+
10 ns
B label
10 ns
10 ns
Datorteknik PerformanceAnalyse bild 13
35 ns for load/store
but
10 ns for NOP !?
Datorteknik PerformanceAnalyse bild 14
Amdahl’s Law:
“Make the common case fast”
Datorteknik PerformanceAnalyse bild 15
Amdahl's Law
Handy for evaluating impact of a change not tied to
CPU performance equation
Insight: No improvement of a feature enhances
performance by more than the use of the feature.
Suppose that enhancement E accelerates fraction F
of a program by a factor S (remainder of the task is
unaffected):
ExecTimeE = (1 – F(1 – 1/S)) X ExecTimewithout
E
F
1-F
F/S
1-F
S=
Datorteknik PerformanceAnalyse bild 16
What if we don’t need the ALU?
A branch instruction?
Datorteknik PerformanceAnalyse bild 17
BUT!
The single cycle model has to accomodate
the slowest instruction
Even if it rarely occurs!
Datorteknik PerformanceAnalyse bild 18
How much work can our
structure perform?
For a program Q:
Time = Number of executed instruction *
Number of cycles per instruction *
Time per cycle
T = Nq * CPI * Tc
Datorteknik PerformanceAnalyse bild 19
For the single cycle model....
CPI = 1 for all instructions
Tc determined by the slowest instruction
Datorteknik PerformanceAnalyse bild 20
How to reduce T?
T = Nq * CPI * Tc
Reduce Nq.
More powerful instructions!
More hardware, longer paths, cycle time
goes up (slower machine)
Datorteknik PerformanceAnalyse bild 21
“No free lunch”
Why designers are so well paid to optimize designs.
Datorteknik PerformanceAnalyse bild 22
How to reduce T?
T = Nq * CPI * Tc
Faster hardware
Technological limits
Cost increase not linearly related
Sales volume drops
Datorteknik PerformanceAnalyse bild 23
How to reduce T?
T = Nq * CPI * Tc
Make this a function of the instruction
For example:
NOP = 1 cycle
LW = 4 cycles
Chapter 5.4, the classical method
Datorteknik PerformanceAnalyse bild 24
How to reduce T?
T = Nq * CPI * Tc
Make this a function of the instruction
CPI goes up, but we can use an average,
not the worst case
Tc goes down, time to do the longes step,
not the entire instruction
Datorteknik PerformanceAnalyse bild 25
Example
Branch:
Step 1: fetch
Step 2: New PC
Add:
Step 1: fetch
Step 2: decode/ register fetch
Step 3: Compute and write back
Datorteknik PerformanceAnalyse bild 26
Example
LW = 4 steps
Cycletime = 1/4 old time
T
LW
=4
* 1/4 old time,
CPI
just as slow for the lw instruction
our worst case!
Datorteknik PerformanceAnalyse bild 27
But that’s not important if
LW is not common!
T = Nq * CPI * 1/4 old time
Averaged
over this many
instructions
1,3?
1,7?
Never = 4,0!
Datorteknik PerformanceAnalyse bild 28
We win because of quantitative statistical
properties of our programs!
Datorteknik PerformanceAnalyse bild 29
What value of CPI do we use?
1,3?
1,5?
1,7?
Easy: Use average program!
?
Datorteknik PerformanceAnalyse bild 30
There is no such thing!
Datorteknik PerformanceAnalyse bild 31
Artificial “average programs”
called “benchmarks”
Are they something to trust?
What about “peak performance values”
mips?
mflops?
We have a peak at CPI = 1....
...a program of only NO-OPS!
Datorteknik PerformanceAnalyse bild 32
Why Do Benchmarks?
How we evaluate performance differences
– Across and within a single system (design & variations)
What should benchmarks do?
– Represent a large class of important programs
– Behave like typical programs:
improved benchmark performance => improved
performance broadly
For better or worse, benchmarks shape a field
Good ones accelerate progress
Bad benchmarks hurt progress
– help real programs vs. sell machines/papers?
– Enhancements that help benchmarks may not help most
programs and v.v.
Datorteknik PerformanceAnalyse bild 33
Classes of Benchmarks
(Toy) Benchmarks
– 10-100 line–e.g.,: sieve, puzzle, quicksort
– good first programming assignments
Synthetic Benchmarks
– attempt to match average frequencies of real workloads
– e.g., Whetstone, dhrystone
– mostly good for nothing: too artificial
Kernels
– Time critical excerpts of real programs
– e.g., Livermore loops, Linpack
– good for micro-performance studies
Real programs
– e.g., gcc, spice, Verilog, Database, stock trading
Datorteknik PerformanceAnalyse bild 34
Successful Benchmark: SPEC Collection
1987 RISC industry (workstations) mired in
“bench marketing”:
– (“That is an 8 MIPS machine, but they claim 10 MIPS!”)
EE Times + 5 companies band together to
perform Systems Performance Evaluation
Committee (SPEC) in 1988:
– Sun, MIPS, HP, Apollo, DEC
Create standard list of programs, inputs,
reporting rules:
– several real programs, including OS calls
– some I/O
– rules for running and reporting
Datorteknik PerformanceAnalyse bild 35
Multiple clock cycle designs:
State machines
Micro programming
chapter 5.4
“Computer Organization & Design”
Datorteknik PerformanceAnalyse bild 36
How to reduce T?
T = Nq * CPI * Tc
Reduce quotient cycles / instruction
reduce “cycles”
multiple clockcycle design
Increase “instruction”
execute more
than one instr.
per cycle!
Datorteknik PerformanceAnalyse bild 37
More than one
instruction per cycle?
Parallelism
– Div/mult + floating point + integer
Superscalarity
– Multiple issue etc.
Pipelining
– Of general importance
Datorteknik PerformanceAnalyse bild 38