Let’s look at the single-cycle model analytically

Transcript Let’s look at the single-cycle model analytically

Performance

Performance
– what is it: measures of performance

The CPU Performance Equation:
– Execution time as the measure
– what affects execution time
– examples

Choosing good benchmarks?
– choosing bad benchmarks?

Amdahl's Law
Datorteknik PerformanceAnalyse bild 1
Performance is Time

Time to do the task (Execution Time)
– execution time, response time, latency

Tasks per unit time (sec, minute, ...)
– throughput, bandwidth
Datorteknik PerformanceAnalyse bild 2
Performance as Response Time


Performance is most often measured as
response time or execution time for some
task.
“X is n times faster than Y” means
Performance(X)
–––––––––––––– =
Performance(Y)

Execution Time(Y)
–––––––––––––––– = n
Execution Time(X)
Example
Execution time of program P
X is 5 sec; Y is 10 sec.

X is 2 times faster than Y.
Datorteknik PerformanceAnalyse bild 3
What time to measure?

Elapsed time, wall-clock time:
–
–
–
–

CPU Time:
–
–
–
–

actual time from start to completion
depends on CPU, system, I/O, etc.
often used in real benchmarks
only suitable choice when I/O is included
measure/analyze CPU performance only
may be suitable when machine is timeshared
possibly both user and system component
User CPU time is our focus for first part of course
Elapsed time = CPU time + Idle time
– usually and assuming time is accurately accounted for
Datorteknik PerformanceAnalyse bild 4
Metrics of performance

Different performance metrics are appropriate
at different levels:
Application
Programming
Language
Compiler
ISA
Datapath
Control
Function Units
Transistors
Frames per second
Operations per second
(millions) of Instructions per second – MIPS
(millions) of (F.P.) operations per second –
MFLOP/s
Cycles per second (clock rate)
Cycles per Instruction
Datorteknik PerformanceAnalyse bild 5
Relating Processor Metrics

CPU execution time per program
= CPU clock cycles/program X Clock cycle time
= CPU clock cycles/program ÷ Clock rate (frequency)

CPU clock cycles/program
= Instructions/program X Clock cycles Per Instruction

Clock cycles Per Instruction (CPI) is an average
measurement, it depends on :
– ISA, the implementation, and the program measured
– CPI = CPU clock cycles/program ÷ Instructions/program
– Also, Instructions per clock cycle or IPC = 1 / CPI

CPU execution time = Instructions X CPI X Clock cycle
Datorteknik PerformanceAnalyse bild 6
Let’s look at the single-cycle model
analytically
Datorteknik PerformanceAnalyse bild 7
Static timing analysis

Memories
Register
Adders
ALU

Use topological sort!



10 ns
5 ns
10 ns
10 ns
Datorteknik PerformanceAnalyse bild 8
Zero ext.
35 ns delay
10 ns
Branch
logic
5 ns
10 ns
0
A
ALU
4
B
+
Sgn/Ze
extend
31
+
10 ns
lw $2 const($3)
10 ns
10 ns
Datorteknik PerformanceAnalyse bild 9
But that path goes through
the data memory!

What if this is not a load/store?

How about an instruction that does nothing?
“NOP”
Datorteknik PerformanceAnalyse bild 10
Zero ext.
10 ns delay
10 ns
Branch
logic
5 ns
10 ns
0
A
ALU
4
B
+
Sgn/Ze
extend
31
+
10 ns
Nop
10 ns
10 ns
Datorteknik PerformanceAnalyse bild 11
Zero ext.
25 ns delay
10 ns
Branch
logic
5 ns
10 ns
0
A
ALU
4
B
+
Sgn/Ze
extend
31
+
10 ns
Add $ra $rb $rc
10 ns
10 ns
Datorteknik PerformanceAnalyse bild 12
Zero ext.
20 ns delay
10 ns
Branch
logic
5 ns
10 ns
0
A
ALU
4
B
+
Sgn/Ze
extend
31
+
10 ns
B label
10 ns
10 ns
Datorteknik PerformanceAnalyse bild 13
35 ns for load/store
but
10 ns for NOP !?
Datorteknik PerformanceAnalyse bild 14
Amdahl’s Law:
“Make the common case fast”
Datorteknik PerformanceAnalyse bild 15
Amdahl's Law




Handy for evaluating impact of a change not tied to
CPU performance equation
Insight: No improvement of a feature enhances
performance by more than the use of the feature.
Suppose that enhancement E accelerates fraction F
of a program by a factor S (remainder of the task is
unaffected):
ExecTimeE = (1 – F(1 – 1/S)) X ExecTimewithout
E
F
1-F
F/S
1-F
S=
Datorteknik PerformanceAnalyse bild 16
What if we don’t need the ALU?
A branch instruction?
Datorteknik PerformanceAnalyse bild 17
BUT!

The single cycle model has to accomodate
the slowest instruction

Even if it rarely occurs!
Datorteknik PerformanceAnalyse bild 18
How much work can our
structure perform?

For a program Q:

Time = Number of executed instruction *
Number of cycles per instruction *
Time per cycle

T = Nq * CPI * Tc
Datorteknik PerformanceAnalyse bild 19
For the single cycle model....

CPI = 1 for all instructions

Tc determined by the slowest instruction
Datorteknik PerformanceAnalyse bild 20
How to reduce T?

T = Nq * CPI * Tc
Reduce Nq.
More powerful instructions!
More hardware, longer paths, cycle time
goes up (slower machine)
Datorteknik PerformanceAnalyse bild 21
“No free lunch”
Why designers are so well paid to optimize designs.
Datorteknik PerformanceAnalyse bild 22
How to reduce T?

T = Nq * CPI * Tc
Faster hardware
Technological limits
Cost increase not linearly related
Sales volume drops
Datorteknik PerformanceAnalyse bild 23
How to reduce T?

T = Nq * CPI * Tc
Make this a function of the instruction
For example:
NOP = 1 cycle
LW = 4 cycles
Chapter 5.4, the classical method
Datorteknik PerformanceAnalyse bild 24
How to reduce T?

T = Nq * CPI * Tc
Make this a function of the instruction
CPI goes up, but we can use an average,
not the worst case
Tc goes down, time to do the longes step,
not the entire instruction
Datorteknik PerformanceAnalyse bild 25
Example

Branch:
Step 1: fetch
Step 2: New PC

Add:
Step 1: fetch
Step 2: decode/ register fetch
Step 3: Compute and write back
Datorteknik PerformanceAnalyse bild 26
Example

LW = 4 steps

Cycletime = 1/4 old time

T
LW

=4
* 1/4 old time,
CPI
just as slow for the lw instruction
our worst case!
Datorteknik PerformanceAnalyse bild 27
But that’s not important if
LW is not common!
T = Nq * CPI * 1/4 old time
Averaged
over this many
instructions
1,3?
1,7?
Never = 4,0!
Datorteknik PerformanceAnalyse bild 28
We win because of quantitative statistical
properties of our programs!
Datorteknik PerformanceAnalyse bild 29
What value of CPI do we use?
1,3?
1,5?
1,7?
Easy: Use average program!
?
Datorteknik PerformanceAnalyse bild 30
There is no such thing!
Datorteknik PerformanceAnalyse bild 31
Artificial “average programs”
called “benchmarks”
Are they something to trust?
What about “peak performance values”
mips?
mflops?
We have a peak at CPI = 1....
...a program of only NO-OPS!
Datorteknik PerformanceAnalyse bild 32
Why Do Benchmarks?

How we evaluate performance differences
– Across and within a single system (design & variations)

What should benchmarks do?
– Represent a large class of important programs
– Behave like typical programs:
 improved benchmark performance => improved
performance broadly



For better or worse, benchmarks shape a field
Good ones accelerate progress
Bad benchmarks hurt progress
– help real programs vs. sell machines/papers?
– Enhancements that help benchmarks may not help most
programs and v.v.
Datorteknik PerformanceAnalyse bild 33
Classes of Benchmarks

(Toy) Benchmarks
– 10-100 line–e.g.,: sieve, puzzle, quicksort
– good first programming assignments

Synthetic Benchmarks
– attempt to match average frequencies of real workloads
– e.g., Whetstone, dhrystone
– mostly good for nothing: too artificial

Kernels
– Time critical excerpts of real programs
– e.g., Livermore loops, Linpack
– good for micro-performance studies

Real programs
– e.g., gcc, spice, Verilog, Database, stock trading
Datorteknik PerformanceAnalyse bild 34
Successful Benchmark: SPEC Collection

1987 RISC industry (workstations) mired in
“bench marketing”:
– (“That is an 8 MIPS machine, but they claim 10 MIPS!”)

EE Times + 5 companies band together to
perform Systems Performance Evaluation
Committee (SPEC) in 1988:
– Sun, MIPS, HP, Apollo, DEC

Create standard list of programs, inputs,
reporting rules:
– several real programs, including OS calls
– some I/O
– rules for running and reporting
Datorteknik PerformanceAnalyse bild 35
Multiple clock cycle designs:
State machines
Micro programming
chapter 5.4
“Computer Organization & Design”
Datorteknik PerformanceAnalyse bild 36
How to reduce T?
T = Nq * CPI * Tc
Reduce quotient cycles / instruction
reduce “cycles”
multiple clockcycle design
Increase “instruction”
execute more
than one instr.
per cycle!
Datorteknik PerformanceAnalyse bild 37
More than one
instruction per cycle?

Parallelism
– Div/mult + floating point + integer

Superscalarity
– Multiple issue etc.

Pipelining
– Of general importance
Datorteknik PerformanceAnalyse bild 38

Let’s look at the single-cycle model analytically

Transcript Let’s look at the single-cycle model analytically

Directory