Document 7296545

Transcript Document 7296545

FAMU-FSU College of Engineering
Computer
Architecture
EEL 4713/5764, Spring 2006
Dr. Michael Frank
Module #5 – Computer Performance
1
Part I
Background and Motivation
June 2005
Computer Architecture, Background and Motivation
Slide 2
I Background and Motivation
Provide motivation, paint the big picture, introduce tools:
• Review components used in building digital circuits
• Present an overview of computer technology
• Understand the meaning of computer performance
(or why a 2 GHz processor isn’t 2 as fast as a 1 GHz model)
Topics in This Part
Chapter 1 Combinational Digital Circuits
Chapter 2 Digital Circuits with Memory
Chapter 3 Computer System Technology
Chapter 4 Computer Performance
June 2005
Computer Architecture, Background and Motivation
Slide 3
4 Computer Performance
Performance is key in design decisions; also cost and power
• It has been a driving force for innovation
• Isn’t quite the same as speed (higher clock rate)
Topics in This Chapter
4.1 Cost, Performance, and Cost/Performance
4.2 Defining Computer Performance
4.3 Performance Enhancement and Amdahl’s Law
4.4 Performance Measurement vs Modeling
4.5 Reporting Computer Performance
4.6 The Quest for Higher Performance
June 2005
Computer Architecture, Background and Motivation
Slide 4
Course Instructional Objective #1
As the syllabus says:


At the completion of this course, students should be able to…

CIO #1. (Metrics) Calculate and interpret different performance and cost
metrics of computer systems.
This CIO should also support the following Program Outcome:


Students graduating from the BSEE and BSCpE degree programs will have:



PO (a). (Apply) An ability to apply knowledge of mathematics, science and
engineering;
PO (e). (Solve) An ability to identify, formulate, and solve engineering
problems;
PO (o). (Topics) EE: A knowledge of electrical engineering applications selected
from the …digital systems… areas; CpE: A knowledge of computer science and
computer engineering topics including …computer architecture.
Under “assessment instruments,” the syllabus says:


1. Metrics. Students will solve exam problems in which they must analyze
descriptions of hypothetical processors to determine their performance, costperformance, and power-performance.
5
Module Instructional Objectives
I break down the CIO as follows:
CIO #1. Metrics (aeo). Calculate and interpret different performance and cost metrics
of computer systems.
1.1. Know & apply (a) the definitions of clock frequency, MIPS, execution time,
performance, throughput, cost-performance, and power-performance.
1.2. Explain why a given metric is or is not appropriate to use in a given situation.
1.3. Identify (e.i) the specific figure(s) of merit that are most appropriate for choosing
between alternative computer designs in a given scenario.
1.4. Formulate (e.ii) appropriate symbolic equations for calculating a desired figure of
merit from the provided information about an architectural scenario.
1.5. Solve (e.iii) problems involving the determination of which of several computer
designs would be preferable in a given scenario.
1.6. Apply Amdahl’s Law (and generalizations thereof) in characterizing the relationship
between an improvement to a particular component of a system and the overall
improvement of the whole system.
1.7. Apply (a) the CPU Performance Equation that relates performance and execution time
to instruction count, CPI, and clock frequency.
6
FAMU-FSU College of Engineering
Topic #1
Overview of Some Important Metrics for
Computer Systems:
Performance, Cost, and Power Consumption
7
Important Performance Metrics

Some metrics that are often used, but that do not always
accurately reflect true performance:




CPU clock frequency = number of CPU clock cycles per unit time
MIPS rating = How many Millions of Instructions Per Second
Benchmark ratings (e.g., SPECmarks) – more on this later
Metrics that are “true” measures of performance:

Total execution time of a work unit (on real applications)


Performance = 1/(execution time)


Wall-clock time from beginning to end of the execution process
For a single work unit
Throughput = (# work units)/(execution time)

A generalized kind of performance
8
Cost and Cost-Related Metrics

In the real world, the performance of a system is not the only
thing that is important…

For example, its cost may also matter a lot!



We almost always have budgetary constraints.
The usual goal: Maximize the cost-performance
(i.e., cost-efficiency) of the systems that you buy.



E.g., the IBM Blue Gene/L has really high performance, but you’re not
likely to buy it as your next computer…
Cost-performance = (performance) / (cost).
In other words, you want to get the best value for your dollar.
This strategy roughly maximizes total throughput within a
fixed budget.

Whenever you can have many systems gathered together working in
parallel.
9
Throughput and Cost-Performance

When there is a fixed budget, the maximum
throughput of a parallel system is (roughly)  the
cost-performance of the individual serial units.
throughput  perf unit  nunits
 budget 
 perf unit  

 costunit 
perf unit

 budget
costunit
(can't exceed budget)
(rearranging)
 (costPerformance)  (budget )
10
The Vanishing Computer Cost
Computer cost
$1 G
$1 M
$1 K
$1
1960
1980
2000
2020
Calendar year
June 2005
Computer Architecture, Background and Motivation
Slide 11
Cost/Performance
Performance
Superlinear:
economy of
scale
Linear
(ideal?)
Sublinear:
diminishing
returns
Cost
Figure 4.1
June 2005
Performance improvement as a function of cost.
Computer Architecture, Background and Motivation
Slide 12
Importance of Power Consumption

In the real world, a computer’s performance and manufacturing cost are not the
only important concerns…


Today, power consumption is an increasingly important factor that impacts all of
the following:


Operating costs, usability, and other factors may also be important!
Manufacturing cost, operating cost, performance, and usability!
In general, higher power consumption means…

More manufacturing cost

for more aggressive power-delivery & cooling systems


Higher operating cost


Higher performance would exceed limits of cooling system
Poor usability / poorer overall quality of product:


More electricity consumed, frequent changing/recharging of batteries, inconvenience to user
Lower performance


power supplies, heat sinks, fans, etc.
Annoyingly noisy cooling fans or data center A/C units, laptops that burn up your lap
So in many design scenarios, we may wish to maximize performance within a
fixed power budget, or minimize power consumption to reach a desired
performance.
13
Throughput and Power-Performance

When there is a fixed power budget, the maximum
throughput of a parallel system is (roughly)  the powerperformance of the individual serial units.

This is exactly analogous to the earlier cost-performance analysis.
throughput  perf unit  nunits
 power budget 
 perf unit  
 (within power budget)
 powerunit 
perf unit

 ( power budget )
powerunit
 ( powerPerformance)  (budget )
14
Performance Maximization
within Cost and Power Constraints

Suppose we have both a cost budget and a power budget,

and we want to maximize system throughput.


Then we have the following constraints on nunits:

nunits  Cunit ≤ Cmax
C = cost
P = power

nunits  Punit ≤ Pmax
T = throughput

and nunits ≤  Pmax/Punit 
The largest value of nunits within these constraints is:




With a given unit design, we must maximize the number of || units.
So, nunits ≤  Cmax/Cunit 
nunits = min(  Cmax/Cunit ,  Pmax/Punit  )
=  min( Cmax/Cunit, Pmax/Punit) 
and so the maximum feasible throughput is:

Ttot = Tunit  nunits
= Tunit  min(Cmax/Cunit,Pmax/Punit)
15
Power-Performance and
Energy Efficiency

Power-performance means performance (i.e., throughput) per
unit of power consumption:


Of course, since



throughput = (work units)/(time) and
power = (energy consumed)/(time),
The times cancel, and so power-performance is equal to:


power-performance = (throughput)/(power).
(work units)/(energy consumed)
In other words, system power-performance is the same thing
as the energy efficiency of the underlying computing process.

To maximize power-performance, minimize the amount of energy that
is consumed per unit of work that is performed.
16
System Optimization Example

Suppose you have a budget of $1M to set up a new corporate data center
that should have a total power consumption of no more than 100kW while
serving web transactions in a simple database application. If your goal is
to maximize total performance (in transactions/second) while staying
within your budget and meeting the power constraint, which of the
following types of machines would be preferable as a basis for the design?



Sun servers, each $15,000, burning 100W, processing 100
transactions/second
Playstation 2s, each $100 from flea market, 30W, processing 30
transactions/second
Solution:
Tsuns  100   min($106 / $15 103 ,105 /10 2 )   100  min(66.7, 1000)   6600 trans/sec
TPS2s  100   min($106 / $100,105 / 30)   100  min(10000, 3333.3)   333,300 trans/sec

A PS2-based design could attain 50 higher throughput and use only 1/3 of
the budget while still meeting the power constraints!
17
FAMU-FSU College of Engineering
Topic #2
Measuring Computer Performance
18
4.2 Defining Computer Performance
CPU-bound task
Input
Processing
Output
I/O-bound task
Figure 4.2 Pipeline analogy shows that imbalance between processing
power and I/O capabilities leads to a performance bottleneck.
June 2005
Computer Architecture, Background and Motivation
Slide 19
Concepts of Performance and Speedup
Performance = 1 / Execution time
is simplified to
Performance = 1 / CPU execution time
(Performance of M1) / (Performance of M2) = Speedup of M1 over M2
= (Execution time on M2) / (Execution time on M1)
Terminology:
M1 is x times as fast as M2 (e.g., 1.5 times as fast)
M1 is 100(x – 1)% faster than M2 (e.g., 50% faster)
CPU performance equation:
CPU time = (Clock cycles executed)  (Time per cycle)
= Instructions  (Cycles per instruction)  (Time per cycle)
= Instructions  CPI / (Clock frequency)
Instruction count, CPI, and clock rate are not completely independent, so
improving one by a given factor may not lead to overall execution time
improvement by the same factor.
June 2005
Computer Architecture, Background and Motivation
Slide 20
Faster Clock  Shorter Running Time
Solution
1 GHz
4 steps
20 steps
2 GHz
Figure 4.3 Faster steps do not necessarily
mean shorter travel time.
June 2005
Computer Architecture, Background and Motivation
Slide 21
4.3 Performance Enhancement: Amdahl’s Law
50
f = fraction
f =0
unaffected
p = speedup
of the rest
Speedup (s )
40
f = 0.01
30
f = 0.02
20
f = 0.05
s=
10
f = 0.1
 min(p, 1/f)
0
0
10
20
30
Enhancement factor (p )
40
1
f + (1 – f)/p
50
Figure 4.4
Amdahl’s law: speedup achieved if a fraction f of a
task is unaffected and the remaining (1–f) part runs p times as fast.
June 2005
Computer Architecture, Background and Motivation
Slide 22
Amdahl’s Law Used in Design
Example 4.1
A processor spends 30% of its time on flp addition, 25% on flp mult,
and 10% on flp division. Evaluate the following enhancements, each
costing the same to implement:
a. Redesign of the flp adder to make it twice as fast.
b. Redesign of the flp multiplier to make it three times as fast.
c. Redesign the flp divider to make it 10 times as fast.
Solution
a. Adder redesign speedup = 1 / [0.7 + 0.3 / 2] = 1.18
b. Multiplier redesign speedup = 1 / [0.75 + 0.25 / 3] = 1.20
c. Divider redesign speedup = 1 / [0.9 + 0.1 / 10] = 1.10
What if both the adder and the multiplier are redesigned?
June 2005
Computer Architecture, Background and Motivation
Slide 23
4.4 Performance Measurement vs. Modeling
Execution time
Machine 1
Machine 2
Machine 3
Program
A
Figure 4.5
June 2005
B
C
D
E
F
Running times of six programs on three machines.
Computer Architecture, Background and Motivation
Slide 24
Performance Benchmarks
Example 4.3
You are an engineer at Outtel, a start-up aspiring to compete with Intel
via its new processor design that outperforms the latest Intel processor
by a factor of 2.5 on floating-point instructions. This level of performance
was achieved by design compromises that led to a 20% increase in the
execution time of all other instructions. You are in charge of choosing
benchmarks that would showcase Outtel’s performance edge.
a. What is the minimum required fraction f of time spent on floating-point
instructions in a program on the Intel processor to show a speedup of
2 or better for Outtel?
Solution
a. We use a generalized form of Amdahl’s formula in which a fraction f
is speeded up by a given factor (2.5) and the rest is slowed down by
another factor (1.2): 1 / [1.2(1 – f) + f / 2.5]  2  f  0.875
June 2005
Computer Architecture, Background and Motivation
Slide 25
Performance Estimation
Average CPI = All instruction classes (Class-i fraction)  (Class-i CPI)
Machine cycle time = 1 / Clock rate
CPU execution time = Instructions  (Average CPI) / (Clock rate)
Table 4.3 Usage frequency, in percentage, for various
instruction classes in four representative applications.
Application 
Instr’n class 
Data
compression
C language
compiler
Reactor
simulation
Atomic motion
modeling
A: Load/Store
25
37
32
37
B: Integer
32
28
17
5
C: Shift/Logic
16
13
2
1
D: Float
0
0
34
42
E: Branch
19
13
9
10
F: All others
8
9
6
4
June 2005
Computer Architecture, Background and Motivation
Slide 26
MIPS Rating Can Be Misleading
Example 4.5
Two compilers produce machine code for a program on a machine
with two classes of instructions. Here are the number of instructions:
Class
A
B
CPI
1
2
Compiler 1
600M
400M
Compiler 2
400M
400M
a. What are run times of the two programs with a 1 GHz clock?
b. Which compiler produces faster code and by what factor?
c. Which compiler’s output runs at a higher MIPS rate?
Solution
a. Running time 1 (2) = (600M  1 + 400M  2) / 109 = 1.4 s (1.2 s)
b. Compiler 2’s output runs 1.4 / 1.2 = 1.17 times as fast
c. MIPS rating 1, CPI = 1.4 (2, CPI = 1.5) = 1000 / 1.4 = 714 (667)
June 2005
Computer Architecture, Background and Motivation
Slide 27
4.5 Reporting Computer Performance
Table 4.4
Measured or estimated execution times for three programs.
Time on
machine X
Time on
machine Y
Speedup of
Y over X
Program A
20
200
0.1
Program B
1000
100
10.0
Program C
1500
150
10.0
All 3 prog’s
2520
450
5.6
Analogy: If a car is driven to a city 100 km away at 100 km/hr
and returns at 50 km/hr, the average speed is not (100 + 50) / 2
but is obtained from the fact that it travels 200 km in 3 hours.
June 2005
Computer Architecture, Background and Motivation
Slide 28
Comparing the Overall Performance
Table 4.4 Measured or estimated execution times for three programs.
Speedup of
X over Y
Time on
machine X
Time on
machine Y
Speedup of
Y over X
Program A
20
200
0.1
10
Program B
1000
100
10.0
0.1
Program C
1500
150
10.0
0.1
Arithmetic mean
Geometric mean
6.7
2.15
3.4
0.46
Geometric mean does not yield a measure of overall speedup,
but provides an indicator that at least moves in the right direction
June 2005
Computer Architecture, Background and Motivation
Slide 29
4.6 The Quest for Higher Performance
State of available computing power ca. the early 2000s:
Gigaflops on the desktop
Teraflops in the supercomputer center
Petaflops on the drawing board
Note on terminology (see Table 3.1)
Prefixes for large units:
Kilo = 103, Mega = 106, Giga = 109, Tera = 1012, Peta = 1015
For memory:
K = 210 = 1024, M = 220, G = 230, T = 240, P = 250
Prefixes for small units:
micro = 10-6, nano = 10-9, pico = 10-12, femto = 10-15
June 2005
Computer Architecture, Background and Motivation
Slide 30
Supercomputer performance
Supercomputers
PFLOPS
Massively parallel
processors
$240M MPPs
$30M MPPs
CM-5
TFLOPS
CM-5
CM-2
Vector
supercomputers
Y-MP
GFLOPS
Cray
X-MP
MFLOPS
1980
1990
2000
2010
Calendar year
Figure 4.7
June 2005
Exponential growth of supercomputer performance.
Computer Architecture, Background and Motivation
Slide 31
The Most Powerful Computers
Performance (TFLOPS)
1000
Plan
Develop
Use
100+ TFLOPS, 20 TB
ASCI Purple
100
30+ TFLOPS, 10 TB
ASCI Q
10+ TFLOPS, 5 TB
ASCI White
10
ASCI
3+ TFLOPS, 1.5 TB
ASCI Blue
1+ TFLOPS, 0.5 TB
1
1995
ASCI Red
2000
2005
2010
Calendar year
Figure 4.8 Milestones in the DOE’s Accelerated Strategic Computing
Initiative (ASCI) program with extrapolation up to the PFLOPS level.
June 2005
Computer Architecture, Background and Motivation
Slide 32
Performance is Important, But It Isn’t Everything
TIPS
DSP performance
per Watt
Absolute
proce ssor
performance
Performance
GIPS
GP processor
performance
per Watt
MIPS
kIPS
1980
1990
2000
Figure 25.1
Trend in
energy
consumption
per MIPS of
computational
power in
generalpurpose
processors
and DSPs.
2010
Calendar year
June 2005
Computer Architecture, Background and Motivation
Slide 33
Computer Architecture
Lecture Notes
Spring 2005
Dr. Michael P. Frank
Competency Area 2:
Performance Metrics
Lecture 1
34
Performance Metrics
• Why is it necessary for us to study
performance?
— Performance is usually the key to the effectiveness
of a system (hardware + software).
— Performance is critical to customers (purchasers),
thus, we as designers and architects must also make
it a priority.
— Performance must be assessed and understood in
order for a system to communicate efficiently with
peripheral devices.
Topic: Computer Performance
Sub-Topic: Airplane Analogy
36
Performance Metrics
• How can we determine performance?
Consider this example from the transportation industry:
Aircraft
Passenger
Capacity
Fuel
Capacity
Cruising
Range Speed
Throughput
Cost
Boeing 747-400
421
216,847 10,734
920
387,320
0.048
Boeing 767-300
270
91,380 10,548
853
230,310
0.032
Airbus 340-300
284
139,681 12,493
869
246,796
0.039
Airbus 340-300
120
23,859
4,442
837
100,440
0.045
77
11,750
2,406
708
54,516
0.063
132
119,501
6,230
2,180
287,760
0.145
50
3,202
1,389
531
26,550
0.046
5
60
100
500
0.017
BAE-146-200
Concorde
Dash-8
Car
700
Performance Example
•
•
•
•
Fuel Capacity in liters
Range in kilometers
Speed in kilometers/hour
Throughput is defined as
(# of passengers) x (cruising speed)
• Cost is given as
(fuel capacity) / (passengers x range)
Which mode of transportation has the “best”
performance?
Performance Example
• It depends on how we define performance.
• Consider raw speed:
—Getting from one place to another quickly
best
worst
Performance Example
• What if we’re interested in the rate at which
people are carried throughput:
best
worst
Performance Example
• Often times we relate performance and cost. Thus we
can consider the amount of fuel used per passenger:
Best
plane
Best
overall
Topic: Computer Performance
Sub-Topic: Basic Concepts:
Performance, Throughput, and
Execution Time
42
Performance Metrics
•
Similar measures of performance are used for
computers.
— Number of computations done per unit of time
— Cost of computations
— Possibly several aspects of cost can be considered including
initial purchase price, operating cost, cost of training users of
system, etc.
•
Common performance measures are
1. RESPONSE TIME – the amount of time it takes a program to
complete (a.k.a execution time)
2. THROUGHPUT – the total amount of work done in a given
amount of time
Performance Metrics
Example:
Given the following actions:
1. Replacing processor with a faster version
2. Adding additional processors to perform separate
tasks in a multiprocessor system
do they (a) increase throughput, (a) decrease
response time or (c) both?
Defining Performance
•
•
Our focus will be primarily on execution time.
To maximize performance implies a minimization in
execution time:
1
Performance X 
ExecutionTime X
•
For two machines:
if performanceY  performance X
1
1

ExecutionTimeY ExecutionTime X
ExecutionTime X  ExecutionTimeY
•
We say that machine Y is faster than machine X.
Performance Metrics
Notes:
(1) If X is n times faster than Y, then
Performance X
n
PerformanceY
Also ,
Performance X ExecutionTimeY

n
PerformanceY ExecutionTime X
(2) To avoid confusion, we’ll use the following terminology:
We say
We mean
“improve performance”

increase performance
“improve execution time”

decrease execution time
Performance Example
If machine A runs a program in 10 seconds
and machine B runs the same program in 15
seconds, how much faster is A than B?
Performance Example
If machine A runs a program in 10 seconds
and machine B runs the same program in 15
seconds, how much faster is A than B?
ExecutionTime A  10 sec
ExecutionTimeB  15 sec
Since ETB  ETA ,
Perf B  Perf A
so,
perf A ETB 15 sec


 1.5
perf B ETA 10 sec
 Machine A is 1.5 times faster than B.
Topic: Computer Performance
Sub-Topic:
Measuring Performance
49
Measuring Performance
• Quite simply, TIME is the measure of computer
performance!
• The most straightforward definition of time is
wall-clock time  elapsed time  response
time.
Total time to complete a task including system overhead
activities such as Input/Output tasks, disk and memory
accesses, etc.
Measuring Performance
• CPU Time is the time it takes to complete a task
excluding the time it takes for I/O waits.
CPU TIME
USER CPU TIME
The time CPU is busy
executing the user’s
code.
SYSTEM CPU TIME
The time CPU spends
performing operating
system tasks.
Note: Sometimes system and user CPU times are difficult to
distinguish since it is hard to assign responsibility for
OS activities.
Measuring Performance
Example,
To understand the concept of CPUTime,
consider the UNIX command ‘time’. Once typed,
it may return a response similar to
90.7u 12.9s 2:39
What do these numbers mean?
65%
Measuring Performance
Example,
To understand the concept of CPUTime,
consider the UNIX command ‘time’. Once typed,
it may return a response similar to
90.7u 12.9s 2:39
User CPU Time
System CPU Time
65%
% of elapsed time
that is CPU time
Elapsed Time
Measuring Performance
Example,
To understand the concept of CPUTime,
consider the UNIX command ‘time’. Once
typed, it may return a response similar to
90.7u 12.9s 2:39 65%
a. What is the total CPUTime?
b. Percentage of time spent on I/O and other
programs?
Measuring Performance
Example,
To understand the concept of CPUTime,
consider the UNIX command ‘time’. Once
typed, it may return a response similar to
90.7u 12.9s 2:39 65%
a. What is the total CPUTime?
CPUTime  90.7 12.9  103.6 sec
b. Percentage of time spent on I/O and other
programs?
159 - 103.6
100  35%
159
Measuring Performance
•
Other notes:
1. SYSTEM PERFORMANCE – reciprocal of elapsed
time on an unloaded system (e.g. no user
applications)
2. CPU PERFORMANCE – recip. of user CPU time
3. CLOCK CYCLES (CC) – discrete time intervals
measured by the processor clock running at a
constant rate.
4. CLOCK PERIOD – time it takes to complete a clock
cycle
5. CLOCK RATE – inverse of clock period
Measuring Performance
•
Consider CPU performance:
CPUTime  CPU Clock Cycles for a program 
 Clock Cycle Time, tCC 
Also,

CPU Clock Cycles
CPUTime 
for a program 
Clock Rate, f CC
Measuring Performance
•
Since the execution time clearly depends on
the number of instructions for a program, we
must also define another performance metric:
CPI = average number of clock cycles
per instruction

CPU Clock Cycles
CPI 
for a program 
Instruction Count
Measuring Performance
•
Now we have two more equations that we can
define for CPUTime:
CPUTime  IC  CPI  tCC
IC  CPI
CPUTime 
f cc
Measuring Performance
•
In summary, performance metrics
include:
Components of
Performance
Units of Measure
CPUTime
Seconds for program
IC
# of instructions for a
program
CPI
Average # of clock cycles
per instructions
tCC
Seconds per clock cycle
Measuring Performance
Example,
Suppose Machine A implements the same
ISA as Machine B. Given tccA  1ns and
CPI A  2.0 for some program, and tccB  2ns
and CPI B  1.2 for the same program,
determine which machine is faster and by
how much.
Breakdown by Instruction Category
• Recall CPI = Clock cycles (CC) per instruction
• But, CPI depends on many factors, including:
—Memory system behavior
—Processor structure
—Availability special processor features
– E.g., floating point, graphics, etc.
• To characterize the effect of changing specific
aspects of the architecture, we find it helpful to
break down CC into components due to different
classes (categories) of instructions:
—Where:
– ICi = instruction count for class i
– CPIi = avg. cycles for insts. in class i
– n = the number of instruction classes
n
CC   (CPIi  ICi )
i 1
Example
• Suppose a processor has 3
categories of instructions
A,B,C with the following CPIs:
• And, suppose a compiler
designer is comparing two
code sequences for a given
program that have the
following instruction counts:
• Determine:
(i) Which code sequence executes
the most instructions?
(ii) Which will be faster?
(iii) What is the average CPI for
each code sequence?
Instr.
Class CPIi
A 1
B 2
C 3
Code Inst. counts
Seq. ICA ICB ICC
1
2
1
2
2
4
1
1
Solution to Example
• Part (i):
— ICseq1 = 2 + 1 + 2 = 5 instructions
— ICseq2 = 4 + 1 + 1 = 6 instructions
— Code sequence 2 executes more instructions
• Part (ii):
— CCseq1 = ∑i(CPIixICi) = 1x2 + 2x1 + 3x2 = 10 cycles
— CCseq2 = ∑i(CPIixICi) = 1x4 + 2x1 + 3x1 = 9 cycles
— Code sequence 2 takes fewer cycles  is faster!
• Part (iii):
— CPIseq1 = {CC/IC}seq1 = 10 cyc./5 inst. = 2
— CPIseq2 = {CC/IC}seq2 = 9 cyc./6 inst. = 1.5
• Which part should we consult to tell us which
code sequence has better performance?
Topic: Computer Performance
Subtopic:
Benchmarks & Performance
Summaries
65
Importance of Benchmarks
• How do we evaluate and compare the
performance of different architectures?
—We use benchmarks
Programs that are specifically chosen to measure
performance.
A workload is a set of programs.
Benchmarks consist of workloads that (user hopes)
will predict the performance of the actual workload
It is important that benchmarks consist of realistic
workloads
Not simple toy programs or code fragments
Manufacturers often try to fine-tune their machines
to do well on popular benchmarks that were too
simple
This does not always mean the machine will do well on real
programs!
SPEC benchmark
• A popular source of benchmarks is SPEC
—Standard Performance Evaluation Corporation
• General CPU benchmarks: CPU2000.
—Includes programs such as:
– gzip (compression), vpr (FPGA place & route), gcc
(compiler), crafty (chess), vortex (database)
• SPEC also offers specialized benchmarks for:
—Graphics, Parallel computing, Java, mail servers,
network fileservers, web servers
• They publish reports on benchmark results for
various systems.
—Main metric: SPECRatio – Proportional to average
inverse execution time. The bigger, the better!
• Reproducibility of results is very important!
Summarizing Performance
• How do we summarize performance in a way
that accurately compares different machines?
—One common approach: Total Execution Time (TET)
– Based on:
Perf B ETA

Perf A ETB
—Or, if the workload includes n different programs, we
can calculate the average or Arithmetic Mean (AM):
1 n
AM   timei
n i 1
– Smaller AM  Improved performance
—Other methods are also used:
– Weighted arithmetic mean, geometric mean ratio.
Topic: Computer Performance
Subtopic:
Performance Improvement
and Amdahl’s Law
69
Performance Improvement
• Recall the formula: CPUTime = IC × CPI / fcyc.
—Thus, CPU performance is Perf = f / (IC×CPI).
• Thus we can see 3 basic ways to improve CPU
performance on a given task:
—Increase clock frequency
—Decrease CPI
– by improved processor organization
—Decrease instruction count
– By compiler enhancement,
– change in ISA design (new instructions), or
– A more efficient application algorithm.
• However, we have to be careful!
—Sometimes, improving one of these can hurt others!
Generalized Cost Measures
• In this course, we will often be focusing on ways to
minimize execution time of programs.
— Either CPU time, or number of clock cycles.
• Execution time is one example of what we may call a
generalized cost measure (GCM).
— A GCM is any property of a HW/SW design that tells us how
much of some valued resource is used up when the system is
manufactured or used.
• Other examples of important GCMs include:
— Energy consumed by a computation
— Silicon chip area used up by a circuit design
— Dollar cost to manufacture a computer component
• We will study some general engineering principles that
apply to the minimization of any GCM in any system.
Additive Cost Measures
• Let us suppose we have a GCM C for a system.
• Many times, the total cost C can be represented
as a sum of independent cost components
:
n
— E.g., C = C1 + C2 + … + Cn or C   Ci .
i 1
• These could correspond to the resources used
by individual subsystems of the whole system.
—Or, used in doing particular categories of tasks.
• For example, execution time T can be broken
down as the sum of time Tfp taken by floatingpoint instructions and the time Toth for others.
—That is, T = Tfp + Toth.
Improving Part of a System
• Suppose a GCM is broken down as C = A + B.
—The total cost is the sum of two components A & B.
• Now suppose you are considering making an
improvement to the system design that affects
only cost component B.
—Suppose you reduce it by a factor f, to B′ = B/f.
• The new total cost is then C′ = A + B′.
—The cost of component A is unaffected.
• Overall (total) cost has therefore been reduced
by the factor:
C
A B A B
f overall 
C

A  B

A
B
f
.
Diminishing Returns
• Suppose we continue improving (reducing) a
cost component by larger and larger factors.
—Does this mean the system’s total cost will be
reduced by correspondingly large factors?  NO!
• Even if we improved one cost component (B in
our example) by a factor of f = ∞, note that:
A B A B A B A B
B
f overall,max  lim



 1 .
B
B
f  A 
A  A0
A
A
f
• Even here, the overall cost reduction factor
foverall would still be only the finite value 1+B/A!
—The system can only be improved by at most this
factor, if we improve just the one component B.
Diminishing Returns Example
• Suppose a particular chip contains B = 1 cm2 of
logic circuits, and A = 2 cm2 of cache memory.
—The total cost (in terms of area) is C = A+B = 3 cm2.
• Now, let’s go crazy trying to simplify and shrink
the design of just the logic circuit…
Logic
—What is the maximum factor by which
this tactic can reduce the area cost of
the whole design (logic+memory)?
1 cm2
Memory
2 cm2
• Obviously, this can reduce the total area from 3
(cm2) to no less than 2 (area of memory alone),
—or, shrink it by a factor of foverall = 3/2 = 1.5.
• Note we could have obtained this same answer
using the equation foverall,max = 1+B/A as well.
Graph Showing
Diminishing Returns
Generalized Amdahl's Law
Factor Reduction in Whole Cost
1000
1000
Part/rest (initial)
(B/A)
100
100
10
10
1
0.1
1
1
3
10
32
100
316
1000
Factor Reduction in Part Cost
3162 10000
(f)
Important Lessons to Take from This
• It’s probably not worth spending significant
design time extensively improving just a single
component of a system,
—Unless that component accounts for a dominant part
of the total cost (by some measure) to begin with.
(B/A >> 1).
• It’s only worth improving a given component up
to the point where it is no longer dominant.
—Reducing it further won’t make a lot of difference.
• Therefore, all components with significant costs
must be improved together in order to
significantly improve an entire design.
—Well-engineered systems will tend to have roughly
comparable costs in all of their major components.
Other Ways to Calculate foverall
• Earlier, we saw this formula:
— For the overall improvement factor
foverall resulting from improving
component B by the factor f.
A B
f overall 
.
A  Bf
• But, what if we don’t know the values of A and B?
— What if we only know their relative sizes?
– Fortunately, it turns out that we can still calculate foverall.
• Let us define fracenh = B/C = B/(A+B) to be the fraction
of the original total system cost that is accounted for by
the particular part B that is going to be enhanced.
— Then, the fraction of cost accounted for by A (the rest of the
system) is
1 - fracenh
B
A B
B
A
 1

.
A B A B A B A B
• Our equation for foverall can then be reexpressed in terms
of the quantities fracenh and 1−fracenh, as follows…
Calculating foverall in terms of fracenh
• Let’s re-express foverall in terms of fracenh:
A B
1
1
f overall 


B
B  A f   A
B/ f 
A


 

f
 A B   A B A B 
1


fracenh 
1 - fracenh   f 


• We will call this form for foverall the Generalized
Amdahl’s Law. (We’ll see why in a moment.)
Amdahl’s Law Proper
• We saw that execution time is one valid cost measure.
— In such a case, note that the factor by which a cost is reduced
is the speedup, or the factor by which performance is improved.
• We thus rename the improvement factor f of B (the
enhanced part) to speedupenh, and the overall
improvement factor foverall becomes speedupoverall, and
we get:
speedupoverall 
1

fracenh 
1 - fracenh   speedup 
enh 

• This is called Amdahl’s Law, and it is one of the most
widely hyped quantitative principles of processor design.
— But as we can see, it is not a special law of CPU architecture,
but just an application of the universal engineering principle of
diminishing returns which we discussed earlier.
Key Points from This Module
•
•
•
•
•
Throughput vs. Response Time
Performance as Inverse Execution Time
Speedup Factors
Averaging Benchmark Results
CPU Performance Equation:
— Execution time = IC × CPI × tcc
— Performance = fcc / (IC × CPI)
• Amdahl’s Law:
— C′ = A + B/f
— Implies:
C = Execution time after improvement
B = Part of execution time affected by improvement
f = Factor of improvement (speedup of enhanced part)
A = Part of execution time unaffected by improvement
speedupoverall 
1

fracenh 


1
frac

enh

speedupenh 

Example Performance Calculation
• Suppose program takes 10 secs. on computer A
—And suppose computer A has a 4 GHz clock
• Want new computer B to run prg. in 6 seconds.
—Suppose that increasing the clock speed is only
possible with a substantial processor redesign,
– which will result in 1.2× as many clock cycles being needed
to execute the program.
• What clock rate is needed?
— Answer: 4 GHz × (10/6) × 1.2 = 8 GHz
Another Example
• Consider two different implementations of a
given ISA, running a given benchmark:
— Processor A has a cycle time of 250 ps
– And a CPI of 2.0
— Processor B has a cycle time of 500 ps
– And a CPI of 1.2
• Which computer is faster on this benchmark,
and by what factor?
— Processor A takes 250 ps × 2.0 = 500 ps / instr.
— Processor B takes 500 ps × 1.2 = 600 ps / instr.
— Thus, A is faster by a factor of 6/5 = 1.2×.
Another example
• Suppose some Java application takes 15
seconds on a certain machine.
• A new Java compiler is released that requires
only 0.6 as many dynamic instructions to run
the application.
— Unfortunately, it also increases the CPI by 1.1×
– Presumably, uses more multi-cycle instructions.
• How fast will the application run when compiled
using the new compiler?
—It will take 15 × 0.6 × 1.1 = 9.9 seconds to run
—It will be 15/9.9 = 50/33 = 1.515…× faster
– Only slightly more than 50% faster than before.
Another Example
• Consider the following measurements of
execution time:
Program
Computer A
Computer B
1
2 sec.
4 sec.
2
5 sec.
2 sec.
• Which of the following statements are true?
— A is faster than B for program 1.
— A is faster than B for program 2.
— A is faster than B for a workload with equal numbers
of executions of programs 1 and 2.
— A is faster than B for a workload with twice as many
executions of program 1 as of program 2.

Document 7296545

Transcript Document 7296545

Directory