UTMOST

Transcript UTMOST

Why study computer architecture?
q To learn the principles for designing processors and
systems
q To learn the system configuration trade-off
what size of caches/memory is enough
what kind of buses to connect system components
what size (speed) of disks to use
q To choose a computer for a set of applications in a project.
q To interpret the benchmark figures given by salespersons.
q To decide which processor chips to use in a system
q To design the system software (compiler, OS) for a new
processor?
q To be the leader of a processor design team?
q To learn several machine’s assembly languages?
1/16/99
CS520S99 Introduction
C. Edward Chow Page 1
The Basic Structure of a Computer
1/16/99
CS520S99 Introduction
C. Edward Chow Page 2
Control and Data Flow in Processor
Processor is made up of
 Data operator (Arithmetic and Logic Unit, ALU)—D
consumes and combines information into a new meaning
 Control—K
evokes operations of other components
1/16/99
CS520S99 Introduction
C. Edward Chow Page 3
Control is often distributed
1/16/99
CS520S99 Introduction
C. Edward Chow Page 4
Instruction Execution at
Register Transfer Level (RTL)
• Consider the detailed execution of the instruction
“move &100, %d0” (Moving constant 100 to register d0)
• Assume the instruction was loaded into memory location
1000
• The op code of the move instruction and the register
address d0 are encoded in byte1000 and 1001
• The constant 100 in byte 1002 and 1003.
1/16/99
CS520S99 Introduction
C. Edward Chow Page 5
RTL Instruction Execution
• Mpc is set to 1000 pointing at instruction in the meory
• Step 1: Mmar = Mpc; // put pc into mar; prepare to fetch instruction.
1000
1/16/99
CS520S99 Introduction
C. Edward Chow Page 6
Update Program Counter
• Step 2: Mpc = Mpc+4; // update program counter;
move Mpc value to D, D perform +4, move result back to Mpc
1000+2
1000
1002
1/16/99
CS520S99 Introduction
C. Edward Chow Page 7
Instruction Fetch
• Step 3: Mir = Mp[Mmar];
// fetch instruction
send Mmar value to Mp, Mp retrieve move|d0, send back to Mir
Steps3 and 2 can be done
in parallel.
1000
Move|d0
100
1/16/99
CS520S99 Introduction
C. Edward Chow Page 8
Instruction Decoding
• Step 4: Decode Instruction in Mir
Move|d0
100
1/16/99
CS520S99 Introduction
C. Edward Chow Page 9
RTL Instruction Execution
• Step 5: Mgeneral[0] = Mp[Mir16-31];
// execute the
move of the constant into a general register named d0
Move|d0
100
Subscript 16-31 denotes
the 16th and 31th bits
containing constant 100
1/16/99
100
CS520S99 Introduction
C. Edward Chow Page 10
Computer Architecture
The term “computer architecture” was coined by
IBM in 1964 for use with IBM 360. Amdahl, Blaauw,
and Brooks [1964] used the term to refer to the
programmer-visible portion of the instruction set.
They believe that a family of machines of the same
architecture should be able to run the same software.
Benefits:
• With a precise defined architecture, we can have
many compatible implementations.
• The program written in the same instruction set can
run in all the compatible implementations.
1/16/99
CS520S99 Introduction
C. Edward Chow Page 11
Architecture & Implementation
• Single Architecture—multiple implementation
 computer family
• Multiple Architecture—single implementation
 microcode emulator
1/16/99
CS520S99 Introduction
C. Edward Chow Page 12
Computer Architecture Topics
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
Input/Output and Storage
Disks, WORM, Tape
DRAM
Memory
Hierarchy
L2 Cache
L1 Cache
VLSI
Instruction Set Architecture
1/16/99
RAID
Emerging Technologies
Interleaving
Bus protocols
Coherence,
Bandwidth,
Latency
Addressing,
Protection,
Exception Handling
Pipelining, Hazard Resolution,
Pipelining and Instruction
Superscalar, Reordering,
Level Parallelism
Prediction, Speculation,
Vector, DSP
CS520S99 Introduction
C. Edward Chow Page 13
Computer Architecture Topics
P M
P M
S
°°°
P M
P M
Interconnection Network
Processor-Memory-Switch
Multiprocessors
Networks and Interconnections
Shared Memory,
Message Passing,
Data Parallelism
Network Interfaces
Topologies,
Routing,
Bandwidth,
Latency,
Reliability
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99
CS520S99 Introduction
C. Edward Chow Page 14
CS 520 Course Focus
Understanding the design techniques, machine structures,
technology factors, evaluation methods that will determine
the form of computers in 21st Century
Technology
Parallelism
Programming
Languages
Applications
Computer Architecture:
• Instruction Set Design
• Organization
• Hardware
Operating
Systems
Measurement &
Evaluation
Interface Design
(ISA)
History
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99
CS520S99 Introduction
C. Edward Chow Page 15
Function Requirements faced by a
computer designer
• Applications
– general purpose
balanced performance for a range of tasks
– Scientific
high performance floating points
– Commercial
support for COBOL (decimal arithmetic)
database/transaction processing
• Level of software compatibility
– Object code/binary level
no software porting, more hw design cost
– Programming Lang. Level
avoid old architecture burden, require software porting
1/16/99
CS520S99 Introduction
C. Edward Chow Page 16
Function Requirements faced by a
computer designer
• Operating System Requirements
– Size of address space
– Memory management/Protection
(e.g. garbage collection vs. realtime scheduling)
– Interrupt/traps
• Standards
– Floating Point (IEEE754)
– I/O Bus
– OS
– Networks
– Programming Languages
1/16/99
CS520S99 Introduction
C. Edward Chow Page 17
1988 Computer Food Chain
Mainframe
Supercomputer
Minisupercomputer
Work- PC
Ministation
computer
Massively Parallel
Processors
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99
CS520S99 Introduction
C. Edward Chow Page 18
Massively Parallel Processors
Minisupercomputer
Minicomputer
1998 Computer Food Chain
Mainframe
Server
Supercomputer
Work- PC
station
Now who is eating whom?
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99
CS520S99 Introduction
C. Edward Chow Page 19
Why Such Change in 10 years?
• Performance
– Technology Advances
• CMOS VLSI dominates older technologies (TTL, ECL) in cost
AND performance
– Computer architecture advances improves low-end
• RISC, superscalar, RAID, …
• Price: Lower costs due to …
– Simpler development
• CMOS VLSI: smaller systems, fewer components
– Higher volumes
• CMOS VLSI : same dev. cost 10,000 vs. 10,000,000 units
– Lower margins by class of computer, due to fewer services
• Function
– Rise of networking/local interconnection technology
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99
CS520S99 Introduction
C. Edward Chow Page 20
Technology Trends: Microprocessor
Capacity
“Graduation Window”
Moore’s Law
Alpha 21264: 15 million
Pentium Pro: 5.5 million
PowerPC 620: 6.9 million
Alpha 21164: 9.3 million
Sparc Ultra: 5.2 million
CMOS improvements:
• Die size: 2X every 3 yrs
• Line width: halve / 7 yrs
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99
CS520S99 Introduction
C. Edward Chow Page 21
Memory Capacity
(Single Chip DRAM)
year
1980
1983
1986
1989
1992
1996
2000
size(Mb) cycle time
0.0625 250 ns
0.25
220 ns
1
190 ns
4
165 ns
16
145 ns
64
120 ns
256
100 ns
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99
CS520S99 Introduction
C. Edward Chow Page 22
Technology Trends
(Summary)
Capacity
Speed (latency)
Logic
2x in 3 years
2x in 3 years
DRAM
4x in 3 years
2x in 10 years
Disk
4x in 3 years
2x in 10 years
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99
CS520S99 Introduction
C. Edward Chow Page 23
Processor Performance Trends
1000
Supercomputers
100
Mainframes
10
Minicomputers
Microprocessors
1
0.1
1965
1970
1975
1980
1985
1990
1995
2000
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
Year
1/16/99
CS520S99 Introduction
C. Edward Chow Page 24
Processor Performance
(1.35X before, 1.55X now)
1200
DEC Alpha 21264/600
1000
1.54X/yr
800
600
DEC Alpha 5/500
400
200
0
DEC
DEC Alpha 5/300
HP
SunMIPSMIPSIBM
AXP/
9000/
DEC Alpha 4/266
-4/ M M/ RS/
500
750
IBM POWER 100
6000
260 2000 120
87 88 89 90 91 92 93 94 95 96 97
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99
CS520S99 Introduction
C. Edward Chow Page 25
Performance Trends
(Summary)
• Workstation performance (measured in Spec Marks)
improves roughly 50% per year
(2X every 18 months)
• Improvement in cost performance estimated at 70%
per year
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99
CS520S99 Introduction
C. Edward Chow Page 26
Computer Engineering Methodology
Technology
Trends
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99
CS520S99 Introduction
C. Edward Chow Page 27
Computer Engineering Methodology
Evaluate Existing
Systems for
Bottlenecks
Benchmarks
Technology
Trends
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99
CS520S99 Introduction
C. Edward Chow Page 28
Computer Engineering Methodology
Evaluate Existing
Systems for
Bottlenecks
Benchmarks
Technology
Trends
Simulate New
Designs and
Organizations
Workloads
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99
CS520S99 Introduction
C. Edward Chow Page 29
Computer Engineering Methodology
Implementation
Complexity
Evaluate Existing
Systems for
Bottlenecks
Benchmarks
Technology
Trends
Implement Next
Generation System
Simulate New
Designs and
Organizations
Workloads
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99
CS520S99 Introduction
C. Edward Chow Page 30
Measurement and Evaluation
Design
Analysis
Architecture is an iterative process:
• Searching the space of possible designs
• At all levels of computer systems
Creativity
Cost /
Performance
Analysis
Good Ideas
Bad Ideas
Mediocre Ideas
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99
CS520S99 Introduction
C. Edward Chow Page 31
Measurement Tools
• Benchmarks, Traces, Mixes
• Hardware: Cost, delay, area, power estimation
• Simulation (many levels)
– ISA, RT, Gate, Circuit
• Queuing Theory
• Rules of Thumb
• Fundamental “Laws”/Principles
1/16/99
CS520S99 Introduction
C. Edward Chow Page 32
Metric of Computer Architecture
• Space measured in bits of representation
• Time measures in bit traffic (memory bandwidth)
Many old frequency and benchmark studies focus on
• dynamic opcode (memory size concern)
• exponent differences of floating point operands (precision)
• length of decimal numbers in business files (memory size)
Trend: space is not much a concern; speed/time is everything.
• Here we focus more on the following two performance metrics
• Response time = time between start and finish of an event
— execution time
— latency
• Throughput = total amount of work done in a given time
— bandwidth (no. of bits or bytes moved per second)
1/16/99
CS520S99 Introduction
C. Edward Chow Page 33
Metrics of Performance at Different
Levels
Application
Answers per month
Operations per second
Programming
Language
Compiler
ISA
(millions) of Instructions per second: MIPS
(millions) of (FP) operations per second: MFLOP/s
Datapath
Control
Function Units
Transistors Wires Pins
Megabytes per second
Cycles per second (clock rate)
Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB
1/16/99
CS520S99 Introduction
C. Edward Chow Page 34
Quantitative principles
Improve means
• increase performance
• decrease execution time
“X is n% faster than Y”
ExecutionTime
n
 1
ExecutionTime
100
x
y
Quantitative principles
• Make the common case fast
— Amdahl’s Law
• Locality of reference
— 90% of execution time in 10% of code
1/16/99
CS520S99 Introduction
C. Edward Chow Page 35
Amdahl’s Law
Law of diminishing returns

50
50
50
25
Timeold
Timenew
FractionInEnhancedMode=0.5 based on old system
SpeedupOfEnhancedMode=2
Speedup 
ExecutionTimeWithoutEnhancement
ExecutionTimeWithEnhancement

FractionInEnhancedMode 
Timenew  Timeold (1  FractionInEnhancedMode) 

SpeedupOfE
nhancedMod
e


Timeold
1
Speedup 

Timenew (1  FractionInEnhancedMode)  FractionInEnhancedMode
SpeedupOfEnhancedMode
1/16/99
CS520S99 Introduction
C. Edward Chow Page 36
Amdahl’s Law Result
FractionIn
Enhancedmode
OverallSpeedup When
OverallSpeedup When
SpeedupOfEnhancedMode=2 SpeedupOfEnhancedMode
=

0.1
1.05
1.1
0.3
1.15
1.4
0.5
1.33
2
0.7
1.5
3.33
0.9
1.9
0.99
2
1/16/99
10
100
CS520S99 Introduction
C. Edward Chow Page 37
Apply Amdahl’s Law: Example 1
Example1: Assume that the memory access accounts for 90% of
the execution time. What is the speedup by replacing a 100ns
memory with a 10ns memory? How much fast is the new
system?
Answer:
FractionInEnhancedMode = 90%=0.9
SpeedupOfEnhancedMode = 100ns/10ns = 10
SpeedupOverall 
1
0.1 
0.9
10

1
1
426

 5.26  1 
0.1  0.09 0.19
100
The new system is 426% faster than the old one.
Is it worthwhile if the high speed memory costs 10 times more?
1/16/99
CS520S99 Introduction
C. Edward Chow Page 38
Apply Amdahl’s Law: Example 2
Example 2: Assume that 40% of the time is spent on CPU task; the rest is
spent on I/O. Assume we improve CPU and keep I/O speed
unchanged.
a) How much faster should new CPU be to have the overall speedup of
1.5?
b) Is that possible to have an overall speedup of 2? Why?
Solution:
1
a) 1.5 
 x=6. 500% faster
(1  0.4) 
0.4
x
1
b) The maximum overall speedup that can be achieved is
 1.66
1  0.4
Therefore, it is not possible to achieve the overall speedup of 2.
1/16/99
CS520S99 Introduction
C. Edward Chow Page 39
Apply Amdahl’s Law: Example 3
Example: A recent research on the bottleneck of a 10Mbps Ethernet
network system showed that only 10% of the execution time of a
distributed application was spent on transmitting messages and
90% of the time was on application/ protocol software execution at
hosts’ computers. If we replace Ethernet with 100 Mbps FDDI,
900% faster than Ethernet, what will be speedup of this
improvement? What if we use 900% faster hosts?
1/16/99
CS520S99 Introduction
C. Edward Chow Page 40
Excution Time
The first performance metric and the best metric.
Measure the time it takes to execute the intended application(s) or the
typical workload. The time command can measure an application.
vlsia[93]: time ts9
217.1u 27.2s 8:16 49% 0+27552k 6+3io 26pf+0w
Here is an example which shows how OS and I/O impact the execution
time.
For program 1,
Elapsed Time = sum(t1:t11)-t6-t8
System CPU time = t1+t3+t5+t9+t11
CPU time = t1 + t3 + t4 + t5 + t9+t10 User CPU time = t4 + t10
1/16/99
CS520S99 Introduction
C. Edward Chow Page 41
CPU Time
CPI=(Clock cycles Per Instruction); Ii is the frequency of instruction i in a
program; IC=Instruction Count.; ClockCycleTime=1/ClockRate
CPI figure gives insight into different styles of instruction sets &
implementations.
Interdependence among instruction count, CPI, and Clock rate
Clock rate—Hardware technology and organization
CPI—Organization and instruction set architecture
Instruction count—Instruction set architecture and compiler technology
We cannot measure the performance of a computer by single factor above
alone.
1/16/99
CS520S99 Introduction
C. Edward Chow Page 42
Evaluating Instruction Set Design
Example Page 39: 1/4 of ALU and Load instructions
replaced by new r->m inst. Assume that the clock
cycle time is not changed. Is this a good idea?
Frequency
Before
Clockcycle
Frequency
After
ClockCycle
ALU ops
43%
1
36.1%
1
Loads
21%
2
11.4%
2
Stores
12%
2
13.5%
2
Braches
24%
2
26.9%
3
12.1%
2
New r->m
1/16/99
CS520S99 Introduction
C. Edward Chow Page 43
Evaluate Instruction Design
CPIold = (0.43*1 + 0.21*2 + 0.12*2 + 0.24*2) = 1.57
CPU timeold = InstructionCountold * 1.57 *
ClockCycleTimeold
CPInew= (0.43  (0.25 * 0.43)) *1  (0.21  (0.25 * 0.43) * 2  (0.25 * 0.43) * 2  0.12 * 2  0.24 * 3
1  0.25 * 0.43
=1.908
CPU timenew = (0.893*InstructionCountold) * 1.908 *
ClockCycleTimeold
= 1.703 * InstructionCountold * ClockCycleTimeold
With the assumptions, it is a bad idea to add registermemory instructions.
1/16/99
CS520S99 Introduction
C. Edward Chow Page 44
Estimate CPU time by
(SCPIi*InstructionCounti)*ClockCycleTime
Program: f=(a-b)/(c-d*e)
IC=InstructionCount=10
MIPS R2000 25MHz
CPI=ClockcyclesPerInstruction
Instructions
(op dst, src1, src2)
CPIi=ClockcyclesOfInstructionType i
lw $14, 20($sp)
Ii=number of Instructions of type i
lw $15, 16($sp)
in a prog.
subu
$24, $14, $15
lw $25, 8($sp)
ClockCycleTime
lw $8, 4($sp)
=1/ClockRate=1/25*106
mul
$9, $25, $8
=40*10-9sec=40nsec
lw $10, 12($sp)
subu
$11, $10, $9
CPIi can be obtained from processor
div$12, $24, $11
handbook.
sw $12, 0($sp)
Here we assume no cache misses.
1/16/99
CS520S99 Introduction
C. Edward Chow Page 45
Estimate CPU time by
ClockCycleTime*S(CPIi*InstructionCounti)
i
Instruction
Type
Ii
Count
CPIi
CPIi*ICi
1
lw
5
2
10
2
subu
2
1
2
3
mul
1
1
1
4
div
1
1
1
5
sw
1
2
2
16
CPU Time = 16*40 nsec = 640 nsec
1/16/99
CS520S99 Introduction
C. Edward Chow Page 46
Other Performance Measures
The only reliable measure of performance is the execution time of
real programs. Other attempts:
ClockRate
InstructionCount
1.
MIPS 

6
CPI

10
ExecutionT
ime

10
Depends on instruction set, hard to compare,
6
•
• MIPS varies with programs on the same computer.
Example1: the impact of using Floating Point Hardware on MIPS.
Example2: Impact of optimizing compiler usage on MIPS.
What affects performance?
•
•
•
•
input
version of programs, compiler, OS, CPU
optimizing level of compiler
machine configurations
— amount of cache, main memory, disks
— the speed of cache, main memory, disks, and bus.
1/16/99
CS520S99 Introduction
C. Edward Chow Page 47
Myth of MIPS
Example: The effect of optimizing compiler on MIPS number. (Page45)
A machine with the 500MHz clock rate and the following clock cycles for
instructions. For a program, the relative frequencies of instructions
before and after using an optimizing compiler are as shown in the table.
Instruction
Type
IC Before
Optimization
CPIi
IC After
Optimization
ALU ops
86
1
43
Loads
42
2
42
Stores
24
2
24
Branches
48
2
48
CPI unoptimized = 86/200*1+42/200*2+24/200*2+48/200*2=1.57
MIPS unoptimized = 500/(1.57*106)=318.5
CPI optimized = 43/157*1+42/157*2+24/157*2+48/157*2=1.73
MIPS optimized = 500/(1.73*106)=289.0
CPU time unoptimized = 200*1.57*(2*10-9) = 6.28*10-7
CPU time optimized = 157*1.73*(2*10-9) = 5.43*10-7
1/16/99
CS520S99 Introduction
C. Edward Chow Page 48
MFLOPS
For scientific computing MFLOPS is used as a metric:
NoOfFPOperationInA Pr ogram
MFLOPS 
ExecutionTime 106
Here it emphasizes operations instead of instructions.
• Unfortunately, the set of floating-point operations is not
consistent across machines.
• The rating changes with different mix ratio of integer-floating or
floating-floating instructions.
The solution is to use a canonical number of floating point
operations for certain type of FP operations, e.g. 1 for (add, sub,
compare, mul), 4 for (fdiv, fsqrt), 8 for (arctan, sin, exp)
1/16/99
CS520S99 Introduction
C. Edward Chow Page 49
Programs to Evaluate
Performance
Real programs — The set of programs to be run forms the
workload.
Kernels — key pieces of real programs; isolate features of a
machines; Livermore Loops (weighted ops); Linpack
Toy Benchmarks — 10 to 100 lines of codes: e.g., quicksort, Sieve,
Puzzle
Synthetic Benchmarks — artificially created to match an average
execution profile: e.g., Whetstone, Dhrystone
SPEC (System Performance Evaluation Cooperation) Benchmarks
89, 92, 95.
Perfect Club Benchmarks for parallel computations.
1/16/99
CS520S99 Introduction
C. Edward Chow Page 50
SPEC: System Performance Evaluation
Cooperative Benchmark
• First Round 1989: 10 programs yielding a single number
(“SPECmarks”)
• Second Round 1992: SPECInt92 (6 integer programs) and
SPECfp92 (14 floating point programs)
– Compiler Flags unlimited. March 93 of DEC 4000 Model 610:
spice:
unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)=
memcpy(b,a,c)”
wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200
nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas
• Third Round 1995
– new set of programs: SPECint95 (8 integer programs) and
SPECfp95 (10 floating point)
– “benchmarks useful for 3 years”
– Single flag setting for all programs: SPECint_base95,
SPECfp_base95
1/16/99
CS520S99 Introduction
C. Edward Chow Page 51
Comparison of Machine Performance
Single Program—execution time
Collection of (n) Programs
1. Total execution time
2. Normalized to a reference machine, compute the TimeRatio of
ith program TimeRatioi=Timei/Timei(ReferenceMachine)
1 n
arithmetic mean=  TimeRatio i
n i 1
geometric mean=
n
n
 TimeRatio
i 1
harmonic mean=
i
n
n
 TimeRatio
i 1
i
Geometric mean is consistent independent of referenced machine.
Harmonic mean decrease impact of outliers.
1/16/99
CS520S99 Introduction
C. Edward Chow Page 52
Summarize Performance Results
Example: Execution of two programs on three machines. Assume
Program 1 has 10M floating point operations and Program 2 has
50M floating point operations
ComputerA
ComputerB
ComputerC
Program1(sec)
1
10
20
Program2(sec)
100
50
20
TotalTime(sec)
101
60
40
Native MFLOPS on
Program 1
10/1=10
10/10=1
10/20=0.5
Native MFLOPS on
Program 2
50/100=0.5
50/50=1
50/20=2.5
Arithmetic Mean
(10+0.5)/2=5.25
(1+1)/2=1
(0.5+2.5)/2=3
Geometric Mean
10  0.5  2.24
1/16/99
CS520S99 Introduction
11  1
0.5  2.5  1.12
C. Edward Chow Page 53
Weighted Arithmetic Means
• For a set of n program, each takes Timei on one machine, the
“equal-time” weights on that machine
are
wi 
1
Timei
n

j 1
1
Time j
a
b
c
w(1)
P1(sec)
P2(sec)
1
1000
10
100
20
20
AM:W(1)
500.5
55
20
AM:W(2)
91.82
18.18
20
w(2)
w(3)
0.5 0.909 0.999
0.5 0.091 0.001
AM:W(3) 1.998 10.09
20
Figure 1.12
W(3) [W(2)] are equal-time weights based on machineA [B].
This is used in Exercise 1.11
1/16/99
CS520S99 Introduction
C. Edward Chow Page 54
Hints for Homework # 1
Exercise 1.7:
1. Whetstone consists of integer operations besides the floatingpoint operations.
2. When floating point processor is not used, all floating-point
operations need to be emulated by integer operations (e.g.
shift, and, add, sub, multiply, div...).
3. For different co-fp processors, we will have the same # of integer
ops but different # of FP ops.
Exercise 1.11:
a. use the equal-time weightings formula in Page 26.
b. DEC3000 execution time(ora)
= VAX11 780Time(ora)/ DEC3000SPECRatio=7421/165
1/16/99
CS520S99 Introduction
C. Edward Chow Page 55
FP Compilation Results depend on
existence of FP coprocessor
Exercise 1.7. Whetstone is a benchmark with both Integer and
Floating Point (FP) operations.
1/16/99
CS520S99 Introduction
C. Edward Chow Page 56
Compiling floating-point statement
Here are the generated
assembly instructions of a
floating-point operation
statement in C on DEC3100
(with R2010 floating point
unit) using command cc -S
Note that since the R2010 only
implements simple floating
point add, sub, mult, and div
operations, sqrt, exp, and
alog are translated as
subroutine calls using jal
instr. The floating-point
division is translated as div.d
and will be executed by
R2010.
1/16/99
# 7 x=sqrt(exp(alog(x)/t1));
s.d $f4, 48($sp) #load x to fp
register f4
l.d $f12, 56($sp) #load t1 to fp
register f12
jal alog #call subroutine alog
move $16, $2
mtc1 $16, $f6
cvt.d.w $f8, $f6 #f8 contains
alog(x)
l.d $f10, 48($sp)
div.d $f12, $f8, $f10
jal exp
mov.d $f20, $f0
mov.d $f12, $f20
jal sqrt
s.d $f0, 56($sp)
CS520S99 Introduction
C. Edward Chow Page 57
Homework #1
Problems 1.7 and 1.11
Problem A. Program segment: f=(a-b)/(a*b) is compiled into the
following MIPS R2000 code.
Instructions (op dst, src1, src2)
lw
$14, 20($sp)
# a is allocated at M[sp+20]
lw
$15, 16($sp)
# b is allocated at M[sp+16]
subu
$24, $14, $15
mul
$9, $14, $15
div $12, $24, $9
sw
1/16/99
$12, 0($sp)
# f is allocated at M[sp+0]
CS520S99 Introduction
C. Edward Chow Page 58
Homework #1 (Continue)
Assume all the variables are already in the cache (i.e. does not have
to go the main memory for data) and Table 1 contains the clock
cycles for each types of instructions when data is in the cache.
What is the execution time (in term of seconds) of the above segment
using a R2000 chip with a 25 MHz clock?
Problem B. Assume the CPU operation accounts for 70% of the time
in a system.
a) What is the overall speedup if we improve CPU speed by 100%?
b) How much faster should the new CPU be in order to have the
overall speedup of 1.7?
c) Is it possible to have overall speedup of 3 by just improving the
CPU?
1/16/99
CS520S99 Introduction
C. Edward Chow Page 59

UTMOST

Transcript UTMOST

Directory