Transcript UTMOST
Why study computer architecture? q To learn the principles for designing processors and systems q To learn the system configuration trade-off what size of caches/memory is enough what kind of buses to connect system components what size (speed) of disks to use q To choose a computer for a set of applications in a project. q To interpret the benchmark figures given by salespersons. q To decide which processor chips to use in a system q To design the system software (compiler, OS) for a new processor? q To be the leader of a processor design team? q To learn several machine’s assembly languages? 1/16/99 CS520S99 Introduction C. Edward Chow Page 1 The Basic Structure of a Computer 1/16/99 CS520S99 Introduction C. Edward Chow Page 2 Control and Data Flow in Processor Processor is made up of Data operator (Arithmetic and Logic Unit, ALU)—D consumes and combines information into a new meaning Control—K evokes operations of other components 1/16/99 CS520S99 Introduction C. Edward Chow Page 3 Control is often distributed 1/16/99 CS520S99 Introduction C. Edward Chow Page 4 Instruction Execution at Register Transfer Level (RTL) • Consider the detailed execution of the instruction “move &100, %d0” (Moving constant 100 to register d0) • Assume the instruction was loaded into memory location 1000 • The op code of the move instruction and the register address d0 are encoded in byte1000 and 1001 • The constant 100 in byte 1002 and 1003. 1/16/99 CS520S99 Introduction C. Edward Chow Page 5 RTL Instruction Execution • Mpc is set to 1000 pointing at instruction in the meory • Step 1: Mmar = Mpc; // put pc into mar; prepare to fetch instruction. 1000 1/16/99 CS520S99 Introduction C. Edward Chow Page 6 Update Program Counter • Step 2: Mpc = Mpc+4; // update program counter; move Mpc value to D, D perform +4, move result back to Mpc 1000+2 1000 1002 1/16/99 CS520S99 Introduction C. Edward Chow Page 7 Instruction Fetch • Step 3: Mir = Mp[Mmar]; // fetch instruction send Mmar value to Mp, Mp retrieve move|d0, send back to Mir Steps3 and 2 can be done in parallel. 1000 Move|d0 100 1/16/99 CS520S99 Introduction C. Edward Chow Page 8 Instruction Decoding • Step 4: Decode Instruction in Mir Move|d0 100 1/16/99 CS520S99 Introduction C. Edward Chow Page 9 RTL Instruction Execution • Step 5: Mgeneral[0] = Mp[Mir16-31]; // execute the move of the constant into a general register named d0 Move|d0 100 Subscript 16-31 denotes the 16th and 31th bits containing constant 100 1/16/99 100 CS520S99 Introduction C. Edward Chow Page 10 Computer Architecture The term “computer architecture” was coined by IBM in 1964 for use with IBM 360. Amdahl, Blaauw, and Brooks [1964] used the term to refer to the programmer-visible portion of the instruction set. They believe that a family of machines of the same architecture should be able to run the same software. Benefits: • With a precise defined architecture, we can have many compatible implementations. • The program written in the same instruction set can run in all the compatible implementations. 1/16/99 CS520S99 Introduction C. Edward Chow Page 11 Architecture & Implementation • Single Architecture—multiple implementation computer family • Multiple Architecture—single implementation microcode emulator 1/16/99 CS520S99 Introduction C. Edward Chow Page 12 Computer Architecture Topics Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB Input/Output and Storage Disks, WORM, Tape DRAM Memory Hierarchy L2 Cache L1 Cache VLSI Instruction Set Architecture 1/16/99 RAID Emerging Technologies Interleaving Bus protocols Coherence, Bandwidth, Latency Addressing, Protection, Exception Handling Pipelining, Hazard Resolution, Pipelining and Instruction Superscalar, Reordering, Level Parallelism Prediction, Speculation, Vector, DSP CS520S99 Introduction C. Edward Chow Page 13 Computer Architecture Topics P M P M S °°° P M P M Interconnection Network Processor-Memory-Switch Multiprocessors Networks and Interconnections Shared Memory, Message Passing, Data Parallelism Network Interfaces Topologies, Routing, Bandwidth, Latency, Reliability Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB 1/16/99 CS520S99 Introduction C. Edward Chow Page 14 CS 520 Course Focus Understanding the design techniques, machine structures, technology factors, evaluation methods that will determine the form of computers in 21st Century Technology Parallelism Programming Languages Applications Computer Architecture: • Instruction Set Design • Organization • Hardware Operating Systems Measurement & Evaluation Interface Design (ISA) History Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB 1/16/99 CS520S99 Introduction C. Edward Chow Page 15 Function Requirements faced by a computer designer • Applications – general purpose balanced performance for a range of tasks – Scientific high performance floating points – Commercial support for COBOL (decimal arithmetic) database/transaction processing • Level of software compatibility – Object code/binary level no software porting, more hw design cost – Programming Lang. Level avoid old architecture burden, require software porting 1/16/99 CS520S99 Introduction C. Edward Chow Page 16 Function Requirements faced by a computer designer • Operating System Requirements – Size of address space – Memory management/Protection (e.g. garbage collection vs. realtime scheduling) – Interrupt/traps • Standards – Floating Point (IEEE754) – I/O Bus – OS – Networks – Programming Languages 1/16/99 CS520S99 Introduction C. Edward Chow Page 17 1988 Computer Food Chain Mainframe Supercomputer Minisupercomputer Work- PC Ministation computer Massively Parallel Processors Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB 1/16/99 CS520S99 Introduction C. Edward Chow Page 18 Massively Parallel Processors Minisupercomputer Minicomputer 1998 Computer Food Chain Mainframe Server Supercomputer Work- PC station Now who is eating whom? Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB 1/16/99 CS520S99 Introduction C. Edward Chow Page 19 Why Such Change in 10 years? • Performance – Technology Advances • CMOS VLSI dominates older technologies (TTL, ECL) in cost AND performance – Computer architecture advances improves low-end • RISC, superscalar, RAID, … • Price: Lower costs due to … – Simpler development • CMOS VLSI: smaller systems, fewer components – Higher volumes • CMOS VLSI : same dev. cost 10,000 vs. 10,000,000 units – Lower margins by class of computer, due to fewer services • Function – Rise of networking/local interconnection technology Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB 1/16/99 CS520S99 Introduction C. Edward Chow Page 20 Technology Trends: Microprocessor Capacity “Graduation Window” Moore’s Law Alpha 21264: 15 million Pentium Pro: 5.5 million PowerPC 620: 6.9 million Alpha 21164: 9.3 million Sparc Ultra: 5.2 million CMOS improvements: • Die size: 2X every 3 yrs • Line width: halve / 7 yrs Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB 1/16/99 CS520S99 Introduction C. Edward Chow Page 21 Memory Capacity (Single Chip DRAM) year 1980 1983 1986 1989 1992 1996 2000 size(Mb) cycle time 0.0625 250 ns 0.25 220 ns 1 190 ns 4 165 ns 16 145 ns 64 120 ns 256 100 ns Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB 1/16/99 CS520S99 Introduction C. Edward Chow Page 22 Technology Trends (Summary) Capacity Speed (latency) Logic 2x in 3 years 2x in 3 years DRAM 4x in 3 years 2x in 10 years Disk 4x in 3 years 2x in 10 years Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB 1/16/99 CS520S99 Introduction C. Edward Chow Page 23 Processor Performance Trends 1000 Supercomputers 100 Mainframes 10 Minicomputers Microprocessors 1 0.1 1965 1970 1975 1980 1985 1990 1995 2000 Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB Year 1/16/99 CS520S99 Introduction C. Edward Chow Page 24 Processor Performance (1.35X before, 1.55X now) 1200 DEC Alpha 21264/600 1000 1.54X/yr 800 600 DEC Alpha 5/500 400 200 0 DEC DEC Alpha 5/300 HP SunMIPSMIPSIBM AXP/ 9000/ DEC Alpha 4/266 -4/ M M/ RS/ 500 750 IBM POWER 100 6000 260 2000 120 87 88 89 90 91 92 93 94 95 96 97 Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB 1/16/99 CS520S99 Introduction C. Edward Chow Page 25 Performance Trends (Summary) • Workstation performance (measured in Spec Marks) improves roughly 50% per year (2X every 18 months) • Improvement in cost performance estimated at 70% per year Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB 1/16/99 CS520S99 Introduction C. Edward Chow Page 26 Computer Engineering Methodology Technology Trends Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB 1/16/99 CS520S99 Introduction C. Edward Chow Page 27 Computer Engineering Methodology Evaluate Existing Systems for Bottlenecks Benchmarks Technology Trends Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB 1/16/99 CS520S99 Introduction C. Edward Chow Page 28 Computer Engineering Methodology Evaluate Existing Systems for Bottlenecks Benchmarks Technology Trends Simulate New Designs and Organizations Workloads Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB 1/16/99 CS520S99 Introduction C. Edward Chow Page 29 Computer Engineering Methodology Implementation Complexity Evaluate Existing Systems for Bottlenecks Benchmarks Technology Trends Implement Next Generation System Simulate New Designs and Organizations Workloads Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB 1/16/99 CS520S99 Introduction C. Edward Chow Page 30 Measurement and Evaluation Design Analysis Architecture is an iterative process: • Searching the space of possible designs • At all levels of computer systems Creativity Cost / Performance Analysis Good Ideas Bad Ideas Mediocre Ideas Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB 1/16/99 CS520S99 Introduction C. Edward Chow Page 31 Measurement Tools • Benchmarks, Traces, Mixes • Hardware: Cost, delay, area, power estimation • Simulation (many levels) – ISA, RT, Gate, Circuit • Queuing Theory • Rules of Thumb • Fundamental “Laws”/Principles 1/16/99 CS520S99 Introduction C. Edward Chow Page 32 Metric of Computer Architecture • Space measured in bits of representation • Time measures in bit traffic (memory bandwidth) Many old frequency and benchmark studies focus on • dynamic opcode (memory size concern) • exponent differences of floating point operands (precision) • length of decimal numbers in business files (memory size) Trend: space is not much a concern; speed/time is everything. • Here we focus more on the following two performance metrics • Response time = time between start and finish of an event — execution time — latency • Throughput = total amount of work done in a given time — bandwidth (no. of bits or bytes moved per second) 1/16/99 CS520S99 Introduction C. Edward Chow Page 33 Metrics of Performance at Different Levels Application Answers per month Operations per second Programming Language Compiler ISA (millions) of Instructions per second: MIPS (millions) of (FP) operations per second: MFLOP/s Datapath Control Function Units Transistors Wires Pins Megabytes per second Cycles per second (clock rate) Adapted from (Prof. Patterson’s CS252S98 viewgraph). Copyright 1998 UCB 1/16/99 CS520S99 Introduction C. Edward Chow Page 34 Quantitative principles Improve means • increase performance • decrease execution time “X is n% faster than Y” ExecutionTime n 1 ExecutionTime 100 x y Quantitative principles • Make the common case fast — Amdahl’s Law • Locality of reference — 90% of execution time in 10% of code 1/16/99 CS520S99 Introduction C. Edward Chow Page 35 Amdahl’s Law Law of diminishing returns 50 50 50 25 Timeold Timenew FractionInEnhancedMode=0.5 based on old system SpeedupOfEnhancedMode=2 Speedup ExecutionTimeWithoutEnhancement ExecutionTimeWithEnhancement FractionInEnhancedMode Timenew Timeold (1 FractionInEnhancedMode) SpeedupOfE nhancedMod e Timeold 1 Speedup Timenew (1 FractionInEnhancedMode) FractionInEnhancedMode SpeedupOfEnhancedMode 1/16/99 CS520S99 Introduction C. Edward Chow Page 36 Amdahl’s Law Result FractionIn Enhancedmode OverallSpeedup When OverallSpeedup When SpeedupOfEnhancedMode=2 SpeedupOfEnhancedMode = 0.1 1.05 1.1 0.3 1.15 1.4 0.5 1.33 2 0.7 1.5 3.33 0.9 1.9 0.99 2 1/16/99 10 100 CS520S99 Introduction C. Edward Chow Page 37 Apply Amdahl’s Law: Example 1 Example1: Assume that the memory access accounts for 90% of the execution time. What is the speedup by replacing a 100ns memory with a 10ns memory? How much fast is the new system? Answer: FractionInEnhancedMode = 90%=0.9 SpeedupOfEnhancedMode = 100ns/10ns = 10 SpeedupOverall 1 0.1 0.9 10 1 1 426 5.26 1 0.1 0.09 0.19 100 The new system is 426% faster than the old one. Is it worthwhile if the high speed memory costs 10 times more? 1/16/99 CS520S99 Introduction C. Edward Chow Page 38 Apply Amdahl’s Law: Example 2 Example 2: Assume that 40% of the time is spent on CPU task; the rest is spent on I/O. Assume we improve CPU and keep I/O speed unchanged. a) How much faster should new CPU be to have the overall speedup of 1.5? b) Is that possible to have an overall speedup of 2? Why? Solution: 1 a) 1.5 x=6. 500% faster (1 0.4) 0.4 x 1 b) The maximum overall speedup that can be achieved is 1.66 1 0.4 Therefore, it is not possible to achieve the overall speedup of 2. 1/16/99 CS520S99 Introduction C. Edward Chow Page 39 Apply Amdahl’s Law: Example 3 Example: A recent research on the bottleneck of a 10Mbps Ethernet network system showed that only 10% of the execution time of a distributed application was spent on transmitting messages and 90% of the time was on application/ protocol software execution at hosts’ computers. If we replace Ethernet with 100 Mbps FDDI, 900% faster than Ethernet, what will be speedup of this improvement? What if we use 900% faster hosts? 1/16/99 CS520S99 Introduction C. Edward Chow Page 40 Excution Time The first performance metric and the best metric. Measure the time it takes to execute the intended application(s) or the typical workload. The time command can measure an application. vlsia[93]: time ts9 217.1u 27.2s 8:16 49% 0+27552k 6+3io 26pf+0w Here is an example which shows how OS and I/O impact the execution time. For program 1, Elapsed Time = sum(t1:t11)-t6-t8 System CPU time = t1+t3+t5+t9+t11 CPU time = t1 + t3 + t4 + t5 + t9+t10 User CPU time = t4 + t10 1/16/99 CS520S99 Introduction C. Edward Chow Page 41 CPU Time CPI=(Clock cycles Per Instruction); Ii is the frequency of instruction i in a program; IC=Instruction Count.; ClockCycleTime=1/ClockRate CPI figure gives insight into different styles of instruction sets & implementations. Interdependence among instruction count, CPI, and Clock rate Clock rate—Hardware technology and organization CPI—Organization and instruction set architecture Instruction count—Instruction set architecture and compiler technology We cannot measure the performance of a computer by single factor above alone. 1/16/99 CS520S99 Introduction C. Edward Chow Page 42 Evaluating Instruction Set Design Example Page 39: 1/4 of ALU and Load instructions replaced by new r->m inst. Assume that the clock cycle time is not changed. Is this a good idea? Frequency Before Clockcycle Frequency After ClockCycle ALU ops 43% 1 36.1% 1 Loads 21% 2 11.4% 2 Stores 12% 2 13.5% 2 Braches 24% 2 26.9% 3 12.1% 2 New r->m 1/16/99 CS520S99 Introduction C. Edward Chow Page 43 Evaluate Instruction Design CPIold = (0.43*1 + 0.21*2 + 0.12*2 + 0.24*2) = 1.57 CPU timeold = InstructionCountold * 1.57 * ClockCycleTimeold CPInew= (0.43 (0.25 * 0.43)) *1 (0.21 (0.25 * 0.43) * 2 (0.25 * 0.43) * 2 0.12 * 2 0.24 * 3 1 0.25 * 0.43 =1.908 CPU timenew = (0.893*InstructionCountold) * 1.908 * ClockCycleTimeold = 1.703 * InstructionCountold * ClockCycleTimeold With the assumptions, it is a bad idea to add registermemory instructions. 1/16/99 CS520S99 Introduction C. Edward Chow Page 44 Estimate CPU time by (SCPIi*InstructionCounti)*ClockCycleTime Program: f=(a-b)/(c-d*e) IC=InstructionCount=10 MIPS R2000 25MHz CPI=ClockcyclesPerInstruction Instructions (op dst, src1, src2) CPIi=ClockcyclesOfInstructionType i lw $14, 20($sp) Ii=number of Instructions of type i lw $15, 16($sp) in a prog. subu $24, $14, $15 lw $25, 8($sp) ClockCycleTime lw $8, 4($sp) =1/ClockRate=1/25*106 mul $9, $25, $8 =40*10-9sec=40nsec lw $10, 12($sp) subu $11, $10, $9 CPIi can be obtained from processor div$12, $24, $11 handbook. sw $12, 0($sp) Here we assume no cache misses. 1/16/99 CS520S99 Introduction C. Edward Chow Page 45 Estimate CPU time by ClockCycleTime*S(CPIi*InstructionCounti) i Instruction Type Ii Count CPIi CPIi*ICi 1 lw 5 2 10 2 subu 2 1 2 3 mul 1 1 1 4 div 1 1 1 5 sw 1 2 2 16 CPU Time = 16*40 nsec = 640 nsec 1/16/99 CS520S99 Introduction C. Edward Chow Page 46 Other Performance Measures The only reliable measure of performance is the execution time of real programs. Other attempts: ClockRate InstructionCount 1. MIPS 6 CPI 10 ExecutionT ime 10 Depends on instruction set, hard to compare, 6 • • MIPS varies with programs on the same computer. Example1: the impact of using Floating Point Hardware on MIPS. Example2: Impact of optimizing compiler usage on MIPS. What affects performance? • • • • input version of programs, compiler, OS, CPU optimizing level of compiler machine configurations — amount of cache, main memory, disks — the speed of cache, main memory, disks, and bus. 1/16/99 CS520S99 Introduction C. Edward Chow Page 47 Myth of MIPS Example: The effect of optimizing compiler on MIPS number. (Page45) A machine with the 500MHz clock rate and the following clock cycles for instructions. For a program, the relative frequencies of instructions before and after using an optimizing compiler are as shown in the table. Instruction Type IC Before Optimization CPIi IC After Optimization ALU ops 86 1 43 Loads 42 2 42 Stores 24 2 24 Branches 48 2 48 CPI unoptimized = 86/200*1+42/200*2+24/200*2+48/200*2=1.57 MIPS unoptimized = 500/(1.57*106)=318.5 CPI optimized = 43/157*1+42/157*2+24/157*2+48/157*2=1.73 MIPS optimized = 500/(1.73*106)=289.0 CPU time unoptimized = 200*1.57*(2*10-9) = 6.28*10-7 CPU time optimized = 157*1.73*(2*10-9) = 5.43*10-7 1/16/99 CS520S99 Introduction C. Edward Chow Page 48 MFLOPS For scientific computing MFLOPS is used as a metric: NoOfFPOperationInA Pr ogram MFLOPS ExecutionTime 106 Here it emphasizes operations instead of instructions. • Unfortunately, the set of floating-point operations is not consistent across machines. • The rating changes with different mix ratio of integer-floating or floating-floating instructions. The solution is to use a canonical number of floating point operations for certain type of FP operations, e.g. 1 for (add, sub, compare, mul), 4 for (fdiv, fsqrt), 8 for (arctan, sin, exp) 1/16/99 CS520S99 Introduction C. Edward Chow Page 49 Programs to Evaluate Performance Real programs — The set of programs to be run forms the workload. Kernels — key pieces of real programs; isolate features of a machines; Livermore Loops (weighted ops); Linpack Toy Benchmarks — 10 to 100 lines of codes: e.g., quicksort, Sieve, Puzzle Synthetic Benchmarks — artificially created to match an average execution profile: e.g., Whetstone, Dhrystone SPEC (System Performance Evaluation Cooperation) Benchmarks 89, 92, 95. Perfect Club Benchmarks for parallel computations. 1/16/99 CS520S99 Introduction C. Edward Chow Page 50 SPEC: System Performance Evaluation Cooperative Benchmark • First Round 1989: 10 programs yielding a single number (“SPECmarks”) • Second Round 1992: SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs) – Compiler Flags unlimited. March 93 of DEC 4000 Model 610: spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)= memcpy(b,a,c)” wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200 nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas • Third Round 1995 – new set of programs: SPECint95 (8 integer programs) and SPECfp95 (10 floating point) – “benchmarks useful for 3 years” – Single flag setting for all programs: SPECint_base95, SPECfp_base95 1/16/99 CS520S99 Introduction C. Edward Chow Page 51 Comparison of Machine Performance Single Program—execution time Collection of (n) Programs 1. Total execution time 2. Normalized to a reference machine, compute the TimeRatio of ith program TimeRatioi=Timei/Timei(ReferenceMachine) 1 n arithmetic mean= TimeRatio i n i 1 geometric mean= n n TimeRatio i 1 harmonic mean= i n n TimeRatio i 1 i Geometric mean is consistent independent of referenced machine. Harmonic mean decrease impact of outliers. 1/16/99 CS520S99 Introduction C. Edward Chow Page 52 Summarize Performance Results Example: Execution of two programs on three machines. Assume Program 1 has 10M floating point operations and Program 2 has 50M floating point operations ComputerA ComputerB ComputerC Program1(sec) 1 10 20 Program2(sec) 100 50 20 TotalTime(sec) 101 60 40 Native MFLOPS on Program 1 10/1=10 10/10=1 10/20=0.5 Native MFLOPS on Program 2 50/100=0.5 50/50=1 50/20=2.5 Arithmetic Mean (10+0.5)/2=5.25 (1+1)/2=1 (0.5+2.5)/2=3 Geometric Mean 10 0.5 2.24 1/16/99 CS520S99 Introduction 11 1 0.5 2.5 1.12 C. Edward Chow Page 53 Weighted Arithmetic Means • For a set of n program, each takes Timei on one machine, the “equal-time” weights on that machine are wi 1 Timei n j 1 1 Time j a b c w(1) P1(sec) P2(sec) 1 1000 10 100 20 20 AM:W(1) 500.5 55 20 AM:W(2) 91.82 18.18 20 w(2) w(3) 0.5 0.909 0.999 0.5 0.091 0.001 AM:W(3) 1.998 10.09 20 Figure 1.12 W(3) [W(2)] are equal-time weights based on machineA [B]. This is used in Exercise 1.11 1/16/99 CS520S99 Introduction C. Edward Chow Page 54 Hints for Homework # 1 Exercise 1.7: 1. Whetstone consists of integer operations besides the floatingpoint operations. 2. When floating point processor is not used, all floating-point operations need to be emulated by integer operations (e.g. shift, and, add, sub, multiply, div...). 3. For different co-fp processors, we will have the same # of integer ops but different # of FP ops. Exercise 1.11: a. use the equal-time weightings formula in Page 26. b. DEC3000 execution time(ora) = VAX11 780Time(ora)/ DEC3000SPECRatio=7421/165 1/16/99 CS520S99 Introduction C. Edward Chow Page 55 FP Compilation Results depend on existence of FP coprocessor Exercise 1.7. Whetstone is a benchmark with both Integer and Floating Point (FP) operations. 1/16/99 CS520S99 Introduction C. Edward Chow Page 56 Compiling floating-point statement Here are the generated assembly instructions of a floating-point operation statement in C on DEC3100 (with R2010 floating point unit) using command cc -S Note that since the R2010 only implements simple floating point add, sub, mult, and div operations, sqrt, exp, and alog are translated as subroutine calls using jal instr. The floating-point division is translated as div.d and will be executed by R2010. 1/16/99 # 7 x=sqrt(exp(alog(x)/t1)); s.d $f4, 48($sp) #load x to fp register f4 l.d $f12, 56($sp) #load t1 to fp register f12 jal alog #call subroutine alog move $16, $2 mtc1 $16, $f6 cvt.d.w $f8, $f6 #f8 contains alog(x) l.d $f10, 48($sp) div.d $f12, $f8, $f10 jal exp mov.d $f20, $f0 mov.d $f12, $f20 jal sqrt s.d $f0, 56($sp) CS520S99 Introduction C. Edward Chow Page 57 Homework #1 Problems 1.7 and 1.11 Problem A. Program segment: f=(a-b)/(a*b) is compiled into the following MIPS R2000 code. Instructions (op dst, src1, src2) lw $14, 20($sp) # a is allocated at M[sp+20] lw $15, 16($sp) # b is allocated at M[sp+16] subu $24, $14, $15 mul $9, $14, $15 div $12, $24, $9 sw 1/16/99 $12, 0($sp) # f is allocated at M[sp+0] CS520S99 Introduction C. Edward Chow Page 58 Homework #1 (Continue) Assume all the variables are already in the cache (i.e. does not have to go the main memory for data) and Table 1 contains the clock cycles for each types of instructions when data is in the cache. What is the execution time (in term of seconds) of the above segment using a R2000 chip with a 25 MHz clock? Problem B. Assume the CPU operation accounts for 70% of the time in a system. a) What is the overall speedup if we improve CPU speed by 100%? b) How much faster should the new CPU be in order to have the overall speedup of 1.7? c) Is it possible to have overall speedup of 3 by just improving the CPU? 1/16/99 CS520S99 Introduction C. Edward Chow Page 59