Chapter 1 Introduction to Computer Architecture Computer Architecture COE 501 Course Objective The course objective is to gain the knowledge required to design and analyze highperformance.
Download ReportTranscript Chapter 1 Introduction to Computer Architecture Computer Architecture COE 501 Course Objective The course objective is to gain the knowledge required to design and analyze highperformance.
Chapter 1 Introduction to Computer Architecture Computer Architecture COE 501 Course Objective The course objective is to gain the knowledge required to design and analyze highperformance computer systems. Technology Parallelism Programming Languages Applications Computer Architecture: • Instruction Set Design • Machine Organization • Hardware Operating Systems Interface Design History Topic Coverage Textbook: Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 3rd Ed., 2004. • • • • • • • • Fundamentals of Computer Architecture (Chapter 1) Instruction Set Architecture (Chapter 2, Appendix D, Reading) ILP and dynamic exploitation (Chapter 3) Exploiting ILP with Software Approaches (Chapters 4) Memory Hierarchy (Chapter 5) Multiprocessors (Chapter 6) Networks and Interconnection Technology (Overview, Chapter 7) Clusters (Overview, Chapter 8) Plus, several computer architecture research papers. Computer Architecture Topics Input/Output and Storage Disks and Tape DRAM Memory Hierarchy L2 Cache L1 Cache VLSI Emerging Technologies, Interleaving Coherence, Bandwidth, Latency Cache Design Block size, Associativity Addressing modes, formats Pipelining, Hazard Resolution, Superscalar, Reordering, ILP, Branch Prediction, Speculation Instruction Set Architecture Processor Design RAID Computer Architecture Topics Networks, Interconnections, and Multiprocessors P M P M °°° P M Interconnection Network P M Shared Memory or Message Passing Network Interfaces Topologies, Routing, Bandwidth, Latency, Reliability Context for Designing New Architectures • Application Area – Special Purpose / General Purpose – Scientific (FP intensive) / Commercial (Integer Intensive) • Level of Software Compatibility Required – Object Code / Binary Compatible – Assembly Language – Programming Language Context for Designing New Architectures • Operating System Requirements – – – – Size of Address Space Memory Management / Protection Context Switch Capability Interrupts and Traps • Standards: – IEEE 754 Floating Point – I/O Bus – Networks • Technology Improvements – Increased Processor Performance – Larger Memory and I/O devices – Software/Compiler Innovations Technology Trends • Functionality Enhancements – Improvements in networking technology – Increase in communication and multimedia support – Support for larger programs • Performance – Technology Advances » Decreases in feature size, increase in wafer size, lower voltages – Computer architecture advances improves low-end » RISC, superscalar, pipelining, … • Price: Lower costs due to … – – – – Simpler architectures Higher volumes More reliable systems Improved technology Technology Trends: Microprocessor Capacity 100000000 Alpha 21264: 15 million Alpha 21164: 9.3 million PowerPC 620: 6.9 million Pentium Pro: 5.5 million Sparc Ultra: 5.2 million 10000000 Transistors Transistors Pent ium i80486 1000000 i80386 i80286 100000 CMOS improvements: • Die size: 2X every 3 yrs • Line width: halve / 7 yrs i8086 10000 i8080 i4004 1000 1970 1975 1980 1985 Year 1990 1995 2000 Memory Capacity (Single Chip DRAM) 1000000000 100000000 Bits 10000000 1000000 100000 10000 1000 1970 1975 1980 1985 1990 1995 2000 year 1980 1983 1986 1989 1992 1996 2000 Year Moore’s Law for Memory: Memory capacity increases by 4x every 3 years size 64 Kb 256 Kb 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb cyc time 250 ns 220 ns 190 ns 165 ns 145 ns 125 ns 100 ns Technology Improvements Capacity Speed Logic 2x in 3 years 2x in 3 years DRAM 4x in 3 years 1.4x in 10 years Disk 2x in 3 years 1.4x in 10 years Speed increases of memory and I/O have not kept pace with processor speed increases. Why does DRAM capacity increase faster than logic capacity? Processor Performance (1.35X before, 1.55X now) 1200 DEC Alpha 21264/600 1000 800 600 DEC Alpha 5/500 400 200 0 DEC Alpha 5/300 DEC HP IBM AXP/ SunMIPSMIPS 9000/ RS/ DEC Alpha 4/266 500 -4/ M M/ 750 6000 IBM POWER 100 260 2000 120 87 88 89 90 91 92 93 94 95 96 97 Processor frequency trend 100 10,000 Intel DEC Gate delays/clock 21264S 1,000 Mhz 21164A 21264 Pentium(R) 21064A 21164 II 21066 MPC750 604 604+ 10 Pentium Pro 601, 603 (R) Pentium(R) 100 Gate Delays/ Clock Processor freq scales by 2X per generation IBM Power PC 486 386 1 2005 2003 2001 1999 1997 1995 1993 1991 1989 1987 10 Frequency doubles each generation Number of gates/clock reduce by 25% Processor Performance Trends 1000 Supercomputers 100 Mainframes 10 Minicomputers Microprocessors 1 0.1 1965 1970 1975 1980 1985 Year 1990 1995 2000 1988 Computer Food Chain Mainframe Supercomputer Minisupercomputer Massively Parallel Processors Work- PC Ministation computer Massively Parallel Processors Minisupercomputer Minicomputer Current Computer Food Chain Mainframe Server Supercomputer Work- PC PDA station Processor Perspective • Putting performance growth in perspective: Pentium 3 (Coppermine) Cray YMP Type Desktop Supercomputer Year 2000 1988 Clock 1130 MHz 167 MHz MIPS > 1000 MIPS < 50 MIPS Cost $2,000 $1,000,000 Cache 256 KB 0.25 KB Memory 512 MB 256 MB Where Has This Performance Improvement Come From? • Technology – More transistors per chip – Faster logic • Machine Organization/Implementation – Deeper pipelines – More instructions executed in parallel • Instruction Set Architecture – Reduced Instruction Set Computers (RISC) – Multimedia extensions – Explicit parallelism • Compiler technology – Finding more parallelism in code – Greater levels of optimization What is Ahead? • • • • • Greater instruction level parallelism Bigger caches, and more levels of cache Multiple processors per chip Complete systems on a chip High performance interconnect Current Computers • Technology – Very large dynamic RAM: 10 GB and beyond – Large fast static RAM: 500 MB, 1-2ns – Very large disks: Approaching 300 GB • Complete systems on a chip – 50+ Million Transistors • Parallelism – – – – – Superscalar, VLIW Superpipeline Explicitely Parallel Multiprocessors Distributed systems Current Computers • Low Power – 60% of PCs portable – Performance per watt is now of interest • Parallel I/O – Many applications I/O limited, not computation limited – Processors speeds increase faster than memory and I/O • Multimedia – New interface technologies – Video, speech, handwriting, virtual reality, … • Embedded systems extremely important – 90% of computers manufactured and 50% of processor revenue is in the embedded market (e.g., microcontrollers, DSPs, graphics processors, etc.) Hardware Technology 1980 Memory chips 64 KB 1990 4 MB Current 2 GB Clock Rate 1-2 MHz 20-40 MHz 3-6 GHz Hard disks 40 M 1G 300 GB Floppies .256 M 1.5 M 1-4 GB Computing in the 21st century • Continue quadrupling memory about every 3 years • Single-chip multiprocessor systems • High-speed communication networks • These improvements will create the need for new and innovative computer systems. Measurement and Evaluation Architecture is an iterative process: • Search the possible design space • Make selections • Evaluate the selections made Good measurement tools are required to accurately evaluate the selection. Cost / Performance Analysis Good Ideas Bad Ideas Mediocre Ideas Measurement Tools • Benchmarks • Simulation (many levels) – ISA, RTL, Gate, Transistor • • • • Cost, Delay, and Area Estimates Queuing Theory Rules of Thumb Fundamental Laws The Bottom Line: Performance (and Cost) Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6.5 hours 610 mph 470 286,700 BAD/Sud Concorde 3 hours 1350 mph 132 178,200 • Time to run the task (Execution Time) – Execution time, response time, latency • Tasks per day, hour, week, sec, ns … (Performance) – Throughput, bandwidth Performance and Execution Time Execution time and performance are reciprocals ExTime(Y) Performance(X) --------- = --------------ExTime(X) Performance(Y) • Speed of Concorde vs. Boeing 747 • Throughput of Boeing 747 vs. Concorde Performance Terminology “X is n% faster than Y” means: ExTime(Y) --------- = ExTime(X) Performance(X) -------------Performance(Y) = 1 + n ----100 n = 100(Performance(X) - Performance(Y)) Performance(Y) n = 100(ExTime(Y) - ExTime(X)) ExTime(X) Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X? Example Example: Y takes 15 seconds to complete a task, X takes 10 seconds. What % faster is X? n = 100(ExTime(Y) - ExTime(X)) ExTime(X) n = 100(15 - 10) 10 n = 50% Amdahl's Law Speedup due to enhancement E: ExTime w/o E Speedup(E) = ------------ExTime w/ E = Performance w/ E ------------------Performance w/o E Suppose that enhancement E accelerates a fraction Fractionenhanced of the task by a factor Speedupenhanced, and the remainder of the task is unaffected. What are the new execution time and the overall speedup due to the enhancement? Amdahl’s Law ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced Speedupoverall = ExTimeold ExTimenew 1 = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced Example of Amdahl’s Law • Floating point instructions improved to run 2X; but only 10% of the time was spent on these instructions. ExTimenew = Speedupoverall = Example of Amdahl’s Law • Floating point instructions improved to run 2X; but only 10% of the time was spent on these instructions. ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold Speedupoverall = ExTimenew = 1 0.95 ExTimeold = 1.053 • The new machine is 5.3% faster for this mix of instructions. Make The Common Case Fast • All instructions require an instruction fetch, only a fraction require a data fetch/store. – Optimize instruction access over data access • Programs exhibit locality - 90% of time in 10% of code - Temporal Locality (items referenced recently) - Spatial Locality (items referenced nearby) • Access to small memories is faster – Provide a storage hierarchy such that the most frequent accesses are to the smallest (closest) memories. Reg's Cache Memory Disk / Tape Hardware/Software Partitioning • The simple case is usually the most frequent and the easiest to optimize! • Do simple, fast things in hardware and be sure the rest can be handled correctly in software. Would you handled these in hardware or software: • Integer addition? • Accessing data from disk? • Floating point square root? Performance Factors CPU time = Seconds Program = Instructions x Cycles Program Instruction x Seconds Cycle • The number of instructions/program is called the instruction count (IC). • The average number of cycles per instruction is called the CPI. • The number of seconds per cycle is the clock period. • The clock rate is the multiplicative inverse of the clock period and is given in cycles per second (or MHz). • For example, if a processor has a clock period of 5 ns, what is it’s clock rate? Aspects of CPU Performance CPU time = Seconds Program = Instructions x Cycles Program Instruction Instr. Cnt Program Compiler Instr. Set Organization Technology CPI x Seconds Cycle Clock Rate Marketing Metrics MIPS = Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6 • Not effective for machines with different instruction sets • Not effective for programs with different instruction mixes • Uncorrelated with performance MFLOPs = FP Operations / Time * 10^6 • Machine dependent • Often not where time is spent •Peak - maximum able to achieve Normalized MFLOPS: •Native - average for a set of benchmarks add,sub,compare,mult 1 •Relative - compared to another platform divide, sqrt 4 exp, sin, . . . 8 Programs to Evaluate Processor Performance • (Toy) Benchmarks – 10-100 line program – e.g.: sieve, puzzle, quicksort • Synthetic Benchmarks – Attempt to match average frequencies of real workloads – e.g., Whetstone, dhrystone • Kernels – Time critical excerpts • Real Benchmarks Benchmarks • Benchmark mistakes – – – – – – – Only average behavior represented in test workload Loading level controlled inappropriately Caching effects ignored Buffer sizes not appropriate Ignoring monitoring overhead Not ensuring same initial conditions Collecting too much data but doing too little analysis • Benchmark tricks – Compiler wired to optimize the workload – Very small benchmarks used – Benchmarks manually translated to optimize performance SPEC: System Performance Evaluation Cooperative • First Round SPEC CPU89 – 10 programs yielding a single number • Second Round SPEC CPU92 – SPEC CINT92 (6 integer programs) and SPEC CFP92 (14 floating point programs) – Compiler flags can be set differently for different programs • Third Round SPEC CPU95 – new set of programs: SPEC CINT95 (8 integer programs) and SPEC CFP95 (10 floating point) – Single flag setting for all programs • Fourth Round SPEC CPU2000 – new set of programs: SPEC CINT2000 (12 integer programs) and SPEC CFP2000 (14 floating point) – Single flag setting for all programs – Programs in C, C++, Fortran 77, and Fortran 90 SPEC 2000 • 12 integer programs: – 2 Compression – 2 Circuit Placement and Routing – C Programming Language Compiler – Combinatorial Optimization – Chess, Word Processing – Computer Visualization – PERL Programming Language – Group Theory Interpreter – Object-oriented Database. – Written in C (11) and C++ (1) • 14 floating point programs: – – – – – – – – – – – – – – – – Quantum Physics Shallow Water Modeling Multi-grid Solver 3D Potential Field Parabolic / Elliptic PDEs 3-D Graphics Library Computational Fluid Dynamics Image Recognition Seismic Wave Simulation Image Processing Computational Chemistry Number Theory / Primality Testing Finite-element Crash Simulation High Energy Nuclear Physics Pollutant Distribution Written in Fortran (10) and C (4) Other SPEC Benchmarks • JVM98: – Measures performance of Java Virtual Machines • SFS97: – Measures performance of network file server (NFS) protocols • Web99: – Measures performance of World Wide Web applications • HPC96: – Measures performance of large, industrial applications • APC, MEDIA, OPC – Meausre performance of graphics applications • For more information about the SPEC benchmarks see: http://www.spec.org. Conclusions on Performance • A fundamental rule in computer architecture is to make the common case fast. • The most accurate measure of performance is the execution time of representative real programs (benchmarks). • Execution time is dependent on the number of instructions per program, the number of cycles per instruction, and the clock rate. • When designing computer systems, both cost and performance need to be taken into account.