UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN Csaba Andras Moritz ECE668 Part.1 .1
Download ReportTranscript UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN Csaba Andras Moritz ECE668 Part.1 .1
UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Computer Architecture ECE 668 Part 1 IntroductioN Csaba Andras Moritz ECE668 Part.1 .1 Coping with ECE 668 Students with varied backgrounds Prerequisites – Basic Computer Architecture, VLSI 2 projects to choose from, some flexibility beyond that You need software and/or Verilog/HSPICE skils to complete it 2 exams – midterm and final Class participation, attend office hours About the instructor First lectures- review of Performance and Pipelining (Chapter 1 + Appendix A) Many lectures will be using the whiteboard, and slides Lectures related to textbook and beyond Many lectures are outside the textbook Web: www.ecs.umass.edu/ece/andras/courses/ECE668/ ECE668 Part.1 .2 What you should know Basic machine structure processor (data path, control, arithmetic), memory, I/O Read and write in an assembly language, C, C++,.. MIPS/ARM ISA preferred Understand the concepts of pipelining and virtual memory Basic VLSI – HSPICE and/or Verilog ECE668 Part.1 .3 Textbook and references Textbook: D.A. Patterson and J.L. Hennessy, Computer Architecture: A Quantitative Approach, 4th edition (or later), Morgan-Kaufmann. Recommended reading: J.P. Shen and M.H. Lipasti, Modern Processor Design: Fundamentals of Superscalar Processors, McGraw-Hill, 2005. Chandrakasan et al, Design of High-Performance Microprocessor Circuits NASIC research papers and Nanoelectronics textbook chapter; SKYBRIDGE, N3ASIC, CMOL, FPNI, SPWF papers if interested Other research papers we bring up in class. ECE668 Part.1 .4 Course Outline I. Introduction (Ch 1) II. Pipeline Design (App A) III. Instruction-level Parallelism, Pipelining (App.A,Ch.2) IV. Memory Design: Memory Hierarchy, Cache Memory, Secondary Memory (Ch.4) V. Multiprocessors (Ch. 3) VI. Deep Submicron Implementation – Process Variation, Power-aware Architectures, Compiler’s role VII. Nanoscale architectures ECE668 Part.1 .5 Administrative Details Instructor: Prof. Csaba Andras Moritz KEB 309H Email: [email protected] Office Hours: 2:30-3:30 pm, Tues., & 2:30-3PM Thur. TA – pending Course web page: details available at: http://www.ecs.umass.edu/ece/andras/course s/ECE668 ECE668 Part.1 .6 Grading Midterm I - 35% Project – 30%: two projects to choose from Class Participation – 5% Final Exam. - 30% Homework – exam questions ECE668 Part.1 .7 What is “Computer Architecture” Computer Architecture = Instruction Set Architecture + Machine Organization (e.g., Pipelining, Memory Hierarchy, Storage systems, etc) Or Unconventional Organization IBM 360 (minicomputer, mainframe, supercomputer) Intel X86 vs. ARM vs. Nanoprocessors ECE668 Part.1 .8 Computer Architecture Topics - Processors Input/Output and Storage RAID performance, reliability Disks, Tape Interleaving Bus protocols DRAM Memory Hierarchy VLSI L2 Cache L1 Cache Bandwidth, Latency Addressing Instruction Set Architecture Pipelining, Hazard Resolution, Superscalar, Reordering, Branch Prediction, VLIW, Vector ECE668 Part.1 .9 Instruction Level Parallelism Advanced CMOS multi-cores &Nano proc.? 2013 ECE668 Part.1 .10 ECE668 Part.1 .11 Scaling ECE668 Part.1 .12 Figure courtesy of Intel; Copyright – Baskaran Ganesan, Intel Higher Education Program Shrinking geometry ECE668 Part.1 .13 Figure courtesy of Intel; Copyright – Baskaran Ganesan, Intel Higher Education Program Die ECE668 Part.1 .14 Wafer ECE668 Part.1 .15 CPUs: Archaic (Nostalgic) v. Semi Modern v. Modern? 2001 Intel Pentium 4 1500 MHz (120X) 4500 MIPS (peak) (2250X) Latency 15 ns (20X) 42,000,000 xtors, 217 mm2 64-bit data bus, 423 pins 3-way superscalar, Dynamic translate to RISC, Superpipelined (22 stage), Out-of-Order execution On-chip 8KB Data caches, 96KB Instr. Trace cache, 256KB L2 cache 1982 Intel 80286 12.5 MHz 2 MIPS (peak) Latency 320 ns 2 134,000 xtors, 47 mm 16-bit data bus, 68 pins Microcode interpreter, separate FPU chip (no caches) ECE668 Part.1 .16 2015? Multi-core = Network on a chip Everything you learn as CSE students applied/integrated in a chip! ECE668 Part.1 .17 Intel Polaris with 80 cores ECE668 Part.1 .18 Figure courtesy of Intel; Copyright – Baskaran Ganesan, Intel Higher Education Program Tilera processor with 64 cores MIT startup from Raw project (used to be involved in this) ECE668 Part.1 .19 What is next: Nanoprocessors? Molecular memory, NASIC processors, 3D? Cross NW devices Courtesy of Prof Chui’s Group at UCLA 2-4 decoder opcode dest adder/ multiplier 2-4 decoder rf3~0 opcode operanda ECE668 Part.1 .20 rf3~0 result operandb operanda dest operandb opcode NASIC ALU, Copyright: NASIC group, UMASS adder/ result multiplier From Nanodevices to Nanocomputing Crossed Nanowire Array n+ gate Array-based Circuits with Built-in Fault-tolerance (NASICs) pchannel n+ source & drain a0 b0 a0 b0 clk s0 s0 a1 b1 a1 b1 Down clk s1 s1 a2 b2 a2 b2 Down clk s2 s2 a3 b3 a3 b3 Down Up s3 s3 Down Up Up c0 c0 clk s0 s0 Up s0 s0 c2 c2 c1 c1 s0 s0 s1 s1 s0 s0 s0 s0 c4 c4 c3 c3 s2 s2 Evaluation/Cascading: Streaming Control with Surrounding Microwires s3 s3 Nanoprocessor ECE668 Part.1 .21 NASICs Fabric Based Architectures Cellular Architecture WIre Streaming Processor General purpose stream processor ECE668 Part.1 .22 5-stage pipeline with minimal feedback Built-in fault tolerance: up to 10% device level defect rates 33X density adv vs. 16nm scaled CMOS Simpler manufacturing ~9X improved power-per-performance efficiency (rough estimate) • Special purpose for image and signal processing Massively parallel array of identical interacting simple functional cells Fully programmable from external template signals 22X denser than in 16nm scaled CMOS N3ASIC- 3D Nanowire Technology ECE668 Part.1 .23 N3P – Hybrid Spin-Charge Platform ECE668 Part.1 .24 Skybridge 3D Circuits – Vertically Integrated 3D Circuit concept and 1 bit full adder Designed in my group FETs are gate-all-around on vertical nanowires ECE668 Part.1 .25 Example ISAs in Processors (Instruction Set Architectures) ARM Digital Alpha HP PA-RISC Sun Sparc MIPS Intel RISC vs. CISC ECE668 Part.1 .26 (32, 64-bit, v8) 1985 (v1, v3) 1992 (v1.1, v2.0) 1986 (v8, v9) 1987 (MIPS I, II, III, IV, V) 1986 (8086,80286,80386, 1978 80486,Pentium, MMX, ...) Basics Let us review some basics ECE668 Part.1 .27 RISC ISA Encoding Example ECE668 Part.1 .28 Virtualized ISAs BlueRISC TrustGUARD ISA is randomly created internally Fluid - more than one ISA possible ECE668 Part.1 .29 Characteristics of RISC Only Load/Store instructions access memory A relatively large number of registers Goals of new computer designs Higher performance More functionality (e.g., MMX) Other design objectives? (examples) ECE668 Part.1 .30 How to measure performance? • Time to run the task – Execution time, response time, latency – Performance may be defined as 1 / Ex_Time – Throughput, bandwidth ECE668 Part.1 .31 Speedup performance(x) = 1 execution_time(x) " Y is n times faster than X" means n = speedup = Execution_time(old / brand x) Execution_time(new / brand y) Speedup must be greater than 1; Tx/Ty = 3/2 = 1.5 ECE668 Part.1 .32 but not Ty/Tx = 2/3 = 0.67 MIPS and MFLOPS MIPS (Million Instructions Per Second) Can we compare two different CPUs using MIPS? MFLOPS (Million Floating-point operations Per Sec.) Application dependent (e.g., compiler) Still useful for benchmarks Benchmarks: e.g., SPEC CPU 2000: 26 applications (with inputs) SPECint2000: Twelve integer, e.g., gcc, gzip, perl SPECfp2000: Fourteen floating-point intensive, e.g., equake ECE668 Part.1 .33 SPEC CPU 2000 SPECint2000 Benchmark Language Category 164.gzip C Compression 175.vpr C FPGA Circuit Place& Route 176.gcc C C Compiler 181.mcf C Combinatorial Optimization 186.crafty C Game Playing: Chess 197.parser C Word Processing 252.eon C++ Computer Visualization 253.perlbmk C PERL Prog Language 254.gap C Group Theory, Interpreter 255.vortex C Object-oriented Database 256.bzip2 C Compression 300.twolf C Place and Route Simulator www.specbench.org/cpu200 ECE668 Part.1 .34 SPECfp2000 Benchmark Language Category 168.wupwise Fortran77 Quantum Chromodynamics 171.swim Fortran77 Shallow Water Modeling 172.mgrid Fortran77 Multi-grid Solver 173.applu Fortran77 Partial Differential Equations 177.mesa C 3-D Graphics Library 178.galgel Fortran90 Fluid Dynamics 179.art C Image Recognition /Neural Nets 183.equake C Seismic Wave Propagation 187.facerec Fortran 90 Face Recognition 188.ammp C Computational Chemistry 189.lucas Fortran90 Primality Testing 191.fma3d Fortran90 Finite-element Crash - Nuclear Physics 200.sixtrack Fortran77 Accelerator Design 301.apsi Fortran77 Meteorology: Pollutant Distribution Spec2006 (still current) ECE668 Part.1 .35 Other Benchmarks Workload Category Example Benchmark Suite CPU Benchmarks - Uniprocessor SPEC CPU 2006 Java Grande Forum Benchmarks SciMark, ASCI SPLASH, NASPAR MediaBench EEMBC benchmarks BDTI benchmarks SPECjvm98, CaffeineMark SPECjBB2000, VolanoMark Java Grande Forum Benchmarks SciMark CPU - Parallel Processor Multimedia Embedded Digital Signal Processing Java - Client side Java - Server side Java - Scientific www.spec.org ECE668 Part.1 .36 Transaction Processing On-Line Transaction Processing TPC-C, TPC-W Transaction Processing Decision Support Systems TPC-H, TP-R Web Server SPEC web99, TPC-W, VolanoMark Electronic commerce TPC-W, SPECjBB2000 Mail-server SPECmail2000 Network File System SPEC SFS 2.0 Personal Computer SYSMARK, WinBench, DMarkMAX99 Handheld device committee SPEC Synthetic Benchmarks Whetstone Benchmark www.cse.clrc.ac.uk /disco/Benchmarks/ whetstone.shtml Rank Machine 1 Pentium 4/3066 (ifc) 1966 2 HP Superdome Itanium2/1500 492 3 HP RX5670 Itanium2/1500-H 655 4 Pentium 4/2666 (ifc) 1966 5 IBM pSeries 690Turbo/1.7 1996 6 Compaq Alpha ES45/1250 1679 7 HP RX4640 Itanium2/1300 492 8 IBM Regatta-HPC/1300 492 9 IBM pSeries 690Turbo/1.3 1996 10 AMD Opteron848/2200 1966 Core DMIPS Freq. DMIPS Inline /MHz. (MHz) DMIPS/MHz 4Kc™ 1.3 4KEc™ 1.35 5Kc™ 1.4 5Kf™ 1.4 20Kc™ 1.7 ECE668 Part.1 .37 300 300 350 320 600 Mflop ratings (Vl=1024) Total CPU N2 N3 N8 (seconds) 390 405 490 448 1020 1.6 1.8 2.0 2.0 2.2 Inline DMIPS 480 540 700 640 1320 529 3441 3441 444 475 815 2753 444 353 1147 1347 2907 2907 1201 1841 1925 2511 1454 1905 1255 9.2 9.8 9.8 10.4 10.8 10.9 11.3 11.5 11.7 11.8 MWIPS 4071 3826 3855 3532 3472 3441 3324 3281 3260 3158 Dhrystone Benchmark How do we design faster CPUs? Faster technology – used to be the main approach (a) getting more expensive (b) reliability & yield (c) speed of light (3.10^8 m/sec) Larger dies (SOC - System On a Chip) less wires between ICs but - low yield (next slide) Parallel processing - use n independent processors limited success n-issue superscaler microprocessor (currently n=4) Can we expect a Speedup = n ? Pipelining Multi-threading ECE668 Part.1 .38 Power consumption Dynamic α * Vdd^2 * f* Cl Leakage Mainly from subthreshold (the FETs leak current) Significant for small feature sizes (less Ion/Ioff) Power-aware architectures Objective is to minimize activity often Role of compilers - control Circuit level optimizations – make same more efficient CAD tools – e.g., clock gating – make it easy to add ECE668 Part.1 .39 Define and quantify power Poweridle = Currentidle ´ Voltage Leakage current increases in processors with smaller transistor sizes Increasing the number of transistors increases power even if they are turned off Leakage is dominant sub 90nms Very low power systems even gate voltage to inactive modules to control loss due to leakage ECE668 Part.1 .40 Define and quantity dependability (2/3) Module reliability = measure of continuous service accomplishment (or time to failure). 2 metrics Mean Time To Repair (MTTR) measures Service Interruption Mean Time To Failure (MTTF) measures Reliability Failures In Time (FIT) = 1/MTTF, the rate of failures • Traditionally reported as failures per billion hours of operation Mean Time Between Failures (MTBF) = MTTF+MTTR Module availability (MA) measures service as alternate between the 2 states of accomplishment and interruption (number between 0 and 1, e.g. 0.9) ECE668 Part.1 .41 Module availability MA = MTTF / ( MTTF + MTTR) Example calculating reliability If modules have exponentially distributed lifetimes (age of module does not affect probability of failure), overall failure rate is the sum of failure rates of the modules Calculate FIT and MTTF for 10 disks (1M hour MTTF per disk), 1 disk controller (0.5M hour MTTF), and 1 power supply (0.2M hour MTTF): FailureRate MTTF ECE668 Part.1 .42 Integrated Circuits Yield Defect_Den sity ´ Die_area Die Yield Wafer_yiel d ´ 1 + ÷ a ECE668 Part.1 .44 -a Integrated Circuits Costs IC cost Die cost + Testing cost + Packaging cost Die cost Final test yield Wafer cost Dies per Wafer x Die Yield Dies per wafer P (Wafer_diam/2)² Die_Area - P x Wafer_diam 2× x Die_Area - Test_Die Die Cost goes up roughly with (Die_Area)2 ECE668 Part.1 .45 Amdahl’s Law - Basics Example: Executing a program on n independent processors Fraction = parallelizable part of program Speedup = n enhanced enhanced ExTime = ExTime old (1- Fraction new Speedupoverall n enhanced ExTimeold ExTimenew Lim Speedup ECE668 Part.1 .46 overall ) + ExTime oldFraction enhanced n 1 (1 - Fractionenhanced + = 1 / (1 - Fraction Fractionenhanced Speedupenhanced enhanced ) Amdahl’s Law - Graph Law of Diminishing Returns 1-f enh ECE668 Part.1 .47 Amdahl’s Law - Extension Example: Improving part of a processor (e.g., multiplier, floating-point unit) Fraction enhanced Speedup overall = part of program to be enhanced 1 ( 1 - Fractionenhanced + < 1 / (1 - Fraction enhanced Fractionenhanced Speedup ) enhanced A given signal processing application consists of 40% multiplications. An enhanced multiplier will execute 5 times faster Speedup ECE668 Part.1 .48 overall = 1 / ( + ) = 1.47 < 1/0.6 = 1.66 Amdahl’s Law - Another Example Floating point instructions improved to run 2X; but only 10% of actual run time is used by FP instructions ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold Speedupoverall = ECE668 Part.1 .49 1 0.95 = 1.053 Instruction execution Components of average execution time (CPI Law) Average CPU time per program CPU time = Seconds Program = Instructions x Cycles x Seconds Program Instruction Cycle CPI 1/clock_rate The “End to End Argument” is what RISC was ultimately about - it is the performance of the complete system that matters, not individual components! ECE668 Part.1 .50 Cycles Per Instruction – Another Performance Metric “Average Cycles per Instruction” CPI = Total_No_of_Cycles / Instruction Count “CPI of Individual Instructions” CPIj - CPI for instruction j (j=1,…,n) Ij - # of times instruction j is executed n CPU time Cycle Time CPI j I j j 1 “Instruction Frequency” CPI CPIj ´ Fj n j 1 ECE668 Part.1 .51 where Fj Ij Instruction Count Example: Calculating CPI Base Machine Op ALU Load Store Branch (Reg / Reg) Freq Cycles 50% 1 20% 2 10% 2 20% 2 Typical Mix of instruction types in program ECE668 Part.1 .52 CPIj * Fj .5 .4 .2 .4 (% Time) ( %) (27%) (13%) (27%) Pipelining - Basics 4 consecutive operations 2 2 Z=F(X,Y)=SqRoot(X +Y ) X Y ( )2 Square Root Z If each step takes 1T then one calculation takes 3T, four take 12T X Stage 1 X2 Stage 2 +Y 2 Stage 3 Z SqRoot Y Assuming ideally that each stage takes 1T What will be the latency (time to produce the first result)? What will be the throughput (pipeline rate in the steady state)? ECE668 Part.1 .53 Pipelining - Timing Total of 6T; T T T T Speedup = ? For n operations: 3T + (n-1)T = latency + Speedup = ECE668 Part.1 .54 n 3T 3T + (n-1)T = 3n n + 2 T T n-1 throughput # of stages n Pipelining - Non ideal Non-ideal situation: 1. Steps take T ,T ,T Rate = 1 / max T 2 3 i 1 Slowest unit determines the throughput 2. To allow independent operation must add latches t t latch+ max ECE668 Part.1 .55 T i Rule of Thumb for Latency Lagging BW In the time that bandwidth doubles, latency improves by no more than a factor of 1.2 to 1.4 (and capacity improves faster than bandwidth) Stated alternatively: Bandwidth improves by more than the square of the improvement in Latency ECE668 Part.1 .56 Latency Lags Bandwidth (last ~20 years) Performance Milestones Processor: ‘286, ‘386, ‘486, Pentium, Pentium Pro, Pentium 4 (21x,2250x) Ethernet: 10Mb, 100Mb, 1000Mb, 10000 Mb/s (16x,1000x) Memory Module: 16bit plain DRAM, Page Mode DRAM, 32b, 64b, SDRAM, DDR SDRAM (4x,120x) Disk : 3600, 5400, 7200, 10000, 15000 RPM (8x, 143x) 10000 CPU high, Memory low (“Memory Wall”) 1000 Processor Network Relative Memory BW 100 Improve ment Disk 10 (Latency improvement = Bandwidth improvement) 1 1 10 100 Relative Latency Improvement ECE668 Part.1 .57 Summary of Technology Trends For disk, LAN, memory, and microprocessor, bandwidth improves by square of latency improvement In the time that bandwidth doubles, to 1.4X latency improves by no more than 1.2X Lag probably even larger in real systems, as bandwidth gains multiplied by replicated components Multiple processors in a cluster or even in a chip Multiple disks in a disk array Multiple memory modules in a large memory Simultaneous communication in switched LAN HW and SW developers should innovate assuming Latency Lags Bandwidth If everything improves at the same rate, then nothing really When rates vary, require real innovation ECE668 Part.1 .58 changes Summary of Architecture Trends CMOS Microprocessors focus on computing bandwidth with multiple cores Accelerators for specialized support Software to take advantage – Von Neumann design As nanoscale technologies emerge new architectural areas are created Unconventional architectures ECE668 Part.1 .59 » Not programmed – would operate more like the brain through learning and inference As well as new opportunities for microprocessor design Backup slides for students ECE668 Part.1 .60 6 Reasons Latency Lags Bandwidth 1. • Moore’s Law helps BW more than latency • ECE668 Part.1 .61 Faster transistors, more transistors, more pins help Bandwidth » MPU Transistors: 0.130 vs. 42 M xtors (300X) » DRAM Transistors: 0.064 vs. 256 M xtors (4000X) » MPU Pins: 68 vs. 423 pins (6X) » DRAM Pins: 16 vs. 66 pins (4X) Smaller, faster transistors but communicate over (relatively) longer lines: limits latency » Feature size: 1.5 to 3 vs. 0.18 micron (8X,17X) » MPU Die Size: 35 vs. 204 mm2 (ratio sqrt 2X) » DRAM Die Size: 47 vs. 217 mm2 (ratio sqrt 2X) 6 Reasons Latency Lags Bandwidth (cont’d) 2. Distance limits latency • • Size of DRAM block long bit and word lines most of DRAM access time Speed of light and computers on network 3. Bandwidth easier to sell (“bigger=better”) • • • • E.g., 10 Gbits/s Ethernet (“10 Gig”) vs. 10 msec latency Ethernet 4400 MB/s DIMM (“PC4400”) vs. 50 ns latency Even if just marketing, customers now trained Since bandwidth sells, more resources thrown at bandwidth, which further tips the balance ECE668 Part.1 .62 6 Reasons Latency Lags Bandwidth (cont’d) 4. Latency helps BW, but not vice versa • • • ECE668 Part.1 .63 Spinning disk faster improves both bandwidth and rotational latency » 3600 RPM 15000 RPM = 4.2X » Average rotational latency: 8.3 ms 2.0 ms » Things being equal, also helps BW by 4.2X Lower DRAM latency More access/second (higher bandwidth) Higher linear density helps disk BW (and capacity), but not disk Latency » 9,550 BPI 533,000 BPI 60X in BW 6 Reasons Latency Lags Bandwidth (cont’d) 5. Bandwidth hurts latency • • Queues help Bandwidth, hurt Latency (Queuing Theory) Adding chips to widen a memory module increases Bandwidth but higher fan-out on address lines may increase Latency 6. Operating System overhead hurts Latency more than Bandwidth • ECE668 Part.1 .64 Long messages amortize overhead; overhead bigger part of short messages