CS 258 Parallel Computer Architecture CS 258, Spring 99 David E. Culler Computer Science Division U.C.
Download ReportTranscript CS 258 Parallel Computer Architecture CS 258, Spring 99 David E. Culler Computer Science Division U.C.
CS 258 Parallel Computer Architecture CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Today’s Goal: • Introduce you to Parallel Computer Architecture • Answer your questions about CS 258 • Provide you a sense of the trends that shape the field 11/6/2015 CS258 S99 2 What will you get out of CS258? • In-depth understanding of the design and engineering of modern parallel computers – technology forces – fundamental architectural issues » naming, replication, communication, synchronization – basic design techniques » cache coherence, protocols, networks, pipelining, … – methods of evaluation – underlying engineering trade-offs • from moderate to very large scale • across the hardware/software boundary 11/6/2015 CS258 S99 3 Will it be worthwhile? • Absolutely! – even through few of you will become PP designers • The fundamental issues and solutions translate across a wide spectrum of systems. – Crisp solutions in the context of parallel machines. • Pioneered at the thin-end of the platform pyramid on the most-demanding applications – migrate downward with time • Understand implications for software SuperServers Departmenatal Servers Workstations Personal Computers 11/6/2015 CS258 S99 4 Am I going to read my book to you? • NO! • Book provides a framework and complete background, so lectures can be more interactive. – You do the reading – We’ll discuss it • Projects will go “beyond” 11/6/2015 CS258 S99 5 What is Parallel Architecture? • A parallel computer is a collection of processing elements that cooperate to solve large problems fast • Some broad issues: – Resource Allocation: » how large a collection? » how powerful are the elements? » how much memory? – Data access, Communication and Synchronization » how do the elements cooperate and communicate? » how are data transmitted between processors? » what are the abstractions and primitives for cooperation? – Performance and Scalability » how does it all translate into performance? » how does it scale? 11/6/2015 CS258 S99 6 Why Study Parallel Architecture? Role of a computer architect: To design and engineer the various levels of a computer system to maximize performance and programmability within limits of technology and cost. Parallelism: • Provides alternative to faster clock for performance • Applies at all levels of system design • Is a fascinating perspective from which to view architecture • Is increasingly central in information processing 11/6/2015 CS258 S99 7 Why Study it Today? • History: diverse and innovative organizational structures, often tied to novel programming models • Rapidly maturing under strong technological constraints – The “killer micro” is ubiquitous – Laptops and supercomputers are fundamentally similar! – Technological trends cause diverse approaches to converge • Technological trends make parallel computing inevitable • Need to understand fundamental principles and design tradeoffs, not just taxonomies – Naming, Ordering, Replication, Communication performance 11/6/2015 CS258 S99 8 Is Parallel Computing Inevitable? • Application demands: Our insatiable need for computing cycles • Technology Trends • Architecture Trends • Economics • Current trends: – Today’s microprocessors have multiprocessor support – Servers and workstations becoming MP: Sun, SGI, DEC, COMPAQ!... – Tomorrow’s microprocessors are multiprocessors 11/6/2015 CS258 S99 9 Application Trends • Application demand for performance fuels advances in hardware, which enables new appl’ns, which... – Cycle drives exponential increase in microprocessor performance – Drives parallel architecture harder » most demanding applications New Applications More Performance • Range of performance demands – Need range of system performance with progressively increasing cost 11/6/2015 CS258 S99 10 Speedup • Speedup (p processors) = Performance (p processors) Performance (1 processor) • For a fixed problem size (input data set), performance = 1/time • Speedup fixed problem (p processors) = Time (1 processor) Time (p processors) 11/6/2015 CS258 S99 11 Commercial Computing • Relies on parallelism for high end – Computational power determines scale of business that can be handled • Databases, online-transaction processing, decision support, data mining, data warehousing ... • TPC benchmarks (TPC-C order entry, TPC-D decision support) – – – – 11/6/2015 Explicit scaling criteria provided Size of enterprise scales with size of system Problem size not fixed as p increases. Throughput is performance measure (transactions per minute or tpm) CS258 S99 12 TPC-C Results for March 1996 25,000 Throughput (tpmC) 20,000 Tandem Himalaya DEC Alpha SGI PowerChallenge HP PA IBM PowerPC Other 15,000 10,000 5,000 0 0 20 40 60 Number of processors 80 100 120 • Parallelism is pervasive • Small to moderate scale parallelism very important • Difficult to obtain snapshot to compare across vendor platforms 11/6/2015 CS258 S99 13 Scientific Computing Demand 11/6/2015 CS258 S99 14 Engineering Computing Demand • Large parallel machines a mainstay in many industries – Petroleum (reservoir analysis) – Automotive (crash simulation, drag analysis, combustion efficiency), – Aeronautics (airflow analysis, engine efficiency, structural mechanics, electromagnetism), – Computer-aided design – Pharmaceuticals (molecular modeling) – Visualization » in all of the above » entertainment (films like Toy Story) » architecture (walk-throughs and rendering) – Financial modeling (yield and derivative analysis) – etc. 11/6/2015 CS258 S99 15 Applications: Speech and Image Processing 10 GI PS 1 GIP S Te lephone Num ber Rec ognition 100 MIP S 10 MI PS 1 MIP S 1980 200 Words Isolate d Spee ch Rec ognition Sub-Band Spe ec h Coding 1,000 Words Continuous Spe ec h Rec ognition 5,000 Words Continuous Spe ec h Rec ognition HDTVRec e ive r CIF Video ISDN-CD Stere o Rec eive r CELP Spe ec h Coding Spe aker Ve ri¼c ation 1985 1990 1995 • Also CAD, Databases, . . . • 100 processors gets you 10 years, 1000 gets you 20 ! 11/6/2015 CS258 S99 16 Is better parallel arch enough? • AMBER molecular dynamics simulation program • Starting point was vector code for Cray-1 • 145 MFLOP on Cray90, 406 for final version on 128processor Paragon, 891 on 128-processor Cray T3D 11/6/2015 CS258 S99 17 Summary of Application Trends • Transition to parallel computing has occurred for scientific and engineering computing • In rapid progress in commercial computing – Database and transactions as well as financial – Usually smaller-scale, but large-scale systems also used • Desktop also uses multithreaded programs, which are a lot like parallel programs • Demand for improving throughput on sequential workloads – Greatest use of small-scale multiprocessors • Solid application demand exists and will increase 11/6/2015 CS258 S99 18 - - - Little break - - - 11/6/2015 CS258 S99 19 Technology Trends Performance 100 Supercomputers 10 Mainframes Microprocessors Minicomputers 1 0.1 1965 1970 1975 1980 1985 1990 1995 • Today the natural building-block is also fastest! 11/6/2015 CS258 S99 20 Can’t we just wait for it to get faster? • Microprocessor performance increases 50% - 100% per year • Transistor count doubles every 3 years • DRAM size quadruples every 3 years • Huge investment per generation is carried by huge commodity market 180 160 140 DEC alpha 120 100 80 60 40 20 MIPS Sun 4 M/120 260 MIPS M2000 IBM RS6000 540 Integer FP HP 9000 750 0 1987 11/6/2015 1988 1989 1990 1991 1992 CS258 S99 21 Technology: A Closer Look • Basic advance is decreasing feature size ( ) – Circuits become either faster or lower in power • Die size is growing too – Clock rate improves roughly proportional to improvement in – Number of transistors improves like (or faster) • Performance > 100x per decade – clock rate < 10x, rest is transistor count • How to use more transistors? – Parallelism in processing » multiple operations per cycle reduces CPI – Locality in data access » avoids latency and reduces CPI » also improves processor utilization – Both need resources, so tradeoff Proc $ Interconnect • Fundamental issue is resource distribution, as in uniprocessors 11/6/2015 CS258 S99 22 Growth Rates 100,000,000 R10000 Pentium100 i80386 100 10 i8086 i80286 1 i8080 10,000,000 Trans i s tors Cloc k rate (MHz ) 1,000 1,000,000 i8086 i8008 10,000 i4004 i8080 i8008 i4004 0.1 1970 1980 1990 2000 1975 1985 1995 2005 1,000 1970 1980 1990 2000 1975 1985 1995 2005 • 30% per year 11/6/2015 100,000 R10000 Pentium i80386 i80286 R3000 R2000 40% per year CS258 S99 23 Architectural Trends • Architecture translates technology’s gifts into performance and capability • Resolves the tradeoff between parallelism and locality – Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect – Tradeoffs may change with scale and technology advances • Understanding microprocessor architectural trends => Helps build intuition about design issues or parallel machines => Shows fundamental role of parallelism even in “sequential” computers 11/6/2015 CS258 S99 24 Phases in “VLSI” Generation Bit-level parallelism Instruction-level T hread-level (?) 100,000,000 10,000,000 1,000,000 R10000 Transistors Pentium i80386 i80286 100,000 R3000 R2000 i8086 10,000 i8080 i8008 i4004 1,000 1970 11/6/2015 1975 1980 1985 1990 CS258 S99 1995 2000 2005 25 Architectural Trends • Greatest trend in VLSI generation is increase in parallelism – Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit » slows after 32 bit » adoption of 64-bit now under way, 128-bit far (not performance issue) » great inflection point when 32-bit micro and cache fit on a chip – Mid 80s to mid 90s: instruction level parallelism » pipelining and simple instruction sets, + compiler advances (RISC) » on-chip caches and functional units => superscalar execution » greater sophistication: out of order execution, speculation, prediction • to deal with control transfer and latency problems – Next step: thread level parallelism 11/6/2015 CS258 S99 26 How far will ILP go? 3 25 2.5 20 2 Speedup Fraction of total cycles (%) 30 15 1.5 10 1 5 0.5 0 0 0 1 2 3 4 5 6+ 0 Number of instructions issued 5 10 15 Instructions issued per cycle • Infinite resources and fetch bandwidth, perfect branch prediction and renaming – real caches and non-zero miss latencies 11/6/2015 CS258 S99 27 Threads Level Parallelism “on board” Proc Proc Proc Proc MEM • Micro on a chip makes it natural to connect many to shared memory – dominates server and enterprise market, moving down to desktop • Faster processors began to saturate bus, then bus technology advanced – today, range of sizes for bus-based systems, desktop to large servers 11/6/2015 CS258 S99 No. of processors in fully configured commercial shared-memory systems 28 What about Multiprocessor Trends? 70 CRAY CS6400 Sun E10000 60 Number of processors 50 40 SGI Challenge 30 Sequent B2100 Symmetry81 SE60 Sun E6000 SE70 Sun SC2000 20 Symmetry21 SE10 10 Power SGI PowerSeries 0 1984 1986 SC2000E SGI PowerChallenge/XL AS8400 Sequent B8000 11/6/2015 1988 SS1000 SS690MP 140 SS690MP 120 1990 1992 CS258 S99 SE30 SS1000E AS2100 HP K400 SS20 SS10 1994 1996 P-Pro 1998 29 Bus Bandwidth 100,000 Sun E10000 Shared bus bandwidth (MB/s) 10,000 SGI Sun E6000 PowerCh AS8400 XL CS6400 SGI Challenge HPK400 SC2000E AS2100 SC2000 P-Pro SS1000E SS1000 SS20 SS690MP 120 SE70/SE30 SS10/ SS690MP 140 SE10/ 1,000 SE60 Symmetry81/21 100 SGI PowerSeries Power Sequent B2100 Sequent B8000 10 1984 11/6/2015 1986 1988 1990 1992 CS258 S99 1994 1996 1998 30 What about Storage Trends? • Divergence between memory capacity and speed even more pronounced – Capacity increased by 1000x from 1980-95, speed only 2x – Gigabit DRAM by c. 2000, but gap with processor speed much greater • Larger memories are slower, while processors get faster – Need to transfer more data in parallel – Need deeper cache hierarchies – How to organize caches? • Parallelism increases effective size of each level of hierarchy, without increasing access time • Parallelism and locality within memory systems too – New designs fetch many bits within memory chip; follow with fast pipelined transfer across narrower interface – Buffer caches most recently accessed data • Disks too: Parallel disks plus caching 11/6/2015 CS258 S99 31 Economics • Commodity microprocessors not only fast but CHEAP – Development costs tens of millions of dollars – BUT, many more are sold compared to supercomputers – Crucial to take advantage of the investment, and use the commodity building block • Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors • Standardization makes small, bus-based SMPs commodity • Desktop: few smaller processors versus one larger one? • Multiprocessor on a chip? 11/6/2015 CS258 S99 32 Can we see some hard evidence? 11/6/2015 CS258 S99 33 Consider Scientific Supercomputing • Proving ground and driver for innovative architecture and techniques – Market smaller relative to commercial as MPs become mainstream – Dominated by vector machines starting in 70s – Microprocessors have made huge gains in floating-point performance » high clock rates » pipelined floating point units (e.g., multiply-add every cycle) » instruction-level parallelism » effective use of caches (e.g., automatic blocking) – Plus economics • Large-scale multiprocessors replace vector supercomputers 11/6/2015 CS258 S99 34 Raw Uniprocessor Performance: LINPACK 10,000 CRAY CRAY Micro Micro n = 1,000 n = 100 n = 1,000 n = 100 1,000 T 94 LINPACK (MFLOPS) C90 100 DEC 8200 Ymp Xmp/416 IBM Power2/990 MIPS R4400 Xmp/14se DEC Alpha HP9000/735 DEC Alpha AXP HP 9000/750 CRAY 1s IBM RS6000/540 10 MIPS M/2000 MIPS M/120 Sun 4/260 1 1975 11/6/2015 1980 1985 1990 CS258 S99 1995 2000 35 Raw Parallel Performance: LINPACK 10,000 MPP peak CRAY peak ASCI Red LINPACK (GFLOPS) 1,000 Paragon XP/S MP (6768) Paragon XP/S MP (1024) T 3D CM-5 100 T 932(32) Paragon XP/S CM-200 CM-2 1 C90(16) Delta 10 Ymp/832(8) iPSC/860 nCUBE/2(1024) Xmp/416(4) 0.1 1985 1987 1989 1991 1993 1995 1996 • Even vector Crays became parallel – X-MP (2-4) Y-MP (8), C-90 (16), T94 (32) • Since 1993, Cray produces MPPs too (T3D, T3E) 11/6/2015 CS258 S99 36 500 Fastest Computers 350 Number of systems 300 313 200 187 MPP PVP SMP 198 150 110 106 100 50 0 11/93 11/6/2015 284 239 250 319 63 11/94 11/95 CS258 S99 106 73 11/96 37 Summary: Why Parallel Architecture? • Increasingly attractive – Economics, technology, architecture, application demand • Increasingly central and mainstream • Parallelism exploited at many levels – Instruction-level parallelism – Multiprocessor servers – Large-scale multiprocessors (“MPPs”) • Focus of this class: multiprocessor level of parallelism • Same story from memory system perspective – Increase bandwidth, reduce average latency with many local memories • Spectrum of parallel architectures make sense – Different cost, performance and scalability 11/6/2015 CS258 S99 38 Where is Parallel Arch Going? Old view: Divergent architectures, no predictable pattern of growth. Application Software Systolic Arrays System Software Architecture SIMD Message Passing Dataflow Shared Memory • Uncertainty of direction paralyzed parallel software development! 11/6/2015 CS258 S99 39 Today • Extension of “computer architecture” to support communication and cooperation – Instruction Set Architecture plus Communication Architecture • Defines – Critical abstractions, boundaries, and primitives (interfaces) – Organizational structures that implement interfaces (hw or sw) • Compilers, libraries and OS are important bridges today 11/6/2015 CS258 S99 40 Modern Layered Framework CAD Database Multiprogramming Shared address Scientific modeling Message passing Data parallel Compilation or library Operating systems support Communication hardware Parallel applications Programming models Communication abstraction User/system boundary Hardware/software boundary Physical communication medium 11/6/2015 CS258 S99 41 How will we spend out time? http://www.cs.berkeley.edu/~culler/cs258-s99/schedule.html 11/6/2015 CS258 S99 42 How will grading work? • • • • 30% homeworks (6) 30% exam 30% project (teams of 2) 10% participation 11/6/2015 CS258 S99 43 Any other questions? 11/6/2015 CS258 S99 44