Transcript Document
Topic 3 -- II: System Software Fundamentals: Multithreaded Execution Models, Virtual Machines and Memory Models Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor Electrical & Computer Engineering University of Delaware [email protected] CPEG421-2001-F-Topic-3-II 1 Outline • An introduction to parallel program execution models • Coarse-grain vs. fine-grain multithreading • Evolution of fine-grain multithreaded program execution models. • Memory and synchronization. models • Fine-Grain Multithreaded execution and virtual machine models for peta-scale computing: a case study on HTMT/EARTH CPEG421-2001-F-Topic-3-II 2 Terminology Clarification • Parallel Model of Computation – Parallel Models for Algorithm Designers – Parallel Models for System Designers • Parallel Programming Models • Parallel Execution Models • Parallel Architecture Models CPEG421-2001-F-Topic-3-II 3 System Characterization Questions: Q1: What characteristics of a computational system are required … Q2: The diversity of existing and potential multi-core architectures… Response: R1: An important characteristic of such a compiler should include, at both chip level and system level, a program execution model that should at least include the specification and API Gao, ECCD Workshop, Washington D.C., Nov. 2007 CPEG421-2001-F-Topic-3-II 4 What Does Program Execution Model (PXM) Mean ? • The notion of PXM The program execution model (PXM) is the basic low-level abstraction of the underlying system architecture upon which our programming model, compilation strategy, runtime system, and other software components are developed. • The PXM (and its API) serves as an interface between the architecture and the software. CPEG421-2001-F-Topic-3-II 5 Program Execution Model (PXM) – Cont’d Unlike an instruction set architecture (ISA) specification, which usually focuses on lower level details (such as instruction encoding and organization of registers for a specific processor), the PXM refers to machine organization at a higher level for a whole class of high-end machines as view by the users Gao, et. al., 2000 CPEG421-2001-F-Topic-3-II 6 What is your “Favorite” Program Execution Model? CPEG421-2001-F-Topic-3-II 7 A Generic MIMD Architecture Node: Processor(s), Memory System plus Communication assist (Network Interface & Communication Controller) Memory IC Communication Assist NIC $ $ P P Full Feature Interconnect Networks. Packet Switching Fabrics. Key: Scalable Network Objective: Make efficient use of scarce communication resources – providing high bandwidth, low-latency communication between nodes with a minimum cost and energy CPEG421-2001-F-Topic-3-II 8 Programming Models for MultiProcessor Systems • Message Passing Model • Shared Memory Model – Multiple address spaces – Communication can only be achieved through “messages” – Memory address space is accessible to all – Communication is achieved through memory Messages Global Memory Local Memory Local Memory Processor Processor CPEG421-2001-F-Topic-3-II Processor Processor 9 Comparison Message Passing + Less Contention + Highly Scalable + Simplified Synch – Message Passing Sync + Comm. – But does not mean highly programmable - Load Balancing Deadlock prone Overhead of small messages Shared Memory + global shared address space + Easy to program (?) + No (explicit) message passing (e.g. communication through memory put/get operations) - Synchronization (memory consistency models, cache models) - Scalability CPEG421-2001-F-Topic-3-II 10 What is A Shared Memory Execution Model? Thread Model A set of rules for creating, destroying and managing threads Execution Model Memory Model Synchronization Model Provide a set of mechanisms to protect from data races Dictate the ordering of memory operations The Thread Virtual Machine CPEG421-2001-F-Topic-3-II 11 Essential Aspects in User-Level Shared Memory Support? • Shared address space support and management • Access control and management - Memory consistency model (MCM) - Cache management mechanism CPEG421-2001-F-Topic-3-II 12 Grand Challenge Problems • How to build a shared-memory multiprocessor that is scalable both within a (multi-core/many-core chip) and a system with many chips ? • How to program and optimize application programs? Our view: One major obstacle in solving these problems in the memory coherence assumption in today’s hardwarecentric memory consistency model. CPEG421-2001-F-Topic-3-II 13 A Parallel Execution Model Execution / Architecture Model Application Programming Interface (API) Synchronization Model Thread Model Memory Model CPEG421-2001-F-Topic-3-II 14 A Parallel Execution Model Execution / Architecture Model Application Programming Interface (API) Fine Grained Synchronization Model Fine Grained Multithreaded Model Memory Adaptive / Aware Model With Dataflow Origins Our Model CPEG421-2001-F-Topic-3-II 15 Comment on OS impact? • Should compiler be OS-Aware too ? If so, how ? • Or other alternatives ? Compiler-controlled runtime, of compiler-aware kernels, etc. • Example: software pipelining … Gao, ECCD Workshop, Washington D.C., Nov. 2007 CPEG421-2001-F-Topic-3-II 16 Outline • An introduction to multithreaded program execution models • Coarse-grain vs. fine-grain parallel execution models – a historical overview • Fine-grain multithreaded program execution models. • Memory and synchronization. models • Fine-grain multithreaded execution and virtual machine models for extreme-scale machines: a case study on HTMT/EARTH CPEG421-2001-F-Topic-3-II 17 Course Grain Execution Models The Single Instruction Multiple Data (SIMD) Model Pipelined Vector Unit or Array of Processors The Single Program Multiple Data (SPMD) Model Program Program Processor Program Processor Processor Program Processor The Data Parallel Model Data Structure Task Task Task Task CPEG421-2001-F-Topic-3-II 18 Data Parallel Model Limitations Difficult to write unstructured programs Convenient only for problems with regular structured parallelism. Compute Communication ? Compute Limited composability! Inherent limitation of coarse-grain multithreading CPEG421-2001-F-Topic-3-II Communication 19 Dataflow Model of Computation a b c d e 1 3 + 4 3 * + CPEG421-2001-F-Topic-3-II 20 Dataflow Model of Computation a b + 4 3 c d e 4 * + CPEG421-2001-F-Topic-3-II 21 Dataflow Model of Computation a b + c d e 4 7 * + CPEG421-2001-F-Topic-3-II 22 Dataflow Model of Computation a b c d e + 28 * + CPEG421-2001-F-Topic-3-II 23 Dataflow Model of Computation a b c d e 1 3 + 28 4 3 * + Dataflow Software Pipelining CPEG421-2001-F-Topic-3-II 24 Outline • An introduction to multithreaded program execution models • Coarse-grain vs. fine-grain parallel execution models – A Historical Overview • Fine-grain multithreaded program execution models. • Memory and synchronization. models • Fine-grain multithreaded execution and virtual machine models for peta-scale machines: a case study on HTMT/EARTH CPEG421-2001-F-Topic-3-II 25 CPU CPU Memory Memory Thread Unit Executor Locus A Single Thread Coarse-Grain threadThe family home model Thread Unit Executor Locus A Pool Thread Fine-Grain non-preemptive threadThe “hotel” model Coarse-Grain vs. Fine-Grain Multithreading [Gao: invited talk at Fran Allen’s Retirement Workshop, 07/2002] CPEG421-2001-F-Topic-3-II 26 Evolution of Multithreaded Execution and Architecture Models CHoPP’77 Non-dataflow based CHoPP’87 MASA Alwife Halstead 1986 Agarwal 1989-96 HEP CDC 6600 1964 Tera B. Smith 1978 Flynn’s Processor B. Smith 1990- Cosmic Cube Seiltz 1985 1969 Eldorado CASCADE J-Machine M-Machine Dally 1988-93 Dally 1994-98 Others: Multiscalar (1994), SMT (1995), etc. Dataflow model inspired Monsoon MIT TTDA Arvind 1980 LAU Syre 1976 Static Dataflow Papadopoulos & Culler 1988 P-RISC *T/Start-NG Nikhil & Arvind 1989 MIT/Motorola 1991- Iannuci’s 1988-92 TAM Manchester Culler 1990 SIGMA-I Gurd & Watson 1982 Shimada 1988 Cilk Leiserson EM-5/4/X RWC-1 1992-97 Dennis 1972 MIT Arg-Fetching Dataflow DennisGao 1987-88 MDFA Gao 1989-93 CPEG421-2001-F-Topic-3-II MTA HumTheobald Gao 94 EARTH PACT95’, ISCA96, Theobald99 CARE Marquez04 27 The Von Neumann-type Processing begin for i = 1 … … endfor end Compiler Sequential Machine Representation Source Code Load CPU Processor CPEG421-2001-F-Topic-3-II 28 A Multithreaded Architecture To Other PE’s One PE CPEG421-2001-F-Topic-3-II 29 McGill Data Flow Architecture Model (MDFA) CPEG421-2001-F-Topic-3-II 30 n1 n1 store fetch fetch fetch fetch n2 n3 n2 Argument –flow Principle n3 Argument –fetching Principle CPEG421-2001-F-Topic-3-II 31 A Dataflow Program Tuple Program Tuple = { P-Code . S-Code } S-Code P-Code N1: x = a + b; N2: y = c – d; N3: z = x * y; a b n1 2 3 c d IPU 2 3 2 3 n1 n2 ISU CPEG421-2001-F-Topic-3-II 32 The McGill Dataflow Architecture Model Pipelined Instruction Processing Unit (PIPU) Fire Done Dataflow Instruction Scheduling Unit (DISU) Enable Memory & Controller Signal Processing CPEG421-2001-F-Topic-3-II 33 The McGill Dataflow Architecture Model Pipelined Instruction Processing Unit (PIPU) Important Features Fire Pipeline can be kept fully utilized provided that the program has sufficient parallelism Done Dataflow Instruction Scheduling Unit (DISU) Enabled Instructions Waiting Instructions CPEG421-2001-F-Topic-3-II = PC 34 The Scheduling Memory (Enable) Dataflow Instruction Scheduling Unit (DISU) Fire 1 0 1 1 0 0 1 0 0 0 1 0 0 1 0 1 1 1 0 1 1 C O N T R O L L E R Done Count Signal(s) Signal Processing Enabled Instructions 0 Waiting Instructions CPEG421-2001-F-Topic-3-II 35 Advantages of the McGill Dataflow Architecture Model • Eliminate unnecessary token copying and transmission overhead • Instruction scheduling is separated from the main datapath of the processor (e.g. asynchronous, decoupled) CPEG421-2001-F-Topic-3-II 36 Von Neumann Threads as Macro Dataflow Nodes A sequence of instructions is “packed” into a macro-dataflow node 1 2 3 Synchronization is done at the macro-node level k CPEG421-2001-F-Topic-3-II 37 Hybrid Evaluation Von Neumann Style Instruction Execution” on the McGill Dataflow Architecture • Group a “sequence” of dataflow instruction into a “thread” or a macro dataflow node. • Data-driven synchronization among threads. • “Von Neumann style sequencing” within a thread. Advantage: Preserves the parallelism among threads but avoids unnecessary fine-grain synchronization between instructions within a sequential thread. CPEG421-2001-F-Topic-3-II 38 What Do We Get? • A hybrid architecture model without sacrificing the advantage of fine-grain parallelism! (latency-hiding, pipelining support) CPEG421-2001-F-Topic-3-II 39 A Realization of the Hybrid Evaluation Shortcut Pipelined Instruction Processing Unit (PIPU) Fire Von Neumann bit Done 1 2 k Dataflow Instruction Scheduling Unit (DISU) CPEG421-2001-F-Topic-3-II 40