Transcript Document

Topic 3 -- II:
System Software Fundamentals:
Multithreaded Execution Models, Virtual Machines
and Memory Models
Guang R. Gao
ACM Fellow and IEEE Fellow
Endowed Distinguished Professor
Electrical & Computer Engineering
University of Delaware
[email protected]
CPEG421-2001-F-Topic-3-II
1
Outline
• An introduction to parallel program execution
models
• Coarse-grain vs. fine-grain multithreading
• Evolution of fine-grain multithreaded program
execution models.
• Memory and synchronization. models
• Fine-Grain Multithreaded execution and virtual
machine models for peta-scale computing: a case
study on HTMT/EARTH
CPEG421-2001-F-Topic-3-II
2
Terminology Clarification
• Parallel Model of Computation
– Parallel Models for Algorithm Designers
– Parallel Models for System Designers
• Parallel Programming Models
• Parallel Execution Models
• Parallel Architecture Models
CPEG421-2001-F-Topic-3-II
3
System Characterization
Questions:
Q1: What characteristics of a computational system are
required …
Q2: The diversity of existing and potential multi-core
architectures…
Response:
R1: An important characteristic of such a compiler should
include, at both chip level and system level, a program
execution model that should at least include the specification
and API
Gao, ECCD Workshop, Washington D.C., Nov. 2007
CPEG421-2001-F-Topic-3-II
4
What Does Program Execution
Model (PXM) Mean ?
• The notion of PXM
The program execution model (PXM) is the basic
low-level abstraction of the underlying system
architecture upon which our programming model,
compilation strategy, runtime system, and other
software components are developed.
• The PXM (and its API) serves as an interface
between the architecture and the software.
CPEG421-2001-F-Topic-3-II
5
Program Execution Model (PXM)
– Cont’d
Unlike an instruction set architecture (ISA)
specification, which usually focuses on
lower level details (such as instruction
encoding and organization of registers for a
specific processor), the PXM refers to
machine organization at a higher level for a
whole class of high-end machines as view
by the users
Gao, et. al., 2000
CPEG421-2001-F-Topic-3-II
6
What is your
“Favorite”
Program Execution Model?
CPEG421-2001-F-Topic-3-II
7
A Generic MIMD Architecture
Node: Processor(s), Memory System
plus Communication assist (Network
Interface & Communication Controller)
Memory
IC
Communication
Assist
NIC
$
$
P
P
Full Feature Interconnect Networks. Packet Switching
Fabrics.
Key: Scalable Network
Objective: Make efficient use of scarce communication resources – providing high
bandwidth, low-latency communication between nodes with a minimum cost and
energy
CPEG421-2001-F-Topic-3-II
8
Programming Models for MultiProcessor Systems
• Message Passing
Model
• Shared Memory
Model
– Multiple address
spaces
– Communication can
only be achieved
through “messages”
– Memory address space
is accessible to all
– Communication is
achieved through
memory
Messages
Global Memory
Local Memory
Local Memory
Processor
Processor
CPEG421-2001-F-Topic-3-II
Processor
Processor
9
Comparison
Message Passing
+ Less Contention
+ Highly Scalable
+ Simplified Synch
– Message Passing  Sync +
Comm.
– But does not mean highly
programmable
-
Load Balancing
Deadlock prone
Overhead of small messages
Shared Memory
+ global shared address space
+ Easy to program (?)
+ No (explicit) message
passing (e.g. communication
through memory put/get
operations)
- Synchronization (memory
consistency models, cache
models)
- Scalability
CPEG421-2001-F-Topic-3-II
10
What is A Shared Memory
Execution Model?
Thread Model
A set of rules for creating, destroying and managing threads
Execution
Model
Memory Model
Synchronization Model
Provide a set of mechanisms to protect from data races
Dictate the ordering of memory operations
The Thread Virtual Machine
CPEG421-2001-F-Topic-3-II
11
Essential Aspects in User-Level
Shared Memory Support?
• Shared address space support and
management
• Access control and management
- Memory consistency model (MCM)
- Cache management mechanism
CPEG421-2001-F-Topic-3-II
12
Grand Challenge Problems
• How to build a shared-memory multiprocessor that is
scalable both within a (multi-core/many-core chip) and a
system with many chips ?
• How to program and optimize application programs?
Our view: One major obstacle in solving these problems in
the memory coherence assumption in today’s hardwarecentric memory consistency model.
CPEG421-2001-F-Topic-3-II
13
A Parallel Execution Model
Execution / Architecture
Model
Application Programming
Interface (API)
Synchronization
Model
Thread Model
Memory Model
CPEG421-2001-F-Topic-3-II
14
A Parallel Execution Model
Execution / Architecture
Model
Application Programming
Interface (API)
Fine Grained
Synchronization
Model
Fine Grained
Multithreaded
Model
Memory
Adaptive /
Aware Model
With Dataflow Origins
Our Model
CPEG421-2001-F-Topic-3-II
15
Comment on OS impact?
• Should compiler be OS-Aware too ? If so,
how ?
• Or other alternatives ? Compiler-controlled
runtime, of compiler-aware kernels, etc.
• Example: software pipelining …
Gao, ECCD Workshop, Washington D.C., Nov. 2007
CPEG421-2001-F-Topic-3-II
16
Outline
• An introduction to multithreaded program
execution models
• Coarse-grain vs. fine-grain parallel execution
models – a historical overview
• Fine-grain multithreaded program execution
models.
• Memory and synchronization. models
• Fine-grain multithreaded execution and virtual
machine models for extreme-scale machines: a
case study on HTMT/EARTH
CPEG421-2001-F-Topic-3-II
17
Course Grain Execution Models
The Single Instruction Multiple Data (SIMD) Model
Pipelined Vector Unit or
Array of Processors
The Single Program Multiple Data (SPMD) Model
Program
Program
Processor
Program
Processor
Processor
Program
Processor
The Data Parallel Model
Data Structure
Task
Task
Task
Task
CPEG421-2001-F-Topic-3-II
18
Data Parallel Model
Limitations
Difficult to write unstructured programs
Convenient only for problems with regular
structured parallelism.
Compute
Communication
?
Compute
Limited composability!
Inherent limitation of coarse-grain multithreading
CPEG421-2001-F-Topic-3-II
Communication
19
Dataflow Model of Computation
a
b
c
d
e
1
3
+
4
3
*
+
CPEG421-2001-F-Topic-3-II
20
Dataflow Model of Computation
a
b
+
4
3
c
d
e
4
*
+
CPEG421-2001-F-Topic-3-II
21
Dataflow Model of Computation
a
b
+
c
d
e
4
7
*
+
CPEG421-2001-F-Topic-3-II
22
Dataflow Model of Computation
a
b
c
d
e
+
28
*
+
CPEG421-2001-F-Topic-3-II
23
Dataflow Model of Computation
a
b
c
d
e
1
3
+
28
4
3
*
+
Dataflow Software Pipelining
CPEG421-2001-F-Topic-3-II
24
Outline
• An introduction to multithreaded program
execution models
• Coarse-grain vs. fine-grain parallel execution
models – A Historical Overview
• Fine-grain multithreaded program execution
models.
• Memory and synchronization. models
• Fine-grain multithreaded execution and virtual
machine models for peta-scale machines: a case
study on HTMT/EARTH
CPEG421-2001-F-Topic-3-II
25
CPU
CPU
Memory
Memory
Thread
Unit
Executor
Locus
A Single
Thread
Coarse-Grain threadThe family home model
Thread
Unit
Executor
Locus
A Pool
Thread
Fine-Grain non-preemptive threadThe “hotel” model
Coarse-Grain vs. Fine-Grain Multithreading
[Gao: invited talk at Fran Allen’s Retirement Workshop, 07/2002]
CPEG421-2001-F-Topic-3-II
26
Evolution of Multithreaded
Execution and Architecture Models
CHoPP’77
Non-dataflow
based
CHoPP’87
MASA
Alwife
Halstead
1986
Agarwal
1989-96
HEP
CDC 6600
1964
Tera
B. Smith
1978
Flynn’s
Processor
B. Smith
1990-
Cosmic Cube
Seiltz
1985
1969
Eldorado
CASCADE
J-Machine
M-Machine
Dally
1988-93
Dally
1994-98
Others: Multiscalar (1994), SMT (1995), etc.
Dataflow
model inspired
Monsoon
MIT TTDA
Arvind
1980
LAU
Syre
1976
Static
Dataflow
Papadopoulos
& Culler
1988
P-RISC
*T/Start-NG
Nikhil &
Arvind
1989
MIT/Motorola
1991-
Iannuci’s
1988-92
TAM
Manchester
Culler
1990
SIGMA-I
Gurd & Watson
1982
Shimada
1988
Cilk
Leiserson
EM-5/4/X
RWC-1
1992-97
Dennis 1972
MIT
Arg-Fetching
Dataflow
DennisGao
1987-88
MDFA
Gao
1989-93
CPEG421-2001-F-Topic-3-II
MTA
HumTheobald
Gao 94
EARTH
PACT95’,
ISCA96,
Theobald99
CARE
Marquez04
27
The Von Neumann-type
Processing
begin
for i = 1 …
…
endfor
end
Compiler
Sequential
Machine
Representation
Source Code
Load
CPU
Processor
CPEG421-2001-F-Topic-3-II
28
A Multithreaded Architecture
To Other PE’s
One PE
CPEG421-2001-F-Topic-3-II
29
McGill Data Flow
Architecture Model
(MDFA)
CPEG421-2001-F-Topic-3-II
30
n1
n1
store
fetch
fetch
fetch
fetch
n2
n3
n2
Argument –flow Principle
n3
Argument –fetching Principle
CPEG421-2001-F-Topic-3-II
31
A Dataflow Program Tuple
Program Tuple = { P-Code . S-Code }
S-Code
P-Code
N1: x = a + b;
N2: y = c – d;
N3: z = x * y;
a
b
n1
2
3
c
d
IPU
2
3
2
3
n1
n2
ISU
CPEG421-2001-F-Topic-3-II
32
The McGill Dataflow Architecture
Model
Pipelined Instruction
Processing Unit (PIPU)
Fire
Done
Dataflow Instruction
Scheduling Unit (DISU)
Enable Memory &
Controller
Signal
Processing
CPEG421-2001-F-Topic-3-II
33
The McGill Dataflow Architecture
Model
Pipelined Instruction
Processing Unit (PIPU)
Important Features
Fire
Pipeline can be kept fully
utilized provided that the
program has sufficient
parallelism
Done
Dataflow Instruction
Scheduling Unit (DISU)
Enabled Instructions
Waiting Instructions
CPEG421-2001-F-Topic-3-II
= PC
34
The Scheduling Memory (Enable)
Dataflow Instruction
Scheduling Unit (DISU)
Fire
1
0
1
1
0
0
1
0
0
0
1
0
0
1
0
1
1
1
0
1
1
C
O
N
T
R
O
L
L
E
R
Done
Count Signal(s)
Signal Processing
Enabled Instructions
0
Waiting Instructions
CPEG421-2001-F-Topic-3-II
35
Advantages of the McGill
Dataflow Architecture Model
• Eliminate unnecessary token copying
and transmission overhead
• Instruction scheduling is separated
from the main datapath of the
processor (e.g. asynchronous,
decoupled)
CPEG421-2001-F-Topic-3-II
36
Von Neumann Threads as Macro
Dataflow Nodes
A sequence of
instructions is “packed”
into a macro-dataflow
node
1
2
3
Synchronization is done
at the macro-node level
k
CPEG421-2001-F-Topic-3-II
37
Hybrid Evaluation Von Neumann
Style Instruction Execution” on
the McGill Dataflow Architecture
• Group a “sequence” of dataflow instruction into a
“thread” or a macro dataflow node.
• Data-driven synchronization among threads.
• “Von Neumann style sequencing” within a thread.
Advantage:
Preserves the parallelism among threads but
avoids unnecessary fine-grain synchronization
between instructions within a sequential thread.
CPEG421-2001-F-Topic-3-II
38
What Do We Get?
• A hybrid architecture model
without sacrificing the advantage
of fine-grain parallelism!
(latency-hiding, pipelining support)
CPEG421-2001-F-Topic-3-II
39
A Realization of the Hybrid
Evaluation
Shortcut
Pipelined Instruction
Processing Unit (PIPU)
Fire
Von Neumann bit
Done
1
2
k
Dataflow Instruction
Scheduling Unit (DISU)
CPEG421-2001-F-Topic-3-II
40