Exploit Fine-Grain Parallelism in the Grack Propagation

Transcript Exploit Fine-Grain Parallelism in the Grack Propagation

From EARTH to HTMT:
The Evolution of a Multithreaded
Architecture Model
Guang R. Gao
Computer Architecture & Parallel Systems
Laboratory
(CAPSL)
University of Delaware
7/17/2015
\Seminar\Spain-00-01
1
Outline
• Introduction
• The EARTH Execution and Architecture
Model
• The EARTH Programming Model and
Threaded-C
• Application Studies and Performance
Evaluation
• Related Work and Conclusions
7/17/2015
\Seminar\Spain-00-01
2
Main Challenges:
Scalable
High-Performance
Parallel Systems
for
both Class A and Class B
Applications
7/17/2015
\Seminar\Spain-00-01
3
Challenges: The “Killer Latency Problem”
P
NI
C
M
Network
P
Latency due to:
- Communication
- Synchronization
- task spawning
- load balancing
NI
C
M
SP2 is hard enough, PC clusters is much worse !
7/17/2015
\Seminar\Spain-00-01
4
Meeting High-End Application
Challenges:
• Observation I: Many such Applications
have “Bad Latencies” demanding good
support of adaptive fine-grain parallelism
[Petaflop-2 Conference, 99-2]
7/17/2015
\Seminar\Spain-00-01
5
Here Comes the Surprise!
[Theobald’s Ph.D. thesis, May, 1999]
Observation II: It is not necessarily too
hard to “generate” and “program” finegrain threads!
However, it may be hard to statically group
them into coarse-grain threads!
7/17/2015
\Seminar\Spain-00-01
6
A Base Adaptive Fine-Grain
Multithreaded Execution Model
C1 (Abundance) : a very large pool of
threads
C2 (ultra-light weight): can be spawned
as easily and as quickly as possible
C3 (Mobility): Adaptively migratable as
easily and as quickly as possible
7/17/2015
\Seminar\Spain-00-01
7
Motivation of The EARTH
Project
How to exploit fine-grain multithreadeding
on a parallel system given
off-the-shelf microprocessors
7/17/2015
\Seminar\Spain-00-01
8
Two Types of Fine-Grain Threads
• A parallel function invocation
• Strand/Fiber - a function body can be
divided into several “strands/fibers”
7/17/2015
\Seminar\Spain-00-01
9
Threads and Fibers
• A fiber becomes enabled if it has received all input signals
• An enabled fiber may be selected for
execution when the required hardware
resource has been allocated
Fiber within a frame
Parallel function
invocation
Call a procedure
SYNC ops
• After finished execution,
a signal is sent to all
destination fiber to
update the corresponding
sync slots
Note: The role of strand !
7/17/2015
\Seminar\Spain-00-01
10
The Execution Model of Fibers
• Dependence-Driven
firing rule for fibers
• Fiber is atomic and
ultra-light weighted
• Relation with
dataflow model
(Dennis72)
7/17/2015
2 2
1 2
Fibers
Signal
Token
0 1
\Seminar\Spain-00-01
0 2
2 4
11
The Threaded C Language
• Threaded C = ANSI C + extensions for multithreading
C
FORTRAN
• Extensions include:
–
–
–
–
Threaded functions
Threaded synchronization
Support for global addresses
Data transfer primitives
Users
High-Level Language
Translation
Treaded C
• Threaded C is:
– The “instruction set” of the
EARTH processor
– A target language for
high-level compilers
7/17/2015
\Seminar\Spain-00-01
Threaded C
Compiler
EARTH Platforms
12
7/17/2015
\Seminar\Spain-00-01
13
An Evolutionary Path for EARTH
SU-int
SU-ext
CPU / SU
CPU SU
MANNA-dual/spn
CPU
- Parallel
machines
<=
- PC-clusters
- ...
7/17/2015
SU
CPU CPU LINK
SEMi Simulation Platform
(Theobald99)
\Seminar\Spain-00-01
14
Platforms for EARTH
• MANNA:
– MANNA is architecture testbed from GMD
– benchmarking platform for fine-grain
multithreading
• EARTH-SP2
• EARTH-Beowulf (Linux based)
• EARTH-SUN/SMP/Cluster
7/17/2015
\Seminar\Spain-00-01
15
Unique Advantages of EARTHMANNA Platform
• We can push OS completely out of the way!
• We can design the EARTH runtime system
from very low level up
• The invaluable experience/lessons learned
from EARTH-MANNA are essential for the
successful migration of the EARTH model
to other platforms (e.g. the IBM SP-2 story,
etc.)
7/17/2015
\Seminar\Spain-00-01
16
7/17/2015
\Seminar\Spain-00-01
17
7/17/2015
\Seminar\Spain-00-01
18
Sumamry of Recent
Experimental Results (Kevin99)
• Impressive speedup and scalability (scalable
even with high overhead fine-grain parallel
programs: e.g. fib)
• Enhanced Programmability (N-queen-p
example)
• Broad applicability
7/17/2015
\Seminar\Spain-00-01
20
Experiements
• Example 1 (assorted benchmarks): fib,
nqueen, paraffin, tomcatv, matrixmultiply,etc.
• Example 2: Adaptive unstructured grids
• Example 3: Wavelet computation
7/17/2015
\Seminar\Spain-00-01
21
7/17/2015
\Seminar\Spain-00-01
22
Performance of N-Queens(12)
[Theobald99]
• 117.8 fold speedup on a 120 node
simulation!
• 1,637,099 tokens are generated !
• average, 30+ tokens are maintained per
processors
• n-QUUEN is a useful HTMT benchmark after
all ! (Phil Murkey)
7/17/2015
\Seminar\Spain-00-01
24
7/17/2015
\Seminar\Spain-00-01
25
7/17/2015
\Seminar\Spain-00-01
26
7/17/2015
\Seminar\Spain-00-01
27
Coarse-Grain Applications
• 116 fold speedup on 120-node machine
is achieved for Cannon’s matrix
multiply algorithm!
• Deep software systolic-style
implementation to exploit paralelism
• Fine-grain mechanisms
7/17/2015
\Seminar\Spain-00-01
28
Example 2 --- Adaptive
Unstructured Mesh Computation
Y
Partitioning
Balanced?
Mapping
N
Repartitioning
Initialization
Y
Expensive?
Solution
Adapt?
N
Y
Remapping
Execution
N
Finalization
7/17/2015
\Seminar\Spain-00-01
Observation
• The critical part of the
framework is mesh
adaptation and load
balancing
• Partitioning problem in
better shape,
remapping problem
open
29
The Initial Picture
***
Node 0
Node N
Node 1
The Mapping After a Few Iterations
***
Node 0
7/17/2015
Node 1
\Seminar\Spain-00-01
Node N
30
Initial Results
• About 3000 lines of Threaded-C code
• migration >= 70% (good)
• Unbiased variance = 3 - 5% (very good)
• A good speedup on EARTH-MANNA has
been observed
7/17/2015
\Seminar\Spain-00-01
31
Example 3 --- Adaptive Wavelet
Transformation
• Load evolution pattern is dynamically
changing, but is statically predictable
• Need adaptive load redistribution/grouping
• Mapping onto EARTH [IPPS99]
7/17/2015
\Seminar\Spain-00-01
33
HTMT Facility (Perspective)
7/17/2015
\Seminar\Spain-00-01
34
HTMT Architecture
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
SPIM
SPIM
SPIM
SPIM
SPELLs
SPIM
SPIM
DPIM
SPIM
SPIM
SPIM
DPIM
SPIM
DPIM
SPIM
DPIM
DPIM
DPIM
DPIM
DPIM
DPIM
7/17/2015
DPIM
SPIM
DPIM
\Seminar\Spain-00-01
35
Extensions to Current
EARTH Model
• Percolation Model
• Memory Model: Location
Consistency
• Load balancing and percolation
7/17/2015
\Seminar\Spain-00-01
36
HTMT Percolation Model
SCP
Execution
Unit
CRYOGENIC AREA
start
DMA
CRAM
done
A-Pool
Parcel
Dispatcher
&
Dispenser
I-Pool
Parcel
Assembly
&
Disassembly
DMA
SRAM-PIM
Parcel
Invocation
&
Termination
T-Pool
D-Pool
7/17/2015
Split-Phase
Synchronization
to SRAM
\Seminar\Spain-00-01
Run Time System
37
The System Software Architecture
Applications
•
•
•
The threaded-C compiler has
part of its functions
embedded in RTS
The RTS will work with
architecture and OS layers to
provide the PXM interface
The performance models Are
defined across all layers
Performance Models
Note:
High-level language
compiler
High-level
languages
e.g. parallel C
etc.
HTMT-C/
Threaded-C
Threaded-C
Compiler
and Tool Set
PXM
Interface
RTS
OS
Hardware Architectures
Threaded-C Compiler
- RTS interface
7/17/2015
RTS-OS interface
\Seminar\Spain-00-01
RTS-hardware architecture
interface
38
Evolution of Multithreaded Architecture Models
CHoPP’77
CHoPP’87
Non-dataflow
based
MASA
Agarwal
1989-96
HEP
CDC 6600
1964
Seiltz
1985
J-Machine
M-Machine
Dally
1988-93
Dally
1994-98
Others: Multiscalar (1994), SMT (1995), etc.
Monsoon
P-RISC
Papadopoulos
MIT TTDA
Nikhil &
Arvind
1980
& Culler
1988
Iannuci’s
1988-92
Static
Dataflow
Dennis 1972
MIT
Manchester
DennisGao
1987-88
7/17/2015
Arvind
1989
MDFA
Gao
1989-93
\Seminar\Spain-00-01
*T/Start-NG
MIT/Motorola
1991-
TAM
Cilk
Culler
1990
Leiserson
SIGMA-I
Gurd & Watson
1982
Arg-Fetching
Dataflow
Vishkin
B. Smith
1990-
Cosmic Cube
1969
XMT
Tera
B. Smith
1978
Flynn’s
Processor
Dataflow
model inspired
Alwife
Halstead
1986
EM-5/4/X
Shimada
1988
RWC-1
1992-97
MTA
EARTH
HumTheobald
Gao 94
PACT95’,
ISCA96, Theobald99
39
Acknowledgement
NSERC, FCAR,
DARPA,NSA,NSF,NASA
(Incomplete List)
•
•
•
•
•
•
•
•
•
•
•
7/17/2015
Erik Altman
Haiying Cai
Nasser Elmasri
Gerd Heber
Laurie J. Hendren
Herbert Hum
Alberto Jimenez
Prasad Kakulavarapu
Cheng Li
Olivier Maquelin
Andres Marquez
•
•
•
•
•
•
•
•
•
•
•
Shashank Nemawarkar
Zach Ruiz
Sean Ryan
V.C. Sreedhar
Xinan Tang
Kevin Theobald
Ruppa Thulasiram
Parimala Thulasiraman
Xinmin Tian
Yingchun Zhu
J. Nelson Amaral
\Seminar\Spain-00-01
40

Exploit Fine-Grain Parallelism in the Grack Propagation

Transcript Exploit Fine-Grain Parallelism in the Grack Propagation

Directory