Transcript Document

Topic 2 -- II:
Compilers and Runtime Technology:
Optimization Under Fine-Grain Multithreading
- The EARTH Model (in more details)
Guang R. Gao
ACM Fellow and IEEE Fellow
Endowed Distinguished Professor
Electrical & Computer Engineering
University of Delaware
[email protected]
cpeg421-10-F/Topic-3-II-EARTH
1
Outline
•
•
•
•
Overview
Fine-grain multithreading
Compiling for fine-grain multithreading
The power of fine-grain synchronization SSB
• The percolation model and its applications
• Summary
cpeg421-10-F/Topic-3-II-EARTH
2
Outline
•
•
•
•
Overview
Fine-grain multithreading
Compiling for fine-grain multithreading
The power of fine-grain synchronization SSB
• The percolation model and its applications
• Summary
cpeg421-10-F/Topic-3-II-EARTH
3
The EARTH Multithreaded
Execution Model
Two Level of Fine-Grain Threads:
- threaded procedures
- fibers
fiber within a frame
Aync. function invocation
A sync operation
Invoke a threaded func
2
2
1
2
Signal Token
Total # signals
Arrived # signals
1
2
2
4
cpeg421-10-F/Topic-3-II-EARTH
4
EARTH vs. CILK
Fiber within a frame
Parallel function
invocation frames
fork a procedure
SYNC ops
CILK Model
EARTH Model
Note: EARTH has it origin in static dataflow model
cpeg421-10-F/Topic-3-II-EARTH
5
The “Fiber” Execution Model
0 2
0 2
Signal Token
Total # signals
Arrived # signals
0 1
0 2
cpeg421-10-F/Topic-3-II-EARTH
0 4
6
The “Fiber” Execution Model
1 2
0 2
Signal Token
Total # signals
Arrived # signals
0 1
0 2
cpeg421-10-F/Topic-3-II-EARTH
0 4
7
The “Fiber” Execution Model
2 2
0 2
Signal Token
Total # signals
Arrived # signals
0 1
0 2
cpeg421-10-F/Topic-3-II-EARTH
0 4
8
The “Fiber” Execution Model
2 2
0 2
Signal Token
Total # signals
Arrived # signals
1 1
0 2
cpeg421-10-F/Topic-3-II-EARTH
0 4
9
The “Fiber” Execution Model
2 2
0 2
Signal Token
Total # signals
Arrived # signals
1 1
1 2
cpeg421-10-F/Topic-3-II-EARTH
0 4
10
The “Fiber” Execution Model
2 2
1 2
Signal Token
Total # signals
Arrived # signals
1 1
1 2
cpeg421-10-F/Topic-3-II-EARTH
0 4
11
The “Fiber” Execution Model
2 2
2 2
Signal Token
Total # signals
Arrived # signals
1 1
1 2
cpeg421-10-F/Topic-3-II-EARTH
0 4
12
The “Fiber” Execution Model
2 2
2 2
Signal Token
Total # signals
Arrived # signals
1 1
2 2
cpeg421-10-F/Topic-3-II-EARTH
0 4
13
The “Fiber” Execution Model
2 2
2 2
Signal Token
Total # signals
Arrived # signals
1 1
2 2
cpeg421-10-F/Topic-3-II-EARTH
1 4
14
The “Fiber” Execution Model
2 2
2 2
Signal Token
Total # signals
Arrived # signals
1 1
2 2
cpeg421-10-F/Topic-3-II-EARTH
2 4
15
The “Fiber” Execution Model
2 2
2 2
Signal Token
Total # signals
Arrived # signals
1 1
2 2
cpeg421-10-F/Topic-3-II-EARTH
3 4
16
The “Fiber” Execution Model
2 2
2 2
Signal Token
Total # signals
Arrived # signals
1 1
2 2
cpeg421-10-F/Topic-3-II-EARTH
4 4
17
A Loop Example
for(i =1; i <= N; ++i){
S1: …
S2: x[i] = …
S3: y[i] = … + x[i-1] …
.
.
.
Sk: …
}
i= 1
i= 2
i= 3
T1
T2
T3
i= N
S1:
S2:
S3:
Sk:
Note:
How loop carried dependencies are handled?
And its implication on cross core software pipelining
cpeg421-10-F/Topic-3-II-EARTH
18
Main Features of EARTH
* Fast thread context switching
• Efficient parallel function invocation
• Good support of fine grain dynamic load
balancing
* Efficient support split phase transactions
and fibers
*Features unique to the EARTH model in comparison to the CILK model
cpeg421-10-F/Topic-3-II-EARTH
19
Outline
•
•
•
•
Overview
Fine-grain multithreading
Compiling for fine-grain multithreading
The power of fine-grain synchronization SSB
• The percolation model and its applications
• Summary
cpeg421-10-F/Topic-3-II-EARTH
20
Compiling C for EARTH
Objectives
• Design simple high-level extensions for C that
allow programmers to write programs that will run
efficiently on multi-threaded architectures.
(EARTH-C)
• Develop compiler techniques to automatically
translate programs written in EARTH-C to multithreaded programs. (EARTH-C, Threaded-C)
• Determine if EARTH-C + compiler can compete
with hand-coded Threaded-C programs.
cpeg421-10-F/Topic-3-II-EARTH
21
Summary of EARTH-C
Extensions
• Explicit Parallelism
– Parallel versus Sequential statement sequences
– Forall loops
• Locality Annotation
– Local versus Remote Memory references (global, local,
replicate, …)
• Dynamic Load Balancing
– Basic versus remote function and invocation sites
cpeg421-10-F/Topic-3-II-EARTH
22
EARTH-C Compiler
Environment
EARTH SIMPLE
C
EARTH-C
Split Phase Analysis
Build DDG
Program Dependence
Analysis
EARTH
SIMPLE
EARTH-C
Compiler
Thread Generation
Thread Partitioning
McCAT
Compute Remote Level
Merge Statements
Thread Synchronization
Threaded-C
Threaded-C
Compiler
Thread Scheduling
Thread Code Generation
Threaded-C
EARTH Compilation Environment
The EARTH Compiler
cpeg421-10-F/Topic-3-II-EARTH
23
The McCAT/EARTH Compiler
EARTH-C
Simplify goto elimination
Local function inlining Points-to Analysis
Heap Analysis
R/W Set Analysis
Array Dependence Tester
PHASE I
(Standard McCAT
Analyses &
Transformations)
EARTH-SIMPLE-C
Forall Loop Detection
Loop Partitioning
PHASE II
(Parallelization)
EARTH-SIMPLE-C
Build Hierarchical DDG
Thread Generation
PHASE III
Code Generation
THREADED-C
cpeg421-10-F/Topic-3-II-EARTH
24
result
n
done
fib
0 0 If n < 2
DATA_RSYNC (1, result, done)
else
{
TOKEN (fib, n-1, & sum1, slot_1);
TOKEN (fib, n-2, & sum2, slot_2);
}
END_THREAD( ) ;
2 2
THREAD-1;
DATA_RSYNC (sum1 + sum2, result, done);
END_THREAD ( ) ;
END_FUNCTION
The Fibonacci Example
7/21/2015
\Petaflop\Workshop98-7B.ppt
25
Matrix Multiplication
void main ( )
{
int
i, j, k;
float sum;
for (i=0; i < N; i++)
for (j=0; j < N ; j++) {
sum = 0;
for (k=0; k < N; k++)
sum = sum + a [i] [k] * b [k] [j]
c [i] [j] = sum;
}
}
Sequential Version
7/21/2015
\Petaflop\Workshop98-7B.ppt
26
a
result
b
done
inner
0 0 BLKMOV_SYNC (a, row_a, N, slot_1);
BLKMOV_SYNC (b, column_b, N, slot_1);
sum = 0;
END_THREAD;
2 2 THREAD-1;
for (i=0; i<N; i++ );
sum = sum + (row_a[i] * column_b[i]);
DATA_RSYNC (sum, result, done);
END_THREAD ( ) ;
END_FUNCTION
The Inner Product Example
7/21/2015
\Petaflop\Workshop98-7B.ppt
27
Summary of EARTH-C
Extensions
• Explicit Parallelism
– Parallel versus Sequential statement sequences
– Forall loops
• Locality Annotation
– Local versus Remote Memory references (global, local,
replicate, …)
• Dynamic Load Balancing
– Basic versus remote function and invocation sites
cpeg421-10-F/Topic-3-II-EARTH
28
EARTH C Threaded C
(Thread Generation)
Given a sequence of statements, s1, s2, …sn,
we wish to create threads such that:
– Maximize thread length (minimize thread
switching overhead)
– retain sufficient parallelism
– Issue remote memory requests as early as
possible (prefetching)
– Compile split-phase remote memory operations
and remote function calls correctly
cpeg421-10-F/Topic-3-II-EARTH
29
An Example
int f(int *x, int i, int j){
int a, b, sum, prod, fact;
int r1, r2, r3;
a = x[i];
fact = 1;
b = x[j];
fact = fact * a;
sum = a + b;
prod = a * b;
r1 = g(sum);
r2 = g(prod);
r3 = g(fact);
return(r1 + r2 + r3); }
cpeg421-10-F/Topic-3-II-EARTH
30
Example Partitioned into Four
Fibers
a = x[i];
fact = 1;
1
fact = fact * a;
b = x[j];
Fiber-0:
Fiber-1:
sum = a + b;
prod = a * b;
r1 = g(sum);
r2 = g(prod);
r3 = g(fact);
1
Fiber-2:
return (r1 + r2 + r3);
3
Fiber-3:
cpeg421-10-F/Topic-3-II-EARTH
31
Better Strategy Using
List Scheduling
• Put each instruction in the earliest possible thread.
• Within a thread, the remote operations are
executed as early as possible.
Build a Data Dependence Graph (DDG), and use a
list scheduling strategy, where the selection of
instructions is guided by Earliest Thread Number
and Statement Type.
cpeg421-10-F/Topic-3-II-EARTH
32
Instruction Types
• Schedule First
–
–
–
–
–
–
remote_read, remote_write
remote_fn_call
local_simple
remote_compound
local_compound
basic_fn_call
• Schedule Last
cpeg421-10-F/Topic-3-II-EARTH
33
List Scheduling Previous
Example
(0,RR)
(0,RR)
a = x[i];
b = x[j];
(1,LS)
(1,LS)
sum=a+b;
prod=a*b;
(1,RF)
(1,RF)
r1=g(sum);
r2=g(prod)
(0,LS)
fact = 1;
(1,LC)
fact = fact*a;
(1,RF)
r3=g(fact)
(2,LS)
return(r1 + r2 + r3)
cpeg421-10-F/Topic-3-II-EARTH
34
Resulting List Scheduled
Threads
a=x[i];
b=x[j];
fact=1;
2
sum=a+b;
r1=g(sum);
prod=a*b;
r2=g(prod);
fact=fact*i;
r3=g(fact)
3
return (r1+r2+r3);
cpeg421-10-F/Topic-3-II-EARTH
35
Generating Threaded-C Code
THREADED f
(
int *ret_parm, SLOT *rsync_parm, int *x, int i, int j)
{
SLOTS SYNC_SLOTS[2];
int a, b, sum, prod, fact, r1, r2, r3;
/* THREAD_0:; */
INIT_SYNC(0, 2, 2, 1); INIT_SYNC (1, 3, 3, 2);
GET_SYNC_L (&x[i], &a, 0);
GET_SYNC_L (&x[j], &b, 0);
fact = 1;
THREAD_1:;
END_THREAD( );
sum = a + b;
TOKEN (G, &r1, SLOT_ADR(1), sum);
prod = a * b;
TOKEN (g, &r2, SLOT_ADR(1), prod);
fact = fact * a;
TOKEN (g, &r3, SLOT_ADR(1), fact);
END_THREAD( );
THREAD_2:;
DATA_RSYNC_L(r1 + r2 + r3, ret_parm, rsync_parm);
END_FUNCTION( );
}
cpeg421-10-F/Topic-3-II-EARTH
36
Outline
•
•
•
•
Overview
Fine-grain multithreading
Compiling for fine-grain multithreading
The power of fine-grain synchronization SSB
• The percolation model and its applications
• Summary
cpeg421-10-F/Topic-3-II-EARTH
37
Fine-Grain Synchronization:
Two Types
Sync Type
Order
Fine Grain
Sync.
Solution
Enforce Mutual
Exclusion
No Specific Order
required
• Software Fine
grained locks
• Lock free
concurrent data
structures
• Full / Empty bits
cpeg421-10-F/Topic-3-II-EARTH
Enforce Data
Dependencies
Uni-directional
• I-structures
• Full / Empty bits
38
Enforce Data Dependencies
• A DoAcross loop with positive and
constant dependence distance.
In parallel iterations are
assigned to different threads
for(i= D; i < N; ++i){
A[i] = …
…
… = A[i-D];
}
T0
T1
(i = 2)
{
A[2] = …
…
… = A[2-D]
}
(i = 2 + D)
{
A[2+D] = …
…
… = A[2]
}
The data dependence needs to be enforced by synchronization
cpeg421-10-F/Topic-3-II-EARTH
39
Memory Based Fine-Grain
Synchronization:
• Full/Empty Bits (HEP, Tera MTA, etc) & IStructures (dataflow based machines)
• Associate “state” to a memory location (finegranularity). Fine-grain synchronization for the
memory location is realized through “state
transition” on such “state”.
Empty
read
write
I-Structure state transition
[ArvindEtAl89 @ TOPLAS]
reset
read
Full
cpeg421-10-F/Topic-3-II-EARTH
write
Deferred
read
40
With Memory Based FineGrain Sync
for(i= D; i < N; ++i){
A[i] = …
…
… = A[i-D];
}
for(i= D; i < N; ++i){
write_sync(&(A[i]),…)
…
… = read_sync(&(A[i-D]));
}
• Using a single atomic
operation complete
synchronized write/read in
memory directly
• No need to implement
synchronization with other
resources, e.g., shared
memory.
• Low overhead: just one
memory transaction
cpeg421-10-F/Topic-3-II-EARTH
41
With Memory Based FineGrain Sync
T0
(i = 2)
{
write_sync(&(A[2]),…);
…
… = read_sync(&(A[2-D]));
}
T1
(i = 2 + D)
{
write_sync(&(A[2 + D]),…);
…
… = read_sync(&(A[2]));}
• Using a single atomic
operation complete
synchronized write/read in
memory directly
• No need to implement
synchronization with other
resources, e.g., shared
memory.
• Low overhead: just one
memory transaction
cpeg421-10-F/Topic-3-II-EARTH
42
An Alternative: control-flow
based synchronizations
for(i= D; i < N; ++i){
A[i] = …
No data
post(i);
dependency
…
wait(i-D);
No data
dependency
… = A[i-D];
}
•
•
The post/wait instructions needs to
be implemented in shared memory
in coordination with the underline
memory (consistency) models
You may need to worry about this:
A[i] = …;
fence;
post(i);
wait(i-D);
fence;
… = A[i-D];
For computation with more complicated data dependencies, memory-based finegrain synchronization is more effective and efficient. [ArvindEtAl89 @ TOPLAS]
cpeg421-10-F/Topic-3-II-EARTH
43
A Question!
Is that really necessary to tag every
word in the entire memory to support
memory-based fine-grain
synchronization?
cpeg421-10-F/Topic-3-II-EARTH
44
Key Observation
Key Observation:
At any instance of a “reasonable” parallel execution
only a small fraction of memory locations are actively
participating in synchronization.
Solution:
Synchronization State Buffer (SSB): Only record and
manage states of active synchronized data units to
support fine-grain synchronization.
cpeg421-10-F/Topic-3-II-EARTH
45
What is SSB?
• A small hardware buffer attached to the memory
controller of each memory bank.
• Record and manage states of actively
synchronized data units.
• Hardware Cost
– Each SSB is a small look-up table: Easy-to-implement
– Independence of each SSB: hardware cost increases
only linearly proportional to # of memory banks
cpeg421-10-F/Topic-3-II-EARTH
46
SSB on Many-Core (IBM C64)
IBM Cyclops-64, Designed by Monty Denneau.
cpeg421-10-F/Topic-3-II-EARTH
47
SSB Synchronization
Functionalities
Data Synchronization: Enforce RAW data dependencies
• Support word-level
– Two single-writer-single-reader (SWSR) modes
– One single-writer-multiple-reader (SWMR) mode
Fine-Grain Locking: Enforce mutual exclusion
• Support word-level
– write lock (exclusive lock)
– read lock (shared lock)
– recursive lock
SSB is capable of supporting more functionality
cpeg421-10-F/Topic-3-II-EARTH
48
Experimental Infrastructure
OpenMP
Compiler
C
Compiler
(GCC/Open64)
Binutils:
Libraries:
TiNy Threads Library/RTS
OpenMP RTS
linker
assembler
Std C/Math lib
Cyclops-64 Micro Kernel
Simulation Testbed:
FAST Simulator (Software)
Ms. Clops Hardware Emulator
IBM Cyclops-64 Chip Architecture
• 160 thread units (500MHz)
• Three-level explicit-addressable memory
hierarchy
• Efficient thread-level execution support
• SSB for on-chip SRAM bank: 16-entry, 8-way
associative
cpeg421-10-F/Topic-3-II-EARTH
49
SSB Fine-Grain Sync. is
Efficient
• For all the benchmarks, the SSB-based version
shows significant performance improvement over
the versions based on other synchronization
mechanisms.
• For example, with up to 128 threads
– Livermore loop 6 (linear recurrence): a 312%
improvement over the barrier based version
– Ordered integer set (hash table): outperform the
software-based fine-grain methods by up to 84%
cpeg421-10-F/Topic-3-II-EARTH
50
Outline
•
•
•
•
Overview
Fine-grain multithreading
Compiling for fine-grain multithreading
The power of fine-grain synchronization SSB
• The percolation model and its applications
• Summary
cpeg421-10-F/Topic-3-II-EARTH
51
Research Layout
Future Programming Models
HTMT like Architecture
Scientific
Computation
Kernels
Advanced Execution /
Programming Model
Percolation
High
Performance Bio
computing
kernels
Other High end
Applications
Base Execution
Model Fine
Grain Multi
threading (e.g.
EARTH, CARE)
Location
Consistency
Infrastructure &
Tools
•System Software
•Simulation /
Emulation
•Analytical Modeling
cpeg421-10-F/Topic-3-II-EARTH
High End PIM Architecture
Cellular Multithreaded
Architecture(e.g. BG/c)
52
Percolation Model
DRAM
PIM
SRAM
PIM
High
Speed
CPUs
A User’s Perspective
CRAM
SRAM
CPUs
S-PIM Engine
Primary Execution Engine
Prepare and percolate
“parceled threads”
Perform intelligent memory
operations
DRAM
D-PIM Engine
cpeg421-10-F/Topic-3-II-EARTH
Global Memory
Management
53
The Percolation Model
•
•
What is percolation?
dynamic, adaptive
computation/data
movement, migration,
transformation in-place or
on-the fly to keep system
resource usefully busy
Features of percolation
– both data and thread
may percolate
– computation
reorganization and data
layout reorganization
– asynchronous
invocation
Level 0: fast cpu
Level 1 PIM
Level 2 PIM
Level 3
HTML-like Architectures
Level 0
Level 1
Level 2
Level 3
percolation
Data layout reorganization during percolation
Cannon’s nearest neighbor data transfer
An Example of percolation—Cannon’s Algorithm
cpeg421-10-F/Topic-3-II-EARTH
54
Performance of SCCA2
Kernel 4
#threads
C64
SMPs
MTA2
4
2917082
5369740
752256
8
5513257
2141457
619357
16
9799661
915617
488894
32
17349325
362390
482681
Metric:
TEPS -- Traversed Edges per second
• Reasonable scalability
–Scale well with # threads
–Linear speedup for #threads < 32
• Commodity SMPs has poor
performance
• Competitive vs. MTA-2
SMPs: 4-way Xeon dual-core,
2MB L2 Cache
cpeg421-10-F/Topic-3-II-EARTH
55
Outline
•
•
•
•
Overview
Fine-grain multithreading
Compiling for fine-grain multithreading
The power of fine-grain synchronization SSB
• The percolation model and its applications
• Summary
cpeg421-10-F/Topic-3-II-EARTH
56