Topic 5 - University of Delaware

Download Report

Transcript Topic 5 - University of Delaware

"Foolproof systems don't take into account the ingenuity of fools."
Gene Brown.
Topic 6
Advance Topics
“But I don’t want to go among mad people,” Alice remarked.
“Oh, you can’t help that,” said the Cat. ‘”We’re all mad here. I’m mad. You’re mad.”'
“How do you know I’m mad?” said Alice.
“You must be,” said the Cat. “or you wouldn’t have come here.”
Alice's Adventures in Wonderland
10/30/2006
ELEG652-06F
1
Reading List
• Slides: Topic6x
• Other papers as assigned in class or
homework
10/30/2006
ELEG652-06F
2
Outline
• A review of Parallel Architecture topics
• Synchronization and Parallelism
– Methods to Alleviate and exploit
• Dataflow Model
– Program Graphs
– Static, dynamic and recursive models
• From Pure Dataflow to multithreading
• Transactional Memory
– Lock Free Data Structures
– Types
10/30/2006
ELEG652-06F
3
What have we learned?
• Terminology and interconnect networks
– Its effects on communication
– The different types of architectures
– Classes of Applications
• Exploiting ILP
– Methods of Exploiting parallelism in different architectures
• Memory Models
– Its impact in programming and hardware design
– Different implementation of such models into the memory
hierarchy
• Synchronization
– Its cost and types
10/30/2006
ELEG652-06F
4
Exploiting Parallelism
What is the factor that determinate the parallelism of
an application?
The dependencies
How to extract the maximum possible parallelism?
Respect most dependencies and resolve
the ones that can be resolved
Dataflow
An model in which data “fires” operations
Think of Re-order
Buffer
10/30/2006
ELEG652-06F
5
Synchronization
• Cost
– Lock Acquisition and operations
• In the order of thousand cycles
– Barriers
• In the order of ten thousand cycles
• Problem
– Lock access
• Network and memory bandwidth and latency overhead
• Solution
–
–
–
–
Get rid of the locks
“Optimistic execution”
Lock Free Data structures
Transactional Memory
10/30/2006
ELEG652-06F
6
Topic 6a
Dataflow Model
An Execution Model for Parallel
Computation
10/30/2006
ELEG652-06F
7
A Short Story
Karp and Miller analyzed
Computation Graphs w/o
branches or merges
Dennis proposes a
dataflow language. Pure
Dataflow is born
Chamberlain proposes Single
Assignment language for dataflow
1960
Carl Adam
Petri defines
Petri Nets
1970
10/30/2006
1980
Rodriguez
proposes
Dataflow Graphs
Estrin and Turn
proposed an early
dataflow model
Arvind, Nikkel, et al designed the
Monsoon dataflow machine
1990
2000
2010
Arving & Gostelow, & separately Gurd and
Watson created a tagged token dataflow
model. Dynamic Dataflow is born
Kahn proposes a simple
parallel processing language
with vertices as queues.
Static Dataflow is born
Dennis designs a dataflow
arch
ELEG652-06F
8
Important Concepts / Properties
• Determinate
– In the execution of a concurrent program, if
the order in which the operations are
performed does not affect the outcome of
the computation.
• Non-determinate
– In the execution of a concurrent program, if
the order in which the operations performed
does affect the outcome of the computation.
10/30/2006
ELEG652-06F
9
Important Concepts / Properties
• Deterministic
– In the execution of a concurrent program, if the order
in which the operations are performed remain the
same each time the program is executed
• Non-deterministic
– In the execution of a concurrent program, if the order
in which the operations are performed may vary each
time the program is executed
• Determinate == Deterministic (?)
10/30/2006
ELEG652-06F
10
Deadlock
• A set of processes is “deadlocked” if each process in the
set is waiting on events that only another process in the
set can cause.
• Necessary conditions
– Mutual Exclusion
– Circular Wait
– No pre-emption
• Difficulty
– Programmability
– Correctness
– Avoidance, prevention and detection
Lock A
Lock B
…
Unlock B
Unlock A
Lock B
Lock A
…
Unlock A
Unlock B
A Deadlock Example
• The case of LL and SC
10/30/2006
ELEG652-06F
11
The Dataflow Model
• Can we come up with a parallel program execution
model and a base language such that parallelism is fully
exposed while the determinacy and deadlock-free
properties are ensured if the user guided to write “wellstructured” programs?
• The maximum parallelism in a given piece of code.
• Motivation
–
–
–
–
Parallelism by explicit data dependency
Determinacy
Deadlock free
Support high-performance architectures
10/30/2006
ELEG652-06F
12
Dataflow V.S. Control Flow
• Dataflow
• Control Flow
– Program  graph of
operators
– Operator  Consume
/ produce tokens
– All enabled operators
can run concurrently
– Program  sequence
of ops
– Operator  reads and
write data from
storage
– Only one operation per
time
• Define Successor
10/30/2006
ELEG652-06F
13
Dataflow Concepts
• Tokens
– Data value with “presence” indication
+
• Actor
– Takes a set of n inputs and produces a set of m
outputs
– Only “enabled” when all n inputs are available
• Dataflow Graph
+
– A group of operators / actors that represents a
computational section
– The relationship between each actor
– Controlled by data presence
10/30/2006
+
ELEG652-06F
+
14
Dataflow
A Base Language
• More on Dataflow Graphs
– To serve as an intermediate-level language for highlevel languages (Jack B. Dennis)
– To serve as a machine language for parallel machines
(Jack B. Dennis)
– G = ( A, E ) is a directed graph where A, is a set of
actors and E is a set of directed arcs
• A Proper Graph
– All actors must have arcs of required types
– All arcs must be connected at both ends
10/30/2006
ELEG652-06F
15
Dataflow Model
• Similarities to the DAG
• Dataflow graph can be constructed from
the DAG in a systematic and concise
manner
• Exploit dynamic ordering of data arrival
• Seen in aggressive control flow
implementations
– ROB and Tomasulo
• Add some other actors
10/30/2006
ELEG652-06F
16
Actors
2) Operators
1) Links
T
T
T
F
F
F
4) Merge
3) Switch & Control Actors
10/30/2006
ELEG652-06F
17
Dataflow Model of Computation
ADD R0, R1, R2
SUB R3, R4, R5
MULT R6, R0, R3
R1
1
+
R2
R4
3
*
6
R5
10/30/2006
4
ELEG652-06F
18
Dataflow Model of Computation
ADD R0, R1, R2
SUB R3, R4, R5
MULT R6, R0, R3
R1
1
+
R2
R4
3
*
-
2
R5
10/30/2006
ELEG652-06F
19
Dataflow Model of Computation
ADD R0, R1, R2
SUB R3, R4, R5
MULT R6, R0, R3
R1
+
4
R2
R4
*
-
2
R5
10/30/2006
ELEG652-06F
20
Dataflow Model of Computation
ADD R0, R1, R2
SUB R3, R4, R5
MULT R6, R0, R3
R1
+
R2
R4
*
8
R5
10/30/2006
ELEG652-06F
21
Operational Semantics
Firing Rule
• Tokens  Data
• Assignment  Placing a token in the
output arc
• Snapshot / configuration: state
• Computation
– The intermediate step between snapshots /
configurations
• An actor of a dataflow graph is enabled if
there is a token on each of its input arcs
10/30/2006
ELEG652-06F
22
Operational Semantics
Firing Rule
• Any enabled actor may be fired to define the
“next state” of the computation
• An actor is fired by removing a token from each
of its input arcs and placing tokens on each of its
output arcs.
• Computation  A Sequence of Snapshots
– Many possible sequences as long as firing rules are
obeyed
– Determinacy
– “Locality of effect”
10/30/2006
ELEG652-06F
23
Firing Rules
2) Operators
1) Links
v
10/30/2006
vn
v1
v
v
u1
ELEG652-06F
un
24
Firing Rules
3) Switch & Control Actors
4) Merge
v1
v
v
T
v1
T
v2
F
T
F
T
v2
T
F
T
F
F
v1
v1
v
v
T
v2
F
T
F
v2
v1
F
T
F
T
F
v2
10/30/2006
ELEG652-06F
25
General Firing Rules
• A switch actor is enabled if a token is available
on its control input arc, as well as the
corresponding data input arc.
– The firing of a switch actor will remove the input
tokens and deliver the input data value as an output
token on the output arc.
• A (unconditional) merge actor is enabled if there
is a token available on any of its input arcs.
– An enabled (unconditional) merge actor may be fired
and will (non-deterministically) put one of the input
tokens on the output arc.
10/30/2006
ELEG652-06F
26
Conditional Expression
x
if (p(y)){
f(x,y);
}
else{
g(y);
}
y
p
T
T
g
f
10/30/2006
F
ELEG652-06F
27
A Conditional Schema
m
k
D
(k,1)
T
m
F
m
Q
(m,n)
P
(m,n)
n
10/30/2006
n
n
ELEG652-06F
28
A Loop Schema
Initial Loop value
T
F
F
COND
T
F
Loop op
10/30/2006
ELEG652-06F
29
Snapshots
V1
Vm
A (m,n) schema
without any
enabled actors
A (m,n) schema
without any
enabled actors
U1
Initial Snapshot
10/30/2006
Un
Final Snapshot
ELEG652-06F
30
Dataflow Graphs
Well Behaved Graphs
• Data flow graphs that produce exactly one
set of result values at each output arcs for
each set of values presented at the input
arcs
• Self Resetting Graphs
• Determinacy
10/30/2006
ELEG652-06F
31
Dataflow Graph
Well Formed Schemas
•
•
•
•
•
Well Formed Dataflow Schemas (WFDS)
An operator is an WFDS
A Conditional Schema is an WFDS
An Iterative Schema is an WFDS
An Acyclic Composition of WFDS is a WFDS in
itself
• Proposed by Jack B. Dennis and Fossen in 1973
• Theorem: “A well-formed data flow graph is
well-behaved”
10/30/2006
ELEG652-06F
32
Sick Formed Dataflow Graph
A
Hangup
Unclean
J
B
E
D
K
L
C
A
G
M
H
Deadlock
N
Conflict
I
10/30/2006
ELEG652-06F
33
Well Behaved Program
• Always determinate in the sense that a
unique set of output values is determined
by a set of input values
• References:
Rodriquez, J.E. 1966, “A Graph Model of Parallel Computation”,
MIT, TR-64]
Patil, S. “Closure Properties of Interconnections of Determinate
Systems”, Records of the project MAC conf. on concurrent systems
and parallel Computation, ACM, 1970, pp 107-116]
Denning, P.J. “On the Determinacy of Schemata” pp 143-147
Karp, R.M. & Miller, R.E., “Properties of a Model of Parallel
Computation Termination, termination, queuing”, Appl. Math, 14(6), Nov.
1966
10/30/2006
ELEG652-06F
34
Topic 6b
Types of Dataflow
10/30/2006
ELEG652-06F
35
Dataflow Models
• Static Dataflow Model
• Tagged Token Dataflow Model
– Also known as dynamic
• Recursive Program Graphs
10/30/2006
ELEG652-06F
36
Static Dataflow Model
• “...for any actor to be enabled, there
must be no tokens on any of its output
arcs...”
10/30/2006
ELEG652-06F
37
Conditional Expression
x
if (p(y)){
f(x,y);
}
else{
g(y);
}
y
p
T
T
F
g
T
10/30/2006
ELEG652-06F
FIFO
f
F
38
Example
Power Function
long power(int x, int n){
int y = 1;
for(int i = n; i > 0; --i)
y *= x;
return y;
}
10/30/2006
ELEG652-06F
y = xn
39
Power Function
x
T
T
F
x
n
1
y
f
T
F
F
i
f
f
i>0
T
T
*
-1
return
10/30/2006
T
F
ELEG652-06F
y = xn
40
Power Function
2
T
T
F
x
3
1
y
f
T
F
F
i
f
f
i>0
T
T
*
-1
return
10/30/2006
T
F
ELEG652-06F
y = 23
41
Power Function
T
T
F
2
T
F
1
F
3
3
i>0
T
T
*
-1
return
10/30/2006
T
F
ELEG652-06F
y = 23
42
Power Function
T
T
F
T
F
t
F
t
t
i>0
2
1
t
T
T
t
T
F
*
-1
return
10/30/2006
3
t
ELEG652-06F
y = 23
43
Power Function
T
T
F
T
F
t
F
t
t
i>0
T
T
*
2
10/30/2006
2
T
F
-1
1
return
ELEG652-06F
3
y = 23
44
Power Function
T
T
F
T
F
F
t
2
t
i>0
2
2
T
T
*
-1
return
10/30/2006
T
F
ELEG652-06F
y = 23
45
Power Function
T
T
F
T
F
2
F
2
2
i>0
2
T
T
*
-1
return
10/30/2006
T
F
ELEG652-06F
y = 23
46
Power Function
T
T
F
T
F
t
F
t
t
i>0
2
2
t
T
T
t
T
F
*
-1
return
10/30/2006
2
t
ELEG652-06F
y = 23
47
Power Function
T
T
F
T
F
t
F
t
t
i>0
T
T
2
10/30/2006
2
T
F
*
-1
2
return
ELEG652-06F
2
y = 23
48
Power Function
T
T
F
T
F
F
t
t
2
i>0
4
T
T
F
*
T
-1
return
10/30/2006
1
ELEG652-06F
y = 23
49
Power Function
T
T
F
T
F
F
1
4
1
i>0
2
T
T
*
-1
return
10/30/2006
T
F
ELEG652-06F
y = 23
50
Power Function
T
F
T
t
F
T
t
F
t
i>0
2
4
t
T
T
t
T
F
*
-1
return
10/30/2006
1
t
ELEG652-06F
y = 23
51
Power Function
T
F
T
t
F
T
t
F
t
i>0
T
T
*
2
10/30/2006
2
T
F
-1
4
return
ELEG652-06F
1
y = 23
52
Power Function
T
T
F
F
T
t
F
t
2
i>0
8
T
T
*
T
-1
return
10/30/2006
0
F
ELEG652-06F
y = 23
53
Power Function
T
T
F
T
F
F
0
8
0
i>0
2
T
T
*
-1
return
10/30/2006
T
F
ELEG652-06F
y = 23
54
Power Function
T
T
F
F
f
T
f
F
f
i>0
f
2
8
T
T
f
T
F
*
-1
return
10/30/2006
0
f
ELEG652-06F
y = 23
55
Power Function
T
T
F
F
f
T
f
F
f
i>0
T
T
*
-1
8
10/30/2006
T
F
ELEG652-06F
y = 23
56
DFG
Vector Addition
a
T
NULL
b
T
F
F
T
N
0
T
F
T
F
F
>=
T
T
T
F
T
T
+1
Select
Select
+
Assign
c
10/30/2006
ELEG652-06F
for(i = 0; i < N; ++i)
c[i] = a[i] + b[i];
57
Static Dataflow Model
Features
•
•
•
•
One-token-per-arc
Deterministic merge
Conditional/iteration construction
Consecutive iterations of a loop can only be
pipelined.
• A dataflow graph  activity templates
– Opcode of the represented instruction
– Operand slots for holding operand values
– Destination address fields
• Token  value + destination
10/30/2006
ELEG652-06F
58
Static Dataflow Model
Features
• Deficiencies:
– Due to acknowledgment tokens, the token traffic is
doubled.
– Lack of support for programming constructs that are
essential to modern programming language
– no procedure calls,
– no recursion.
• Advantage:
– simple model
10/30/2006
ELEG652-06F
59
Activity Template
Opcode
x
y
Operand / Value
*
Next Operand /
Result
sqrt
Opx
Input Address
Signal Back
next
sqrt
sqrt res
next
sqrt signal
Token Arc
Communication Arc
Opx / Opy
next
10/30/2006
Opy
*
x=2
y=3
sqrt
x signal
y signal
The operation that produced x and y
The operation that will use the sqrt result
ELEG652-06F
next
60
A more Complicated Example
a
*
a
c
a signal
c signal
10/30/2006
d
b
*
b
d
b signal
d signal
*
a
d
+
a signal
d signal
c
*
b
c
+
b signal
c signal
M1
M2
next
M1 signal
M2 signal
*
N1
N2
next
N1 signal
N2 signal
a*c–b*d
a*d+b*c
ELEG652-06F
61
Recursive Program Graphs
• Outlaw iterations :
– Graph must be acyclic
• One-token-per-arc-per-invocation
• Iteration is expressed in terms of a tail
recursion
10/30/2006
ELEG652-06F
62
Tail Function Application
• Tail-procedure application
– a procedure application that occurs as
the last statement in another procedure;
• Tail-function application is a function
application (appears in the body
expression) whose result value is also
returned as the value of the entire
functions
• Consider the role of stack
10/30/2006
ELEG652-06F
63
Factorial
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
Tail Recursive
10/30/2006
Normal Recursive
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
ELEG652-06F
64
Factorial
The Normal Version
n
1
n==0
F
T
Hand Simulate fact(3)
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
65
Factorial
The Normal Version
3
1
n==0
F
T
Hand Simulate fact(3)
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
66
Factorial
The Normal Version
3
n
1
n==0
3
F
T
Hand Simulate fact(3)
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
67
Factorial
The Normal Version
n
1
n==0
F
3
F
T
Hand Simulate fact(3)
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
68
Factorial
The Normal Version
n
1
n==0
3
F
T
F
F
Hand Simulate fact(3)
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
69
Factorial
The Normal Version
n
1
n==0
F
T
3
Hand Simulate fact(3)
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
70
Factorial
The Normal Version
n
1
n==0
F
T
Hand Simulate fact(3)
3
-1
Apply fact
3
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
71
Factorial
The Normal Version
n
1
n==0
F
T
Hand Simulate fact(3)
-1
2
Apply fact
3
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
72
Factorial
The Normal Version
2
1
n==0
F
T
3 * fact(2)
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
73
Factorial
The Normal Version
2
n
1
n==0
2
F
T
3 * fact(2)
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
74
Factorial
The Normal Version
n
1
n==0
2
F
F
T
3 * fact(2)
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
75
Factorial
The Normal Version
n
1
n==0
2
F
T
F
F
3 * fact(2)
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
76
Factorial
The Normal Version
n
1
n==0
F
T
2
3 * fact(2)
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
77
Factorial
The Normal Version
n
1
n==0
F
T
3 * fact(2)
2
-1
Apply fact
2
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
78
Factorial
The Normal Version
n
1
n==0
F
T
3 * fact(2)
-1
1
Apply fact
2
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
79
Factorial
The Normal Version
1
1
n==0
F
T
3 * fact(2 * fact(1))
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
80
Factorial
The Normal Version
1
n
1
n==0
1
F
T
3 * fact(2 * fact(1))
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
81
Factorial
The Normal Version
n
1
n==0
1
F
F
T
3 * fact(2 * fact(1))
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
82
Factorial
The Normal Version
n
1
n==0
1
F
T
F
F
3 * fact(2 * fact(1))
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
83
Factorial
The Normal Version
n
1
n==0
F
T
1
3 * fact(2 * fact(1))
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
84
Factorial
The Normal Version
n
1
n==0
F
T
3 * fact(2 * fact(1))
1
-1
Apply fact
1
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
85
Factorial
The Normal Version
n
1
n==0
F
T
3 * fact(2 * fact(1))
-1
0
Apply fact
1
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
86
Factorial
The Normal Version
0
1
n==0
F
T
3 * fact(2 * fact(1 * fact(0)))
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
87
Factorial
The Normal Version
0
n
1
n==0
0
F
T
3 * fact(2 * fact(1 * fact(0)))
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
88
Factorial
The Normal Version
n
1
n==0
0
F
T
T
3 * fact(2 * fact(1 * fact(0)))
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
89
Factorial
The Normal Version
n
1
n==0
0
F
T
T
T
3 * fact(2 * fact(1 * fact(0)))
-1
Apply fact
*
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
90
Factorial
The Normal Version
n
n==0
F
T
3 * fact(2 * fact(1 * fact(0)))
-1
Apply fact
*
1
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
91
Factorial
The Normal Version
n
1
n==0
F
T
3 * fact(2 * fact(1 * 1))
-1
Apply fact
1
*
10/30/2006
1
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
92
Factorial
The Normal Version
n
1
n==0
F
T
3 * fact(2 * fact(1 * 1))
-1
Apply fact
*
1
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
93
Factorial
The Normal Version
n
1
n==0
F
T
3 * fact(2 * 1)
-1
Apply fact
2
*
10/30/2006
1
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
94
Factorial
The Normal Version
n
1
n==0
F
T
3 * fact(2 * 1)
-1
Apply fact
*
2
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
95
Factorial
The Normal Version
n
1
n==0
F
T
3* 2
-1
Apply fact
3
*
10/30/2006
2
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
96
Factorial
The Normal Version
n
1
n==0
F
T
3* 2
-1
Apply fact
*
6
10/30/2006
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
ELEG652-06F
97
Factorial
The Normal Version
n
1
n==0
F
T
6
-1
long fact(n){
if(n == 0) return 1;
else return n * fact(n-1);
}
Apply fact
*
10/30/2006
6
ELEG652-06F
98
Factorial
The Tail Recursion Version
n
p
n==0
F
F
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
Hand Simulate fact(3,1)
10/30/2006
ELEG652-06F
99
Factorial
The Tail Recursion Version
3
1
n==0
F
F
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
Hand Simulate fact(3,1)
10/30/2006
ELEG652-06F
100
Factorial
The Tail Recursion Version
3
n
1
n==0
3
F
F
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
Hand Simulate fact(3,1)
10/30/2006
ELEG652-06F
101
Factorial
The Tail Recursion Version
n
1
n==0
3
F
F
F
-1
F
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
Hand Simulate fact(3,1)
10/30/2006
ELEG652-06F
102
Factorial
The Tail Recursion Version
n
p
n==0
F
F
3
1
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
Hand Simulate fact(3,1)
10/30/2006
ELEG652-06F
103
Factorial
The Tail Recursion Version
n
p
n==0
F
F
3
1
3
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
Hand Simulate fact(3,1)
10/30/2006
ELEG652-06F
104
Factorial
The Tail Recursion Version
n
p
n==0
F
F
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
2
Apply fact
3
Hand Simulate fact(3,1)
10/30/2006
ELEG652-06F
105
Factorial
The Tail Recursion Version
2
3
n==0
F
F
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
fact(fact(2,3))
10/30/2006
ELEG652-06F
106
Factorial
The Tail Recursion Version
2
n
3
n==0
2
F
F
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
fact(fact(2,3))
10/30/2006
ELEG652-06F
107
Factorial
The Tail Recursion Version
n
3
n==0
2
F
F
F
-1
F
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
fact(fact(2,3))
10/30/2006
ELEG652-06F
108
Factorial
The Tail Recursion Version
n
p
n==0
F
F
2
3
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
fact(fact(2,3))
10/30/2006
ELEG652-06F
109
Factorial
The Tail Recursion Version
n
p
n==0
F
F
2
3
2
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
fact(fact(2,3))
10/30/2006
ELEG652-06F
110
Factorial
The Tail Recursion Version
n
p
n==0
F
F
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
1
Apply fact
6
fact(fact(2,3))
10/30/2006
ELEG652-06F
111
Factorial
The Tail Recursion Version
1
6
n==0
F
F
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
fact(fact(fact(1,6)))
10/30/2006
ELEG652-06F
112
Factorial
The Tail Recursion Version
1
n
6
n==0
1
F
F
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
fact(fact(fact(1,6)))
10/30/2006
ELEG652-06F
113
Factorial
The Tail Recursion Version
n
6
n==0
1
F
F
F
-1
F
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
fact(fact(fact(1,6)))
10/30/2006
ELEG652-06F
114
Factorial
The Tail Recursion Version
n
p
n==0
F
F
1
6
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
fact(fact(fact(1,6)))
10/30/2006
ELEG652-06F
115
Factorial
The Tail Recursion Version
n
p
n==0
F
F
1
6
1
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
fact(fact(fact(1,6)))
10/30/2006
ELEG652-06F
116
Factorial
The Tail Recursion Version
n
p
n==0
F
F
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
0
Apply fact
6
fact(fact(fact(1,6)))
10/30/2006
ELEG652-06F
117
Factorial
The Tail Recursion Version
0
6
n==0
F
F
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
fact(fact(fact(fact(0,6))))
10/30/2006
ELEG652-06F
118
Factorial
The Tail Recursion Version
0
n
6
n==0
0
F
F
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
fact(fact(fact(fact(0,6))))
10/30/2006
ELEG652-06F
119
Factorial
The Tail Recursion Version
n
6
n==0
0
F
F
T
-1
T
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
fact(fact(fact(fact(0,6))))
10/30/2006
ELEG652-06F
120
Factorial
The Tail Recursion Version
n
p
n==0
F
F
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
fact(fact(fact(fact(0,6))))
6
10/30/2006
ELEG652-06F
121
Factorial
The Tail Recursion Version
n
p
n==0
F
F
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
fact(fact(fact(6)))
6
10/30/2006
ELEG652-06F
122
Factorial
The Tail Recursion Version
n
p
n==0
F
F
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
fact(fact(6))
6
10/30/2006
ELEG652-06F
123
Factorial
The Tail Recursion Version
n
p
n==0
F
F
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
fact(6)
6
10/30/2006
ELEG652-06F
124
Factorial
The Tail Recursion Version
n
p
n==0
F
F
-1
*
T
long fact_1(n, p){
if(n == 0) return p;
else return fact_1(n-1, n*p);
}
Apply fact
6
10/30/2006
6
ELEG652-06F
125
Recursive Program Graph
Features
•
•
•
•
•
Acyclic
One-token-per-link-in-lifetime
Tags
No deterministic merge needed
Recursion is expressed by runtime
copying
• No matching is needed (why?)
10/30/2006
ELEG652-06F
126
Power function as
a Recursive Graph
y
x
n
>0
int rec(int x, int y, int n){
if(n >= 0) return y;
else rec(x, x*y, n-1);
}
T
F T
T
-1
X
x
y
n
apply (Rec)
10/30/2006
ELEG652-06F
127
Power function as
a Recursive Graph
x
n
1
n
x
x>0
F
int rec(int x, int y, int n){
if(n >= 0) return 1;
else x * rec(x, n-1);
}
T
T
-1
n
Apply
Note: Tail-recursive = Iterations:
i.e. the states of the computation
are captured explicitly by the set
of iteration variables.
10/30/2006
ELEG652-06F
rec
*
128
Dynamic Dataflow
• Static Dataflow
– Only one token per arc
– Problems with Function calls, nested loops and data
structures
– A signal is needed to allow the parent’s operator to
fire
• Dynamic Dataflow
– You can see them as replicating static dataflow
machines
– The MIT tagged token model
10/30/2006
ELEG652-06F
129
The Token in Dynamic Dataflow
Token  Tag + Value
[v, <u, s>, d]
v:
u:
s :
d:
Value
activation instance
destination actor
operand slot
Different from Static Dataflow that it needs the tag
10/30/2006
ELEG652-06F
130
Dynamic Dataflow
• Loops and function calls
– Should be executed in parallel as instances of
the same graph
• Abstract the replication
• Arc  a container with different token that
have different tags
• A node can fire as soon that all tokens with
identical tags are presented in its input
arcs
10/30/2006
ELEG652-06F
131
Dynamic Dataflow
• Advantages
• Disadvantages
– Better Performance
– More Parallelism
– Implementation of the
matching unit
– Associative Memory
would be ideal
• Not cost effective
• Hashing is used
10/30/2006
ELEG652-06F
132
Dataflow Memory
The I Structures
• Single Assignment Rule and Complex Data
structures
– Consume the entire data structure after each access
• The Concept of the I Structure
– Only consume the entry on a write
– A data repository that obeys the Single Assignment
Rule
– Written only once, read many times
• Elements are associated with status bits and a
queue of deferred reads
10/30/2006
ELEG652-06F
133
Dataflow
The I Structures
• The structure becomes defined on a write
and it only happens once
– At this moment all deferred reads will be
satisfied
• Use a data structure before it is completely
defined
• Incremental creating or reading of data
structures
10/30/2006
ELEG652-06F
134
Dataflow
The I Structures
• Status
– Present: The element can
be read but not written
– Absent: The element has
been attempted to be read
but the element has not
been written yet (initial
state)
– Waiting: At least one read
request has been deferred
Absent
w
r
w
r
Waiting
r
Present
w
Error
10/30/2006
ELEG652-06F
135
Dataflow
The I Structures
• Elementary
– Allocate: reserves space for the new I-Structure
– I-fetch: Get the value of the new I-structure (deferred)
– I-store: Writes a value into the specified I structure
element
• Used to create construct nodes:
– SELECT
– ASSIGN
10/30/2006
ELEG652-06F
136
Dataflow
The I Structures
A
j
A
A
SELECT
j
x
A
j
x
ASSIGN
Addr
I Fetch
10/30/2006
j
Addr
I struct
ELEG652-06F
I Store
I struct
137
Topic 6b
Evolution from “Pure Dataflow”
to
“Hybrid” and “Multithreading”
10/30/2006
ELEG652-06F
138
Evolution of Multithreaded
Execution and Architecture Models
CHoPP’77
CHoPP’87
Non-dataflow
based
MASA
Alwife
Halstead
1986
Agarwal
1989-96
HEP
CDC 6600
1964
Tera
B. Smith
1978
Flynn’s
Processor
B. Smith
1990-
Cosmic Cube
Seiltz
1985
1969
Eldorado
CASCADE
J-Machine
M-Machine
Dally
1988-93
Dally
1994-98
Others: Multiscalar (1994), SMT (1995), etc.
Dataflow
model inspired
Monsoon
MIT TTDA
Arvind
1980
LAU
Syre
1976
Static
Dataflow
Papadopoulos
& Culler
1988
P-RISC
*T/Start-NG
Nikhil &
Arvind
1989
MIT/Motorola
1991-
Iannuci’s
1988-92
TAM
Manchester
Culler
1990
SIGMA-I
Gurd & Watson
1982
Shimada
1988
Cilk
Leiserson
EM-5/4/X
RWC-1
1992-97
Dennis 1972
MIT
10/30/2006
Arg-Fetching
Dataflow
DennisGao
1987-88
MDFA
Gao
ELEG652-06F
1989-93
MTA
HumTheobald
Gao 94
EARTH
PACT95’,
ISCA96,
Theobald99
CARE
Marquez04
139
Von Neumann-type Processing
Begin
for i = 1…
.
.
endfor
end.
Program
Sequential
Machine
Representation
CPU
Processor
10/30/2006
ELEG652-06F
140
A Multi-Threaded Architecture
.. .
...
.
..
. .
To other
PEs
One PE
10/30/2006
ELEG652-06F
141
McGill Data Flow Architecture Model
(MDFA)
n1
n1
store
store
fetch
fetch
n2
fetch
n3
n2
Argument-flow principle
10/30/2006
store
fetch
n3
Argument-fetching principle
ELEG652-06F
142
A Dataflow Program Tuple
Program Tuple ::= {P-Code. S-Code}
P-Code:
n1: x=a+b
n2: y=c-d
n3: z=x*y
S-Code:
a
b
IPU
ISU
2 n1
3
2 n3
3
c
d
10/30/2006
z
2 n2
3
ELEG652-06F
143
The McGill Dataflow Architecture Model
PIPU
Fire
Done
DISU
Enable Memory
and
Controller
10/30/2006
Signal
Processing
ELEG652-06F
144
The McGill Dataflow Processor
PIPU
Important Features:
Fire
Pipeline can be kept fully
utilized provided that the
program has sufficient
parallelism.
Done
X
X
X
DISU
X
X
= PC
Waiting instructions
X Enabled instructions
10/30/2006
ELEG652-06F
145
The Scheduling Memory (enable)
DISU
Fire
Enable
memory
1
0
1
1
0
1
0
0
0
0
1
0
0
1
0
1
C
O
N
T
R
O
L
L
E
R
Count
Signal(s)
Signal
Processing
Done
The instruction is enabled
0 The instruction is not enabled
1
10/30/2006
ELEG652-06F
146
Advantages of the McGill Dataflow
Architecture Model
• Eliminate unnecessary token copying and
transmission overhead
• Instruction scheduling is separated from
the main datapath of the processor
10/30/2006
ELEG652-06F
147
Von Neumann Threads as
Macro Dataflow Nodes
• A sequence of instructions
is “packed” into a macrodataflow node
• Synchronization is done at
the macro-node level.
1
2
k
A macro
node
10/30/2006
ELEG652-06F
148
Hybrid Evaluation Von Neumann style Instruction
Execution on the McGill Dataflow Architecture
• Group a “sequence” of dataflow instruction into a
“thread” or a macro dataflow node.
• Data-driven synchronization among threads.
• “Von Neumann style sequencing” within a thread.
Advantage:
Preserves the parallelism among threads but
avoids unnecessary fine-grain synchronization
between instructions within a sequential thread.
10/30/2006
ELEG652-06F
149
What Do We Get?
• A hybrid architecture model
without sacrificing the advantage
of fine-grain parallelism!
(latency-hiding, pipelining support)
10/30/2006
ELEG652-06F
150
A Realization of the Hybrid Evaluation
1
2
.....
k
...
...
Von Neumann bit
PIPU
Fire
Signals
Short Cut
Done
Signals
DISU
10/30/2006
ELEG652-06F
151
Topic 6c
Multithreaded Execution Model,
Architecture and System
10/30/2006
ELEG652-06F
152
Challenges: The “Killer Latency
Problem”
P
NI
C
M
Network
Latency due to:
- Communication
- Synchronization
P
NI
C
10/30/2006
M
ELEG652-06F
153
Low “Round-trip” Latency
• Very important to many parallel applications
• Solutions?
– Minimize communication and synchronization cost
– Fully utilize available communication bandwidth to
hide latency
• A good multithreaded execution and architecture
model help both
10/30/2006
ELEG652-06F
154
Data Parallel Models
• Difficult to write
unstructured
programs
compute
– convenient only for
problems with, regular
structured parallelism
• Limited composability!
– Inherent limitation of
single threading
10/30/2006
ELEG652-06F
communicate
compute
?
communicate
155
Coarse-Grain vs.
Fine-Grain Multithreading
CPU
CPU
Memory
Memory
Thread
Unit
Executor
Locus
Thread
Unit
A Single
Thread
Coarse-Grain Multithreading
10/30/2006
Executor
Locus
A Pool
Thread
Fine-Grain Multithreading
ELEG652-06F
156
Evolution of Multithreaded Execution &
Architecture Models Based on Dataflow
DennisGao88
HumTheobaldGao94
Andres04
Arg-Fetching Dataflow
MTA
CARE
HumEtAl, McGill
Andres, Delaware
Dennis, Gao, McGill
+
x
-
MIT dataflow
(1970s)
10/30/2006
1987
1989
MDFA/Super Actor
Gao, McGill
Hum93
1994
1999
2004
2005
EARTH
HTVM
Theobald, Delaware/McGill
GaoEtAl, Delaware
ELEG652-06F
Theobald99
157
Topic 6c
EARTH:
An Efficient Architecture
for Running THreads
10/30/2006
ELEG652-06F
158
Open Issues
• Can multithreaded program execution
model supports high scalability for largescale parallel computing while maintaining
uniformly high processing efficiency?
• If so, can this be achieved without exotic
hardware support?
10/30/2006
ELEG652-06F
159
The EARTH Program Execution Model
• What is a thread?
• How is the state of a thread represented?
• How is a thread enabled?
10/30/2006
ELEG652-06F
160
What is a Thread?
• A parallel function invocation
(threaded function invocation)
• A code sequence defined (by a user or a compiler)
to be a thread
• Usually, a function body may be partitioned into
several threads
10/30/2006
ELEG652-06F
161
The Fibonacci Example
int fib (long n){
int sum1;
int sum2;
if (n < 2) {
return (1) ;
}
else {
sum1 = fib(n-1);
sum2 = fib(n-2);
return (sum1 + sum2);
}
}
Frame of
fib(2)
Sum2
Sum1
n
.
2
PC
Frame of
fib(3)
Sum2
Sum1
n
.
3
.
4
PC
Frame of
fib(4)
The state of a function invocation is <fp, ip>
fp: a frame pointer to its own frame
10/30/2006
ELEG652-06F
ip: a program pointer to its own PC
Sum2
Sum1
n
PC
The stack
162
Execution of Fibonacci
Exploitation of Parallelism
fib(6)
fib(4)
fib(5)
fib(4)
fib(3)
fib(2)
10/30/2006
fib(1)
fib(3)
fib(2)
fib(1)
fib(2)
fib(0) fib(1)
fib(3)
fib(1)
fib(2)
fib(1)
fib(0) fib(1)
ELEG652-06F
fib(2)
fib(1)
fib(0)
fib(0)
163
Parallel Function Invocation
fib n-2
fib n-3
SYNC
slots
local
vars
fib n-1
fib n-2
caller’s
<fp,ip>
fib n
Tree of “Activation Frames”
10/30/2006
ELEG652-06F
164
Stack and Activation Frames
Synchronization slots
Local variables
Stack frames
Activation frame
Tree of frames
10/30/2006
ELEG652-06F
165
An Example
int f(int *x, int i, int j)
{
int a, b, sum, prod, fact;
int r1, r2, r3;
a = x[i];
fact = 1;
fact = fact * a;
b = x[j];
sum = a + b;
prod = a * b;
r1 = g(sum);
r2 = g(prod);
r3 = g(fact);
return(r1 + r2 + r3);
}
10/30/2006
ELEG652-06F
166
The Example
Four Fibers
Fiber-0:
a = x[i];
fact = 1;
Fiber-2:
sum = a + b;
prod = a * b;
r1 = g(sum);
r2 = g(prod);
r3 = g(fact);
Fiber-1:
1
fact = fact * a;
b = x[j];
1
3
Fiber-3:
return (r1 + r2 + r3);
10/30/2006
ELEG652-06F
167
Fiber
States
• A Fiber shares its “enclosing frame” with other
Fibers within the same function invocation
• The state of a Fiber includes
– its instruction pointer
– its “temporary register set”
• A Fiber is “ultra-light weighted”: it does not need
dynamic storage (frame) allocation.
10/30/2006
ELEG652-06F
168
The Fiber Execution Model
2 2
“signal token”
1 2
a “Fiber” actor
1
2
2 4
- data token
- locality token
A Multithread Program Graph (MPG)
10/30/2006
ELEG652-06F
169
EARTH Fiber Firing Rule
• A Fiber in a MPG becomes enabled if it has received all
input signals;
• An enabled Fiber may be selected for execution when
required hardware resource is allocated;
• When Fiber finishes its execution, signal will send to
destination Fibers in the MPG and update the
corresponding synchronization slots.
10/30/2006
ELEG652-06F
170
Fiber States
Thread created
Thread terminated
DORMANT
Synchronization
received
Thread completed
ENABLED
ACTIVE
CPU ready
10/30/2006
ELEG652-06F
171
The EARTH Model of Computation
Fiber within a frame
Parallel function
invocation
a sync operation
Invoke a threaded func
10/30/2006
ELEG652-06F
172
EARTH Multithreaded
Architecture Model
Local Memory
EU
Local Memory
SU
EU
PE
SU
PE
NETWORK
10/30/2006
ELEG652-06F
173
The EARTH Operation Set
• The base operation
• Thread synchronization and scheduling ops
SPAWN, SYNC
• Split-phase data & sync ops
GET-SYNC, DATA_SYNC
• Threaded function invocation and load balancing ops
INVOKE, TOKEN
10/30/2006
ELEG652-06F
174
Topic 6d
Programming Models
for
Multithreaded Architectures:
The EARTH Threaded-C Experience
10/30/2006
ELEG652-06F
175
EARTH-MANNA Testbed
Local Memory
Local Memory
EU
EU
SU
PE
SU
PE
NETWORK
10/30/2006
ELEG652-06F
176
Latency tolerance and management
Features of Threaded Programming
10/30/2006
• Thread partition
- Thread length vs useful parallelism
- Where to “cut”?
• Split-phase synchronization and communication
• Parallel threaded function invocation
• Dynamic load balancing
ELEG652-06F
177
Table 1
EARTH Instruction Set
• Basic instructions:
Arithmetic, Logic and Branching
typical RISC instructions, e.g., those from the i860
• Thread Switching
FETCH_NEXT
• Synchronization
SPAWN fp, ip
SYNC fp, ss_off
INIT_SYNC ss_off, sync_cnt, reset_cnt, ip
INCR_SYNC fp, ss_off, value
10/30/2006
ELEG652-06F
178
Table 1
EARTH Instruction Set
• Data Transfer & Synchronization
DATA_SPAWN value, dest_addr, fp, ip
DATA_SYNC value, dest_addr, fp, ss_off
BLOCKDATA_SPAWN src_addr, dest_addr, size, fp, ip
BLOCKDATA_SYNC src_addr, dest_addr, size, fp, ss_off
• Split_phase Data Requests
GET_SPAWN src_addr, dest_addr, fp, ip
GET_SYNC src_addr, dest_addr, fp, ss_off
GET_BLOCK_SPAWN src_addr, dest_addr, size, fp, ip
GET_BLOCK_SYNC src_addr, dest_addr, size, fp, ip
• Function Invocation
INVOKE dest_PE, f_name, no_params, params
TOKEN f_name, no_params, params
END_FUNCTION
10/30/2006
ELEG652-06F
179
Threaded-C
A Base-Language
• To serve as a target language for high-level
language compilers
• To serve as a machine language for the
EARTH architecture
10/30/2006
ELEG652-06F
180
The Role of Threaded-C
C
Users
Fortran
High-level Language
Translation
Threaded-C
Threaded-C
Compiler
EARTH Platforms
10/30/2006
ELEG652-06F
181
Parallel Function Invocation
fib n-2
fib n-3
SYNC
slots
local
vars
fib n-1
Links between
frames
fib n-2
caller’s
<fp,ip>
fib n
Tree of “Activation Frames”
10/30/2006
ELEG652-06F
182
The fib Example
result
n
fib
done
0 0 if( n < 2 )
DATA_RSYNC(1, result, done);
else{
TOKEN(fib, n-1, &sum1, slot_1);
TOKEN(fib, n-2, &sum2, slot_1);
}
END_THREAD();
2 2 THREAD_1:
DATA_RSYNC(sum1 + sum2, result, done);
END_THREAD();
END_FUNCTION
10/30/2006
ELEG652-06F
183
The inner product Example
a
inner
b
result
done
0 0 BLKMOV_SYNC(a, row_a, N, slot_1);
BLKMOV_SYNC(b, column_b, N, slot_1);
sum = 0;
END_THREAD();
2 2 THREAD_1:
for(i = 0; i < N; ++i)
sum = sum + (row_a[i] * column_b[i]);
DATA_RSYNC(sum, result, done);
END_THREAD();
END_FUNCTION
10/30/2006
ELEG652-06F
184
Matrix Multiply
void main ( )
{
int i, j, k;
float sum;
for (i=0; i < N; i++)
for (j=0; j < N ; j++) {
sum = 0;
for (k=0; k < N; k++)
sum = sum + a [i] [k] * b [k] [j]
c [i] [j] = sum;
}
}
Sequential Version
10/30/2006
ELEG652-06F
185
The Matrix Multiply Example
inner
0 0 for(i = 0 ; i < N; ++i){
row_a = &a[i][0];
column_b = &b[0][i];
TOKEN(inner, &c[i][j], row_a, column_b,
slot_1);
}
N2 N2 THREAD_1;
RETURN ( );
END_THREAD();
END_FUNCTION
10/30/2006
ELEG652-06F
186
Topic 6e
Transactional Memory
An Overview
10/30/2006
ELEG652-06F
187
Transactional Memory
• Coming from the database world
• An All or none scheme
• A group of operations (of arbitrary size) is
consider a transaction
– A transaction is atomic
– Get data, operate, commit
• In case of commit: if the memory cell(s) has not
been modified, write your results to memory
– “Modification” has taken place, then discard your results
and try again
10/30/2006
ELEG652-06F
188
Final Side Note
A Review of LL and SC
• PowerPC and many other architecture
instructions
• Provide a way to optimistically execute a
piece of code
• In case that a “violation” has taken place,
discard your results
• Many implementations
– PowerPC: lwarx and stwcx
10/30/2006
ELEG652-06F
189
Final Side Note
The LL and SC behavior
• The lwarx instruction
• The stwcx instruction
– Loads a word aligned
location
– Side Effects:
• A reservation is created
• Storage coherence
mechanism is notified
that a reservation exists
10/30/2006
ELEG652-06F
– Conditionally Store a
location to a given
memory location.
• Conditionally 
Depends on the
reservation
– If success, all changes
will be committed to
memory
– If not, changes will be
discarded.
190
Final Side Note
Reservations
• At most one per processor
• A reservation is lost when
– Processor holding the reservation executes
• A lwarx or ldarx
• A stwcx or stdcx (No matter if the reservation matches or not)
– Other processors executes
• A store or a dcbz to the granule
– Some other mechanism modifies a storage location in the same
reservation granule
• Interrupts does not clean reservations
– But interrupt handlers might
• Granularity
– The length of the memory block to keep under surveillance
10/30/2006
ELEG652-06F
191
Final Side Note
Examples
LL a = ?
a *= 100;
…
SC a
brnz
a
Memory
Storage Mechanism
10/30/2006
ELEG652-06F
a=?
192
Final Side Note
Examples
LL a = ?
a *= 100;
LL a = ?
a += 100;
SC a
SC a
brnz
brnz
a
a
Memory
Storage Mechanism
10/30/2006
ELEG652-06F
a=?
193
Final Side Note
Examples
LL a = ?
a *= 100;
LL a = ?
a += 100;
SC a
a = 100;
SC a
brnz
brnz
X
X
Memory
Storage Mechanism
10/30/2006
ELEG652-06F
a = 100
194
Final Side Note
Examples
LL a = ?
a *= 100;
LL a = ?
a += 100;
SC a
SC a
brnz
brnz
X
X
Memory
Storage Mechanism
10/30/2006
ELEG652-06F
a = 100
195
Final Side Note
Examples
LL a = ?
a *= 100;
LL a = 100
a += 100;
SC a
SC a
brnz
brnz
X
a
Memory
Storage Mechanism
10/30/2006
ELEG652-06F
a = 100
196
Final Side Note
Examples
LL a = 100
a *= 100;
LL a = 100
a += 100;
SC a
SC a
brnz
brnz
a
a
Memory
Storage Mechanism
10/30/2006
ELEG652-06F
a = 100
197
Final Side Note
Examples
LL a = 100
a *= 100;
LL a = 100
a += 100;
SC a
SC a
brnz
brnz
X
a
Memory
Storage Mechanism
10/30/2006
ELEG652-06F
a = 200
198
Final Side Note
Examples
LL a = 100
a *= 100;
SC a
brnz
X
Memory
Storage Mechanism
10/30/2006
ELEG652-06F
a = 200
199
Final Side Note
Examples
LL a = 200
a *= 100;
SC a
brnz
a
Memory
Storage Mechanism
10/30/2006
ELEG652-06F
a = 200
200
Final Side Note
Examples
LL a = 200
a *= 100;
SC a
brnz
a
Memory
Storage Mechanism
10/30/2006
ELEG652-06F
a =20000
201
Final Side Note
LL / SC Disadvantages
• Only works for granule size memory cells
• Cannot target different memory cells at the
same time
• Since at most one reservation can be held
by a processor, nesting is out of the
question
10/30/2006
ELEG652-06F
202
Transactional Memory
• Similar Concept presented by LL and SC
• Based on the concept of a transaction:
– A group of instructions is executed atomically
with respect to others transactions
– The memory affected might be of different
size or distributed across the system
– A transaction will commit or abort depending
on the memory state
Get Sets
Do Ops
Validate Atomic WB
Retry
10/30/2006
ELEG652-06F
Abort
203
Transactional Memory
• More on the concept of a transaction:
– Transactions runs in isolation
• No Side effects are visible to the outside world
– Transaction’s Properties
• Atomicity: All or none
• Serial-ability: Transactions executes one after the
other in the same order for all who observe them.
(Can be weaken)
10/30/2006
ELEG652-06F
204
Transactions
Scalability
• Multiple Readers
– Not allowed for “normal” locks
• Exception: Reader and Writer locks
– Transactions “naturally” allow multiple readers
• Concurrent access to disjoint data
– Normal: Programmer’s responsibility for fine
grain locks
– Transactions allows (given enough hardware
resources) concurrent access to disjoint data
10/30/2006
ELEG652-06F
205
Transaction Memory
• Atomicity and Isolation
– Two basic properties of any implementation
– Data Versioning
• Memory Ops:
– Unprotected reads and writes
– Transactions
• Strong atomicity
– Any write op will produce a violation
– Any read will see the whole transaction or none of it.
• Weak Atomicity
– Only transaction’s writes will be considered to produce violations
– A read from non transactional mem op may read a partial set of the
uncommitted transaction
– Conflict detection
• Detect Read-Write and Write-Write Conflicts
10/30/2006
ELEG652-06F
206
Transaction Memory
• Strongest Transactional Model: Found in DBMS
databases
– Called ACID
•
•
•
•
Atomicity
Consistency
Isolation
Durability
• Implicit versus Explicit
– Programming Language centric
• Provide a collection of low level constructs or function calls
 Explicit
• Provide a general “abstraction” for transactions  Implicit
10/30/2006
ELEG652-06F
207
Transactional Memory
Data Versioning
• Management of data from new and old
transactions
• Eager
– Memory Rollback
– Adv: Faster Commits and direct reads
– Cons: Slower aborts, no fault tolerance
• Lazy
– Buffer Rollback
– Adv: Faster Abort, fault tolerance
– Cons: Slow commits, indirect reads
10/30/2006
ELEG652-06F
208
Transactional Memory
Data Versioning
Eager Versioning Example
Begin
10
Memory
T
Memory
Write
a=10
T
a=15
a=15
Abort
Memory
Memory
T
10/30/2006
T
a=15
ELEG652-06F
a=10
209
Transactional Memory
Data Versioning
Lazy Versioning Example
Begin
15
Memory
T
Memory
Write
a=10
T
a=15
a=10
Abort
Memory
Memory
T
10/30/2006
T
a=15
ELEG652-06F
a=10
210
Transactional Memory
Conflict Detection
• Read and Write Sets
– Read Set represents all the variables that are
only read through out this transaction
– Write Set represents all the variables that are
written through out this transaction
10/30/2006
ELEG652-06F
211
Transactional Memory
Conflict Detection
• A conflict
– The intersection between the read set of one
transaction and the write sets of two or more
different transactions is not zero
– The intersection between the write set of one
transaction and the writes sets of two or more
different transactions is not zero
10/30/2006
ELEG652-06F
212
Transactional Memory
Conflict Detection
• Pessimistic Detection
– Conflicts resolutions during reads and writes
• Through coherence blocks or locks and version numbers
– Manager to resolve conflicts
• Stall or abort
– Pros
• Detects conflicts early
– Stalls instead of aborts (in some cases)
– Cons
• No guarantees in forward progress
• Issues with locks and fine grain communications
10/30/2006
ELEG652-06F
213
Transactional Memory
Conflict Detection
Pessimistic Conflict
Detection
Success
T0
wr(y)
T1
Check
wr(x)
Check
Commit
rd(l)
Check
Check
T1
wr(l)
ABORT
rd(n)
wr(n)
Check
Check
Check
rd(m)
STALL
ABORT
Commit
rd(l)
Check
Commit
Commit
Check
Abort
rd(n)
wr(n)
ABORT
Early Detect
10/30/2006
rd(n)
wr(n)
ABORT
T1
T0
Check
Commit
T0
wr(m)
rd(n)
wr(n)
Check
Check
wr(z)
T1
T0
Commit
ELEG652-06F
…
No Progress
…214
Transactional Memory
Conflict Detection
• Optimistic Detection
– Detect conflicts at commit
• Compare the to-be-committed write set against other read
sets
– To-be-committed write will always succeed, but may cause
others to fails
• Validate write and read sets using locks or version numbers
• Pros
– Forward progress ensured
– Potentially less conflicts
• Cons
– Conflicts are detected late
10/30/2006
ELEG652-06F
215
Transactional Memory
Conflict Detection
Success
T1
T0
wr(y)
Check
wr(z)
Optimistic Conflict
Detection
T1
T0
rd(n)
wr(n)
Check
rd(n)
wr(n)
wr(x)
Commit
T1
T0
Check
rd(l)
Commit
T1
T0
STALL
wr(m)
rd(m)
Check
wr(l)
Check
ABORT
rd(m)
Abort
Commit
Commit
Commit
rd(n)
wr(n)
Forward Progress
Success
Check
10/30/2006
Commit
ABORT
Check
Commit
Check
ELEG652-06F
Commit
…216
Transactional Memory
Conflict Detection
• Granularity
– Object
• Pro: Overhead reduction, closer to the
programming model
• Cons: False sharing
– Word
• Pro: Down with false sharing
• Cons: More overhead
– Cache Line
• Pro: Between Word and Object
10/30/2006
ELEG652-06F
217
Transactional Memory
Nested Transactions
• Transactions inside transactions
– Allowed composability with transactions
running in library calls or function calls
– Allow multiple transactions to run inside the
given transaction
• Remember that the transaction should appear
atomic only to other transactions.
• Doesn’t impose restrictions on operations inside
the transactions
– DMA transfers, threads creations and operations, etc.
10/30/2006
ELEG652-06F
218
Transactional Memory
Nested Transactions
• Closed Nested Transactions
– Inner transactions’ commit stage
• In success  Merge with parent and let the parent commit
the changes
• In failure  Rollback inside the parent (or abort)
– Read and write sets may be disjoint from the parent
– Only outer most transaction will commit
– Children transactions may fail but outer one may
succeed.
– Alternative execution paths!!!!
10/30/2006
ELEG652-06F
219
Transactional Memory
Nested Transactions
• Open Nested Transactions
– Inner transactions’ commit
• In Success  Update memory AND merge with parent
• In Failure  Abort and Rollback inside the parent
– SHOCK!!!! HORROR!!!! ATOMICITY IS BROKEN!!!!
• If write sets are not disjoint
– Moreover, if the parent fails, then a rollback
mechanism should be provided to rollback the
children transactions
10/30/2006
ELEG652-06F
220
Transactional Memory
Nested Transactions
Write Set {A, B, C}
Merge {A, B, C}
Reads A
Open Nested Transaction
Write Set = {A, D}
Commit {A, B, C, D}
10/30/2006
Write Set {A, B, C}
Merge {A, B, C}
Commit {A, B, C}
Reads A
Closed Nested Transaction
Write Set = {A, D}
Commit {A, B, C, D}
ELEG652-06F
221
Transactional Memory
Types
• Hardware, Software and Hybrid
– Implementation dependent classification
– Conflict detection and data revision are
different for all implementations
• Hardware
– Conflict detection  Through Cache
Coherence Protocol
– Data Revision  Cache Lines
– High Performance plus binary compatibility
10/30/2006
ELEG652-06F
222
Transactional Memory
Types
• Software
–
–
–
–
–
–
Translation of programming constructs
Runtime and Compiler Support
Low Performance
Better Abstraction than locks for fine grain constructs
Data Revision: Object granularity
Conflict Detection: Lock and / or version numbers.
Runtime data structures.
• Hybrid
– A combination of the above approaches
10/30/2006
ELEG652-06F
223
Transactional Memory
A summary of transactional
memory systems with main
characteristics being shown
Programming Language 
Programming constructs that
supports transactions
instead of library calls
Multiprocessor  The
main programming model is
based on a multithreaded
environment instead of a
uniprocessor one
Courtesy of Carlstrom, et.al “The ATOMOS Transactional Programming Language” PLDI 2006
10/30/2006
ELEG652-06F
224
Bibliography
• Theobald, Kevin. “EARTH: An Efficient Architecture for
Running Threads.” PhD Thesis, McGill University,
Quebec Canada, May 1999
• Carlstrom, Brian; McDonald, Austen; Chaif, Hassan;
Chung, Jae Woong; Minh, Chi Cao; Kozyrakis, Christos;
Olukotun, Kunle; “The ATOMOS Transactional
Programming Language.” Computer Systems laboratory,
Stanford University, PLDI 2006.
• Herlihy, Maurice; Moss, J.E. B. “Transactional Memory:
Architectural Support for Lock Free Data Structures.”
Proceedings of the Twentieth Annual International
Symposium on Computer Architecture. 1993
• Kozyrakis, Christos. “Transactional Memory Tutorial.”
PACT 2006
10/30/2006
ELEG652-06F
225