Document 7294991

Download Report

Transcript Document 7294991

The Implementation of the
Cilk-5 Multithreaded Language
(Frigo, Leiserson, and Randall)
Alistair Dundas
Department of Computer Science
University of Massachusetts
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
Outline







What is Cilk?
Cilk example: the Fibonacci algorithm.
The work-first principle.
Work Stealing.
The T.H.E. Protocol.
Empirical results.
Summary and questions.
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
2
What Is Cilk?



Extension of C for parallel programming.
Designed for SMP machines with support for shared
memory.
Benefits:




Provably efficient work stealing scheduler.
Clean programming model.
Benefits over normal thread programming: discussion
topic!
Specifically: Source to source compiler generating C.
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
3
Example: Fibonacci Algorithm
int main
(int argc, char *argv[])
{
int n, result;
n = atoi(argv[1]);
result = fib(n);
int fib (int n)
{
if (n<2) return n;
else {
int x, y;
x = fib (n-1);
y = fib (n-2);
printf(“Result:%d\n”,
result);
return 0;
}
return (x+y);
}
}
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
4
Example: Fibonacci In Parallel
cilk int main
(int argc, char *argv[])
{
int n, result;
n = atoi(argv[1]);
result = spawn fib(n);
sync;
printf(“Result:%d\n”,
result);
return 0;
}
cilk int fib (int
{
if (n<2) return
else {
int x, y;
x = spawn fib
y = spawn fib
sync;
return (x+y);
}
}
n)
n;
(n-1);
(n-2);
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
5
Source to Source Compiler
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
6
The Work First Principle



Work is the amount of time needed to
execute the computation serially.
Critical path length is the execution time
on an infinite number of processors.
The Work-First Principle: Minimize
scheduling overhead borne by work at the
expense of increasing the critical path.
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
7
Theory:The Work First Principle

Where TP is the time on P processors:


Making critical path overhead explicit:


TP <= T1/P + cT (2)
Define average parallelism (max speedup):


TP = T1/P + O(T) (1)
PAVERAGE = T1/T
Define parallel slackness:

PAVERAGE/P
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
8
The Work First Principle (cont)

Assumption of parallel slackness:


Combining these with the inequality, we get:



TP ≈ T1/P
Define work overhead:


PAVERAGE/P ≫ c
c1 = T1/TS
TP ≈ c1TS/P
Conclusion: Minimize work overhead.
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
9
Work Stealing Algorithm



Each worker keeps a ready deque (double ended
queue) of procedure instances waiting to run.
Workers treat the deque as a stack, pushing and
popping procedure calls on to the end.
When workers have no more work, they steal from
the front of another workers’ deque.


Parents are stolen before children.
This is implemented using two versions of each
procedure: a fast clone, and a slow clone.
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
10
Fast Clone




Run fast clone when a procedure is spawned.
Little support for parallelism.
Whenever a call is made, save complete state, and
push on to end of deque.
When call returns, check to see if procedure was
stolen.



If stolen, return immediately.
If not stolen, carry on execution.
Since children are never stolen, sync is a no op.
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
11
Fast Clone Example
cilk int fib (int
{
if (n<2) return
else {
int x, y;
x = spawn fib
y = spawn fib
sync;
return (x+y);
}
}
n)
n;
(n-1);
(n-2);
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
12
Fast Clone Example
1 int fib (int n)
2 {
3
fib.frame *f;
4
f = alloc(sizeof(*f));
5
f->sig = fib.sig;
6
if (n!2) {
7
free(f, sizeof(*f));
8
return n;
9
}
10 else { … }
frame pointer
allocate frame
initialize frame
free frame
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
13
Fast Clone Example
11
int x, y;
12
f->entry = 1;
13
f->n = n;
14
*T = f;
15
push();
16
x = fib (n-1);
17
if (pop(x) == FAILURE)
18
return 0;
19
< … >
20
;
21
free(f, sizeof(*f));
22
return (x+y);
23 } }
save PC
save live vars
store frame pointer
push frame
do C call
pop frame
procedure stolen
second spawn
sync is free!
free frame
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
14
Slow Clone




Slow clone used when a procedure is stolen.
Similar to fast clone, but also supports
concurrent execution.
It restores program counter and procedure
state using copy stored on deque.
Calling sync makes call to runtime system for
check on children’s status.
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
15
The T.H.E. Protocol

Deques held in shared memory.




Workers operate at the end, thiefs at the front.
We must prevent race conditions where a thief and
victim try to access the same procedure frame.
Locking deques would be expensive for workers.
The T.H.E Protocol removes overhead of the
common case, where there is no conflict.
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
16
The T.H.E. Protocol


Assumes only reads and writes are atomic.
Head of the deque is H, tail is T, and (T ≥ H)



To steal thiefs must get the lock L.


Only thief can change H.
Only worker can change T.
At most two processors operating on deque.
Three cases of interaction:



Two or more items on deque – each gets one.
One item on deque – either worker or thief gets frame,
but not both.
No items on deque – both worker and thief fail.
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
17
One item on deque case




Both thief and worker assume they can get a
procedure frame and change H or T.
If both thief and worker try to steal frame, one or
both of them will discover (H > T), depending on
instruction order.
If thief discovers (H > T) it backs off and restores
H.
If worker discovers (H > T) it restores T, and then
tries for the lock. Inside lock, procedure can be
safely popped if still there.
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
18
Empirical Results


On an eight processor Sun SMP, achieved
average speed up of 6.2 from elison (serial C
non-threaded versions).
Assumptions of work-first seem sound:


Applications tested all showed high amounts of
“average parallelism”.
Work overhead small for most programs. Least
speed up is where overhead is greatest.
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
19
Summary




Extension of C for parallel programming.
Aims to simplify parallelization.
Main idea is to prevent overhead for workers
rather than focus on critical path.
Empirical results show speed up average of
6.2 on an 8 processor machine.
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
20
My Questions




A cilk spawn is always just a C call. Who
starts the threads, and how many are there?
Why use Cilk rather than use threads
directly?
What about using Cilk on a bewoulf cluster?
Are their test programs representative of
SMP applications?
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
21
Other Extentions



Inlets – a wrapper around spawned
procedure returns.
Abort – terminates work no longer needed
(e.g. in parallel search).
Locking facilities for access to shared data.
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
22
T.H.E. Protocol:The Worker/Victim
pop() {
T--;
if (H > T) {
T++;
lock(L);
T--;
if (H > T) {
T++;
unlock(L);
return FAILURE;
}
unlock(L);
}
return SUCCESS;
}
push()
steal() {
lock(L);
{ T++; }
H++;
if (H > T) {
H--;
unlock(L);
return FAILURE;
}
unlock(L);
return SUCCESS;
}
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
23
Fibonacci Illustration
UNIVERSITY OF MASSACHUSETTS, AMHERST – Department of Computer Science
24