Multithreaded Programming in Cilk

Download Report

Transcript Multithreaded Programming in Cilk

Multithreaded
Programming in
Cilk
LECTURE 2
Charles E. Leiserson
Supercomputing Technologies Research Group
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Minicourse Outline
● LECTURE 1
Basic Cilk programming: Cilk keywords,
performance measures, scheduling.
● LECTURE 2
Analysis of Cilk algorithms: matrix
multiplication, sorting, tableau construction.
● LABORATORY
Programming matrix multiplication in Cilk
— Dr. Bradley C. Kuszmaul
● LECTURE 3
Advanced Cilk programming: inlets, abort,
speculation, data synchronization, & more.
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
2
LECTURE 2
• Recurrences (Review)
• Matrix Multiplication
• Merge Sort
• Tableau Construction
• Conclusion
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
3
The Master Method
The Master Method for solving recurrences
applies to recurrences of the form
T(n) = a T(n/b) + f (n) ,*
where a ¸ 1, b > 1, and f is asymptotically
positive.
IDEA: Compare nlogba with f (n) .
*The unstated base case is T(n) = (1) for
sufficiently small n.
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
4
Master Method — CASE 1
T(n) = a T(n/b) + f (n)
nlogba À f (n)
Specifically, f (n) = O(nlogba – e) for some
constant e > 0.
Solution: T(n) = (nlogba) .
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
5
Master Method — CASE 2
T(n) = a T(n/b) + f (n)
nlogba ¼ f (n)
Specifically, f (n) = (nlogba lgkn) for some
constant k ¸ 0.
Solution: T(n) = (nlogba lgk+1n) .
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
6
Master Method — CASE 3
T(n) = a T(n/b) + f (n)
nlogba ¿ f (n)
Specifically, f (n) = (nlogba + e) for some
constant e > 0 and f (n) satisfies the
regularity condition that a f (n/b) · c f (n)
for some constant c < 1.
Solution: T(n) = (f (n)) .
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
7
Master Method Summary
T(n) = a T(n/b) + f (n)
CASE 1: f (n) = O(nlogba – e), constant e > 0
 T(n) = (nlogba) .
CASE 2: f (n) = (nlogba lgkn), constant k  0
 T(n) = (nlogba lgk+1n) .
CASE 3: f (n) = (nlogba + e ), constant e > 0,
and regularity condition
 T(n) = ( f (n)) .
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
8
Master Method Quiz
• T(n) = 4 T(n/2) + n
nlogba = n2 À n ) CASE 1: T(n) = (n2).
• T(n) = 4 T(n/2) + n2
nlogba = n2 = n2 lg0n ) CASE 2: T(n) = (n2lg n).
• T(n) = 4 T(n/2) + n3
nlogba = n2 ¿ n3 ) CASE 3: T(n) = (n3).
• T(n) = 4 T(n/2) + n2/ lg n
Master method does not apply!
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
9
LECTURE 2
• Recurrences (Review)
• Matrix Multiplication
• Merge Sort
• Tableau Construction
• Conclusion
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
10
Square-Matrix Multiplication
c11 c12
c21 c22
M M
cn1 cn2
L
L
O
L
c1n
c2n
M
cnn
=
a11 a12
a21 a22
L
M M
an1 an2
O
C
L
a1n
a2n
L
M
ann
£
A
b11 b12
b21 b22
L
M M
bn1 bn2
O
L
L
b1n
b2n
M
bnn
B
n
cij =
a
ik bkj
k=1
Assume for simplicity that n = 2k.
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
11
Recursive Matrix Multiplication
Divide and conquer —
C11 C12
C21 C22
=
=
A11 A12
A21 A22
A11B11 A11B12
£
+
A21B11 A21B12
B11 B12
B21 B22
A12B21 A12B22
A22B21 A22B22
8 multiplications of (n/2) £ (n/2) matrices.
1 addition of n £ n matrices.
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
12
Matrix Multiply in Pseudo-Cilk
cilk void Mult(*C, *A, *B, n) {
float *T = Cilk_alloca(n*n*sizeof(float));
h base case & partition matrices i
spawn Mult(C11,A11,B11,n/2);
spawn Mult(C12,A11,B12,n/2);
spawn Mult(C22,A21,B12,n/2);
spawn Mult(C21,A21,B11,n/2);
spawn Mult(T11,A12,B21,n/2);
spawn Mult(T12,A12,B22,n/2);
spawn Mult(T22,A22,B22,n/2);
spawn Mult(T21,A22,B21,n/2);
sync;
spawn Add(C,T,n);
sync;
return;
}
C = A¢ B
© 2006 by Charles E. Leiserson
Absence of type
declarations.
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
13
Matrix Multiply in Pseudo-Cilk
cilk void Mult(*C, *A, *B, n) {
float *T = Cilk_alloca(n*n*sizeof(float));
h base case & partition matrices i
spawn Mult(C11,A11,B11,n/2);
spawn Mult(C12,A11,B12,n/2);
spawn Mult(C22,A21,B12,n/2);
spawn Mult(C21,A21,B11,n/2);
spawn Mult(T11,A12,B21,n/2);
spawn Mult(T12,A12,B22,n/2);
spawn Mult(T22,A22,B22,n/2);
spawn Mult(T21,A22,B21,n/2);
sync;
spawn Add(C,T,n);
sync;
return;
}
C = A¢ B
© 2006 by Charles E. Leiserson
Coarsen base cases
for efficiency.
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
14
Matrix Multiply in Pseudo-Cilk
cilk void Mult(*C, *A, *B, n) {
float *T = Cilk_alloca(n*n*sizeof(float));
h base case & partition matrices i
spawn Mult(C11,A11,B11,n/2);
spawn Mult(C12,A11,B12,n/2);
spawn Mult(C22,A21,B12,n/2);
spawn Mult(C21,A21,B11,n/2);
spawn Mult(T11,A12,B21,n/2);
spawn Mult(T12,A12,B22,n/2);
spawn Mult(T22,A22,B22,n/2);
spawn Mult(T21,A22,B21,n/2);
sync;
spawn Add(C,T,n);
sync;
return;
}
Also need a rowsize argument for
array indexing.
C = A¢ B
© 2006 by Charles E. Leiserson
Submatrices are
produced by pointer
calculation, not
copying of elements.
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
15
Matrix Multiply in Pseudo-Cilk
cilk void Mult(*C, *A, *B, n) {
float *T = Cilk_alloca(n*n*sizeof(float));
h base case & partition matrices i
spawn Mult(C11,A11,B11,n/2);
spawn Mult(C12,A11,B12,n/2);
spawn Mult(C22,A21,B12,n/2);
spawn Mult(C21,A21,B11,n/2);
spawn Mult(T11,A12,B21,n/2);
spawn Mult(T12,A12,B22,n/2);
spawn Mult(T22,A22,B22,n/2);
spawn Mult(T21,A22,B21,n/2);
sync;
cilk void Add(*C, *T, n) {
spawn Add(C,T,n);
h base case & partition matrices i
sync;
spawn Add(C11,T11,n/2);
return;
spawn Add(C12,T12,n/2);
}
spawn Add(C21,T21,n/2);
spawn Add(C22,T22,n/2);
sync;
return;
}
C = A¢ B
© 2006 by Charles E. Leiserson
C=C+T
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
16
Work of Matrix Addition
cilk void Add(*C, *T, n) {
h base case & partition matrices i
spawn Add(C11,T11,n/2);
spawn Add(C12,T12,n/2);
spawn Add(C21,T21,n/2);
spawn Add(C22,T22,n/2);
sync;
return;
}
Work: A1(n) = 4?A1(n/2) + (1)
= (n2) — CASE 1
nlogba = nlog24 = n2 À (1) .
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
17
Span of Matrix Addition
cilk void Add(*C, *T, n) {
h base case & partition matrices i
spawn Add(C11,T11,n/2);
spawn Add(C12,T12,n/2);
maximum spawn Add(C21,T21,n/2);
spawn Add(C22,T22,n/2);
sync;
return;
}
Span: A1(n) = A?1(n/2) + (1)
= (lg n) — CASE 2
nlogba = nlog21 = 1 ) f (n) = (nlogba lg0n) .
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
18
Work of Matrix Multiplication
8
cilk void Mult(*C, *A, *B, n) {
float *T = Cilk_alloca(n*n*sizeof(float));
h base case & partition matrices i
spawn Mult(C11,A11,B11,n/2);
spawn Mult(C12,A11,B12,n/2);
M
spawn Mult(T21,A22,B21,n/2);
sync;
spawn Add(C,T,n);
sync;
return;
}
Work: M1(n) = 8?M1(n/2) +A1(n) + (1)
= 8 M1(n/2) + (n2)
= (n3) — CASE 1
© 2006 by Charles E. Leiserson
nlogba = nlog28 = n3 À (n2) .
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
19
Span of Matrix Multiplication
8
cilk void Mult(*C, *A, *B, n) {
float *T = Cilk_alloca(n*n*sizeof(float));
h base case & partition matrices i
spawn Mult(C11,A11,B11,n/2);
spawn Mult(C12,A11,B12,n/2);
M
spawn Mult(T21,A22,B21,n/2);
sync;
spawn Add(C,T,n);
sync;
return;
}
Span: M1(n) = M?1(n/2) + A1(n) + (1)
= M1(n/2) + (lg n)
= (lg2 n) — CASE 2
nlogba = nlog21 = 1 ) f (n) = (nlogba lg1n) .
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
20
Parallelism of Matrix Multiply
Work:
M1(n) = (n3)
Span: M1(n) = (lg2n)
M1(n)
Parallelism:
= (n3/lg2n)
M1(n)
For 1000 £ 1000 matrices,
parallelism ¼ (103)3/102 = 107.
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
21
Stack Temporaries
cilk void Mult(*C, *A, *B, n) {
float *T = Cilk_alloca(n*n*sizeof(float));
h base case & partition matrices i
spawn Mult(C11,A11,B11,n/2);
spawn Mult(C12,A11,B12,n/2);
M
spawn Mult(T21,A22,B21,n/2);
sync;
spawn Add(C,T,n);
sync;
return;
}
In hierarchical-memory machines (especially chip
multiprocessors), memory accesses are so expensive that
minimizing storage often yields higher performance.
IDEA: Trade off parallelism for less storage.
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
22
No-Temp Matrix Multiplication
cilk void MultA(*C, *A, *B, n) {
// C = C + A * B
h base case & partition matrices i
spawn MultA(C11,A11,B11,n/2);
spawn MultA(C12,A11,B12,n/2);
spawn MultA(C22,A21,B12,n/2);
spawn MultA(C21,A21,B11,n/2);
sync;
spawn MultA(C21,A22,B21,n/2);
spawn MultA(C22,A22,B22,n/2);
spawn MultA(C12,A12,B22,n/2);
spawn MultA(C11,A12,B21,n/2);
sync;
return;
}
Saves space, but at what expense?
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
23
Work of No-Temp Multiply
cilk void MultA(*C, *A, *B, n) {
// C = C + A * B
h base case & partition matrices i
spawn MultA(C11,A11,B11,n/2);
spawn MultA(C12,A11,B12,n/2);
spawn MultA(C22,A21,B12,n/2);
spawn MultA(C21,A21,B11,n/2);
sync;
spawn MultA(C21,A22,B21,n/2);
spawn MultA(C22,A22,B22,n/2);
spawn MultA(C12,A12,B22,n/2);
spawn MultA(C11,A12,B21,n/2);
sync;
return;
}
Work: M1(n) = 8?M1(n/2) + (1)
= (n3) — CASE 1
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
24
Span of No-Temp Multiply
maximum
maximum
cilk void MultA(*C, *A, *B, n) {
// C = C + A * B
h base case & partition matrices i
spawn MultA(C11,A11,B11,n/2);
spawn MultA(C12,A11,B12,n/2);
spawn MultA(C22,A21,B12,n/2);
spawn MultA(C21,A21,B11,n/2);
sync;
spawn MultA(C21,A22,B21,n/2);
spawn MultA(C22,A22,B22,n/2);
spawn MultA(C12,A12,B22,n/2);
spawn MultA(C11,A12,B21,n/2);
sync;
return;
}
Span: M1(n) = 2?M1(n/2) + (1)
= (n) — CASE 1
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
25
Parallelism of No-Temp Multiply
Work: M1(n) = (n3)
Span: M1(n) = (n)
M1(n)
Parallelism:
= (n2)
M1(n)
For 1000 £ 1000 matrices,
parallelism ¼ (103)3/103 = 106.
Faster in practice!
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
26
Testing Synchronization
Cilk language feature: A programmer can
check whether a Cilk procedure is “synched”
(without actually performing a sync) by
testing the pseudovariable SYNCHED:
•SYNCHED = 0 ) some spawned children
might not have returned.
•SYNCHED = 1 ) all spawned children
have definitely returned.
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
27
Best of Both Worlds
cilk void Mult1(*C, *A, *B, n) {// multiply & store
h base case & partition matrices i
spawn Mult1(C11,A11,B11,n/2); // multiply & store
spawn Mult1(C12,A11,B12,n/2);
spawn Mult1(C22,A21,B12,n/2);
spawn Mult1(C21,A21,B11,n/2);
if (SYNCHED) {
spawn MultA1(C11,A12,B21,n/2); // multiply & add
spawn MultA1(C12,A12,B22,n/2);
spawn MultA1(C22,A22,B22,n/2);
spawn MultA1(C21,A22,B21,n/2);
} else {
float *T = Cilk_alloca(n*n*sizeof(float));
spawn Mult1(T11,A12,B21,n/2); // multiply & store
spawn Mult1(T12,A12,B22,n/2);
spawn Mult1(T22,A22,B22,n/2);
spawn Mult1(T21,A22,B21,n/2);
sync;
spawn Add(C,T,n); // C = C + T
}
sync;
return;
}
This code is just as parallel
as the original, but it only
uses more space if runtime
parallelism actually exists.
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
28
Ordinary Matrix Multiplication
n
cij =
a
ik bkj
k=1
IDEA: Spawn n2 inner
products in parallel.
Compute each inner
product in parallel.
Work: (n3)
Span: (lg n)
Parallelism: (n3/lg n)
BUT, this algorithm exhibits
poor locality and does not
exploit the cache hierarchy
of modern microprocessors,
especially CMP’s.
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
29
LECTURE 2
• Recurrences (Review)
• Matrix Multiplication
• Merge Sort
• Tableau Construction
• Conclusion
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
30
Merging Two Sorted Arrays
void Merge(int *C, int *A, int *B, int na, int nb) {
while (na>0 && nb>0) {
if (*A <= *B) {
*C++ = *A++; na--;
} else {
*C++ = *B++; nb--;
}
}
while (na>0) {
*C++ = *A++; na--;
}
while (nb>0) {
*C++ = *B++; nb--;
}
}
Time to merge n
elements = (n).
?
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
3
12
19
46
4
14
21
23
July 14, 2006
31
Merge Sort
cilk void MergeSort(int *B, int *A, int n) {
if (n==1) {
B[0] = A[0];
} else {
int *C;
C = (int*) Cilk_alloca(n*sizeof(int));
spawn MergeSort(C, A, n/2);
spawn MergeSort(C+n/2, A+n/2, n-n/2);
sync;
Merge(B, C, C+n/2, n/2, n-n/2);
}
}
merge
merge
merge
3
4 12 14 19 21 33 46
3 12 19 46 4 14 21 33
3 19 12 46 4 33 14 21
19 3 12 46 33 4 21 14
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
32
Work of Merge Sort
cilk void MergeSort(int *B, int *A, int n) {
if (n==1) {
B[0] = A[0];
} else {
int *C;
C = (int*) Cilk_alloca(n*sizeof(int));
spawn MergeSort(C, A, n/2);
spawn MergeSort(C+n/2, A+n/2, n-n/2);
sync;
Merge(B, C, C+n/2, n/2, n-n/2);
}
}
Work: T1(n) = 2 ?T1(n/2) + (n)
= (n lg n) — CASE 2
nlogba = nlog22 = n ) f (n) = (nlogba lg0n) .
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
33
Span of Merge Sort
cilk void MergeSort(int *B, int *A, int n) {
if (n==1) {
B[0] = A[0];
} else {
int *C;
C = (int*) Cilk_alloca(n*sizeof(int));
spawn MergeSort(C, A, n/2);
spawn MergeSort(C+n/2, A+n/2, n-n/2);
sync;
Merge(B, C, C+n/2, n/2, n-n/2);
}
}
Span: T1(n) = T1
? (n/2) + (n)
= (n) — CASE 3
nlogba = nlog21 = 1 ¿ (n) .
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
34
Parallelism of Merge Sort
Work:
T1(n) = (n lg n)
Span: T1(n) = (n)
Parallelism:
T1(n)
T1(n)
= (lg n)
We need to parallelize the merge!
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
35
Parallel Merge
na/2
0
A
· A[na/2]
Recursive
merge
¸ A[na/2]
Binary search
B · A[na/2]
0
na
j j+1
¸ A[na/2]
Recursive
merge
nb
na ¸ nb
KEY IDEA: If the total number of elements to be
merged in the two arrays is n = na + nb, the total
number of elements in the larger of the two
recursive merges is at most (3/4)
? n.
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
36
Parallel Merge
cilk void P_Merge(int *C, int *A, int *B,
int na, int nb) {
if (na < nb) {
spawn P_Merge(C, B, A, nb, na);
} else if (na==1) {
if (nb == 0) {
C[0] = A[0];
} else {
C[0] = (A[0]<B[0]) ? A[0] : B[0]; /* minimum */
C[1] = (A[0]<B[0]) ? B[0] : A[0]; /* maximum */
}
} else {
int ma = na/2;
int mb = BinarySearch(A[ma], B, nb);
spawn P_Merge(C, A, B, ma, mb);
spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb);
sync;
}
}
Coarsen base cases for efficiency.
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
37
Span of P_Merge
cilk void P_Merge(int *C, int *A, int *B,
int na, int nb) {
if (na < nb) {
M
}
} else {
int ma = na/2;
int mb = BinarySearch(A[ma], B, nb);
spawn P_Merge(C, A, B, ma, mb);
spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb);
sync;
}
Span: T1(n) = T1
? (3n/4) + (lg n)
= (lg2n) — CASE 2
nlogba = nlog4/31 = 1 ) f (n) = (nlogba lg1n) .
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
38
Work of P_Merge
cilk void P_Merge(int *C, int *A, int *B,
int na, int nb) {
if (na < nb) {
M
}
} else {
int ma = na/2;
int mb = BinarySearch(A[ma], B, nb);
spawn P_Merge(C, A, B, ma, mb);
spawn P_Merge(C+ma+mb, A+ma, B+mb, na-ma, nb-mb);
sync;
}
Work: T1(n) = T?1(n) + T1((1–)n) + (lg n),
where 1/4 ·  · 3/4 .
CLAIM: T1(n) = (n) .
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
39
Analysis of Work Recurrence
T1(n) = T1(n) + T1((1–)n) + (lg n),
where 1/4 ·  · 3/4 .
Substitution method: Inductive hypothesis is
T1(k) · c1k – c2lg k, where c1,c2 > 0. Prove that
the relation holds, and solve for c1 and c2.
T1(n) = T1(n) + T1((1–)n) + (lg n)
· c1(n) – c2lg(n)
+ c1((1–)n) – c2lg((1–)n) + (lg n)
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
40
Analysis of Work Recurrence
T1(n) = T1(n) + T1((1–)n) + (lg n),
where 1/4 ·  · 3/4 .
T1(n) = T1(n) + T1((1–)n) + (lg n)
· c1(n) – c2lg(n)
+ c1(1–)n – c2lg((1–)n) + (lg n)
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
41
Analysis of Work Recurrence
T1(n) = T1(n) + T1((1–)n) + (lg n),
where 1/4 ·  · 3/4 .
T1(n) = T1(n) + T1((1–)n) + (lg n)
· c1(n) – c2lg(n)
+ c1(1–)n – c2lg((1–)n) + (lg n)
· c1n – c2lg(n) – c2lg((1–)n) + (lg n)
· c1n – c2 ( lg((1–)) + 2 lg n ) + (lg n)
· c1n – c2 lg n
– (c2(lg n + lg((1–))) – (lg n))
· c1n – c2 lg n
by choosing c1 and c2 large enough.
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
42
Parallelism of P_Merge
Work:
T1(n) = (n)
Span: T1(n) = (lg2n)
Parallelism:
© 2006 by Charles E. Leiserson
T1(n)
T1(n)
= (n/lg2n)
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
43
Parallel Merge Sort
cilk void P_MergeSort(int *B, int *A, int n) {
if (n==1) {
B[0] = A[0];
} else {
int *C;
C = (int*) Cilk_alloca(n*sizeof(int));
spawn P_MergeSort(C, A, n/2);
spawn P_MergeSort(C+n/2, A+n/2, n-n/2);
sync;
spawn P_Merge(B, C, C+n/2, n/2, n-n/2);
}
}
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
44
Work of Parallel Merge Sort
cilk void P_MergeSort(int *B, int *A, int n) {
if (n==1) {
B[0] = A[0];
} else {
int *C;
C = (int*) Cilk_alloca(n*sizeof(int));
spawn P_MergeSort(C, A, n/2);
spawn P_MergeSort(C+n/2, A+n/2, n-n/2);
sync;
spawn P_Merge(B, C, C+n/2, n/2, n-n/2);
}
}
Work: T1(n) = 2 T1(n/2) + (n)
= (n lg n) — CASE 2
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
45
Span of Parallel Merge Sort
cilk void P_MergeSort(int *B, int *A, int n) {
if (n==1) {
B[0] = A[0];
} else {
int *C;
C = (int*) Cilk_alloca(n*sizeof(int));
spawn P_MergeSort(C, A, n/2);
spawn P_MergeSort(C+n/2, A+n/2, n-n/2);
sync;
spawn P_Merge(B, C, C+n/2, n/2, n-n/2);
}
}
Span: T1(n) = T1
? (n/2) + (lg2n)
= (lg3n) — CASE 2
nlogba = nlog21 = 1 ) f (n) = (nlogba lg2n) .
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
46
Parallelism of Merge Sort
Work:
T1(n) = (n lg n)
Span: T1(n) = (lg3n)
Parallelism:
© 2006 by Charles E. Leiserson
T1(n)
T1(n)
= (n/lg2n)
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
47
LECTURE 2
• Recurrences (Review)
• Matrix Multiplication
• Merge Sort
• Tableau Construction
• Conclusion
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
48
Tableau Construction
Problem: Fill in an n £ n tableau A, where
A[i, j] = f ( A[i, j–1], A[i–1, j], A[i–1, j–1] ).
00 01 02 03 04 05 06 07
10 11 12 13 14 15 16 17
20 21 22 23 24 25 26 27
30 31 32 33 34 35 36 37
40 41 42 43 44 45 46 47
50 51 52 53 54 55 56 57
60 61 62 63 64 65 66 67
Dynamic
programming
• Longest common
subsequence
• Edit distance
• Time warping
Work: (n2).
70 71 72 73 74 75 76 77
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
49
Recursive Construction
n
Cilk code
I
II
n
III
© 2006 by Charles E. Leiserson
IV
Multithreaded Programming in Cilk —LECTURE 2
spawn
sync;
spawn
spawn
sync;
spawn
sync;
I;
II;
III;
IV;
July 14, 2006
50
Recursive Construction
n
Cilk code
I
II
n
III
IV
spawn
sync;
spawn
spawn
sync;
spawn
sync;
I;
II;
III;
IV;
Work: T1(n) = 4T
? 1(n/2) + (1)
= (n2) — CASE 1
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
51
Recursive Construction
n
Cilk code
I
II
n
III
IV
spawn
sync;
spawn
spawn
sync;
spawn
sync;
I;
II;
III;
IV;
Span: T1(n) = 3T
? 1(n/2) + (1)
= (nlg3) — CASE 1
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
52
Analysis of Tableau Construction
Work: T1(n) = (n2)
Span: T1(n) = (nlg3)
¼ (n1.58)
Parallelism:
© 2006 by Charles E. Leiserson
T1(n)
T1(n)
¼ (n0.42)
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
53
A More-Parallel Construction
n
I
n
III
VI
© 2006 by Charles E. Leiserson
II
V
VIII
IV
VII
IX
spawn
sync;
spawn
spawn
sync;
spawn
spawn
spawn
sync;
spawn
spawn
sync;
spawn
sync;
Multithreaded Programming in Cilk —LECTURE 2
I;
II;
III;
IV;
V;
VI
VII;
VIII;
IX;
July 14, 2006
54
A More-Parallel Construction
n
I
n
III
VI
II
V
VIII
IV
VII
IX
spawn
sync;
spawn
spawn
sync;
spawn
spawn
spawn
sync;
spawn
spawn
sync;
spawn
sync;
I;
II;
III;
IV;
V;
VI
VII;
VIII;
IX;
Work: T1(n) = 9T
? 1(n/3) + (1)
= (n2) — CASE 1
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
55
A More-Parallel Construction
n
I
n
III
VI
II
V
VIII
IV
VII
IX
spawn
sync;
spawn
spawn
sync;
spawn
spawn
spawn
sync;
spawn
spawn
sync;
spawn
sync;
I;
II;
III;
IV;
V;
VI
VII;
VIII;
IX;
Span: T1(n) = 5T
? 1(n/3) + (1)
= (nlog35) — CASE 1
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
56
Analysis of Revised Construction
Work: T1(n) = (n2)
Span: T1(n) = (nlog35)
¼ (n1.46)
Parallelism:
T1(n)
T1(n)
¼ (n0.54)
More parallel by a factor of
(n0.54)/(n0.42) = (n0.12) .
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
57
Puzzle
What is the largest parallelism that
can be obtained for the tableauconstruction problem using Cilk?
• You may only use basic Cilk control
constructs (spawn, sync) for
synchronization.
• No locks, synchronizing through
memory, etc.
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
58
LECTURE 2
• Recurrences (Review)
• Matrix Multiplication
• Merge Sort
• Tableau Construction
• Conclusion
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
59
Key Ideas
• Cilk is simple: cilk, spawn, sync,
SYNCHED
• Recurrences, recurrences, recurrences, …
• Work & span
• Work & span
• Work & span
• Work & span
• Work & span
• Work & span
• Work & span
• Work & span
•
Work & span
•
Work & span
•
Work & span
•
Work & span
•
Work & span
•
Work & span
•
Work & span
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
60
Minicourse Outline
● LECTURE 1
Basic Cilk programming: Cilk keywords,
performance measures, scheduling.
● LECTURE 2
Analysis of Cilk algorithms: matrix
multiplication, sorting, tableau construction.
● LABORATORY
Programming matrix multiplication in Cilk
— Dr. Bradley C. Kuszmaul
● LECTURE 3
Advanced Cilk programming: inlets, abort,
speculation, data synchronization, & more.
© 2006 by Charles E. Leiserson
Multithreaded Programming in Cilk —LECTURE 2
July 14, 2006
61