CS294, Lecture #10 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294 How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz Based.
Download ReportTranscript CS294, Lecture #10 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294 How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz Based.
CS294, Lecture #10 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294
How to Compute and Prove
Lower and Upper Bounds
on the
Communication Costs
of Your Algorithm Part III: Graph analysis
Oded Schwartz Based on:
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz: Graph expansion and communication costs of fast matrix multiplication.
Previous talk on lower bounds Communication Lower Bounds:
Proving that your algorithm/implementation is as good as it gets.
Approaches:
1. Reduction [Ballard, Demmel, Holtz, S. 2009] 2. Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3. Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] 2
Previous talk on lower bounds: algorithms with “flavor” of 3 nested loops
[Ballard, Demmel, Holtz, S. 2009], [Ballard, Demmel, Holtz, S. 2011a] Following [Irony,Toledo,Tiskin 04]
n M
3
M
• BLAS, LU, Cholesky, LDL T , and QR factorizations, eigenvalues and singular values, i.e., essentially all direct methods of linear algebra.
• Dense or sparse matrices In sparse cases: bandwidth is a function NNZ.
• Bandwidth and latency.
• Sequential, hierarchical, and parallel – distributed and shared memory models.
• Compositions of linear algebra operations.
• Certain graph optimization problems
n M
3
M P
[Demmel, Pearson, Poloni, Van Loan, 11] • Tensor contraction 3
Geometric Embedding (2 nd approach) [Ballard, Demmel, Holtz, S. 2011a] Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] (1) Generalized form: (i,j) S, C(i,j) =
f ij
( g i,j,k1 g i,j,k2 …, (A(i,k1), B(k1,j)), (A(i,k2), B(k2,j)), other arguments) k1,k2,… S ij
But many algorithms just don ’t fit the generalized form!
For example: Strassen ’s fast matrix multiplication
4
Beyond 3-nested loops
How about the communication costs of algorithms that have a more complex structure?
5
Communication Lower Bounds
Proving that your algorithm/implementation is as good as it gets.
Approaches:
1. Reduction [Ballard, Demmel, Holtz, S. 2009] 2. Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3. Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] 6
Recall: Strassen’s Fast Matrix Multiplication
[Strassen 69] • Compute 2 x 2 matrix multiplication using only 7 multiplications (instead of 8).
• Apply recursively (block-wise)
n/2 n/2
M 1 = (A 11 + A 22 ) M 2 = (A 21 + A 22 ) M M M M 3 4 5 7 = = = (A = (A 11 12 + A - A A 11 A 22 12 22 ) M 6 = (A 21 - A 11 ) ) (B B (B 12 - B 22 ) (B 21 - B 11 ) B 22 (B (B 11 11 11 21 + B + B + B 22 12 22 ) ) ) T(n) = 7 T(n) =
C 11 C 21
(n log
C 12 C 22 A 11
=
A 21
T(n/2) + O(n 2 7 ) 2
A 12 A 22
) C 11 C 12 C 21 C 22 = M 1 = M 3 = M 2 = M 1 + M 4 + M 5 - M + M 4 - M 2 + M 3 5 + M 7 + M 6
B 11 B 12 B 21 B 22
7
Strassen-like algorithms
• Compute n 0 x n 0 using only n 0 0 matrix multiplication multiplications 0 (instead of n 0 3 ).
• Apply recursively (block-wise) 2.81
[Strassen 69] works fast in practice.
2.79
[Pan 78] 2.78
2.55
2.50 2.48
n/n 0
=
[Bini 79] [Sch önhage 81] [Pan Romani,Coppersmith Winograd 84] [Strassen 87] T(n) = n 0 0 T(n/n 0 ) + O(n 2 ) T(n) = (n 0 ) 2.38
2.38 [Coppersmith Winograd 90] [Cohn Kleinberg Szegedy Umans 05] Group-theoretic approach 8
New lower bound for Strassen’s fast matrix multiplication
[Ballard, Demmel, Holtz, S. 2011b]: The Communication bandwidth lower bound is For Strassen ’s: Strassen-like: Recall for cubic:
n M
log 2 7
M
n M
0 0
M
n M
log 2 2 8
M
n M
log 2 7
M P
n M
0
M P
n M
log 2 8
M P
The parallel lower bounds applies to 2D:
M =
(n 2 /P)
2.5D:
M =
(c∙n 2 /P)
9
For sequential? hierarchy?
Yes, existing implementation do!
For parallel 2D? parallel 2.5D?
Yes: new algorithms.
10
Sequential and new 2D and 2.5D parallel Strassen-like algorithms
Sequential and Hierarchy cases: Attained by the natural recursive implementation.
Also: LU, QR, … (Black-box use of fast matrix multiplication) [Ballard, Demmel, Holtz, S., Rom 2011]: New 2D parallel Strassen-like algorithm.
Attains the lower bound.
New 2.5D parallel Strassen-like algorithm.
c
0 /2-1
parallel communication speedup over 2D implementation (
c ∙
3
n 2 = M∙P
) [Ballard, Demmel, Holtz, S. 2011b]: This is as good as it gets.
11
Implications for sequential architectural scaling
• Requirements so that “most” time is spent doing arithmetic on
n
x
n
dense matrices,
n 2 > M:
• Time to add two rows of largest locally storable square matrix exceeds reciprocal bandwidth • Time to multiply 2 largest locally storable square matrices exceeds latency CA Matrix multiplication algorithm Scaling Bandwidth Requirement Scaling Latency Requirement Classic
M 1/2
M 3/2
Strassen-like
M
0 /2-1
M
0 /2
Strassen-like algs do fewer flops & less communication but are more demanding on the hardware.
If
2, it is all about communication.
Expansion (3rd approach) [Ballard, Demmel, Holtz, S. 2011b], in the spirit of [Hong & Kung 81] Let
G = (V,E)
be a
d
-regular
graph
h
min
S
,
S
V
2
E d S A
is the normalized adjacency matrix, with eigenvalues : 1 =
1 ≥
2 ≥ … ≥
n
1 - max
{
2, |
n |
} Thm: [Alon-Milman84, Dodziuk84, Alon86] 1 2 2
S
13
Expansion (3rd approach) The Computation Directed Acyclic Graph Input / Output Intermediate value Dependency
W S S V S R S
Communication-cost is Graph-expansion
14
V
M M
Expansion (3rd approach)
M
For a given run (Algorithm, Machine, Input) Read Read Read FLOP Write FLOP Read Read Read FLOP FLOP FLOP Write Write FLOP S 1 S 2 S 3
M
1. Consider the computation DAG: G = (V, E) V = set of computations and inputs E = dependencies 2. Partition G into segments S of (M /2 ) vertices (correspond to time / location adjacency) 3. Show that every S has 3M vertices with incoming / outgoing edges perform M read/writes.
M
W S S R S
4. The total communication BW is BW = BW of one segment = (M) #segments O(n ) / (M /2 ) = (n / M /2 -1 ) 15
Is it a Good Expander?
Break G into edge-disjoint graphs, corresponding to the algorithm on M 1/2 M 1/2 matrices.
Consider the expansions of S in each part (they sum up).
En lg n A n 2 Dec lg n C n En lg n B lg n n 2
BW =
(
T
(
n
))
BW =
(
T
(
n
))
h
(
G
(
M 1/2
)) (
G
(
M 1/2
)) S 1 We need to show that
M
/2
expands to
(M)
.
h(G(n)) =
(M/ M
/2 )
for
n =
(M 1/2 )
.
Namely, for every n,
h(G(n)) =
(n 2 /n
) =
((4/7)
lg
n )
S 2 S 5 S 3 S 4 16
What is the CDAG of Strassen ’s algorithm?
17
The DAG of Strassen, n = 2
Dec 1 C 1,1 1,2 2,1 2,2 M 1 = (A 11 + A 22 ) M 2 = (A 21 + A 22 ) M M 3 4 = = A 11 A 22 M 5 = (A 11 + A 12 ) M 6 = (A 21 - A 11 ) M 7 = (A 12 - A 22 ) (B B (B 12 - B 22 ) (B 21 - B 11 ) B (B (B 11 22 11 11 21 + B + B + B 22 12 22 ) ) )
7 5 4 1 3 2 6
C 11 C 12 C 21 C 22 = M 1 = M 3 = M 2 = M 1 + M 4 + M 5 - M + M 4 - M 2 + M 3 5 + M 7 + M 6 1,1 1,2 2,1 2,2 Enc 1 A 1,1 1,2 2,1 2,2 Enc 1 B 18
The DAG of Strassen, n=4
One recursive level: • Each vertex splits into four.
• Multiply blocks Dec 1 C 1,1 1,2 2,1 2,2 7 5 4 1 Dec 1 C 3 2 6 Enc 1 A Enc 1 B Enc 1 A Enc 1 B 19
The DAG of Strassen: further recursive steps
n 2 Dec 1 C 1,1 1,2 2,1 2,2 Dec lg n C n lg n Enc lg n A Enc lg n B n 2 Recursive construction Given Dec i C, Construct Dec i+1 C: 1. Duplicate 4 times 2. Connect with a cross-layer of Dec 1 C 20
The DAG of Strassen
n 2
C
Dec lg n C n lg n En lg n A
A
En lg n B
B
n 2 1. Compute weighted sums of A ’s elements.
2. Compute weighted sums of B ’s elements.
3. Compute multiplications m 1 ,m 2 , …,m .
4. Compute weighted sums of m 1 ,m 2 , …,m to obtain C.
21
Expansion of a Segment
Two methods to compute the expansion of the recursively constructed graph: • Combinatorial - estimate directly the edge / vertex expansion (in the spirit of [Alon, S., Shapira, 08]) • or Spectral - compute the edge expansion via the spectral-gap (in the spirit of the Zig-Zag analysis [Reingold, Vadhan, Wigderson 00]) 22
Expansion of a Segment
Main technical challenges: • Two types of vertices: with/without recursion.
Dec 1 C 1,1 1,2 2,1 2,2 • The graph is not regular.
7 5 4 1 3 2 6
1,1 1,2 2,1 2,2 Enc 1 A 1,1 1,2 2,1 2,2 Enc 1 B 23
Estimating the edge expansion- Combinatorially In
S
Not in
S
Mixed
S 1 S 2 S 3 S k
•
Dec 1 C
is a consistency gadget: Mixed pays 1/12 of its edges.
• The fraction of
S
between the 1 st vertices is consistent level and the four 2 nd levels (deviations pay linearly).
k
= lg
M
24
Communication Lower Bounds
Proving that your algorithm/implementation is as good as it gets.
Approaches:
1. Reduction [Ballard, Demmel, Holtz, S. 2009] 2. Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3. Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] 25
Open Problems
S
?
Find algorithms that attain the lower bounds: •
Sparse
matrix algorithms • for sequential and parallel models • that auto-tune or are cache oblivious Address complex heterogeneous hardware: • Lower bounds and algorithms [Demmel, Volkov 08],[Ballard, Demmel, Gearhart 11] Extend the techniques to other algorithm and algorithmic tools: • Non-uniform recursive structure Characterize a communication lower bound for a
problem
rather than for an algorithm. 26
CS294, Lecture #10 Fall, 2011 Communication-Avoiding Algorithms
How to Compute and Prove
Lower Bounds
on the
Communication Costs
of Your Algorithm Part III: Graph analysis
Oded Schwartz Based on:
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz: Graph expansion and communication costs of fast matrix multiplication.