CS294, Lecture #10 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294 How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz Based.

Download Report

Transcript CS294, Lecture #10 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294 How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz Based.

CS294, Lecture #10 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294

How to Compute and Prove

Lower and Upper Bounds

on the

Communication Costs

of Your Algorithm Part III: Graph analysis

Oded Schwartz Based on:

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz: Graph expansion and communication costs of fast matrix multiplication.

Previous talk on lower bounds Communication Lower Bounds:

Proving that your algorithm/implementation is as good as it gets.

Approaches:

1. Reduction [Ballard, Demmel, Holtz, S. 2009] 2. Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3. Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] 2

Previous talk on lower bounds: algorithms with “flavor” of 3 nested loops

[Ballard, Demmel, Holtz, S. 2009], [Ballard, Demmel, Holtz, S. 2011a] Following [Irony,Toledo,Tiskin 04]   

n M

  3

M

• BLAS, LU, Cholesky, LDL T , and QR factorizations, eigenvalues and singular values, i.e.,  essentially all direct methods of linear algebra.

• Dense or sparse matrices In sparse cases: bandwidth is a function NNZ.

• Bandwidth and latency.

• Sequential, hierarchical, and parallel – distributed and shared memory models.

• Compositions of linear algebra operations.

• Certain graph optimization problems  

n M

  3

M P

[Demmel, Pearson, Poloni, Van Loan, 11] • Tensor contraction 3

Geometric Embedding (2 nd approach) [Ballard, Demmel, Holtz, S. 2011a] Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] (1) Generalized form:  (i,j)  S, C(i,j) =

f ij

( g i,j,k1 g i,j,k2 …, (A(i,k1), B(k1,j)), (A(i,k2), B(k2,j)), other arguments) k1,k2,…  S ij

But many algorithms just don ’t fit the generalized form!

For example: Strassen ’s fast matrix multiplication

4

Beyond 3-nested loops

How about the communication costs of algorithms that have a more complex structure?

5

Communication Lower Bounds

Proving that your algorithm/implementation is as good as it gets.

Approaches:

1. Reduction [Ballard, Demmel, Holtz, S. 2009] 2. Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3. Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] 6

Recall: Strassen’s Fast Matrix Multiplication

[Strassen 69] • Compute 2 x 2 matrix multiplication using only 7 multiplications (instead of 8).

• Apply recursively (block-wise)

n/2 n/2

M 1 = (A 11 + A 22 ) M 2 = (A 21 + A 22 )   M M M M 3 4 5 7 = = = (A = (A 11 12 + A - A A 11  A 22  12 22 ) M 6 = (A 21 - A 11 )  )   (B B (B 12 - B 22 ) (B 21 - B 11 ) B 22 (B (B 11 11 11 21 + B + B + B 22 12 22 ) ) ) T(n) = 7 T(n) =  

C 11 C 21

(n log

C 12 C 22 A 11

=

A 21

T(n/2) + O(n 2 7 ) 2

A 12 A 22

) C 11 C 12 C 21 C 22 = M 1 = M 3 = M 2 = M 1 + M 4 + M 5 - M + M 4 - M 2 + M 3 5 + M 7 + M 6

B 11 B 12 B 21 B 22

7

Strassen-like algorithms

• Compute n 0 x n 0 using only n 0  0 matrix multiplication multiplications  0 (instead of n  0 3 ).

• Apply recursively (block-wise) 2.81

[Strassen 69] works fast in practice.

2.79

[Pan 78] 2.78

2.55

2.50 2.48

n/n 0

=

[Bini 79] [Sch önhage 81] [Pan Romani,Coppersmith Winograd 84] [Strassen 87] T(n) = n 0  0  T(n/n 0 ) + O(n 2 ) T(n) =  (n  0 ) 2.38

2.38 [Coppersmith Winograd 90] [Cohn Kleinberg Szegedy Umans 05] Group-theoretic approach 8

New lower bound for Strassen’s fast matrix multiplication

[Ballard, Demmel, Holtz, S. 2011b]: The Communication bandwidth lower bound is For Strassen ’s: Strassen-like: Recall for cubic:   

n M

  log 2 7

M

  

n M

    0 0

M

  

n M

  log 2 2 8

M

  

n M

  log 2 7

M P

  

n M

   0

M P

  

n M

  log 2 8

M P

The parallel lower bounds applies to 2D:

M =

(n 2 /P)

2.5D:

M =

(c∙n 2 /P)

9

For sequential? hierarchy?

Yes, existing implementation do!

For parallel 2D? parallel 2.5D?

Yes: new algorithms.

10

Sequential and new 2D and 2.5D parallel Strassen-like algorithms

Sequential and Hierarchy cases: Attained by the natural recursive implementation.

Also: LU, QR, … (Black-box use of fast matrix multiplication) [Ballard, Demmel, Holtz, S., Rom 2011]: New 2D parallel Strassen-like algorithm.

Attains the lower bound.

New 2.5D parallel Strassen-like algorithm.

c

0 /2-1

parallel communication speedup over 2D implementation (

c ∙

3

n 2 = M∙P

) [Ballard, Demmel, Holtz, S. 2011b]: This is as good as it gets.

11

Implications for sequential architectural scaling

• Requirements so that “most” time is spent doing arithmetic on

n

x

n

dense matrices,

n 2 > M:

• Time to add two rows of largest locally storable square matrix exceeds reciprocal bandwidth • Time to multiply 2 largest locally storable square matrices exceeds latency CA Matrix multiplication algorithm Scaling Bandwidth Requirement Scaling Latency Requirement Classic 

M 1/2

   

M 3/2

   Strassen-like 

M

0 /2-1

   

M

0 /2

  

Strassen-like algs do fewer flops & less communication but are more demanding on the hardware.

If

 

2, it is all about communication.

Expansion (3rd approach) [Ballard, Demmel, Holtz, S. 2011b], in the spirit of [Hong & Kung 81] Let

G = (V,E)

be a

d

-regular

graph

h

 min

S

,

S

V

2

E d S A

is the normalized adjacency matrix, with eigenvalues : 1 = 

1 ≥

2 ≥ … ≥

n

 

1 - max

{ 

2, |

n |

} Thm: [Alon-Milman84, Dodziuk84, Alon86] 1 2  2 

S

13

Expansion (3rd approach) The Computation Directed Acyclic Graph Input / Output Intermediate value Dependency

W S S V S R S

Communication-cost is Graph-expansion

14

V

M M

Expansion (3rd approach)

M

For a given run (Algorithm, Machine, Input) Read Read Read FLOP Write FLOP Read Read Read FLOP FLOP FLOP Write Write FLOP S 1 S 2 S 3

M

1. Consider the computation DAG: G = (V, E) V = set of computations and inputs E = dependencies 2. Partition G into segments S of  (M  /2 ) vertices (correspond to time / location adjacency) 3. Show that every S has  3M vertices with incoming / outgoing edges  perform  M read/writes.

M

W S S R S

4. The total communication BW is BW = BW of one segment  =  (M) #segments  O(n  ) /  (M  /2 ) =  (n  / M  /2 -1 ) 15

Is it a Good Expander?

Break G into edge-disjoint graphs, corresponding to the algorithm on M 1/2  M 1/2 matrices.

Consider the expansions of S in each part (they sum up).

En lg n A n 2 Dec lg n C n  En lg n B lg n n 2

BW =

 (

T

(

n

))

BW =

 (

T

(

n

)) 

h

(

G

(

M 1/2

))   (

G

(

M 1/2

)) S 1 We need to show that

M

/2

expands to 

(M)

.

h(G(n)) =

(M/ M

/2 )

for

n =

(M 1/2 )

.

Namely, for every n,

h(G(n)) =

(n 2 /n

) =

((4/7)

lg

n )

S 2 S 5 S 3 S 4 16

What is the CDAG of Strassen ’s algorithm?

17

The DAG of Strassen, n = 2

Dec 1 C 1,1 1,2 2,1 2,2 M 1 = (A 11 + A 22 ) M 2 = (A 21 + A 22 )   M M 3 4 = = A 11  A 22  M 5 = (A 11 + A 12 ) M 6 = (A 21 - A 11 )   M 7 = (A 12 - A 22 )  (B B (B 12 - B 22 ) (B 21 - B 11 ) B (B (B 11 22 11 11 21 + B + B + B 22 12 22 ) ) )

7 5 4 1 3 2 6

C 11 C 12 C 21 C 22 = M 1 = M 3 = M 2 = M 1 + M 4 + M 5 - M + M 4 - M 2 + M 3 5 + M 7 + M 6 1,1 1,2 2,1 2,2 Enc 1 A 1,1 1,2 2,1 2,2 Enc 1 B 18

The DAG of Strassen, n=4

One recursive level: • Each vertex splits into four.

• Multiply blocks Dec 1 C 1,1 1,2 2,1 2,2 7 5 4 1 Dec 1 C 3 2 6 Enc 1 A Enc 1 B Enc 1 A Enc 1 B 19

The DAG of Strassen: further recursive steps

n 2 Dec 1 C 1,1 1,2 2,1 2,2 Dec lg n C n  lg n Enc lg n A Enc lg n B n 2 Recursive construction Given Dec i C, Construct Dec i+1 C: 1. Duplicate 4 times 2. Connect with a cross-layer of Dec 1 C 20

The DAG of Strassen

n 2

C

Dec lg n C n  lg n En lg n A

A

En lg n B

B

n 2 1. Compute weighted sums of A ’s elements.

2. Compute weighted sums of B ’s elements.

3. Compute multiplications m 1 ,m 2 , …,m  .

4. Compute weighted sums of m 1 ,m 2 , …,m  to obtain C.

21

Expansion of a Segment

Two methods to compute the expansion of the recursively constructed graph: • Combinatorial - estimate directly the edge / vertex expansion (in the spirit of [Alon, S., Shapira, 08]) • or Spectral - compute the edge expansion via the spectral-gap (in the spirit of the Zig-Zag analysis [Reingold, Vadhan, Wigderson 00]) 22

Expansion of a Segment

Main technical challenges: • Two types of vertices: with/without recursion.

Dec 1 C 1,1 1,2 2,1 2,2 • The graph is not regular.

7 5 4 1 3 2 6

1,1 1,2 2,1 2,2 Enc 1 A 1,1 1,2 2,1 2,2 Enc 1 B 23

Estimating the edge expansion- Combinatorially In

S

Not in

S

Mixed 

S 1 S 2 S 3 S k

 •

Dec 1 C

is a consistency gadget: Mixed pays  1/12 of its edges.

• The fraction of

S

between the 1 st vertices is consistent level and the four 2 nd levels (deviations pay linearly).

k

= lg

M

  24

Communication Lower Bounds

Proving that your algorithm/implementation is as good as it gets.

Approaches:

1. Reduction [Ballard, Demmel, Holtz, S. 2009] 2. Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3. Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] 25

Open Problems

S

?

Find algorithms that attain the lower bounds: •

Sparse

matrix algorithms • for sequential and parallel models • that auto-tune or are cache oblivious Address complex heterogeneous hardware: • Lower bounds and algorithms [Demmel, Volkov 08],[Ballard, Demmel, Gearhart 11] Extend the techniques to other algorithm and algorithmic tools: • Non-uniform recursive structure Characterize a communication lower bound for a

problem

rather than for an algorithm. 26

CS294, Lecture #10 Fall, 2011 Communication-Avoiding Algorithms

How to Compute and Prove

Lower Bounds

on the

Communication Costs

of Your Algorithm Part III: Graph analysis

Oded Schwartz Based on:

G. Ballard, J. Demmel, O. Holtz, and O. Schwartz: Graph expansion and communication costs of fast matrix multiplication.

Thank you!