Optimizing Matrix Multiply

Download Report

Transcript Optimizing Matrix Multiply

Minimizing Communication in
Numerical Linear Algebra
www.cs.berkeley.edu/~demmel
Communication Lower Bounds for
Direct Linear Algebra
Jim Demmel
EECS & Math Departments, UC Berkeley
[email protected]
1
Outline
•
•
•
•
•
Recall Motivation
Lower bound proof for Matmul, by Irony/Toledo/Tiskin
How to generalize it to linear algebra without orthogonal xforms
How to generalize it to linear algebra with orthogonal xforms
Summary of linear algebra problems for which the lower bound
is attainable
• Summary of extensions to Strassen
2
Collaborators
•
•
•
•
•
•
•
•
•
•
•
•
Grey Ballard, UCB EECS
Ioana Dumitriu, U. Washington
Laura Grigori, INRIA
Ming Gu, UCB Math
Mark Hoemmen, UCB EECS
Olga Holtz, UCB Math & TU Berlin
Julien Langou, U. Colorado Denver
Marghoob Mohiyuddin, UCB EECS
Oded Schwartz , TU Berlin
Hua Xiang, INRIA
Kathy Yelick, UCB EECS & NERSC
BeBOP group, bebop.cs.berkeley.edu
Summer School Lecture 3
3
Motivation (1/2)
Two kinds of costs:
• Arithmetic (FLOPS)
• Communication: moving data between
• levels of a memory hierarchy (sequential case)
• over a network connecting processors (parallel case).
CPU
DRAM
CPU
DRAM
CPU
Cache
DRAM
CPU
DRAM
CPU
DRAM
Summer School Lecture 3
4
Motivation (2/2)
• Running time of an algorithm is sum of 3 terms:
•
•
•
# flops * time_per_flop
# words moved / bandwidth
# messages * latency
communication
• Exponentially growing gaps between
• Time_per_flop << 1/Network BW << Network Latency
• Improving 59%/year vs 26%/year vs 15%/year
• Time_per_flop << 1/Memory BW << Memory Latency
• Improving 59%/year vs 23%/year vs 5.5%/year
• Goal : reorganize linear algebra to avoid communication
• Between all memory hierarchy levels
• L1
L2
DRAM
network, etc
• Not just hiding communication (speedup  2x )
• Arbitrary speedups possible
Summer School Lecture 3
5
Direct linear algebra:
Prior Work on Matmul
• Assume n3 algorithm (i.e. not Strassen-like)
• Sequential case, with fast memory of size M
• Lower bound on #words moved to/from slow memory =
 (n3 / M1/2 ) [Hong & Kung (1981)]
• Attained using blocked or recursive (“cache-oblivious”)
algorithm
• Parallel case on P processors:
• Let NNZ be total memory needed; assume load balanced
• Lower bound on #words communicated
=  (n3 /(P· NNZ )1/2 )
[Irony, Tiskin & Toledo (2004)]
• When NNZ = 3n2 , (“2D alg”), get  (n2 / P1/2 )
• Attained by Cannon’s Algorithm
• For “3D alg” (NNZ = O(n2 P1/3 )), get  (n2 / P2/3 )
• Attainable too (details later)
6
Direct linear algebra:
Generalizing Prior Work
• Sequential case: # words moved =
 (n3 / M1/2 ) =  (#flops / (fast_memory_size)1/2 )
• Parallel case: # words moved =
 (n3 /(P· NNZ )1/2 ) =  ((n3 /P) / (NNZ/P)1/2 )
=  (#flops_per_proc / (memory_per_processor)1/2 )
(true for at least one processor, assuming “balance” in
either flops or in memory)
• In both cases, we let M = memory size, and write
#words_moved by at least one processor =
 (#flops / M1/2 )
Summer School Lecture 3
7
Lower bound for all “direct” linear algebra
#words_moved by at least one processor =
(#flops / M1/2 )
#messages_sent by at least one processor =
(#flops / M3/2 )
• Holds for
• Matmul, BLAS, LU, QR, eig, SVD
•
Need to explain model of computation
• Some whole programs (sequences of these operations, no
matter how they are interleaved)
• Dense and sparse matrices (where #flops << n3 )
• Sequential and parallel algorithms
• Some graph-theoretic algorithms
Summer School Lecture 3
8
Proof of Communication Lower Bound on C = A·B (1/5)
• Proof from Irony/Toledo/Tiskin (2004)
• Original proof, then generalization
• Think of instruction stream being executed
• Looks like “ … add, load, multiply, store, load, add, …”
•
Each load/store moves a word between fast and slow memory, or
between local memory and remote memory (another processor)
• We want to count the number of loads and stores, given that we are
multiplying n-by-n matrices C = A·B using the usual 2n3 flops,
possibly reordered assuming addition is commutative/associative
• Assuming that at most M words can be stored in fast memory
• Outline:
• Break instruction stream into segments, each with M loads and stores
• Somehow bound the maximum number of flops that can be done in
each segment, call it F
• So F · # segments  T = total flops = 2·n3 , so # segments  T / F
• So # loads & stores = M · #segments  M · T / F
Summer School Lecture 3
9
Illustrating Segments, for M=3
Load
Load
Segment 1
Load
FLOP
Store
FLOP
Segment 2
Load
Load
FLOP
FLOP
FLOP
Store
Store
FLOP
Store
Load
FLOP
...
Time
Load
Segment 3
Proof of Communication Lower Bound on C = A·B (1/5)
• Proof from Irony/Toledo/Tiskin (2004)
• Original proof, then generalization
• Think of instruction stream being executed
• Looks like “ … add, load, multiply, store, load, add, …”
•
Each load/store moves a word between fast and slow memory
• We want to count the number of loads and stores, given that we are
multiplying n-by-n matrices C = A·B using the usual 2n3 flops, possibly
reordered assuming addition is commutative/associative
• Assuming that at most M words can be stored in fast memory
• Outline:
• Break instruction stream into segments, each with M loads and stores
• Somehow bound the maximum number of flops that can be done in
each segment, call it F
• So F · # segments  T = total flops = 2·n3 , so # segments  T / F
• So # loads & stores = M · #segments  M · T / F
Summer School Lecture 3
11
Proof of Communication Lower Bound on C = A·B (2/5)
• Given segment of instruction stream with M loads & stores,
how many adds & multiplies (F) can we do?
• At most 2M entries of C, 2M entries of A and/or 2M entries of B
can be accessed
• Use geometry:
• Represent n3 multiplications by n x n x n cube
• One n x n face represents A
•
each 1 x 1 subsquare represents one A(i,k)
• One n x n face represents B
•
each 1 x 1 subsquare represents one B(k,j)
• One n x n face represents C
•
each 1 x 1 subsquare represents one C(i,j)
• Each 1 x 1 x 1 subcube represents one C(i,j) += A(i,k) · B(k,j)
•
May be added directly to C(i,j), or to temporary accumulator
Summer School Lecture 3
12
Proof of Communication Lower Bound on C = A·B (3/5)
k
“C face”
Cube representing
C(1,1) += A(1,3)·B(3,1)
C(2,3)
A(2,1)
A(1,1)
B(1,3)
B(2,1)
A(1,2)
j
B(1,1)
A(1,3)
B(3,1)
C(1,1)
i
“A face”
• If we have at most 2M “A squares”, 2M “B squares”, and
2M “C squares” on faces, how many cubes can we have?
13
Proof of Communication Lower Bound on C = A·B (4/5)
k
“C shadow”
x
y
z
j
y
z
x
“A shadow”
i
# cubes in black box with
side lengths x, y and z
= Volume of black box
= x·y·z
= ( xz · zy · yx)1/2
= (#A□s · #B□s · #C□s )1/2
(i,k) is in “A shadow” if (i,j,k) in 3D set
(j,k) is in “B shadow” if (i,j,k) in 3D set
(i,j) is in “C shadow” if (i,j,k) in 3D set
Thm (Loomis & Whitney, 1949)
# cubes in 3D set = Volume of 3D set
≤ (area(A shadow) · area(B shadow) ·
area(C shadow)) 1/2
Summer School Lecture 3
14
Proof of Communication Lower Bound on C = A·B (5/5)
• Consider one “segment” of instructions with M loads, stores
• Can be at most 2M entries of A, B, C available in one segment
• Volume of set of cubes representing possible multiply/adds in
one segment is ≤ (2M · 2M · 2M)1/2 = (2M) 3/2 ≡ F
• # Segments  2n3 / F
• # Loads & Stores = M · #Segments  M · 2n3 / F
 n3 / (2M)1/2 – M = (n3 / M1/2 )
• Parallel Case: apply reasoning to one processor out of P
• # Adds and Muls  2n3 / P (at least one proc does this )
• M= n2 / P (each processor gets equal fraction of matrix)
• # “Load & Stores” = # words moved from or to other procs
 M · (2n3 /P) / F= M · (2n3 /P) / (2M)3/2 = n2 / (2P)1/2
Summer School Lecture 3
15
Proof of Loomis-Whitney inequality
T(x=i | y)
• Nx = |Tx| = #squares in Tx, same for Ty, Ny, etc
• Goal: N ≤ (Nx · Ny · Nz)1/2
•
•
•
•
T(x=i) = subset of T with x=i
T(x=i | y ) = projection of T(x=i) onto y=0 plane
N(x=i) = |T(x=i)| etc
N = i N(x=i) = i (N(x=i))1/2 · (N(x=i))1/2
≤ i (Nx)1/2 · (N(x=i))1/2
≤ (Nx)1/2 · i (N(x=i | y ) · N(x=i | z ) )1/2
= (Nx)1/2 · i (N(x=i | y ) )1/2 · (N(x=i | z ) )1/2
≤ (Nx)1/2 · (i N(x=i | y ) )1/2 · (i N(x=i | z ) )1/2
= (Nx)1/2 · (Ny)1/2 · (Nz)1/2
y
T
x
x=i
T(x=i)
N(x=i)  N(x=i|y) ·N(x=i|z)
T(x=i)
N(x=i|z)
• T = 3D set of 1x1x1 cubes on lattice
• N = |T| = #cubes
• Tx = projection of T onto x=0 plane
z
N(x=i|y)
16
Homework
• Prove more general statement of Loomis-Whitney
•
•
•
•
•
Suppose T is d-dimensional
N = |T| = #d-dimensional cubes in T
T(i) is projection of T onto hyperplane x(i)=0
N(i) = d-1 – dimensional volume of T(i)
Show N ≤ i=1 to d (N(i))1/(d-1)
17
How to generalize this lower bound (1/4)
• It doesn’t depend on C(i,j) being a matrix entry, just a
unique memory location (same for A(i,k) and B(k,j) )
• It doesn’t depend on C and A not overlapping (or C and
B, or A and B)
• Some reorderings may change answer; still get a
lower bound for all reorderings
• It doesn’t depend on doing n3 multiply/adds
• It doesn’t depend on C(i,j) just being a sum of products
C(i,j) = k A(i,k)*B(k,j)
For all (i,j)  S
Mem(c(i,j)) =
fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij ,
some other arguments)
18
How to generalize this lower bound (2/4)
For all (i,j)  S
Mem(c(i,j)) =
fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij ,
some other arguments)
(P)
• It does assume the presence of an operand generating a
load and/or store; how could this not happen?
• Mem(b(k,j)) could be reused in many more gijk than (P) allows
•
•
Ex: Compute C(m) = A * (B.^m) (Matlab notation) for m=1 to t
Can move many fewer words than  (#flops / M1/2 )
• We might generate a result during a segment, use it, and discard it,
without generating any memory traffic
•
•
Turns out QR, eig, SVD all may do this
Need a different analysis for them later…
Summer School Lecture 3
19
How to generalize this lower bound (3/4)
(P)
For all (i,j)  S
Mem(c(i,j)) =
fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij ,
some other arguments)
• Need to distinguish Sources, Destinations of each operand
in fast memory during a segment:
• Possible Sources:
S1: Already in fast memory at start of segment, or read; at most 2M
S2: Created during segment; no bound without more information
• Possible Destinations:
D1: Left in fast memory at end of segment, or written; at most 2M
D2: Discarded; no bound without more information
• Need to assume no S2/D2 arguments; at most 4M of others
Summer School Lecture 3
20
How to generalize this lower bound (4/4)
(P)
For all (i,j)  S
Mem(c(i,j)) =
fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k  Sij ,
some other arguments)
• Theorem: To evaluate (P) with memory of size M, where
• fij and gijk are “nontrivial” functions of their arguments
• G is the total number of gijk‘s,
• No S2/D2 arguments
requires at least G/ (8M)1/2 – M slow memory references
• Simpler: #words_moved =  (#flops / M1/2 )
• Corollary: To evaluate (P) requires at least
G/ (81/2 M3/2 ) – 1 messages (simpler:  (#flops / M3/2 ) )
• Proof: maximum message size is M
21
Some Corollaries of this lower bound (1/4)
(P)
For all (i,j)  S
Mem(c(i,j)) =

fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) )
for k  Sij , some other arguments)
#words_moved
=  (#flops / M1/2 )
• Theorem applies to dense or sparse, parallel or sequential:
• MatMul, including ATA or A2
• Triangular solve C = A-1·X
•
•
C(i,j) = (X(i,j) - k<i A(i,k)*C(k,j)) / A(i,i) … A lower triangular
C plays double role of b and c in Model (P)
• LU factorization (any pivoting, LU or “ILU”)
• L(i,j) = (A(i,j) - k<j L(i,k)*U(k,j)) / U(j,j), U(i,j) = A(i,j) - k<i L(i,k)*U(k,j)
• L (and U) play double role as c and a (c and b) in Model (P)
• Cholesky (any diagonal pivoting, C or “IC”)
• L(i,j) = (A(i,j) - k<j L(i,k)*LT(k,j)) / L(j,j), L(j,j) = (A(j,j) - k<j L(j,k)*LT(k,j))1/2
• L (and LT) play triple role as a, b and c
Summer School Lecture 3
22
Some Corollaries of this lower bound (2/4)
(P)
For all (i,j)  S
Mem(c(i,j)) =

fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) )
for k  Sij , some other arguments)
#words_moved
=  (#flops / M1/2 )
• Applies to “simplified” operations
• Ex 1: Compute ||A·B||F where A(i,k) = f(i,k) and B(k,j) = g(k,j), so no
inputs and 1 output; assume each f(i,k) and g(k,j) evaluated once
• Ex 2: Compute determinant of A(i,j) = f(i,j) using LU
• Apply lower bound by “imposing reads and writes”
• Ex 1: Every time a final value of (A·B)(i,j) is computed, write it;
every time f(i,k), g(k,j) evaluated, insert a read
• Ex 2: Every time a final value of L(i,j), U(i,j) computed, write it;
every time f(i,j) evaluated, insert a read
• Still get  (#flops / M1/2 – 3n2) words_moved, by subtracting
“imposed” reads/writes (sparse case analogous)
23
Some Corollaries of this lower bound (3/4)
(P)
For all (i,j)  S
Mem(c(i,j)) =

fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) )
for k  Sij , some other arguments)
#words_moved
=  (#flops / M1/2 )
• Applies to compositions of operations
• Ex: Compute Ak by repeated multiplication, only input A, output Ak
• Reorganize composition to match (P)
A
A2
A3 = A2 · A
…
…
At-1
At
• Impose writes of intermediate results A2 , … , At-1
• Still get #words_moved =  (#flops / M1/2 – (t-2) n2)
• Holds for any interleaving of operations
• Homework: apply to repeated squaring
24
Some Corollaries of this lower bound (4/4)
(P)
For all (i,j)  S
Mem(c(i,j)) =

fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) )
for k  Sij , some other arguments)
#words_moved
=  (#flops / M1/2 )
• Applies to some graph algorithms
• Ex: All-Pairs-Shortest-Path by Floyd-Warshall:
Initialize all path(i, j) = length of edge from node i to j (or  if none)
for k := 1 to n
for i := 1 to n
for j := 1 to n
path(i, j) = min ( path(i, j), path(i, k) + path(k, j) );
• Matches (P) with gijk = “+” and fij = “min”
• Get #words_moved =  (n3 / M1/2) , for dense graphs
• Homework: state result for sparse graphs; how does n3 change?
25
“3D” parallel matrix multiplication C = C + A· B
B(1,1) B(2,1)
B(1,2)
p1/3
B(3,1)
C(2,2)
• p procs arranged in p1/3 x p1/3 x p1/3
C(1,1)
grid, with tree networks along “fibers”
A(1,3)
• Each A(i,k), B(k,j), C(i,j) is n/p1/3 x n/p1/3
• Initially
A(1,2)
j
• Proc (i,j,0) owns C(i,j)
• Proc (i,0,k) owns A(i,k)
A(2,1)
A(1,1)
• Proc (0,j,k) owns B(k,j)
• Algorithm
i
p1/3
• For all (i,k), broadcast A(i,k) to proc (i,j,k)  j
… comm. cost = log p1/3   + log p1/3  n2 / p2/3 ·
• For all (j,k), broadcast B(k,j) to proc(i,j,k)  j
… same comm. cost
• For all (i,j,k) Tmp(i,j,k) = A(i,k)  B(k,j) … cost = (2(n/p1/3)3) = 2n3/p flops
• For all (i,j), reduce C(i,j) = k Tmp(i,j,k) … same comm. Cost
• Total comm. cost = O(log(p)  n2 / p2/3 · + log(p)  )
• Lower bound = ((n3/p)/(n2/p2/3)1/2 · + (n3/p)/(n2/p2/3)3/2 · ) = (n2/p2/3 · + )
• Lower bound also ~attainable for all n2/p2/3  M = n2/px  n2/p [Toledo et al]
So when doesn’t Model (P) apply?
(P)
For all (i,j)  S
Mem(c(i,j)) =

fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) )
for k  Sij , some other arguments)
• Ex: for r = 1 to t, C(r) = A * (B.^(1/r))
• (P) does not apply
#words_moved
=  (#flops / M1/2 )
… Matlab notation
• With A(i,k) and B(k,j) in memory, we can do t operations, not just 1 gijk
• Can’t apply (P) with arguments b(k,j) indexed in one-to-one fashion
• Can still analyze using segments, Loomis-Whitney
• #flops/segment  8(tM3)1/2
• #segments  (t n3 ) / 8(tM3)1/2
• #words_moved =  (t1/2 · n3 / M1/2 ), not  (t · n3 / M1/2 )
• Attainable, using variation of usual blocked matmul algorithm
• Homework! (Hint: need to recompute (B(j,k))1/r when needed)
27
Lower bounds for Orthogonal Transformations (1/4)
• Needed for QR, eig, SVD, …
• Analyze Blocked Householder Transformations
• j=1 to b (I – 2 uj uTj ) = I – U T UT where U = [ u1 , … , ub ]
• Treat Givens as 2x2 Householder
• Model details and assumptions
• Write (I – U T UT )A = A – U(TUT A) = A – UZ where Z = T(UT A)
• Only count operations in all A – UZ operations
•
“Generically” a large fraction of the work
• Assume “Forward Progress”, that each successive Householder
transformation leaves previously created zeros zero; Ok for
•
•
•
QR decomposition
Reductions to condensed forms (Hessenberg, tri/bidiagonal)
– Possible exception: bulge chasing in banded case
One sweep of (block) Hessenberg QR iteration
Summer School Lecture 3
28
Lower bounds for Orthogonal Transformations (2/4)
• Perform many A – UZ where Z = T(UT A)
• First challenge to applying theorem: need to collect all A-UZ
into one big set to which model (P) applies
• Write them all as { A(i,j) = A(i,j) - k U(i,k) Z(k,j) } where
k = index of k-th transformation,
• k not necessarily = index of column of A it comes from
• Second challenge: All Z(k,j) may be S2/D2
• Recall: S2/D2 means computed on the fly and discarded
• Ex: An x 2n =Qn x n · Rn x 2n where 2n2 = M so A fits in cache
•
•
•
•
•
Represent Q as n(n-1)/2 2x2 Householder (Givens) transforms
There are n2(n-1)/2 = (M3/2) nonzero Z(k,j), not O(M)
Still only do (M3/2) flops during segment
But can’t use Loomis-Whitney to prove it!
Need a new idea…
Summer School Lecture 3
29
Lower bounds for Orthogonal Transformations (3/4)
• Dealing with Z(k,j) being S2/D2
• How to bound #flops in A(i,j) = A(i,j) - k U(i,k) Z(k,j) ?
• Neither A nor U is S2/D2
• A either turned into R or U, which are output
• So at most 2M of each during segment
• #flops ≤ ( #U(i,k) ) · ( #columns A(:,j) present )
≤ ( #U(i,k) ) · ( #A(i,j) / min #nonzeros per column of A)
≤
h
· O(M) / r where h = O(M)
… How small can r be?
• To store h ≡ #U(i,k) Householder vector entries in r rows,
there can be at most r in the first column, r-1 in the second, etc.,
(to maintain “Forward Progress”) so r(r-1)/2  h so r  h1/2
• # flops ≤ h · O(M) / r ≤ O(M) h1/2 = O(M3/2) as desired
Summer School Lecture 3
30
Lower bounds for Orthogonal Transformations (4/4)
• Theorem: #words_moved by QR decomposition using
(blocked) Householder transformations = ( #flops / M1/2 )
• Theorem: #words_moved by reduction to Hessenberg form,
tridiagonal form, bidiagonal form, or one sweep of QR iteration
(or block versions of any of these) = ( #flops / M1/2 )
• Assuming Forward Progress (created zeros remain zero)
• Model: Merge left and right orthogonal transformations:
A(i,j) = A(i,j) - kLUL(i,kL) ·ZL(kL,j) - kRZR(i,kR) ·UR(j,kR)
Summer School Lecture 3
31
Robust PCA (Candes, 2009)
Decompose a matrix into a low rank component and a sparse component:
M=L+S
Homework: Apply lower bound result
Can we attain these lower bounds?
• Do conventional dense algorithms as implemented in
LAPACK and ScaLAPACK attain these bounds?
• Mostly not
• If not, are there other algorithms that do?
• Yes
• Several goals for algorithms:
• Minimize Bandwidth ( (#flops/ M1/2 )) and latency ( (#flops/ M3/2 ))
• Multiple memory hierarchy levels (attain bound for each level?)
• Explicit dependence on (multiple) M(s), vs “cache oblivious”
•
Fewest flops when fits in smallest memory
• What about sparse algorithms?
Summer School Lecture 3
33
Recall: Minimizing latency requires new data structures
• To minimize latency, need to load/store whole rectangular
subblock of matrix with one “message”
• Incompatible with conventional columnwise (rowwise) storage
• Ex: Rows (columns) not in contiguous memory locations
• Blocked storage: store as matrix of bxb blocks, each block
stored contiguously
• Ok for one level of memory hierarchy, what if more?
• Recursive blocked storage: store each block using subblocks
34
Recall: Blocked vs Cache-Oblivious Algorithms
• Blocked Matmul C = A·B explicitly refers to subblocks of
A, B and C of dimensions that depend on cache size
… Break Anxn, Bnxn, Cnxn into bxb blocks labeled A(i,j), etc
… b chosen so 3 bxb blocks fit in cache
for i = 1 to n/b, for j=1 to n/b, for k=1 to n/b
C(i,j) = C(i,j) + A(i,k)·B(k,j)
… b x b matmul
… another level of memory would need 3 more loops
• Cache-oblivious Matmul C = A·B is independent of cache
Function C = MM(A,B)
If A and B are 1x1
C=A·B
else … Break Anxn, Bnxn, Cnxn into (n/2)x(n/2) blocks labeled A(i,j), etc
for i = 1 to 2, for j = 1 to 2, for k = 1 to 2
C(i,j) = C(i,j) + MM( A(i,k), B(k,j) )
… n/2 x n/2 matmul
35
Summary of dense sequential algorithms
attaining communication lower bounds
• Algorithms shown minimizing # Messages assume (recursive) block layout
• Many references (see reports), only some shown, plus ours
• Older references may or may not include analysis
• Cache-oblivious are underlined, Green are ours, ? is unknown/future work
Algorithm
2 Levels of Memory
#Words Moved
BLAS-3
and # Messages
Usual blocked or recursive algorithms
Multiple Levels of Memory
#Words Moved
and #Messages
Usual blocked algorithms (nested),
or recursive [Gustavson,97]
Cholesky
LAPACK (with b = M1/2)
[Gustavson 97]
[BDHS09]
[Gustavson,97]
[Ahmed,Pingali,00]
[BDHS09]
(←same)
(←same)
LU with
pivoting
LAPACK (rarely)
[Toledo,97] , [GDX 08]
[GDX 08]
[Toledo, 97]
[GDX 08]?
[GDX 08]?
QR
LAPACK (rarely)
[Elmroth,Gustavson,98]
[DGHL08]
[Frens,Wise,03]
[DGHL08]
[Elmroth,
Gustavson,98]
[DGHL08] ?
[Frens,Wise,03]
[DGHL08] ?
[BDD10]
[BDD10]
Eig, SVD
Not LAPACK
[BDD10]
Summary of dense 2D parallel algorithms
attaining communication lower bounds
• Assume nxn matrices on P processors, memory per processor = O(n2 / P)
• Many references (see reports), Green are ours
• ScaLAPACK assumes best block size b chosen
• Recall lower bounds:
#words_moved = ( n2 / P1/2 )
and
#messages = ( P1/2 )
Algorithm
Reference
Factor exceeding
lower bound for
#words_moved
Factor exceeding
lower bound for
#messages
Matrix multiply
[Cannon, 69]
1
1
Cholesky
ScaLAPACK
log P
log P
LU
[GDX08]
ScaLAPACK
log P
log P
log P
log P · N / P1/2
QR
[DGHL08]
ScaLAPACK
log P
log P
log3 P
log P · N / P1/2
Sym Eig, SVD
[BDD10]
ScaLAPACK
log P
log P
log3 P
N / P1/2
Nonsym Eig
[BDD10]
ScaLAPACK
log P
log P · P1/2
log3 P
log P· N
Recent Communication Optimal Algorithms
•
•
•
•
•
QR with column pivoting
Cholesky with diagonal pivoting
LU with “complete” pivoting
LDL’ with “complete” pivoting
Sparse Cholesky
• For matrices with “good separators”
• For “most sparse matrices” (as hard as dense case)
38
Homework – Extend lower bound
• To algorithms using (block) Givens rotations instead of
(block) Householder transformations
• To QR done with Gram-Schmidt orthogonalization
• CGS and MGS
• To heterogeneous collections of processors
• Suppose processor k
•
•
•
Does (k) flops per second
Has reciprocal bandwidth (k)
Has latency (k)
• What is a lower bound on solution time, for any fractions of work
assigned to each processor (i.e. not all equal)
• To a homogenous parallel shared memory machine
• All data initially resides in large shared memory
• Processors communicate by data written to/read from slow memory
• Each processor has local fast memory size M
39
EXTRA SLIDES
40