CS294, Lecture #5 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294 Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded Schwartz Based on: Bootcamp 2011,

Download Report

Transcript CS294, Lecture #5 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294 Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded Schwartz Based on: Bootcamp 2011,

CS294, Lecture #5
Fall, 2011
Communication-Avoiding Algorithms
www.cs.berkeley.edu/~odedsc/CS294
Communication Avoiding
Algorithms
for
Dense Linear Algebra:
LU, QR, and Cholesky decompositions
and
Sparse Cholesky decompositions
Jim Demmel and Oded Schwartz
Based on:
Bootcamp 2011, CS267, Summer-school 2010, Laura Grigori’s, many papers
Outline for today
• Recall communication lower bounds
• What a “matching upper bound”, i.e. CA-algorithm,
might have to do
• Summary of what is known so far
• Case Studies of CA dense algorithms
• Last time: case study I: Matrix multiplication
• Case study II: LU, QR
• Case study III: Cholesky decomposition
• Case Studies of CA sparse algorithms
• Sparse Cholesky decomposition
2
Communication lower bounds
• Applies to algorithms satisfying technical conditions
discussed last time
• “smells like 3-nested loops”
• For each processor:
• M = size of fast memory of that processor
• G = #operations (multiplies) performed by that processor
#words_moved
=
Ω ( max ( G / M1/2 , #inputs + #outputs ) )
to/from fast memory
#messages
≥
to/from fast memory
#words_moved
max_message_size
=
Ω ( G / M3/2 )
3
Summary of dense sequential algorithms
attaining communication lower bounds
• Algorithms shown minimizing # Messages use (recursive) block layout
• Not possible with columnwise or rowwise layouts
• Many references (see reports), only some shown, plus ours
• Cache-oblivious are underlined, Green are ours, ? is unknown/future work
Algorithm
2 Levels of Memory
#Words Moved
BLAS-3
Usual blocked or recursive algorithms
QR
Rankrevealing
Eig, SVD
#Words Moved
and #Messages
Usual blocked algorithms (nested),
or recursive [Gustavson,97]
(←same)
(←same)
[Gustavson 97]
[BDHS09]
[Gustavson,97]
[Ahmed,Pingali,00]
[BDHS09]
LAPACK (sometimes)
[Toledo,97] , [GDX 08]
[GDX 08]
not partial pivoting
[Toledo, 97]
[GDX 08]?
[GDX 08]?
LAPACK (sometimes)
[Elmroth,Gustavson,98]
[DGHL08]
[Frens,Wise,03]
but 3x flops?
[DGHL08]
[Elmroth,
Gustavson,98]
[DGHL08] ?
[Frens,Wise,03]
[DGHL08] ?
[BDD11,
BDK11?]
[BDD11,
BDK11?]
Cholesky LAPACK (with b = M1/2)
LU with
pivoting
and # Messages
Multiple Levels of Memory
Not LAPACK
[BDD11] randomized, but more flops;
[BDK11]
Summary of dense parallel algorithms
attaining communication lower bounds
• Assume nxn matrices on P processors
• Minimum Memory per processor = M = O(n2 / P)
• Recall lower bounds:
#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 )
#messages
= ( (n3/ P) / M3/2 ) = ( P1/2 )
Algorithm
Matrix Multiply
Cholesky
LU
QR
Sym Eig, SVD
Nonsym Eig
Reference
Factor exceeding
lower bound for
#words_moved
Factor exceeding
lower bound for
#messages
Summary of dense parallel algorithms
attaining communication lower bounds
• Assume nxn matrices on P processors (conventional approach)
• Minimum Memory per processor = M = O(n2 / P)
• Recall lower bounds:
#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 )
#messages
= ( (n3/ P) / M3/2 ) = ( P1/2 )
Algorithm
Reference
Matrix Multiply [Cannon, 69]
Factor exceeding
lower bound for
#words_moved
1
Cholesky
ScaLAPACK
log P
LU
ScaLAPACK
log P
QR
ScaLAPACK
log P
Sym Eig, SVD
ScaLAPACK
log P
Nonsym Eig
ScaLAPACK
P1/2 log P
Factor exceeding
lower bound for
#messages
Summary of dense parallel algorithms
attaining communication lower bounds
• Assume nxn matrices on P processors (conventional approach)
• Minimum Memory per processor = M = O(n2 / P)
• Recall lower bounds:
#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 )
#messages
= ( (n3/ P) / M3/2 ) = ( P1/2 )
Algorithm
Reference
Matrix Multiply [Cannon, 69]
Factor exceeding
lower bound for
#words_moved
Factor exceeding
lower bound for
#messages
1
1
Cholesky
ScaLAPACK
log P
log P
LU
ScaLAPACK
log P
n log P / P1/2
QR
ScaLAPACK
log P
n log P / P1/2
Sym Eig, SVD
ScaLAPACK
log P
n / P1/2
Nonsym Eig
ScaLAPACK
P1/2 log P
n log P
Summary of dense parallel algorithms
attaining communication lower bounds
• Assume nxn matrices on P processors (better)
• Minimum Memory per processor = M = O(n2 / P)
• Recall lower bounds:
#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 )
#messages
= ( (n3/ P) / M3/2 ) = ( P1/2 )
Algorithm
Reference
Matrix Multiply [Cannon, 69]
Factor exceeding
lower bound for
#words_moved
Factor exceeding
lower bound for
#messages
1
1
Cholesky
ScaLAPACK
log P
log P
LU
[GDX10]
log P
log P
QR
[DGHL08]
log P
log3 P
Sym Eig, SVD
[BDD11]
log P
log3 P
Nonsym Eig
[BDD11]
log P
log3 P
Can we do even better?
• Assume nxn matrices on P processors
• Why just one copy of data: M = O(n2 / P) per processor?
• Increase M to reduce lower bounds:
#words_moved = ( (n3/ P) / M1/2 ) = ( n2 / P1/2 )
#messages
= ( (n3/ P) / M3/2 ) = ( P1/2 )
Algorithm
Reference
Matrix Multiply [Cannon, 69]
Factor exceeding
lower bound for
#words_moved
Factor exceeding
lower bound for
#messages
1
1
Cholesky
ScaLAPACK
log P
log P
LU
[GDX10]
log P
log P
QR
[DGHL08]
log P
log3 P
Sym Eig, SVD
[BDD11]
log P
log3 P
Nonsym Eig
[BDD11]
log P
log3 P
Recall Sequential One-sided Factorizations (LU, QR)
• Classical Approach
for i=1 to n
update column i
update trailing matrix
• #words_moved = O(n3)
• Blocked Approach (LAPACK)
for i=1 to n/b
update block i of b columns
update trailing matrix
• #words moved = O(n3/M1/3)
• Recursive Approach
func factor(A)
if A has 1 column, update it
else
factor(left half of A)
update right half of A
factor(right half of A)
• #words moved = O(n3/M1/2)
• None of these approaches
minimizes #messages or
addresses parallelism
• Need another idea
10
Minimizing latency in sequential case
requires new data structures
• To minimize latency, need to load/store whole rectangular
subblock of matrix with one “message”
• Incompatible with conventional columnwise (rowwise) storage
• Ex: Rows (columns) not in contiguous memory locations
• Blocked storage: store as matrix of bxb blocks, each block
stored contiguously
• Ok for one level of memory hierarchy, what if more?
• Recursive blocked storage: store each block using subblocks
• Also known as “space filling curves”, “Morton ordering”
11
Different Data Layouts for Parallel GE
Bad load balance:
P0 idle after first
n/4 steps
1) 1D Column Blocked Layout
Can trade load balance
and BLAS2/3
performance by
choosing b, but
factorization of block
column is a bottleneck
2) 1D Column Cyclic Layout
0 1 2 3
3 0 1 2
0 1 2 3 0 1 2 3
2 3 0 1
Complicated addressing,
May not want full parallelism
In each column, row
1 2 3 0
b
3) 1D Column Block Cyclic Layout
Bad load balance:
P0 idle after first
n/2 steps
0123012301230123
0 1 2 3
Load balanced, but
can’t easily use
BLAS2 or BLAS3
0
1
2
3
4) Block Skewed Layout
0
2
0
2
0
2
0
2
1
3
1
3
1
3
1
3
0
2
0
2
0
2
0
2
5) 2D Row and Column Blocked Layout
Summer School Lecture 4
1
3
1
3
1
3
1
3
0
2
0
2
0
2
0
2
1
3
1
3
1
3
1
3
0
2
0
2
0
2
0
2
1
3
1
3
1
3
1
3
The winner!
6) 2D Row and Column
Block Cyclic Layout
12
Distributed GE with a 2D Block Cyclic Layout
02/14/2006
CS267 Lecture13
9
13
02/14/2006
CS267 Lecture14
9
14
green = green - blue * pink
Matrix multiply of
Where is communication bottleneck?
• In both sequential and parallel case: panel factorization
• Can we retain partial pivoting? No
• “Proof” for parallel case:
• Each choice of pivot requires a reduction (to find the maximum
entry) per columns
• #messages = Ω (n) or Ω (n log p), versus goal of O(p1/2)
• Solution for QR, then show LU
15
TSQR: QR of a Tall, Skinny matrix
W=
W0
W1
W2
W3
=
Q00 R00
Q10 R10
Q20 R20 =
Q30 R30
R00
R10
Q01 R01
R20 = Q11 R11 =
R30
Q00
Q10
.
Q20
Q30
Q01
Q11
.
R00
R10
R20
R30
R01
R11
R01
R11 = Q02 R02
Output = { Q00, Q10, Q20, Q30, Q01, Q11, Q02, R02 }
16
TSQR: An Architecture-Dependent Algorithm
W0
W1
W=
W2
W3
R00
R10
R20
R30
W0
Sequential: W = W1
W2
W3
R00
Parallel:
W0
Dual Core: W = W1
W2
W3
R00
R01
R01
R02
R11
R01
R01
R11
R02
R02
R11
R03
R03
Multicore / Multisocket / Multirack / Multisite / Out-of-core: ?
Can choose reduction tree dynamically
17
Back to LU: Using similar idea for TSLU as TSQR:
Use reduction tree, to do “Tournament Pivoting”
W1
Wnxb =
P1·L1·U1
Choose b pivot rows of W1, call them W1’
P2·L2·U2
Choose b pivot rows of W2, call them W2’
W3
P3·L3·U3
Choose b pivot rows of W3, call them W3’
W4
P4·L4·U4
Choose b pivot rows of W4, call them W4’
P12·L12·U12
Choose b pivot rows, call them W12’
P34·L34·U34
Choose b pivot rows, call them W34’
W2
=
W1’
W2’
=
W3’
W4’
W12’
W34’
=
P1234·L1234·U1234
Choose b pivot rows
Go back to W and use these b pivot rows
(move them to top, do LU without pivoting)
18
Minimizing Communication in TSLU
Parallel:
(binary tree)
W1
W2
W=
W3
W4
LU
LU
LU
LU
Sequential:
(flat tree)
W1
W2
W=
W3
W4
LU
Dual Core:
W1
W = W2
W3
W4
LU
LU
LU
LU
LU
LU
LU
LU
LU
LU
LU
LU
LU
Multicore / Multisocket / Multirack / Multisite / Out-of-core:
Can Choose reduction tree dynamically, as before
19
Making TSLU Numerically Stable
• Stability Goal: Make ||A – PLU|| very small: O(machine_epsilon · ||A||)
• Details matter
• Going up the tree, we could do LU either on original rows of W
(tournament pivoting), or computed rows of U
• Only tournament pivoting stable
• Thm: New scheme as stable as Partial Pivoting (PP) in following sense: get
same results as PP applied to different input matrix whose entries are blocks
taken from input A
• CALU – Communication Avoiding LU for general A
– Use TSLU for panel factorizations
– Apply to rest of matrix
– Cost: redundant panel factorizations:
–
–
–
Extra flops for one reduction step: n/b panels  (n  b2) flops per panel
For the sequential algorithms this is n2b. b = M1/2,
so extra n2M1/2 flops, (<n3 as M < n2).
For the parallel 2D algorithms, we have log P reduction steps
n/b panels, n/b b2 / P1/2 flops per panel and b = n/(log(P)2P1/2),
so extra n/b  (n  b2)  log P / P1/2= n2blog P / P1/2 = n3/(Plog(P)) < n3/P
• Benefit:
– Stable in practice, but not same pivot choice as GEPP
– One reduction operation per panel: reduces latency to minimum
20
Sketch of TSLU Stabilty Proof
• Consider A =
A11 A12
A21 A22
A31 A32
11 12 0
21 22 0
0 0 I
A11 A12
A21 A22
A31 A32
with TSLU of first columns done by :
(assume wlog that A21 already
contains desired pivot rows)
• Resulting LU decomposition (first n/2 steps) is
·
=
A’11 A’12
A’21 A’22
A31 A32
=
L’11
L’21 I
L’31 0
·
I
A11
A21
A31
A’11
A21
U’11 U’12
As22
As32
• Claim: As32 can be obtained by GEPP on larger matrix formed from blocks of A
G=
A’11
A’12
A21 A21
-A31 A32
=
L’11
A21·U’11-1
U’11
L21
-L31 I
·
U21
U’12
-L21-1 ·A21 · U’11-1 · U’12
As32
• GEPP applied to G pivots no rows, and
Need to show As32
L31 · L21-1 ·A21 · U’11-1 · U’12 + As32
=
L31 · U21 · U’11-1 · U’12 + As32
= A31 · U’11-1 · U’12 + As32 =
same as above
L’31 · U’12 + As32
= A32
21
Stability of LU using TSLU: CALU (1/2)
• Worst case analysis
• Pivot growth factor from one panel factorization with b columns
•
TSLU:
– Proven: 2bH where H = 1 + height of reduction tree
– Attained: 2b , same as GEPP
• Pivot growth factor on m x n matrix
•
TSLU:
– Proven: 2nH-1
– Attained: 2n-1 , same as GEPP
• |L| can be as large as 2(b-2)H-b+1 , but CALU still stable
• There are examples where GEPP is exponentially less
stable than CALU, and vice-versa, so neither is always
better than the other
• Discovered sparse matrices with O(2n) pivot growth
22
Stability of LU using TSLU: CALU (2/2)
• Empirical testing
•
•
•
•
Both random matrices and “special ones”
Both binary tree (BCALU) and flat-tree (FCALU)
3 metrics: ||PA-LU||/||A||, normwise and componentwise backward errors
See [D., Grigori, Xiang, 2010] for details
Summer School Lecture 4
23
23
Performance vs ScaLAPACK
• TSLU
– IBM Power 5
• Up to 4.37x faster (16 procs, 1M x 150)
– Cray XT4
• Up to 5.52x faster (8 procs, 1M x 150)
• CALU
– IBM Power 5
• Up to 2.29x faster (64 procs, 1000 x 1000)
– Cray XT4
• Up to 1.81x faster (64 procs, 1000 x 1000)
• See INRIA Tech Report 6523 (2008), paper at SC08
25
log2 (memory_per_proc)
log2 (n2/p) =
Exascale predicted speedups
for Gaussian Elimination:
CA-LU vs ScaLAPACK-LU
log2 (p)
27
Case Study III:
Cholesky Decomposition
29
LU and Cholesky decompositions
LU decomposition of A
A = P·L·U
U is upper triangular
L is lower triangular
P is a permutation matrix
Cholesky decomposition of A
If A is a real symmetric and positive definite matrix,
then L (a real lower triangular) s.t.,
A = L·LT
(L is unique if we restrict diagonal elements  0)
Efficient parallel and sequential algorithms?
30
Application of the generalized form
to Cholesky decomposition
(1) Generalized form:
(i,j)  S,
L i , i  
C(i,j) = fij( gi,j,k1 (A(i,k1), B(k1,j)),
gi,j,k2 (A(i,k2), B(k2,j)),
…,
k1,k2,…  Sij
other arguments)
A i , i  
  A i , k 
2
k i 1 

 A i , j  
L i , j  
L  j , j  
1

 A i , j  A  j , k   , i  j
k i 1 

31
By the Geometric Embedding theorem
[Ballard, Demmel, Holtz, S. 2011a]
Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]
The communication cost (bandwidth) of any classic algorithm for
Cholesky decomposition (for dense matrices) is:





n 
 M 

M 

3

 


 M 

M  P 
n
3
32
Some Sequential Classical Cholesky Algorithms
Lower bound
Naive
LAPACK (with right block size!)
Column-major
Contiguous blocks
Cache
Oblivious
Bandwidth
Latency
(n3/M1/2)
(n3/M3/2)
(n3)
(n3/M)

O(n3/M1/2)
O(n3/M1/2)
O(n3/M)
O(n3/M3/2)


O(n3/M1/2+n2 log n)
O(n3/M1/2+n2 log n)
(n3/M)
(n2)


O(n3/M1/2)
O(n3/M1/2)
O(n3/M)
O(n3/M3/2)


Rectangular Recursive
[Toledo 97]
Column-major
Contiguous blocks
Square Recursive
[Ahmad, Pingali 00]
Column-major
Contiguous blocks
33
Some Parallel Classical Cholesky Algorithms
Lower bound
General
2D-layout: M=O(n2/P)
Sca-LAPACK
General
Choosing b=n/P1/2
Bandwidth
Latency
(n3/PM1/2)
(n2/P1/2)
(n3/PM3/2)
(P1/2)
O((n2/P1/2 + nb)log P)
O((n2/P1/2)log P)
O((n/b) log p)
O(P1/2 log P)
34
Upper bounds – Supporting Data Structures
Top (column-major): Full, Old Packed, Rectangular Full Packed.
Bottom (block-contiguous): Blocked, Recursive Format, Recursive Full Packed
[Frigo, Leiserson, Prokop, Ramachandran 99, Ahmed, Pingali 00, Elmroth, Gustavson, Jonsson,
Kagstrom 04].
35
Upper bounds – Supporting Data Structures
Rules of thumb:
• Don’t mix column/row major with blocked:
decide which one fits your algorithm.
• Converting between the two once in your algorithm
is OK for dense instances.
E.g., changing the DS at the beginning
of your algorithms.
36
The Naïve Algorithms
i
L i , i  
A i , i  
  A i , k 
2
k i 1 

 A i , j  
L i , j  
L  j , j  
1
k
j
j

 A i , j  A  j , k   , i  j
k i 1 

j
j
i
i
k
Naïve left looking and right looking
Bandwidth: (n3)
Latency: (n3/M)
Assuming column-major DS.
37
Rectangular Recursive [Toledo 97]
n
n/2
A11
A12
L11
A12
L11
0
n/2
A21
A22
L21
A22
L21
L22
A31
A32
L31
A32
L31
L32
m
n/2
Recursive
on left
rectangular
[A22;A23] :=
[A22;A23] [L21;L31]L21T
Recursive
on right
rectangular;
L12 :=0
Base: Column
L := A/(A11)1/2
38
Rectangular Recursive [Toledo 97]
n
n/2
A11
A12
L11
A12
L11
0
n/2
A21
A22
L21
A22
L21
L22
A31
A32
L31
A32
L31
L32
m
n/2
Toledo’s Algorithm: guarantees stability for LU.
BW(m,n) = 2m if n = 1, otherwise
BW(m,n/2) + BWmm(m-n/2,n/2,n/2) + BW(m-n/2,n/2)
Bandwidth: O(n3/M1/2+n2 log n)
Optimal bandwidth (for almost all M)
39
Rectangular Recursive [Toledo 97]
n
n/2
A11
A12
L11
A12
L11
0
n/2
A21
A22
L21
A22
L21
L22
A31
A32
L31
A32
L31
L32
m
n/2
Is it Latency optimal?
If we use a column major DS, then
L=
(n3/M), so M1/2 away from optimum (n3/M3/2)
This is due to the matrix multiply.
Open problem:
Can we prove a latency lower bound for all matrix multiply classic
algorithm using column major DS?
Note: we don’t need such a lower bound here, as the matrixmultiply algorithm is known.
40
Rectangular Recursive [Toledo 97]
n
n/2
A11
A12
L11
A12
L11
0
n/2
A21
A22
L21
A22
L21
L22
A31
A32
L31
A32
L31
L32
m
n/2
Is it Latency optimal?
If we use a contiguous blocks DS, then
L = (n2).
This is due to the forward columns updates
Optimum is (n3/M3/2)
n3/M3/2 ≤ n2 for n ≤ M2/3.
Outside this range the algorithm is not latency optimal.
41
Square Recursive algorithm
[Ahmed, Pingali 00], First analyzed in [Ballard, Demmel, Holtz, S. 09]
n
n/2
A11
A12
L11
0
L11
0
L11
0
n/2
A21
A22
A21
A22
L21
A22
L21
A22
Recursive on A11
A22 =
Recursive on A22
L21 =
RTRSM(A21,L11’)
A22 – L21 L21’ A12=0
RTRSM: Recursive Triangular Solver
42
Square Recursive algorithm
[Ahmed, Pingali 00], First analyzed in [Ballard, Demmel, Holtz, S. 09]
n
n/2
A11
A12
L11
0
L11
0
L11
0
n/2
A21
A22
A21
A22
L21
A22
L21
A22
Recursive on A11
A22 =
Recursive on A22
L21 =
RTRSM(A21,L11’)
A22 – L21 L21’ A12=0
BW = 3n2 if 3n2 ≤ M, otherwise
2BW(n/2) + O(n3/M1/2 + n2)
43
Square Recursive algorithm
[Ahmed, Pingali 00], First analyzed in [Ballard, Demmel, Holtz, S. 09]
n
n/2
A11
A12
L11
0
L11
0
L11
0
n/2
A21
A22
A21
A22
L21
A22
L21
A22
Uses
• Recursive matrix multiplication
[Frigo, Leiserson, Prokop, Ramachandran 99]
• and recursive RTRSM
Bandwidth: O(n3/M1/2) Latency:
recursive DS.
O(n3/M3/2) Assuming blocks
•Optimal bandwidth
•Optimal latency
•Cache oblivious
•Not stable for LU
44
Parallel Algorithm
1
4
7
1
4
7
2
5
8
2
5
8
3
6
9
3
6
9
1
4
7
1
4
7
2
5
8
2
5
8
Pr
3
6
9
3
6
9
Pc
1. Find Cholesky locally
on a diagonal block.
Broadcast downwards.
2. Parallel triangular solve.
Broadcast to the right.
3.
Parallel symmetric
rank-b update
45
Parallel Algorithm
1
4
7
1
4
7
2
5
8
2
5
8
3
6
9
3
6
9
1
4
7
1
4
7
2
5
8
2
5
8
Pr
3
6
9
3
6
9
Pc
Low order term,
as b ≤ n/P1/2
Optimal (up to a log P factor)
only for 2D.
New 2.5D algorithm extends
to any size M.
Bandwidth
Latency
Lower bound
General
2D-layout: M=O(n2/P)
(n3/PM1/2)
(n2/P1/2)
(n3/PM3/2)
(P1/2)
O((n2/P1/2 + nb)log P)
O((n2/P1/2)log P)
O((n/b) log P)
O(P1/2 log P)
Sca-LAPACK
General
Choosing b=n/P1/2
46
Summary of algorithms for
dense Cholesky decomposition
Tight lower bound for Cholesky decomposition.
Sequential:
Bandwidth: (n3/M1/2) Latency: (n3/M3/2)
Parallel:
Bandwidth and latency lower bound
tight up to (log P) factor.
47
Sparse Algorithms:
Cholesky Decomposition
Based on [L. Grigori P.Y David, J. Demmel, S. Peyronnet, ‘00]
48
Lower bounds for sparse direct methods
• Geometric embedding for sparse 3-nested-loops-like
algorithms may be loose:
BW =  (#flops / M1/2 ) ,
L = Ω(#flops / M3/2 )
It may be smaller than the NNZ (number of non-zeros) in the input/output.
• For example computing the product of two (block) diagonal
matrices may involve no communication in parallel
49
Lower bounds for sparse matrix multiplication
• Compute C = AB, with A, B, C of size n  n, having dn
nonzeros uniformly distributed, #flops is at most d2n
• Lower bound on communication in sequential is
BW = (dn),
L =  (dn/M)
• Two types of algorithms - depending on how their
communication pattern is designed.
- Take into account the nonzero structure of the matrices
Example - Buluc, Gilbert
BW = L = Ω(d2n)
- Oblivious to the nonzero structure of the matrices
Example - sparse Cannon - Buluc, Gilbert for parallel case
BW = Ω(dn/ p1/2), #messages = Ω(p1/2)
• L. Grigori P.Y David, J. Demmel, S. Peyronnet
50
Lower bounds for sparse Cholesky factorization
• Consider A of size ksks results from a finite difference
operator on a regular grid of dimension s2 with ks nodes.
• Its Cholesky L factor contains a dense lower triangular matrix
of size ks-1ks-1.
BW = Ω(W1/2)
L = Ω(W3/2)
where W = k 3(s-1)/2 for a sequential algorithm
W = k 3(s-1)/(2p) for a parallel algorithm
• This result applies more generally to matrix A
whose graph G =(V,E) has the following property for some s:
• every set of vertices W ⊂ V with n/3 ≤ |W| ≤ 2n/3 is adjacent to at least s
vertices in V − W
• the Cholesky factor of A contains a dense s x s submatrix
51
Optimal algorithms for sparse Cholesky factorization
• PSPASES with an optimal layout could attains the lower
bound in parallel for 2D/3D regular grids:
• Uses nested dissection to reorder the matrix
• Distributes the matrix using the subtree-to-subcube algorithm
• The factorization of every dense multifrontal matrix is performed using an
optimal parallel dense Cholesky factorization.
• Sequential multifrontal algorithm attains the lower bound
• The factorization of every dense multifrontal matrix is performed using an
optimal dense Cholesky factorization
52
Open Problems / Work in Progress (1/2)
• Rank Revealing CAQR (RRQR)
• AP = QR, P a perm. chosen to put “most independent” columns first
• Does LAPACK geqp3 move O(n3) words, or fewer?
• Better: Tournament Pivoting (TP) on blocks of columns
• CALU with more reliable pivoting (Gu, Grigori, Khabou et al)
• Replace tournament pivoting of panel (TSLU) with Rank-Revealing QR of Panel
• LU with Complete Pivoting (GECP)
• A = PrLUPcT , Pr and Pc are perms. putting “most independent” rows & columns first
• Idea: TP on column blocks to chose columns, then TSLU on columns to choose rows
• LU of A=AT, with symmetric pivoting
• A = PLLTPT, P perm., L lower triangular
• Bunch-Kaufman, Rook pivoting: O(n3) words moved?
• CA approach: Choose block with one step of CA-GECP, permute rows & columns
to top left, pick symmetric subset with local Bunch-Kaufman
• A = PLBLTPT, with symmetric pivoting (Toledo et al)
• L triangular, P perm., B banded
• B tridiagonal: due to Aasen
• CA approach: B block tridiagonal, stability still in question
53
Open Problems / Work in Progress (2/2)
• Heterogeneity (Ballard, Gearhart)
• How to assign work to attain lower bounds when flop rates / bandwidths / latencies vary?
• Using extra memory in distributed memory case (Solomonik)
• Using all available memory to attain improved lower bounds
• Just matmul and CALU so far
• Using extra memory in shared memory case
• Sparse matrices …
• Combining sparse and dense algorithms, for better communication performance,
e.g.,
• The above [Grigori, David, Peleg, Peyronnet ‘09]
• Our retuned version of [Yuster Zwick ‘04]
• Eliminate the log P factor?
• Prove latency lower bounds for specific DS use , e.g., can we show that any classic
matrix multiply algorithm that uses column-major DS has to be M1/2 away from
optimality?
54