Transcript Document
Minimizing Communication in Numerical Linear Algebra
www.cs.berkeley.edu/~demmel
Optimizing Krylov Subspace Methods
Jim Demmel EECS & Math Departments, UC Berkeley [email protected]
Outline of Lecture 8: Optimizing Krylov Subspace Methods • • • k-steps of typical iterative solver for Ax=b or Ax=λx – Does k SpMVs with starting vector (eg with b, if solving Ax=b) – Finds “best” solution among all linear combinations of these k+1 vectors – Many such “Krylov Subspace Methods” • Conjugate Gradients, GMRES, Lanczos, Arnoldi, … Goal: minimize communication in Krylov Subspace Methods – Assume matrix “well-partitioned,” with modest surface-to-volume ratio – Parallel implementation • Conventional: O(k log p) messages , because k calls to SpMV – •
New: O(log p) messages - optimal
Serial implementation • Conventional: O(k) moves of data from slow to fast memory • New: O(1) moves of data – optimal Lots of speed up possible (modeled and measured) – Price: some redundant computation Summer School Lecture 8 2
Locally Dependent Entries for [x,Ax], A tridiagonal, 2 processors
Proc 1 Proc 2
A 8 x A 7 x A 6 x A 5 x A 4 x A 3 x A 2 x Ax x
Can be computed without communication
Summer School Lecture 8 3
Locally Dependent Entries for [x,Ax,A 2 x], A tridiagonal, 2 processors
Proc 1 Proc 2
A 8 x A 7 x A 6 x A 5 x A 4 x A 3 x A 2 x Ax x
Can be computed without communication
Summer School Lecture 8 4
Locally Dependent Entries for [x,Ax,…,A 3 x], A tridiagonal, 2 processors
Proc 1 Proc 2
A 8 x A 7 x A 6 x A 5 x A 4 x A 3 x A 2 x Ax x
Can be computed without communication
Summer School Lecture 8 5
A 8 x A 7 x A 6 x A 5 x A 4 x A 3 x A 2 x Ax x Locally Dependent Entries for [x,Ax,…,A 4 x], A tridiagonal, 2 processors
Proc 1 Proc 2
Can be computed without communication
Summer School Lecture 8 6
Locally Dependent Entries for [x,Ax,…,A 8 x], A tridiagonal, 2 processors
Proc 1 Proc 2
A 8 x A 7 x A 6 x A 5 x A 4 x A 3 x A 2 x Ax x
Can be computed without communication k=8 fold reuse of A
7
Remotely Dependent Entries for [x,Ax,…,A 8 x], A tridiagonal, 2 processors
Proc 1 Proc 2
A 8 x A 7 x A 6 x A 5 x A 4 x A 3 x A 2 x Ax x
One message to get data needed to compute remotely dependent entries, not k=8 Minimizes number of messages = latency cost Price: redundant work “surface/volume ratio” 8
Spyplot of A with sparsity pattern of a 5 point stencil, natural order
9
Spyplot of A with sparsity pattern of a 5 point stencil, natural order
Summer School Lecture 8 10
Spyplot of A with sparsity pattern of a 5 point stencil, natural order, assigned to p=9 processors
For n x n mesh assigned to p processors, each processor needs 2n data from neighbors 11
Spyplot of A with sparsity pattern of a 5 point stencil, nested dissection order
12
Spyplot of A with sparsity pattern of a 5 point stencil, nested dissection order
For n x n mesh assigned to p processors, each processor needs 4n/p 1/2 data from neighbors 13
Remotely Dependent Entries for [x,Ax,…,A 3 x], A with sparsity pattern of a 5 point stencil
14
Remotely Dependent Entries for [x,Ax,…,A 3 x], A with sparsity pattern of a 5 point stencil
Summer School Lecture 8 15
Remotely Dependent Entries for [x,Ax,A 2 x,A 3 x], A irregular, multiple processors
16
Remotely Dependent Entries for [x,Ax,…,A 8 x], A tridiagonal, 2 processors
Proc 1 Proc 2
A 8 x A 7 x A 6 x A 5 x A 4 x A 3 x A 2 x Ax x
One message to get data needed to compute remotely dependent entries, not k=8 Minimizes number of messages = latency cost Price: redundant work “surface/volume ratio” 17
Reducing redundant work for a tridiagonal matrix Summer School Lecture 8 18
Summary of Parallel Optimizations so far, for “Akx kernel”: (A,x,k)
[Ax,A
2
x,…,A
k
x]
• • • • • • Possible to reduce #messages from O(k) to O(1) Depends on matrix being “well-partitioned” – Each processor only needs data from a few neighbors Price: redundant computation of some entries of A j x – Amount of redundant work depends on “surface to volume ratio” of partition: ideally each processor only needs a little data from neighbors, compared to what it owns itself Best case: tridiagonal (1D mesh): need O(1) data from nbrs, versus O(n/p) data locally; • #flops grows by factor 1+O(k/(n/p)) = 1 + O(k/data_per_proc) Pretty good: 2D 1 mesh (n x n): need O(n/p 1/2 ) data from nbors, versus O(n 2 /p) locally; #flops grows by 1+O(k/(data_per_proc) 1/2 ) Less good: 3D mesh (n x n x n): need O(n 2 /p 2/3 ) data from nbors, versus O(n 3 /p) locally; #flops grows by 1+O(k/(data_per_proc) 1/3 ) Summer School Lecture 8 19
Predicted speedup for model Petascale machine of [Ax,A
2
x,…,A
k
x] for 2D (n x n) Mesh
k log 2 n 20
Predicted fraction of time spent doing arithmetic for model Petascale machine of [Ax,A 2 x,…,A k x] for 2D (n x n) Mesh k log 2 n 21
Predicted ratio of extra arithmetic for model Petascale machine of [Ax,A 2 x,…,A k x] for 2D (n x n) Mesh k log 2 n 22
Predicted speedup for model Petascale machine of [Ax,A
2
x,…,A
k
x] for 3D (n x n x n) Mesh
k log 2 n 23
Predicted fraction of time spent doing arithmetic for model Petascale machine of [Ax,A 2 x,…,A k x] for 3D (n x n x n) Mesh k log 2 n 24
Predicted ratio of extra arithmetic for model Petascale machine of [Ax,A 2 x,…,A k x] for 3D (n x n x n) Mesh k log 2 n 25
Minimizing #messages versus #words_moved
• • Parallel case – – Can’t reduce #words needed from other processors Required to get right answer Sequential case – – – – Can reduce #words needed from slow memory Assume A, x too large to fit in fast memory, Naïve implementation of [Ax,A 2 x,…,A k x] by k calls to SpMV need to move A, x between fast and slow memory k times We will move it one time – clearly optimal Summer School Lecture 8 26
Sequential [x,Ax,…,A 4 x], with memory hierarchy
v One read of matrix from slow memory, not k=4 Minimizes words moved = bandwidth cost No redundant work 27
Remotely Dependent Entries for [x,Ax,…,A 3 x], A 100x100 with bandwidth 2 Only ~25% of A, vectors fits in fast memory
28
In what order should the sequential algorithm process a general sparse matrix?
• • • For a band matrix, we obviously want to process the matrix “from left to right”, since data we need for the next partition is already in fast memory Not obvious what best order is for a general matrix We can formulate question of finding the best order as a Traveling Salesman Problem (TSP) • One vertex per matrix partition • Weight of edge (j, k) is memory cost of processing • partition k right after partition j TSP: find lowest cost “tour” visiting all vertices
What about multicore?
• • Two kinds of communication to minimize – – Between processors on the chip Between on-chip cache and off-chip DRAM Use hybrid of both techniques described so far – – Use parallel optimization so each core can work independently Use sequential optimization to minimize off-chip DRAM traffic of each core Summer School Lecture 8 30
Speedups on Intel Clovertown (8 core) Test matrices include stencils and practical matrices See SC09 paper on bebop.cs.berkeley.edu for details Summer School Lecture 8 31
Minimizing Communication of GMRES to solve Ax=b
• • GMRES: find x in span{b,Ab,…,A k b} minimizing || Ax-b || 2 Cost of k steps of standard GMRES vs new GMRES Standard GMRES for i=1 to k w = A · v(i-1) MGS(w, v(0),…,v(i-1)) update v(i), H endfor solve LSQ problem with H Communication-Avoiding GMRES … CA-GMRES W = [ v, Av, A 2 v, … , A k v ] [Q,R] = TSQR(W) … “Tall Skinny QR” Build H from R, solve LSQ problem Sequential: #words_moved = O( k·nnz ) from SpMV + O( k 2 ·n ) from MGS Parallel: #messages = O( k ) from SpMV + O( k 2 · log p ) from MGS Sequential: #words_moved = O( nnz ) from SpMV + O( k·n ) from TSQR Parallel: #messages = O( 1 ) from computing W + O( log p ) from TSQR
•Oops – W from power method, precision lost!
32
33
How to make CA-GMRES Stable?
•
Use a different polynomial basis for the Krylov subspace • Not Monomial basis W = [v, Av, A 2 v, …], instead [v, p 1 (A)v,p 2 (A)v,…] • Possible choices: • Newton Basis W N = [v, (A – θ 1 I)v , (A – θ 2 I)(A – θ 1 I)v, …] where shifts θ i chosen as approximate eigenvalues from Arnoldi (using same Krylov subspace, so “free”) • Chebyshev Basis W C = [v, T 1 (A)v, T 2 (A)v, …] where T 1 (z) chosen to be small in region of complex plane containing large eigenvalues Summer School Lecture 8 34
“Monomial” basis [Ax,…,A k x] fails to converge Newton polynomial basis does converge 35
Speed ups on 8-core Clovertown
36
Summary of what is known (1/3),
open • • GMRES – – – – Can independently choose k to optimize speed, restart length r to optimize convergence Need to “co-tune” Akx kernel and TSQR Know how to use more stable polynomial bases Proven speedups Can similarly reorganize other Krylov methods – Arnoldi and Lanczos, for Ax = λx and for Ax = λMx – – – – Conjugate Gradients, for Ax = b Biconjugate Gradients, for Ax=b BiCGStab?
Other Krylov methods?
Summer School Lecture 8 37
Summary of what is known (2/3),
open • Preconditioning: MAx = Mb – – – Need different communication-avoiding kernel: [x,Ax,MAx,AMAx,MAMAx,AMAMAx,…] Can think of this as union of two of the earlier kernels [x,(MA)x,(MA) 2 x,…,(MA) k x] and [x,Ax,(AM)Ax,(AM) 2 Ax,…,(AM) k Ax] For which preconditioners M can we minimize communication?
• • Easy: diagonal M How about block diagonal M?
Summer School Lecture 8 38
Examine [x,Ax,MAx,AMAx,MAMAx,…] for A tridiagonal, M block-diagonal Summer School Lecture 8 39
Examine [x,Ax,MAx,AMAx,MAMAx,…] for A tridiagonal, M block-diagonal Summer School Lecture 8 40
Examine [x,Ax,MAx,AMAx,MAMAx,…] for A tridiagonal, M block-diagonal Summer School Lecture 8 41
Examine [x,Ax,MAx,AMAx,MAMAx,…] for A tridiagonal, M block-diagonal Summer School Lecture 8 42
Examine [x,Ax,MAx,AMAx,MAMAx,…] for A tridiagonal, M block-diagonal Summer School Lecture 8 43
Summary of what is known (3/3),
open • • Preconditioning: MAx = Mb – – – Need different communication-avoiding kernel: [x,Ax,MAx,AMAx,MAMAx,AMAMAx,…] For block diagonal M, matrix powers rapidly become dense, but ranks of off-diagonal block grow slowly Can take advantage of low rank to minimize communication – y i = (AM) k ij · x j = U V · x j = U (V · x j ) Reconstruct Compute on proc j, y i on proc i Send to proc i – – Works (in principle) for Hierarchically Semi-Separable M How does it work in practice?
For details (and open problems ) see M. Hoemmen’s PhD thesis Summer School Lecture 8 44
Of things not said (much about) …
• • • • – – – – Other Communication-Avoiding (CA) dense factorizations QR with column pivoting (tournament pivoting) Cholesky with diagonal pivoting GE with “complete pivoting” LDL T ? Maybe with complete pivoting?
CA-Sparse factorizations – Cholesky, assuming good separators – – Lower bounds from Lipton/Rose/Tarjan Matrix multiplication, anything else?
Strassen, and all linear algebra with #words_moved = O(n / M /2-1 ) – Parallel?
Extending CA-Krylov methods to “bottom solver” in multigrid with Adaptive Mesh Refinement (AMR) 45
Summary
Time to redesign all dense and sparse linear algebra Don’t Communic… Summer School Lecture 8 46
EXTRA SLIDES
Summer School Lecture 8 47