Communication-avoiding Krylov subspace methods in finite precision Erin Carson, UC Berkeley Advisor: James Demmel ICME LA/Opt Seminar, Stanford University December 4, 2014
Download
Report
Transcript Communication-avoiding Krylov subspace methods in finite precision Erin Carson, UC Berkeley Advisor: James Demmel ICME LA/Opt Seminar, Stanford University December 4, 2014
Communication-avoiding
Krylov subspace methods in finite
precision
Erin Carson, UC Berkeley
Advisor: James Demmel
ICME LA/Opt Seminar, Stanford University
December 4, 2014
Model Problem: 2D Poisson on 2562 grid, ฮบ(๐ด) = 3.9๐ธ4, ๐ด 2 = 1.99. Equilibration
(diagonal scaling) used. RHS set s.t. elements of true solution ๐ฅ๐ = 1/256.
Residual
replacement
strategy
of with
vaninder
Vorst andaccuracy
Ye variants.
(1999)
We We
can
can
combine
extend
residual
this
strategy
replacement
communication-avoiding
inexpensive
techniques
Roundoff
error
cause
decrease
attainable
of for
This
affect
can
becan
worse
forato
โcommunication-avoidingโ
(๐ -step)
improves
attainable
accuracy
for
classical
methods
maintaining
Accuracy
convergence
can
beKrylov
improved
rate
closer
for
to
minimal
that
of performance
theKrylov
classical
method!
cost!
subspace
methods.
Krylov
subspace
methods
โ limits
practice
applicability!
2
What is communication?
โข Algorithms have two costs: communication and computation
โข Communication : moving data between levels of memory hierarchy
(sequential), between processors (parallel)
sequential comm.
parallel comm.
โข On modern computers, communication is expensive, computation is cheap
โ Flop time << 1/bandwidth << latency
โ Communication bottleneck a barrier to achieving scalability
โ Also expensive in terms of energy cost!
โข Towards exascale: we must redesign algorithms to avoid communication
3
Communication bottleneck in KSMs
โข A Krylov Subspace Method is a projection process onto the Krylov subspace
๐ฆ๐ ๐ด, ๐0 = span ๐0 , ๐ด๐0 , ๐ด2 ๐0 , โฆ , ๐ด๐โ1 ๐0
โข Linear systems, eigenvalue problems, singular value problems, least squares, etc.
โข Best for: ๐ด large & very sparse, stored implicitly, or only approximation needed
Projection process in each iteration:
1. Add a dimension to ๐ฆ๐
โข Sparse Matrix-Vector Multiplication (SpMV)
โข Parallel: comm. vector entries w/ neighbors
โข Sequential: read ๐ด/vectors from slow memory
2. Orthogonalize (with respect to some โ๐ )
โข Inner products
โข Parallel: global reduction
โข Sequential: multiple reads/writes to slow memory
SpMV
orthogonalize
Dependencies between
communication-bound
kernels in each iteration
limit performance!
4
Example: Classical conjugate gradient (CG)
Given: initial approximation ๐ฅ0 for solving ๐ด๐ฅ = ๐
Let ๐0 = ๐0 = ๐ โ ๐ด๐ฅ0
for ๐ = 1, 2, โฆ , until convergence do
๐ผ๐โ1 =
๐
๐๐โ1
๐๐โ1
๐
๐๐โ1
๐ด๐๐โ1
๐ฅ๐ = ๐ฅ๐โ1 + ๐ผ๐โ1 ๐๐โ1
๐๐ = ๐๐โ1 โ ๐ผ๐โ1 ๐ด๐๐โ1
๐ฝ๐ =
๐๐
๐๐
๐
๐
๐๐โ1
๐๐โ1
SpMVs and inner products
require communication in
each iteration!
๐๐ = ๐๐ + ๐ฝ๐ ๐๐โ1
end for
5
Communication-avoiding CA-KSMs
โข Krylov methods can be reorganized to reduce communication cost by ๐ถ(๐)
โข Many CA-KSMs (๐ -step KSMs) in the literature: CG, GMRES, Orthomin,
MINRES, Lanczos, Arnoldi, CGS, Orthodir, BICG, CGS, BICGSTAB, QMR
Related references:
(Van Rosendale, 1983), (Walker, 1988), (Leland, 1989), (Chronopoulos and
Gear, 1989), (Chronopoulos and Kim, 1990, 1992), (Chronopoulos, 1991),
(Kim and Chronopoulos, 1991), (Joubert and Carey, 1992), (Bai, Hu, Reichel,
1991), (Erhel, 1995), GMRES (De Sturler, 1991), (De Sturler and Van der
Vorst, 1995), (Toledo, 1995), (Chronopoulos and Kinkaid, 2001), (Hoemmen,
2010), (Philippe and Reichel, 2012), (C., Knight, Demmel, 2013),
(Feuerriegel and Bücker, 2013).
โข Reduction in communication can translate to speedups on practical problems
โข Recent results: ๐. ๐x speedup with ๐ = 4 for CA-BICGSTAB in GMG
bottom-solve; 2.5x in overall GMG solve (24,576 cores, ๐ = 10243 on
Cray XE6) (Williams, et al., 2014).
6
Benchmark timing breakdown
โข Plot: Net time spent on different operations over one MG solve using
24,576 cores
โข Hopper at NERSC (Cray XE6), 4 6-core Opteron chips per node, Gemini
network, 3D torus
โข 3D Helmholtz equation
โข CS-BICGSTAB with ๐ = 4
7
CA-CG overview
Starting at iteration ๐ ๐ for ๐ โฅ 0, ๐ > 0, it can be shown that to compute the
next ๐ steps (iterations ๐ ๐ + ๐ where ๐ โค ๐ ),
๐๐ ๐+๐ , ๐๐ ๐+๐ , ๐ฅ๐ ๐+๐ โ ๐ฅ๐ ๐ โ ๐ฆ๐ +1 ๐ด, ๐๐ ๐ + ๐ฆ๐ ๐ด, ๐๐ ๐ .
1. Compute basis matrix
๐๐ = ๐0 ๐ด ๐๐ ๐ , โฆ , ๐๐ ๐ด ๐๐ ๐ , ๐0 ๐ด ๐๐ ๐ , โฆ , ๐๐ โ1 ๐ด ๐๐ ๐
of dimension ๐-by-(2๐ +1), where ๐๐ is a polynomial of degree ๐.
This gives the recurrence relation
๐ด๐๐ = ๐๐ ๐ต๐ ,
where ๐ต๐ is (2๐ +1)-by-(2๐ +1) and is 2x2 block diagonal with upper
Hessenberg blocks, and ๐๐ is ๐๐ with columns ๐ +1 and 2๐ +1 set to 0.
โข Communication cost: Same latency cost as one SpMV using matrix powers
kernel (e.g., Hoemmen et al., 2007), assuming sufficient sparsity structure
(low diameter)
8
The matrix powers kernele Matrix Powers
Avoids communication:
โข In serial, by exploiting temporal locality:
โข Reading ๐ด, reading ๐ = [๐ฅ, ๐ด๐ฅ, ๐ด2๐ฅ, โฆ , ๐ด๐ ๐ฅ]
โข In parallel, by doing only 1 โexpandโ phase
(instead of ๐ ).
โข Requires sufficiently low โsurface-to-volumeโ
ratio
Also works for
Tridiagonal Example:
general graphs!
black = local elements
red = 1-level dependencies
green = 2-level dependencies
blue = 3-level dependencies
A3v
A2v
Av
v
Sequential
A3v
A2v
Av
v
Parallel
9
CA-CG overview
2. Orthogonalize: Encode dot products between basis vectors by
computing Gram matrix ๐บ๐ = ๐๐๐ ๐๐ of dimension (2๐ +1)-by- 2๐ +1
(or compute Tall-Skinny QR)
โข Communication cost: of one global reduction
3. Perform ๐ iterations of updates
โข Using ๐๐ and ๐บ๐ , this requires no communication
โข Represent ๐-vectors by their length-(2๐ + 1) coordinates in ๐๐ :
๐ฅ๐ ๐+๐ โ ๐ฅ๐ ๐ = ๐๐ ๐ฅ๐โฒ ,
๐๐ ๐+๐ = ๐๐ ๐๐โฒ , ๐๐ ๐+๐ = ๐๐ ๐๐โฒ
โข Perform ๐ iterations of updates on coordinate vectors
4. Basis change to recover CG vectors
โฒ
โฒ
โฒ
โข Compute ๐ฅ๐ ๐+๐ โ๐ฅ๐ ๐ , ๐๐ ๐+๐ , ๐๐ ๐+๐ = ๐๐ [๐ฅ๐,๐
, ๐๐,๐
, ๐๐,๐
] locally,
no communication
10
No communication in inner loop!
CG
for ๐ = 1, โฆ , ๐ do
๐ผ๐ ๐+๐โ1 =
CA-CG
Compute ๐๐ such that ๐ด๐๐ = ๐๐ ๐ต๐
Compute ๐บ๐ = ๐๐๐ ๐๐
for ๐ = 1, โฆ , ๐ do
๐
๐๐ ๐+๐โ1
๐๐ ๐+๐โ1
๐ผ๐ ๐+๐โ1 =
๐
๐๐ ๐+๐โ1
๐ด๐๐ ๐+๐โ1
โฒ๐
โฒ
๐๐,๐โ1
๐บ๐ ๐๐,๐โ1
โฒ๐
โฒ
๐๐,๐โ1
๐บ๐ ๐ต๐ ๐๐,๐โ1
๐ฅ๐ ๐+๐ = ๐ฅ๐ ๐+๐โ1 + ๐ผ๐ ๐+๐โ1 ๐๐ ๐+๐โ1
โฒ
โฒ
โฒ
๐ฅ๐,๐
= ๐ฅ๐,๐โ1
+ ๐ผ๐ ๐+๐โ1 ๐๐,๐โ1
๐๐ ๐+๐ = ๐๐ ๐+๐โ1 โ ๐ผ๐ ๐+๐โ1 ๐ด๐๐ ๐+๐โ1
โฒ
โฒ
โฒ
๐๐,๐
= ๐๐,๐โ1
โ ๐ผ๐ ๐+๐โ1 ๐ต๐ ๐๐,๐โ1
๐ฝ๐ ๐+๐ =
๐
๐๐ ๐+๐
๐๐ ๐+๐
๐ฝ๐ ๐+๐ =
๐
๐๐ ๐+๐โ1
๐๐ ๐+๐โ1
โฒ๐
โฒ
๐๐,๐โ1
๐บ๐ ๐๐,๐โ1
โฒ
โฒ
โฒ
๐๐,๐
= ๐๐,๐
+ ๐ฝ๐ ๐+๐ ๐๐,๐โ1
๐๐ ๐+๐ = ๐๐ ๐+๐ +๐ฝ๐ ๐+๐ ๐๐ ๐+๐โ1
end for
โฒ๐
โฒ
๐๐,๐
๐บ๐ ๐๐,๐
end for
โฒ
โฒ
โฒ
๐ฅ๐ ๐+๐ โ๐ฅ๐ ๐ , ๐๐ ๐+๐ , ๐๐ ๐+๐ = ๐๐ [๐ฅ๐,๐
, ๐๐,๐
, ๐๐,๐
]
Krylov methods in finite precision
โข Finite precision errors have effects on algorithm behavior (known to
Lanczos (1950); see, e.g., Meurant and Strakoลก (2006) for in-depth survey)
โข Lanczos:
โข Basis vectors lose orthogonality
โข Appearance of multiple Ritz approximations to some eigenvalues
โข Delays convergence of Ritz values to other eigenvalues
โข CG:
โข Residual vectors (proportional to Lanczos basis vectors) lose
orthogonality
โข Appearance of duplicate Ritz values delays convergence of CG
approximate solution
โข Residual vectors ๐๐ and true residuals ๐ โ ๐ด๐ฅ๐ deviate
โข Limits the maximum attainable accuracy of approximate solution
12
CA-Krylov methods in finite precision
โข CA-KSMs mathematically equivalent to classical KSMs
โข But convergence delay and loss of accuracy gets worse with
increasing ๐!
โข Obstacle to solving practical problems
โข Decrease in attainable accuracy โ some problems that
KSM can solve canโt be solved with CA variant
โข Delay of convergence โ if # iterations increases more than
time per iteration decreases due to CA techniques, no
speedup expected!
13
This raises the questionsโฆ
For CA-KSMs,
For solving linear systems:
โข How bad can the effects of roundoff error be?
โข Bound on maximum attainable accuracy for CA-CG and CABICG
โข And what can we do about it?
โข Residual replacement strategy: uses bound to improve
accuracy to ๐(๐) ๐ด ๐ฅ in CA-CG and CA-BICG
14
This raises the questionsโฆ
For CA-KSMs,
For eigenvalue problems:
โข How bad can the effects of roundoff error be?
โข Extension of Chris Paigeโs analysis of Lanczos to CA variants:
โข Bounds on local rounding errors in CA-Lanczos
โข Bounds on accuracy and convergence of Ritz values
โข Loss of orthogonality ๏ฎ convergence of Ritz values
โข And what can we do about it?
โข Some ideas on how to use this new analysisโฆ
โข Basis orthogonalization
โข Dynamically select/change basis size (๐ parameter)
โข Mixed precision variants
15
Maximum attainable accuracy of CG
โข In classical CG, iterates are updated by
๐ฅ๐ = ๐ฅ๐โ1 + ๐ผ๐โ1 ๐๐โ1
and
๐๐ = ๐๐โ1 โ ๐ผ๐โ1 ๐ด๐๐โ1
โข Formulas for ๐ฅ๐ and ๐๐ do not depend on each other - rounding errors
cause the true residual, ๐ โ ๐ด๐ฅ๐ , and the updated residual, ๐๐ , to deviate
โข The size of the true residual is bounded by
๐ โ ๐ด๐ฅ๐ โค ๐๐ + ๐ โ ๐ด๐ฅ๐ โ ๐๐
โข When ๐๐ โซ ๐ โ ๐ด๐ฅ๐ โ ๐๐ , ๐๐ and ๐ โ ๐ด๐ฅ๐ have similar
magnitude
โข When ๐๐ โ 0, ๐ โ ๐ด๐ฅ๐ depends on ๐ โ ๐ด๐ฅ๐ โ ๐๐
โข Many results on attainable accuracy, e.g.: Greenbaum (1989, 1994, 1997),
Sleijpen, van der Vorst and Fokkema (1994), Sleijpen, van der Vorst and
Modersitzki (2001), Björck, Elfving and Strakoลก (1998) and Gutknecht and
Strakoลก (2000).
16
Example: Comparison of convergence of
true and updated residuals for CG vs. CACG using a monomial basis, for various ๐
values
Model problem (2D Poisson on 2562 grid)
17
โข Better conditioned polynomial bases
can be used instead of monomial.
โข Two common choices: Newton and
Chebyshev - see, e.g., (Philippe and
Reichel, 2012).
โข Behavior closer to that of classical CG
โข But can still see some loss of attainable
accuracy compared to CG
โข Clearer for higher ๐ values
18
Residual replacement strategy for CG
โข van der Vorst and Ye (1999): Improve accuracy by replacing updated
residual ๐๐ by the true residual ๐ โ ๐จ๐๐ in certain iterations, combined
with group update.
โข Choose when to replace ๐๐ with ๐ โ ๐ด๐ฅ๐ to meet two constraints:
1. Replace often enough so that at termination, ๐ โ ๐ด๐ฅ๐ โ ๐๐ is
small relative to ๐๐ ๐ด
๐ฅ๐
2. Donโt replace so often that original convergence mechanism of
updated residuals is destroyed (avoid large perturbations to finite
precision CG recurrence)
19
Residual replacement condition for CG
Tong and Ye (2000): In finite precision classical CG, in iteration ๐, the
computed residual and search direction vectors satisfy
๐๐ = ๐๐โ1 โ ๐ผ๐โ1 ๐ด๐๐โ1 + ๐๐ ,
Then in matrix form,
1 ๐๐+1 ๐
๐ด๐๐ = ๐๐ ๐๐ โ โฒ
๐๐+1 + ๐น๐
๐ผ๐ ๐0
๐๐ = ๐๐ + ๐ฝ๐ ๐๐โ1 + ๐๐
with ๐๐ =
๐0
๐๐
,โฆ,
๐0
๐๐
โฒ = ๐ ๐ ๐ โ1 ๐ and ๐น =
where ๐๐ is invertible and tridiagonal, ๐ผ๐
๐ ๐ 1
๐
๐0 , โฆ , ๐๐ , with
๐ด๐๐
1 ๐๐+1
๐ฝ๐ ๐๐
๐๐ =
+
โ
๐๐
๐ผ๐ ๐๐
๐ผ๐โ1 ๐๐
20
Residual replacement condition for CG
Tong and Ye (2000): If sequence ๐๐ satisfies
๐ด๐๐ = ๐๐ ๐๐ โ
1 ๐๐+1 ๐
๐๐+1
โฒ
๐ผ๐
๐0
+ ๐น๐ ,
๐๐ =
๐ด๐๐
๐๐
1 ๐๐+1
+
๐ผ๐ ๐๐
๐ฝ๐ ๐๐
โ
๐ผ๐โ1 ๐๐
then
๐๐+1 โค 1 + ๐ฆ๐
where ๐ฆ๐ =
min
๐โ๐ซ๐ ,๐ 0 =1
โ๐ ๐ด + ฮ๐ด๐ ๐1 โ
+
+.
๐ด๐๐ โ ๐น๐ ๐๐โ1 โ๐๐+1
โ and ฮ๐ด๐ = โ๐น๐ ๐๐
โข As long as ๐๐ satisfies recurrence, bound on its norm holds regardless of
how it is generated
โข Can replace ๐๐ by fl(๐ โ ๐ด๐ฅ๐ ) and still expect convergence whenever
๐๐ = fl ๐ โ ๐ด๐ฅ๐ โ ๐๐โ1 โ ๐ผ๐โ1 ๐ด๐๐โ1
is not too large relative to ๐๐ and ๐๐โ1 (van der Vorst & Ye, 1999).
21
Residual replacement strategy for CG
โข (van der Vorst and Ye, 1999): Use computable bound for โ๐ โ ๐ด๐ฅ๐ โ
Pseudo-code for residual replacement with group update for CG:
if
end
๐๐โ1 โค ๐ ๐๐โ1 ๐๐ง๐ ๐๐ > ๐ ๐๐ ๐๐ง๐ ๐๐ > 1.1๐๐๐๐๐ก
๐ง = ๐ง + ๐ฅ๐
group update of approximate solution
๐ฅ๐ = 0
set residual to true residual
๐๐ = ๐ โ ๐ด๐ง
๐๐๐๐๐ก = ๐๐ = ๐ ๐ ๐ด ๐ง + ๐๐
โข Assuming a bound on condition number of ๐ด, if updated residual
converges to ๐(๐) ๐ด ๐ฅ , true residual reduced to ๐(๐) ๐ด ๐ฅ
22
Sources of roundoff error in CA-CG
Computing the ๐ -step Krylov basis:
๐ด๐๐ = ๐๐ ๐ต๐ + โ๐๐
Error in computing
๐ -step basis
Updating coordinate vectors in the inner loop:
โฒ
โฒ
โฒ
๐ฅ๐,๐
= ๐ฅ๐,๐โ1
+ ๐๐,๐โ1
+ ๐๐,๐
โฒ
โฒ
โฒ
๐๐,๐
= ๐๐,๐โ1
โ ๐ต๐ ๐๐,๐โ1
+ ๐๐,๐
Error in updating
coefficient vectors
โฒ
โฒ
with ๐๐,๐โ1
= fl(๐ผ๐ ๐+๐โ1 ๐๐,๐โ1
)
Recovering CG vectors for use in next outer loop:
โฒ
๐ฅ๐ ๐+๐ = ๐๐ ๐ฅ๐,๐
+ ๐ฅ๐ ๐ + ๐๐ ๐+๐
โฒ
๐๐ ๐+๐ = ๐๐ ๐๐,๐
+ ๐๐ ๐+๐
Error in
basis change
23
Maximum attainable accuracy of CA-CG
โข We can write the deviation of the true and updated residuals in terms of
these errors:
๐โ๐ด๐ฅ๐ ๐+๐ โ๐๐ ๐+๐ = ๐โ๐ด๐ฅ0 โ๐0
๐โ1
โ
๐
โฒ
๐ด๐โ ๐โ,๐ + ๐โ ๐โ,๐ โ ฮ๐โ ๐โ,๐โ1
๐ด๐๐ โ+๐ + ๐๐ โ+๐ +
โ=0
๐=1
๐
โฒ
๐ด๐๐ ๐๐,๐ + ๐๐ ๐๐,๐ โ ฮ๐๐ ๐๐,๐โ1
โ๐ด๐๐ ๐+๐ โ ๐๐ ๐+๐ โ
๐=1
โข Using standard rounding error results, this allows us to obtain an upper
bound on ๐โ๐ด๐ฅ๐ ๐+๐ โ๐๐ ๐+๐ .
24
A computable bound
โข We extend van der Vorst and Yeโs residual replacement strategy to CA-CG
โข Making use of the bound for ๐โ๐ด๐ฅ๐ ๐+๐ โ๐๐ ๐+๐ in CA-CG, update error
estimate ๐๐ ๐+๐ by:
Extra๐computation
all lower order terms, communication
๐
๐ถ(๐
)
flops
per
๐
iterations;
โค1
reduction
per ๐ iterations
๐ถ(๐ + ๐by
) flops
perfactor
๐ iterations;
no communication
only
once
increased
at Estimated
most
of 2.
๐๐ ๐+๐ โก ๐๐ ๐+๐โ1
+๐ 4+๐โฒ
+๐
๐ด
๐ด
โฒ
๐๐ โ ๐ฅ๐,๐
+
๐ฅ๐ ๐+๐ + 2+2๐ โฒ ๐ด
โฒ
๐๐ โ ๐ต๐ โ ๐ฅ๐,๐
+
โฒ
๐๐ โ ๐๐,๐
โฒ
โฒ
๐๐ โ ๐ฅ๐,๐
+๐ โฒ ๐๐ โ ๐๐,๐
,๐=๐
0, otherwise
where ๐ โฒ = max ๐, 2๐ + 1 .
25
Residual replacement for CA-CG
โข Use the same replacement condition as van der Vorst and Ye (1999):
๐๐ ๐+๐โ1 โค ๐ ๐๐ ๐+๐โ1
๐๐ง๐
๐๐ ๐+๐ > ๐ ๐๐ ๐+๐
๐๐ง๐ ๐๐ ๐+๐ > 1.1๐๐๐๐๐ก
โข (possible we could develop a better replacement condition based on
perturbation in finite precision CA-CG recurrence)
Pseudo-code for residual replacement with group update for CA-CG:
if ๐๐ ๐+๐โ1 โค ๐ ๐๐ ๐+๐โ1 ๐๐ง๐
โฒ
๐ง = ๐ง + ๐๐ ๐ฅ๐,๐
+ ๐ฅ๐ ๐
๐ฅ๐ ๐+๐ = 0
๐๐ ๐+๐ = ๐ โ ๐ด๐ง
๐๐ ๐+๐ > ๐ ๐๐ ๐+๐
๐๐ง๐ ๐๐ ๐+๐ > 1.1๐๐๐๐๐ก
๐๐๐๐๐ก = ๐๐ ๐+๐ = ๐ 1 + 2๐โฒ ๐ด ๐ง + ๐๐ ๐+๐
โฒ
๐๐ ๐+๐ = ๐๐ ๐๐,๐
break from inner loop and begin new outer loop
end
26
27
Residual
Replacement
Indices
Total Number
of Reductions
s=4
CACG Mono.
CACG
Newt.
CACG
Cheb.
354
203
353
196
365
197
340
102
353
99
326
68
346
71
401, 517
s=8 224, 334,
157
s=12
135,
2119
557
CG
355
669
โขโข #Inreplacements
small compared
to total
addition to attainable
accuracy,
iterations
๏ฎ RR
doesnโt
significantly affect
convergence
rate
is incredibly
communication
savings!
important in practical
implementations
โขโข Can
have bothrate
speed
and accuracy!
Convergence
depends
on basis
28
consph8, FEM/Spheres (from UFSMC)
n = 8.3 โ
104 , ๐๐๐ง = 6.0 โ
106 , ๐
๐ด = 9.7 โ
103 , ๐ด = 9.7
After
๐ =8
# Replacements
Class.
2
M
12
N
2
C
2
# Replacements
๐ =12
Before
Class.
2
M
0
N
4
C
3
29
xenon1, materials (from UFSMC)
n = 4.9 โ
104 , ๐๐๐ง = 1.2 โ
106 , ๐
๐ด = 3.3 โ
104 , ๐ด = 3.2
After
๐ =4
# Replacements
Class.
2
M
1
N
1
C
1
# Replacements
๐ =8
Before
Class.
2
M
5
N
2
C
1
30
Preconditioning for CA-KSMs
โข Tradeoff: speed up convergence, but increase time per iteration due to
communication!
โข For each specific app, must evaluate tradeoff between preconditioner
quality and sparsity of the system
โข Good news: many preconditioners allow communication-avoiding approach
โข Block Jacobi โ block diagonals
โข Sparse Approx. Inverse (SAI) โ same sparsity as ๐ด; recent work for
CA-BICGSTAB by Mehri (2014)
โข Polynomial preconditioning (Saad, 1985)
โข HSS preconditioning for banded matrices (Hoemmen, 2010),
(Knight, C., Demmel, 2014)
โข CA-ILU(0) โ recent work by Moufawad and Grigori (2013)
โข Deflation for CA-CG (C., Knight, Demmel, 2014), based on Deflated
CG of (Saad et al., 2000); for CA-GMRES (Yamazaki et al., 2014)
31
Deflated CA-CG, model problem
Monomial Basis,
๐ ๐ ==10
4
8
Matrix: 2D Laplacian(512), ๐ = 262,144. Right hand side set such that true solution
has entries ๐ฅ๐ = 1/ ๐. Deflated CG algorithm (DCG) from (Saad et al., 2000).
32
Eigenvalue problems: CA-Lanczos
Problem: 2D Poisson, ๐ = 256, ๐ด = 7.93, with random starting vector
As ๐ increases, both convergence rate and accuracy to which we can find
approximate eigenvalues within ๐ iterations decreases.
33
Eigenvalue problems: CA-Lanczos
Problem: 2D Poisson, ๐ = 256, ๐ด = 7.93, with random starting vector
Improved by better-conditioned bases, but still see decreased
convergence rate and accuracy as ๐ grows.
34
Paigeโs Lanczos convergence analysis
๐ + ๐ฟ๐
๐ด๐๐ = ๐๐ ๐๐ + ๐ฝ๐+1 ๐ฃ๐+1 ๐๐
๐
๐๐ = ๐ฃ1 , โฆ , ๐ฃ๐ ,
๐ฟ ๐๐ = ๐ฟ ๐ฃ1 , โฆ , ๐ฟ ๐ฃ๐ ,
๐๐ =
๐ผ1
๐ฝ2
๐ฝ2
โฑ
โฑ
โฑ
โฑ
๐ฝ๐
๐ฝ๐
๐ผ๐
for ๐ โ {1, โฆ , ๐},
Classic Lanczos rounding
error result of Paige (1976):
2
๐ฝ๐+1
where ๐ โก ๐ด
2,
๐๐ โก
๐ด
๐ฟ ๐ฃ๐ 2 โค ๐1 ๐
๐ฝ๐+1 ๐ฃ๐๐ ๐ฃ๐+1 โค 2๐0 ๐
๐
๐ฃ๐+1
๐ฃ๐+1 โ 1 โค ๐0 2
+ ๐ผ๐2 + ๐ฝ๐2 โ ๐ด๐ฃ๐ 22 โค 4๐ 3๐0 + ๐1 ๐ 2
2 , ๐0
โก 2๐ ๐ + 4 , and ๐1 โก 2๐ ๐๐ + 7
๐0 = ๐ ๐๐
๐1 = ๐ ๐๐๐
๏ฎ These results form the basis for Paigeโs influential results in (Paige, 1980).
35
CA-Lanczos convergence analysis
Let ฮ โก max ๐โ+
โโค๐
2
โ
๐โ
2
โค 2๐ +1 โ max ๐
๐โ .
โโค๐
for ๐ โ {1, โฆ , ๐=๐ ๐+๐},
For CA-Lanczos,
we have:
2
๐ฝ๐+1
๐ฟ ๐ฃ๐ 2 โค ๐1 ๐
๐ฝ๐+1 ๐ฃ๐๐ ๐ฃ๐+1 โค 2๐0 ๐
๐
๐ฃ๐+1
๐ฃ๐+1 โ 1 โค ๐0 2
+ ๐ผ๐2 + ๐ฝ๐2 โ ๐ด๐ฃ๐ 22 โค 4๐ 3๐0 + ๐1 ๐ 2
๐0 โก 2๐ ๐+11๐ +15 ฮ 2 = ๐ ๐๐ฮ 2 ,
(vs. ๐ ๐๐ for Lanczos)
๐1 โก 2๐ N+2๐ +5 ๐ + 4๐ +9 ๐ + 10๐ +16 ฮ = ๐ ๐๐๐ฮ ,
where ๐ โก ๐ด
2,
๐๐ โก
๐ด
2,
(vs. ๐ ๐๐๐ for Lanczos)
๐๐ โก max ๐ตโ
โโค๐
2
36
The amplification term
โข Our definition of amplification term before was
ฮ โก max ๐โ+ 2 โ ๐โ 2 โค 2๐ +1 โ max ๐
๐โ
โโค๐
โโค๐
where we want ๐ |๐ฆโฒ| 2 โค ฮ ๐๐ฆโฒ 2 to hold for the computed
basis ๐ and any coordinate vector ๐ฆโฒ in every iteration.
โข Better, more descriptive estimate for ฮ updated possible
w/tighter bounds; requires some light bookkeeping
๐
โข Example: for bounds on ๐ฝ๐+1 ๐ฃ๐๐ ๐ฃ๐+1 and ๐ฃ๐+1
๐ฃ๐+1 โ 1 , we
can use the definition
๐๐ ๐ฅ 2
ฮ๐,๐ =
max
โฒ
โฒ ,๐ฃ โฒ ,๐ฃ โฒ
๐ฅโ{๐ค๐,๐ ,๐ข๐,๐
๐๐ ๐ฅ 2
๐,๐ ๐,๐โ1 }
37
Local Rounding Errors, Classical Lanczos
๐ฝ๐+1 ๐ฃ๐๐ ๐ฃ๐+1
Measured value
๐
๐ฃ๐+1
๐ฃ๐+1 โ 1
Upper bound (Paige 1976)
Problem: 2D Poisson, ๐ = 256, ๐ด = 7.93, with random starting vector
38
Local Rounding Errors, CA-Lanczos, monomial basis, ๐ = 8
๐ฝ๐+1 ๐ฃ๐๐ ๐ฃ๐+1
Measured value
๐
๐ฃ๐+1
๐ฃ๐+1 โ 1
Upper bound
ฮ value
Problem: 2D Poisson, ๐ = 256, ๐ด = 7.93, with random starting vector
39
Local Rounding Errors, CA-Lanczos, Newton basis, ๐ = 8
๐ฝ๐+1 ๐ฃ๐๐ ๐ฃ๐+1
Measured value
๐
๐ฃ๐+1
๐ฃ๐+1 โ 1
Upper bound
ฮ value
Problem: 2D Poisson, ๐ = 256, ๐ด = 7.93, with random starting vector
40
Local Rounding Errors, CA-Lanczos, Chebyshev basis, ๐ = 8
๐ฝ๐+1 ๐ฃ๐๐ ๐ฃ๐+1
Measured value
๐
๐ฃ๐+1
๐ฃ๐+1 โ 1
Upper bound
ฮ value
Problem: 2D Poisson, ๐ = 256, ๐ด = 7.93, with random starting vector
41
Paigeโs results for classical Lanczos
โข Using bounds on local rounding errors in Lanczos, Paige showed that
1. The computed Ritz values always lie between the extreme
eigenvalues of ๐ด to within a small multiple of machine
precision.
2. At least one small interval containing an eigenvalue of ๐ด is
found by the ๐th iteration.
3. The algorithm behaves numerically like Lanczos with full
reorthogonalization until a very close eigenvalue
approximation is found.
4. The loss of orthogonality among basis vectors follows a
rigorous pattern and implies that some Ritz values have
converged.
Do the same statements hold for CA-Lanczos?
42
Results for CA-Lanczos
โข The answer is YES! โฆbut
โข Only if:
โข ๐โ is numerically full rank for 0 โค โ โค ๐ and
2
โข ๐0 โก 2๐ ๐+11๐ +15 ฮ โค
โข i.e.,
ฮ2
1
12
โค 24๐ ๐ + 11๐ + 15
โ1
โข Otherwise, e.g., can lose orthogonality due to computation
with rank-deficient basis
โข How can we use this bound on ฮ to design a better algorithm?
43
Ideas based on CA-Lanczos analysis
โข Explicit basis orthogonalization
โข Compute TSQR on ๐๐ in outer loop, use ๐ factor as basis
โข Get ฮ = 1 for only one reduction
โข Dynamically changing basis size
โข Run incremental condition estimate when computing ๐ -step basis;
โ1
stop when ฮ 2 โค 24๐ ๐ + 11๐ โฒ + 15
is reached. Use this basis
of size ๐ โฒ โค ๐ in this outer loop.
โข Mixed precision
โข For inner products, use precision ๐ (0) such that
โ1
(0)
2
๐ โค 24ฮ ๐ + 11๐ + 15
โข For SpMVs, use precision ๐ (1) such that
๐ (1) โค 4ฮ N+2๐ +5 ๐ + 4๐ +9 ๐ + 10๐ +16
โ1
44
Summary
โข For high performance iterative methods, both time per iteration
and number of iterations required are important.
โข CA-KSMs offer asymptotic reductions in time per iteration, but
can have the effect of increasing number of iterations required
(or causing instability)
โข For CA-KSMs to be practical, must better understand finite
precision behavior (theoretically and empirically), and develop
ways to improve the methods
โข Ongoing efforts in: finite precision convergence analysis for
CA-KSMs, selective use of extended precision, development
of CA preconditioners, other new techniques.
45
Thank you!
contact: [email protected]
http://www.cs.berkeley.edu/~ecc2z/