A Parallel Version of GPBiCR Method Suitable for

Transcript A Parallel Version of GPBiCR Method Suitable for

What is the most important
kernel of sparse linear
solvers for heterogeneous
supercomputers?
Shengxin Zhu
The University of Oxford
Prof. Xingping Liu and Prof. Tongxiang Gu
National Key Laboratory of Computational Physics
Institute of Applied Physics and Computational Mathematics
2015/7/21
SNSCC'12, [email protected]
1
Outlines






Brief introduction on Heterogeneous supper-computers
Computation kernels of Krylov methods
Influence of communications
Case study: GPBiCG(m,l)
Challenging problems
Conclusion
2015/7/21
SNSCC'12, [email protected]
2
Introduction to heterogeneous
supper-computers

Dawning5000A



2015/7/21
Nodes:
Bandwidth:
Memory:
Dawning 5000
Ranking history
11/2008
06/2009
11th
15th
11/2009
06/2010
11/2010
06/2011
19th
24th
35th
40th
11/2011
58th
SNSCC'12, [email protected]
2011/ Nov : top500
1st K (JP)
2st NUDT (CN)
3rd Cray (US)
4th Dawning (CN)
3
3
Computational kernels of
Krylov methods

Vector update:


Mat-vec：


Computation intensive; multi-core technology
CUDA/OpenMP
Inner product:

2015/7/21
parallel in nature
Communication intensive (CPU/MPI).
SNSCC'12, [email protected]
4
Influence of communication
first glance
Computation
cheap
Communication
expensive
S Zhu, MSc Thesis, CAEP, 2010
2015/7/21
Based on Aztec by Prof.
Tuminaro et al @ Sandia
SNSCC'12, [email protected]
5
Real reason for time-consuming
communications
vector update tvec  2Nt fl / P
mat_vec tm _ v   2nz -1 Nt fl / P  
Small workshops: focus
less preparing time
Conference: diversity
more preparing time
k dots tinn k   2kNt fl / P  2  ts  ktw  
ts
2015/7/21
tw
  log 2 P
bandwidth
: tw
Latency
: ts
SNSCC'12, [email protected]
6
Strategies for minimizing
communications



Replacing dot by others (semi-Chebyshev ) :
workshop only no conference if possible.
Inner product free , Gu, Liu, Mo(2002)
Reorganizing algorithm such that: (reduce
number of conference and each conference
accept more talks) residual replacement
strategies due to Von de Vorst (2000s). CA –
KSMs, Demmel et al (2008)
Overlapping communication over
computation
2015/7/21
SNSCC'12, [email protected]
7
A case study, Paralleling
GPBiCG(m,l) (S. Fujino, 2002)

GPBiCG(1,0)
BiCGSTAB

GPBiCG(0,1)
GPBiCG

GPBiCG(1,1)
BiCGSTAB2
Could be used to design breakdown free BiCGSTAB method.
2015/7/21
SNSCC'12, [email protected]
8
GPBiCG(m,l) (S. Fujino, 2002)
1. r0  b  Ax0 , t1  w1  0,
12.
2. for k  0,1,..., rk  tol do
13.
3.
4.
qk  Apk ;  k
*
0
k
*
0
k
5.
tk  rk   k qk ; sk  Atk
6.
yk  tk 1  tk   k wk 1
7.
if (mod (k,m + l) < m or k = 0 ) then
 sk , tk 
 sk , sk 
8.
k 
9.
uk   k qk
10.
zk   k rk   k uk
11.
rk 1  tk   k sk
2015/7/21
14.
 sk , tk  yk , tk    yk , sk  sk , tk  ,
 sk , sk  yk , yk    yk , sk  sk , yk 
 y , y  s , t    yk , tk  sk , yk 
k  k k k k
 sk , sk  yk , yk    yk , sk  sk , yk 
uk   k qk  k  tk 1  rk   k 1uk 1 
15.
zk   k rk   k zk 1   k uk
16.
rk 1  tk   k yk   k Atk
pk  rk   k 1 ( pk 1  uk 1 ),
r ,r 


r , q 
else
k 
17.
endif
18.
*
 k  r0 , rk 1 
k 
 k  r0* , rk 
19.
20.
xk  xk   k pk  zk
wk  sk   k qk
21. enddo
SNSCC'12, [email protected]
9
GPBiCG(m,l) (S. Fujino, 2002)
1. r0  b  Ax0 , t1  w1  0,
12.
2. for k  0,1,..., rk  tol do
13.
3.
4.
qk  Apk ;  k
*
0
k
*
0
k
5.
tk  rk   k qk ; sk  Atk
6.
yk  tk 1  tk   k wk 1
7.
if (mod (k,m + l) < m or k = 0 ) then
 sk , tk 
 sk , sk 
8.
k 
9.
uk   k qk
10.
zk   k rk   k uk
11.
rk 1  tk   k sk
2015/7/21
14.
 sk , tk  yk , tk    yk , sk  sk , tk  ,
 sk , sk  yk , yk    yk , sk  sk , yk 
 y , y  s , t    yk , tk  sk , yk 
k  k k k k
 sk , sk  yk , yk    yk , sk  sk , yk 
uk   k qk  k  tk 1  rk   k 1uk 1 
15.
zk   k rk   k zk 1   k uk
16.
rk 1  tk   k yk   k Atk
pk  rk   k 1 ( pk 1  uk 1 ),
r ,r 


r , q 
else
k 
17.
endif
18.
*
 k  r0 , rk 1 
k 
 k  r0* , rk 
19.
20.
xk  xk   k pk  zk
wk  sk   k qk
21. enddo
SNSCC'12, [email protected]
10
Algorithm Design of
PGPBiCG(m,l) Method
1. r0  b  Ax0 , t1  w1  0, f 0  AT r0 , p0  r0 ,
q0  Aq0 , fp 0 , rr 0 ,   rr 0 / fp 0
2. for k  0,1,..., rk  tol do
ss k yt k  sy k st k
17.
k 
, k 
18.
rr k 1  rt k   k rs k  k ry k
ss k yy k  ys k sy k
3.
tem  tk 1; tk  rk   k qk ; sk  Atk
4.
yk  tem  tk   k wk 1
19.
fu k+1   k fq k  k fh k
5.
if (mod (k,m+l) < m or k = 0 ) then
20.
fr k+1  ft k   k fs k  k fy k
st k yy k  yt k sy k
ss k yy k  ys k sy k
6.
compute st k ,ss k , rt k ,rs k , fs k , f tk , fq k fp k
22.
uk   k qk  k  tk 1  rk   k 1uk 1 
7.
 k  st k / ss k ,k  0;
23.
zk   k rk  k zk 1   k uk
8.
rr k +1  rt k   k rs k
24.
rk 1  tk  k yk   k Atk
9.
fu k+1   k fq k
25.
10.
fr k+1  ft k   k fs k
26.
11.
uk   k qk
12.
zk   k rk   k uk
13.
rk 1  tk   k sk
14.
27.
28.
29.
30.
else
15.
hk  tk 1  rk   k 1uk 1
16.
comput e st k ,ss k ,sy k , yt k , yy k ; rt k ,ry k , rs k ,
2015/7/21
fs k , fy k , f tk , fhk , fq k fp k
endif
 k rr
 k rr k
xk  xk   k pk  zk
wk  sk   k qk
pk  rk   k 1 ( pk 1  uk 1 ); qk+1 = Apk+1
fp k+1 = f rk +1 +  k fp k - fu k
k 
k 1

rr k 1
fp
SNSCC'12, [email protected] k 1
31.
 k 1 
32. enddo

xy :  x, y  direct computed
11
xy := (x, y) indirect computed
PGPBiCG(m,l) Method
(reduce # global commun. )

Algorithm reconstruct: three GobalCs to one！
Global synch.
Global synch.
2015/7/21
reconstruct
Global synch.
SNSCC'12, [email protected]
Global synch.
12
Performance
Based on Aztec by Prof. R.S. Tuminaro et al @ Sandia
2015/7/21
SNSCC'12, [email protected]
13
Convergence analysis
Our methods PGPBICG (1, 0)
rr k 1  rt k   k f sk
 fu   fq
k
k
 k

 fr k 1  ft k   k fs k

 fp k 1  fr k 1   k fp k  fu k
IBiCGSTAB, Yang , 2002


rr k 1  rr k   k rq k   k f sk
 fu k   k fq k

 fr k 1  fr k   k fq k   k f sk

 fp k 1  fr k 1   k fp k  fu k


Residual replacements strategies
Backward stable analysis
2015/7/21
SNSCC'12, [email protected]
14
Challenging problem
Accurate compute dot




Why Mindless by Kahan
Accurate compute inner product.
 Ogita and Rump –et-al, Accurate sum and dot product, SIAM Sci
Compt. 2005 cited 188 times. (but) ….
 PLASMA team
Backward stable analysis of residual replacement methods.
 Carson and Demmel, A residual replacement strategy for
improving the maximum attainable accuracy of communication
avoiding Krylov subspace Methods, April 20 2012
Reliable dot computation algorithm
2015/7/21
SNSCC'12, [email protected]
15
Conclusion:




Avoiding communication
Reliable computation
Inner product computation is very likely to be the most
challenging kernel for HHPC, while Mat_vec important for both…
Software abstraction and threads programming are helpful,
together with re-designing algorithms will do better
Math/Algorithm
CS/Performance
Aztec
POSKI
POSKI Hyper,
PETSc; Trilinos
Applications
interface
(Parallel Optimized Sparse Kernel Interface LIbrary) Poski v.1.0 May
02/2012
2015/7/21
SNSCC'12, [email protected]
16
Thanks !
2015/7/21
SNSCC'12, [email protected]
17
Initial study on communication
complexity
More than ten
thousand processors
are connected by
network
Global Communication
becomes more and
more serious
2015/7/21
SNSCC'12, [email protected]
18
Methods in literatures


Based on the former two strategies
 de Sturler and van der Vorst: Parallel GMRES(m) and CG methods
(1995)
 Bucker and Sauren: Parallel QMR method (1997)
 Yang and Brent: Improved CGS, BiCG and BiCGSTAB methods
(2002-03)
 Gu and Liu et al.: ICR, IBiCR, IBiCGSTAB(2) and
PQMRCGSTAB methods (2004-2010)
 Demmel et al CA-KSMs (2008---)
Gu, Liu and Mo: MSD-CG: multiple search direction conjugate gradient
method (2004)
 replaced the inner products computation by solving linear systems
with small size. Eliminates global inner products completely.
 The idea have been generated to MPCG by Grief and Bridson (2006)
2015/7/21
SNSCC'12, [email protected]
19
Comparison of computational
count of two Algorithms
computation kernals
compute time + communication time
vector update time：
tvec = 2Nt fl /P
mat - vec time：
tm _ v   2nz -1 Nt fl / P  
k inner products ：
tinn k   2kNt fl / P  2  ts  ktw  
No._inn
Method
Mat_vec vect_update
Syn_points
H M L T
GPBiCG(m, l )
2
18
1 2 5 2
3
PGPBiCG(m, l )
2
18
0 9 15 0
1
2015/7/21
SNSCC'12, [email protected]
20
Comparison of computational
count of two Algorithms
The time of inner product operations of GPBiCG(m, l) and PGPbiCG(m, l)
Methods
GPBiCG  m, l 
PGPBICG(m, l )
2015/7/21
position
No.
time
H
1
2t fl N / P  2 log 2 P(ts  tw )
M
2
4t fl N / P  2 log 2 P(ts  2t w )
L
5
10t fl N / P  2 log 2 P(ts  5tw )
T
2
4t fl N / P  2 log 2 P(ts  2t w )
M
9
18t fl N / P  2 log 2 P(ts  9tw )
L
15
30t fl N / P  2 log 2 P(ts  15t w )
SNSCC'12, [email protected]
21
Mathematical model of the
time consummation
N
T 
P
TG 
1

 t fl    t s , t w    

 =log 2 P 
 2 log 2 P  2  m  l  
P
1   32m  46l  2  m  l  nz  1  Nt fl
2  6  m  l  t s  10m  16l  t w
TPG 
1
  2 log 2 P  2  m  l  
P
 1   40m  60l  2  m  l  nz  1  Nt fl
 2  2  m  l  t s  18m  30l  t w
2015/7/21
TG  TPG

 66%
( ts
SNSCC'12,
[email protected]
TG
tw )
22
Scalability analysis
T  ,1
T
Scaled Speedup：S  S 
TP T  , P 
C
P
SPPG
3
SPG
Isoefficiency analysis：
N = f E  P  (E , f i xed)
Tover  N , P   PTP  N , P 
E
TS ( N ) 
Tover  N , P 
1 E
2 EP log 2 P 2  m  l   EP
G
N 

 1 (1  E )
 1 1  E 
N PG 
 2 EP log 2 P 2  m  l   EP

 1 (1  E )
 1 1  E 
N G / N PG  3
2015/7/21
SNSCC'12, [email protected]
23
The optimal number of
processors
Opt i mal number
TG 
1
TPG 
P
PIG / PG  3
 2 log 2 P  2  m  l  
1
P
  2 log 2 P  2  m  l  
P
opt
G
P
32m  46l  2  m  l  n


opt
PG
z
 1 Nt fl ln 2
6  m  l  ts  10m  16l  t w
40m  60l  2  m  l  n


z
 1 Nt fl ln 2
2  m  l  ts  18m  30l  tw
Brief proof：
  x 
1
x
  2 log 2 x  C , 1 ,  2  0 C=const
 '  x   0,  ''  x   0  x 
2015/7/21
1 ln 2
2
SNSCC'12, [email protected]
Popt
24
Convergence Analysis
Lemma. Let x,y  R N , f l  xT y  is the inner product computed by computer, and xT y is
the real value， t hen
fl  xT y  - xT y  1.01nu i=1 xi yi , where u is machine precision.
n
Conclusion. When n is very large then fl  xT y  - xT y might be much larger than u.
Our methods PGPBICG (1, 0)
rr k 1  rt k   k f sk
 fu   fq
k
k
 k

 fr k 1  ft k   k fs k

 fp k 1  fr k 1   k fp k  fu k
IBiCGSTAB, Yang , 2002


rr k 1  rr k   k rq k   k f sk
 fu k   k fq k

 fr k 1  fr k   k fq k   k f sk

 fp k 1  fr k 1   k fp k  fu k
2015/7/21


SNSCC'12, [email protected]
25
Numerical Experiments:
timing and improvements
 2u
 2u
u
u
a 2 b 2 c
d
 eu  0, x, y  (0,1)
x
y
x
y
u
u
 0,
 0, u y 0  0, u y 0  10
x x 0
x x 1
a  1512, b  c  d  1,
e0
Experiment I：
Each CPU 3600
comm/ al l ：Rc =
Tc
T
T G  T IG

TG
TcG  TcIG
c 
TcG
Experiment II
fit problem size 960  960
speed up
GPBiCG (1, 0)
GPBiCG (0,1)
GPBiCG (1,1)
GPBiCG (2,8)
2015/7/21
SNSCC'12, [email protected]
GPBiCG (8, 2)
26
Numerical Experiments:
Speedup
2015/7/21
SNSCC'12, [email protected]
27
Conclusions





PGPBiCG(m,l) method is more scalable and parallel for
solving large sparse unsymmetrical linear systems on
distributed parallel architectures
Performance, isoefficiency analysis and numerical
experiments have been done for PGPBiCG(m,l) and
GPBiCG(m,l) methods
The parallel communication performance can be improved
by a factor of larger than 3.
The PGPBiCG(m,l) method has better parallel speed up
compared with the GPBiC(m,l) method.
For further performance improvements: overlap of
computation with communication, numerical stability.
2015/7/21
SNSCC'12, [email protected]
28