東京大学情報基盤センター 平成20年度公募型プロジェ

Download Report

Transcript 東京大学情報基盤センター 平成20年度公募型プロジェ

ppOpen-HPC
Open Source Infrastructure for Development
and Execution of Large-Scale Scientific
Applications with Automatic Tuning (AT)
Kengo Nakajima
Information Technology Center
The University of Tokyo
2
Lessons learned in the 20th Century
• Methods for scientific computing (e.g. FEM, FDM,
BEM etc.) consists of typical data structures, and
typical procedures.
• Optimization of each procedure is possible and effective.
• Well-defined data structure can “hide” communication
processes with MPI from code developer.
• Code developers do not have to care about communications
• Halo for parallel FEM
PE#1
21
3
PE#0
22
23
25
24
15
6
7
PE#0
17
16
18
12
11
19
13
20
14
7
8
9
10
3
4
PE#3
4
5
8
4
PE#3
5
7
11
8
10
5
12
3
9
12
15
3
2
PE#3
7
14
10
10
1
6
2
3
8
9
11
12
10
9
11
12
4
8
10
8
2
7
7
1
2
3
6
7
PE#0
13
4
1
2
9
9
11
11
4
7
6
5
7
1
PE#2
13
4
5
10
1
2
3
3
12
12
6
5
2
14
5
11
8
4
1
11
9
6
1
15
6
2
10
8
PE#0
1
5
PE#2
PE#2
PE#1
4
9
6
3
2
12
10
5
1
13
15
11
6
14
3
10
5
12
8
4
PE#3
1
2
9
11
12
10
9
11
12
9
6
3
8
7
8
4
7
1
6
5
2
PE#2
3
4
ppOpen-HPC: Overview
• Open Source Infrastructure for development and
execution of large-scale scientific applications on postpeta-scale supercomputers with automatic tuning (AT)
• “pp” : post-peta-scale
• Five-year project (FY.2011-2015) (since April 2011)
• P.I.: Kengo Nakajima (ITC, The University of Tokyo)
• Part of “Development of System Software Technologies for
Post-Peta Scale High Performance Computing” funded by
JST/CREST (Japan Science and Technology Agency, Core
Research for Evolutional Science and Technology)
• Team with 7 institutes, >30 people (5 PDs) from
various fields: Co-Desigin
• ITC/U.Tokyo, AORI/U.Tokyo, ERI/U.Tokyo, FS/U.Tokyo
• Hokkaido U., Kyoto U., JAMSTEC
5
User’s Program
Framework
Appl. Dev.
ppOpen-APPL
FEM
FDMii
FVM
BEM
Math
Libraries
ppOpen-MATH
MG
ii
GRAPH
VIS
MP
ppOpen-AT
STATIC
ii
DYNAMIC
ppOpen-SYS
COMM
FTii
Automatic
Tuning (AT)
System
Software
ppOpen-HPC
Optimized Application with
Optimized ppOpen-APPL, ppOpen-MATH
DEM
• Group Leaders
–
–
–
–
–
Masaki Satoh (AORI/U.Tokyo)
Takashi Furumura (ERI/U.Tokyo)
Hiroshi Okuda (GSFS/U.Tokyo)
Takeshi Iwashita (Kyoto U., ITC/Hokkaido U.)
Hide Sakaguchi (IFREE/JAMSTEC)
• Main Members
–
–
–
–
–
–
–
–
–
–
–
Takahiro Katagiri (ITC/U.Tokyo)
Masaharu Matsumoto (ITC/U.Tokyo)
Hideyuki Jitsumoto (ITC/U.Tokyo)
Satoshi Ohshima (ITC/U.Tokyo)
Hiroyasu Hasumi (AORI/U.Tokyo)
Takashi Arakawa (RIST)
Futoshi Mori (ERI/U.Tokyo)
Takeshi Kitayama (GSFS/U.Tokyo)
Akihiro Ida (ACCMS/Kyoto U.)
Miki Yamamoto (IFREE/JAMSTEC)
Daisuke Nishiura (IFREE/JAMSTEC)
6
7
ppOpen-HPC: ppOpen-APPL
• ppOpen-HPC consists of various types of optimized
libraries, which covers various types of procedures for
scientific computations.
• ppOpen-APPL/FEM, FDM, FVM, BEM, DEM
• Linear Solvers, Mat. Assemble, AMR., Visualization etc.
• written in Fortran 2003 (C interface is available soon)
• Source code developed on a PC with a single
processor is linked with these libraries, and generated
parallel code is optimized for post-peta scale system.
• Users don’t have to worry about optimization tuning,
parallelization etc.
• Part of MPI, OpenMP, (OpenACC)
8
ppOpen-HPC covers …
9
FEM Code on ppOpen-HPC
Optimization/parallelization could be hidden from
application developers
Program My_pFEM
use ppOpenFEM_util
use ppOpenFEM_solver
call
call
call
call
ppOpenFEM_init
ppOpenFEM_cntl
ppOpenFEM_mesh
ppOpenFEM_mat_init
do
call Users_FEM_mat_ass
call Users_FEM_mat_bc
call ppOpenFEM_solve
call ppOPenFEM_vis
Time= Time + DT
enddo
call ppOpenFEM_finalize
stop
end
10
ppOpen-HPC: AT & Post T2K
• Automatic Tuning (AT) enables development of
optimized codes and libraries on emerging
architectures
− Directive-based Special Language for AT
− Optimization of Memory Access
• Target system is Post T2K system
− 20-30 PFLOPS, FY.2015-2016
 JCAHPC: U. Tsukuba & U. Tokyo
− Many-core based (e.g. Intel MIC/Xeon Phi)
− ppOpen-HPC helps smooth transition of users (> 2,000) to
new system
11
Supercomputers in U.Tokyo
FY
05
2 big systems, 6 yr. cycle
06 07 08 09 10 11 12 13 14 15 16 17 18 19
Hitachi SR11000/J2
18.8TFLOPS, 16.4TB
Fat nodes with large memory
Hitachi SR16000/M1
based on IBM Power-7
54.9 TFLOPS, 11.2 TB
Our last SMP, to be switched to MPP
Hitachi HA8000 (T2K)
140TFLOPS, 31.3TB
(Flat) MPI, good comm. performance
Fujitsu PRIMEHPC FX10
based on SPARC64 IXfx
1.13 PFLOPS, 150 TB
Turning point to Hybrid Parallel Prog. Model
Post T2K
20-30 PFLOPS
Peta
京(=K)
12
User’s Program
ppOpen-APPL
FEM
FDMii
FVM
BEM
ppOpen-MATH
MG
ii
GRAPH
VIS
MP
ppOpen-AT
STATIC
ii
DYNAMIC
ppOpen-SYS
COMM
FTii
ppOpen-HPC
Optimized Application with
Optimized ppOpen-APPL, ppOpen-MATH
DEM
Weak-Coupled Simulation by the
ppOpen-HPC Libraries
Two kinds of applications (Seism3D+ based on FDM, and FrontISTR++
based on FEM) are connected by the ppOpen-MATH/MP coupler.
Seism3D+
FrontISTR++
ppOpen-APPL/FDM
ppOpen-APPL/FEM
Velocity
ppOpen-MATH/MP
Displacement
Principal Functions
 Make a mapping table
 Convert physical variables
 Choose a timing of data
transmission
…
13
558
Speedup [%]
Example of directives
for ppOpen-AT
Loop spilitting/fusion
200
171
!oat$ install LoopFusionSplit region start
!$omp parallel do
private(k,j,i,STMP1,STMP2,STMP3,STMP4,RL,RM,RM2,RMAXY,RMAXZ,RMAYZ,RLT
HETA,QG)
DO K = 1, NZ
DO J = 1, NY
DO I = 1, NX
RL = LAM (I,J,K); RM = RIG (I,J,K); RM2 = RM + RM
RLTHETA = (DXVX(I,J,K)+DYVY(I,J,K)+DZVZ(I,J,K))*RL
!oat$ SplitPointCopyDef region start
QG = ABSX(I)*ABSY(J)*ABSZ(K)*Q(I,J,K)
!oat$ SplitPointCopyDef region end
SXX (I,J,K) = ( SXX (I,J,K) + (RLTHETA + RM2*DXVX(I,J,K))*DT )*QG
SYY (I,J,K) = ( SYY (I,J,K) + (RLTHETA + RM2*DYVY(I,J,K))*DT )*QG
SZZ (I,J,K) = ( SZZ (I,J,K) + (RLTHETA + RM2*DZVZ(I,J,K))*DT )*QG
!oat$ SplitPoint (K, J, I)
STMP1 = 1.0/RIG(I,J,K); STMP2 = 1.0/RIG(I+1,J,K); STMP4 = 1.0/RIG(I,J,K+1)
STMP3 = STMP1 + STMP2
RMAXY = 4.0/(STMP3 + 1.0/RIG(I,J+1,K) + 1.0/RIG(I+1,J+1,K))
RMAXZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I+1,J,K+1))
RMAYZ = 4.0/(STMP3 + STMP4 + 1.0/RIG(I,J+1,K+1))
!oat$ SplitPointCopyInsert
SXY (I,J,K) = ( SXY (I,J,K) + (RMAXY*(DXVY(I,J,K)+DYVX(I,J,K)))*DT )*QG
SXZ (I,J,K) = ( SXZ (I,J,K) + (RMAXZ*(DXVZ(I,J,K)+DZVX(I,J,K)))*DT )*QG
SYZ (I,J,K) = ( SYZ (I,J,K) + (RMAYZ*(DYVZ(I,J,K)+DZVY(I,J,K)))*DT )*QG
END DO; END DO; END DO
!$omp end parallel do
!oat$ install LoopFusionSplit region end
30
20
51
ppOpen-AT on Seism3D
(FDM) (Xeon Phi 8-nodes)
15
Schedule of Public Release
(with English Documents)
http://ppopenhpc.cc.u-tokyo.ac.jp/
• Released at SC-XY (or can be downloaded)
• Multicore/manycore cluster version (Flat MPI,
OpenMP/MPI Hybrid) with documents in English
• We are now focusing on MIC/Xeon Phi
• Collaborations with scientists are welcome
History
• SC12, Nov 2012 (Ver.0.1.0)
• SC13, Nov 2013 (Ver.0.2.0)
• SC14, Nov 2014 (Ver.0.3.0)
16
New Features in Ver.0.3.0
http://ppopenhpc.cc.u-tokyo.ac.jp/
• ppOpen-APPL/AMR-FDM: AMR
framework with a dynamic loadbalancing method for various FDM
applications
• HACApK library for H-matrix comp.
in ppOpen-APPL/BEM
Processor
• Utilities for preprocessing in
ppOpenAPPL/DEM
• Booth #713
Assigned data
Assign ac to Pk if S(ac1j) in R(Pk)
P0
P1
: A small submatrix of
P2
P3
P4
P5
P6
P7
when
17
Collaborations, Outreaching
• Collaborations
– International Collaborations
• Lawrence Berkeley National Lab.
• National Taiwan University
• IPCC(Intel Parallel Computing
Center)
• Outreaching, Applications
– Large-Scale Simulations
•
•
•
•
Geologic CO2 Storage
Astrophysics
Earthquake Simulations etc.
ppOpen-AT, ppOpen-MATH/VIS,
ppOpen-MATH/MP, Linear Solvers
– Intl. Workshops (2012, 2013)
– Tutorials, Classes
18
from Post-Peta to Exascale
• Currently, we are focusing on Post-T2K system by
manycore architectures (Intel Xeon/Phi)
• Outline of the Exascale Systems is much clearer than
which were in 2011 (when this project started).
– Frameworks like ppOpen-HPC are really needed
• More complex, and huge system
• More difficult to extract performance of applications
– Smooth transition from post-peta to exa will be possible
through continuous development and improvement of
ppOpen-HPC (We need funding for that !)
• Research Topics in Exascale Era
– Power-Aware Algorithms/AT
– Communication/Synchronization Reducing Algorithms
19
• pK-Open-HPC
• Ill-Conditioned Problems
• SPPEXA Proposal
SIAM PP14
20
16th SIAM Conference on Parallel Processing for Scientific
Computing, Feb.18-21, 2014
http://www.siam.org/meetings/pp14/
• Comm./Synch. Avoiding/Reducing Algorithms
– Direct/Iterative Solvers, Preconditioning, s-step method
– Coarse Grid Solvers on Parallel Multigrid
• Preconditioning Methods on Manycore Architectures
– SPAI, Poly., ILU with Multicoloring/RCM, (no CM-RCM),
GMG
– Asynchronous ILU on GPU (E.Chow (Ga.Tech.)) ?
• Preconditioning Methods for Ill-Conditioned Problems
– Low-Rank Approximation
• Block-Structured AMR
– SFC: Space-Filling Curve
21
Communications are expensive ...
Ph.D Thesis by Mark Hoemmen (SNL) at
UC Berkeley on CA-KSP Solvers (2010)
• Serial Communications
– Data Transfer through Hierarchical Memory
• Parallel Communications
– Message Passing through Network
• Efficiency for Memory Access -> Exa-Feasible
Applications
22
Future of CAE Applications
• FEM with fully-unstructured
meshes is un-feasible for exascale supercomputer systems
• Block-structured AMR, Voxel-type
FEM
• Serial Comm.
– Fixed inner loops
Level-0
Level-1
Level-2
• e.g. sliced ell
• Parallel Comm.
– Krylov iterative solvers
– Preconditioning, Dot Products, HALO communications
23
ELL: Fixed Loop-length, Nice for
Pre-fetching
1
1

4

0
1
3 0 0 0
2 5 0 0
1 3 0 0

3 7 4 0
0 0 0 5
1 3
1 3 0
1 2 5
1 2 5
4 1 3
4 1 3
3 7 4
3 7 4
1 5
1 5 0
(a) CRS
(b) ELL
24
Special Treatment for “Boundary”
Meshes connected to “Halo”
External
Meshes
Internal Meshes
on Boundary
• Distribution of
Lower/Upper Non-Zero
Off-Diagonal
Components
Pure Internal
Meshes
• If we adopt RCM (or
CM) reordering ...
• Pure Internal Meshes
● Internal
(lower)
● Internal
(upper)
– L: ~3, U: ~3
• Boundary Meshes
– L: ~3, U: ~6
z
● External
(upper)
y
x
Pure Internal
Meshes
Internal Meshes
on Boundary
25
Original ELL: Backward Subst.
Cache is not well-utilized: IAUnew(6,N), Aunew(6,N)
Boundary Cells
AUnew(6,N)
Pure Internal Cells
AUnew(6,N)
26
Improved ELL: Backward Subst.
Cache is well-utilized, separated: AUnew3/AUnew6
Sliced ELL [Monakov et al. 2010] (for SpMV/GPU)
Boundary Cells
AUnew6(6,N)
Pure Internal Cells
AUnew3(3,N)
27
Analyses by Detailed Profiler of
Fujitsu FX10, single node, Flat
MPI, RCM (Multigrid Part),
643cells/core, 1-node
Instruction
L1D
miss
L2 miss
SIMD
Op. Ratio
GFLOPS
CRS
1.53109
2.32107
1.67107
30.14%
6.05
Original
ELL
4.91108
1.67107
1.27107
93.88%
6.99
Improved
ELL
4.91108
1.67107
9.14106
93.88%
8.56
Hierarchical CGA: Comm. Reducing MG
Reduced number of MPI processes[KN 2013]
Fine
Level=1
Level=2
Level=m-3
Level=m-3
Level=m-2
Coarse
Coarse grid solver at a
single MPI process (multithreaded, further multigrid)
28
29
Weak Scaling: ~4,096 nodes
up to 17,179,869,184 meshes (643 meshes/core)
DOWN is GOOD
Matrix
Coarse Grid
Effect of Sliced ELL
Effect of
HB 8x2 on FX10
hCGA
CRS
Single Core
C1
ELL (org)
Single Core
C2
ELL (org)
CGA
C3
ELL (sliced)
CGA
C4
ELL (sliced)
hCGA
15.0
20.00
HB 8x2:C0
HB 8x2:C1
HB 8x2:C2
HB 8x2:C3
Flat MPI:C3
Flat MPI:C4
12.5
HB 4x4:C4
sec.
15.00
sec.
C0
10.00
HB 8x2:C3
10.0
HB 16x1:C3
7.5
5.00
5.0
0.00
100
1000
10000
CORE#
100000
Down is good
100
1000
10000
CORE#
100000
SDHPC-11
30
Framework for Exa-Feasible
Applications: pK-Open-HPC
• Should be based on voxel-type meshes
• Robust/Scalable Geometric/Algebraic Multigrid
Preconditioning
–
–
–
–
–
–
for various types of applications, PDE’s
Robust smoother
Low-rank approximations
Overhead for coarse grid solvers (hCGA)
Dot products
Limited or fixed number of non-zero off-diagonals of
coefficient matrices at every level
• Generally speaking, AMG has many non-zero off-diagonals at
coarse level of meshes
• Preprocessors
31
AMR
Octree (8/27-children), Self-Similar Blocks
Self-similar block
• Cell size becomes half
• The same number of cells
Block
Framework
program
type(oct_Block) :: block
real :: A(nx,ny,nz), B(nx,ny,nz), C(nx,ny,nz)
Cell
Initial parameter setup
do istep=1,Nstep (Time loop)
Treatment of cell refinement
①
②
③
do index=1,Nall (Block loop)
Main calculation
Data copy
A(:,:,:) = block(index)%F(1,:,:,:)
B(:,:,:) = block(index)%F(2,:,:,:)
C(:,:,:) = block(index)%F(3,:,:,:)
Insert uniform-cell program
call advance_field
Data copy to block arrays
block(index)%F(1,:,:,:) = A(:,:,:)
block(index)%F(2,:,:,:) = B(:,:,:)
block(index)%F(3,:,:,:) = C(:,:,:)
enddo
Outer Boundary treatment
enddo
stop
end program
Uniform-cell program
subroutine advance_field
real :: A(nx,ny,nz), B(nx,ny,nz),
C(nx,ny,nz)
do iz=1,nz
do iy=1,ny
do ix=1,nx
A(ix,iy,iz) = ・・・
・・・・
B(ix,iy,iz) = ・・・
・・・・
C(ix,iy,iz) = ・・・
・・・・
enddo
enddo
enddo
return
end subroutine advance_field
32
• pK-Open-HPC
• Ill-Conditioned Problems
• SPPEXA Proposal
33
Extension of Depth of Overlapping
PE#1
21
22
17
16
11
PE#0
23
24
18
19
13
14
8
9
7
1
2
PE#3
5
1
12
3
7
8
9
11
10
12
5
2
11
10
8
4
PE#3
15
6
7
14
13
4
5
10
1
2
3
8
9
11
12
10
9
11
12
9
6
3
5
PE#0
6
2
10
PE#2
PE#1
1
15
4
3
Cost for computation and
communication may increase
20
12
6
4
25
7
8
4
7
1
6
5
2
PE#2
3
●:Internal Nodes,●:External Nodes
■:Overlapped Elements
34
HID: Hierarchical Interface
Decomposition [Henon & Saad 2007]
• Multilevel Domain Decomposition
– Extension of Nested Dissection
• Non-overlapping at each level: Connectors, Separators
• Suitable for Parallel Preconditioning Method
2
2
2
2,3
3
3
3
2
2
2
2,3
3
3
3
2
2
2
0,1
2,3
3
3
3
0,2
0,2
0,2
0,1
2,3
1,3
1,3
1,3
0
0
0
0,1
2,3
1
1
1
0
0
0
0,1
1
1
1
0
0
0
0,1
1
1
1
level-1:●
level-2:●
level-4:●
35
Parallel ILU in HID for each
Connector at each LEVEL
• The unknowns are
reordered according to their Level-1
level numbers, from the
lowest to highest.
• The block structure of the
Level-2
reordered matrix leads to
natural parallelism if ILU/IC
Level-4
decompositions or
forward/backward
substitution processes are
applied.
0
1
2
3
0,1
0,2
2,3
1,3
0,1,
2,3
36
Results: 64 cores
Contact Problems
BILU(p)-(depth of overlapping)
3,090,903 DOF
1500
■BILU(p)-(0): Block Jacobi
■BILU(p)-(1)
■BILU(p)-(1+)
■BILU(p)-HID
GPBiCG
350
250
1000
sec.
ITERATIONS
300
500
200
150
100
50
0
0
BILU(1)
BILU(1+)
BILU(2)
BILU(1)
BILU(1+)
BILU(2)
37
Hetero 3D (1/2)
• Parallel FEM Code (Flat MPI)
– 3D linear elasticity problems in
cube geometries with
heterogeneity
– SPD matrices
– Young’s modulus: 10-6~10+6
• (Emin-Emax): controls condition number
z
Uniform Distributed Force in
z-direction @ z=Zmax
Uy=0 @ y=Ymin
• Preconditioned Iterative Solvers
Ux=0 @ x=Xmin
(Nz-1) elements
Nz nodes
(Ny-1) elements
Ny nodes
y
Uz=0 @ z=Zmin
x
(Nx-1) elements
Nx nodes
– GP-BiCG [Zhang 1997]
– BILUT(p,d,t)
• Domain Decomposition
– Localized Block-Jacobi with
Extended Overlapping (LBJ)
– HID/Extended HID
38
Hetero 3D (2/2)
• based on the parallel FEM procedure of GeoFEM
– Benchmark developed in FP3C project under Japan-France
collaboration
• Parallel Mesh Generation
– Fully parallel way
FP3C
• each process generates local mesh, and assembles local matrices.
–
–
–
–
–
Total number of vertices in each direction (Nx, Ny, Nz)
Number of partitions in each direction (Px,Py,Pz)
Number of total MPI processes is equal to PxPyPz
Each MPI process has (Nx/Px)( Ny/Py)( Nz/Pz) vertices.
Spatial distribution of Young’s modulus is given by an
external file, which includes information for heterogeneity for
the field of 1283 cube geometry.
• If Nx (or Ny or Nz) is larger than 128, distribution of these 1283 cubes
is repeated periodically in each direction.
39
BILUT(p,d,t)
• Incomplete LU factorization with threshold (ILUT)
• ILUT(p,d,t) [KN 2010]
– p: Maximum fill-level specified before factorization
– d, t: Criteria for dropping tolerance before/after
factorization
• The process (b) can be substituted by other
factorization methods or more powerful direct linear
solvers, such as MUMPS, SuperLU and etc.
A
Initial
Matrix
(a)
Dropping
Components
- Aij< d
- Location
A’
Dropped
Matrix
(b)
ILU (p)
Factorization
(ILU)’
(c)
Dropping
Components
ILU
- Aij< t
Factorization - Location
(ILUT)’
ILUT(p,d,t)
40
Preliminary Results
• Hardware
– 16-240 nodes (256-3,840 cores) of Fujitsu PRIMEHPC
FX10 (Oakleaf-FX), University of Tokyo
• Problem Setting
– 420×320×240 vertices (3.194×107 elem’s, 9.677×107 DOF)
– Strong scaling
– Effect of thickness of overlapped zones
• BILUT(p,d,t)-LBJ-X (X=1,2,3)
– RCM-Entire renumbering for LBJ
41
Effect of t on Performance
1.50
1.50
1.25
1.25
1.00
1.00
Ratio
Ratio
BILUT(2,0,t)-GPBi-CG with 240 nodes (3,840 cores)
Emax=10-6, Emax=10+6
Normalized by results of BILUT(2,0,0)-LBJ-2
●: [NNZ], ▲: Iterations, ◆: Solver Time
0.75
0.75
0.50
0.50
0.25
0.25
0.00
0.00E+00
0.00
0.00E+00
1.00E-02
2.00E-02
t: BILUT(2,0,t)-HID
3.00E-02
1.00E-02
2.00E-02
t: BILUT(2,0,t)-LJB-2
3.00E-02
42
BILUT(p,0,0) at 3,840 cores
NO dropping: Effect of Fill-in
Preconditioner
NNZ of
[M]
Set-up
(sec.)
Solver
(sec.)
Total
(sec.)
Iterations
BILUT(1,0,0)-LBJ-1
1.9201010
1.35
65.2
66.5
1916
BILUT(1,0,0)-LBJ-2
2.5191010
2.03
61.8
63.9
1288
BILUT(1,0,0)-LBJ-3
3.1971010
2.79
74.0
76.8
1367
BILUT(2,0,0)-LBJ-1
3.3511010
3.09
71.8
74.9
1339
BILUT(2,0,0)-LBJ-2
4.3941010
4.39
65.2
69.6
939
BILUT(2,0,0)-LBJ-3
5.6311010
5.95
83.6
89.6
1006
BILUT(3,0,0)-LBJ-1
6.4681010
9.34
105.2
114.6
1192
BILUT(3,0,0)-LBJ-2
8.5231010
12.7
98.4
111.1
823
BILUT(3,0,0)-LBJ-3
1.1011011
17.3
101.6
118.9
722
BILUT(1,0,0)-HID
1.6361010
2.24
60.7
62.9
1472
BILUT(2,0,0)-HID
2.9801010
5.04
66.2
71.7
1096
[NNZ] of [A]: 7.174109
43
BILUT(p,0,0) at 3,840 cores
NO dropping: Effect of Overlapping
Preconditioner
NNZ of
[M]
Set-up
(sec.)
Solver
(sec.)
Total
(sec.)
Iterations
BILUT(1,0,0)-LBJ-1
1.9201010
1.35
65.2
66.5
1916
BILUT(1,0,0)-LBJ-2
2.5191010
2.03
61.8
63.9
1288
BILUT(1,0,0)-LBJ-3
3.1971010
2.79
74.0
76.8
1367
BILUT(2,0,0)-LBJ-1
3.3511010
3.09
71.8
74.9
1339
BILUT(2,0,0)-LBJ-2
4.3941010
4.39
65.2
69.6
939
BILUT(2,0,0)-LBJ-3
5.6311010
5.95
83.6
89.6
1006
BILUT(3,0,0)-LBJ-1
6.4681010
9.34
105.2
114.6
1192
BILUT(3,0,0)-LBJ-2
8.5231010
12.7
98.4
111.1
823
BILUT(3,0,0)-LBJ-3
1.1011011
17.3
101.6
118.9
722
BILUT(1,0,0)-HID
1.6361010
2.24
60.7
62.9
1472
BILUT(2,0,0)-HID
2.9801010
5.04
66.2
71.7
1096
[NNZ] of [A]: 7.174109
44
BILUT(p,0,t) at 3,840 cores
Optimum Value of t
Preconditioner
NNZ of [M]
Set-up
(sec.)
Solver
(sec.)
Total
(sec.)
Iteration
s
BILUT(1,0,2.7510-2)-LBJ-1
7.755109
1.36
45.0
46.3
1916
BILUT(1,0,2.7510-2)-LBJ-2
1.0191010
2.05
42.0
44.1
1383
BILUT(1,0,2.7510-2)-LBJ-3
1.2851010
2.81
54.2
57.0
1492
BILUT(2,0,1.0010-2)-LBJ-1
1.1181010
3.11
39.1
42.2
1422
BILUT(2,0,1.0010-2)-LBJ-2
1.4871010
4.41
37.1
41.5
1029
BILUT(2,0,1.0010-2)-LBJ-3
1.8931010
5.99
37.1
43.1
915
BILUT(3,0,2.5010-2)-LBJ-1
8.072109
9.35
38.4
47.7
1526
BILUT(3,0,2.5010-2)-LBJ-2
1.0631010
12.7
35.5
48.3
1149
BILUT(3,0,2.5010-2)-LBJ-3
1.3421010
17.3
40.9
58.2
1180
BILUT(1,0,2.5010-2)-HID
6.850109
2.25
38.5
40.7
1313
BILUT(2,0,1.0010-2)-HID
1.0301010
5.04
36.1
41.1
1064
[NNZ] of [A]: 7.174109
45
Strong Scaling up to 3,840 cores
according to elapsed computation time (setup+solver) for BILUT(1,0,2.510-2)-HID with 256
cores
MUMPS:
Low-Rank-Approx.
4.00E+03
130
BILUT(1,0,2.50e-2)-HID
BILUT(2,0,1.00e-2)-HID
BILUT(1,0,2.75e-2)-LBJ-2
BILUT(2,0,1.00e-2)-LBJ-2
BILUT(3,0,2.50e-2)-LBJ-2
Ideal
Parallel Performance (%)
Speed-Up
3.00E+03
120
2.00E+03
1.00E+03
0.00E+00
110
100
90
BILUT(1,0,2.50e-2)-HID
BILUT(2,0,1.00e-2)-HID
BILUT(1,0,2.75e-2)-LBJ-2
BILUT(2,0,1.00e-2)-LBJ-2
BILUT(3,0,2.50e-2)-LBJ-2
80
70
0
500
1000
1500
2000
CORE#
2500
3000
3500
4000
100
1000
CORE#
10000
46
• pK-Open-HPC
• Ill-Conditioned Problems
• SPPEXA Proposal
47
Developments for SPPEXA (1/2)
• Geometric/Algebraic Multigrid Solvers
• Robust Preconditioning Method with Special
Partitioning Method
• pK-Open-HPC
– Robust and Scalable
– Communication Reducing/Avoiding Algorithms
– Extensions of Sliced ELL for Future Architectures
• Post K (Sparc), Xeon Phi, Nvidia Volta
• U.Tokyo’s Systems
– FY.2016: KNL-based System (20-30 PF)
– FY.2018-2019: Post FX10 System (Sparc ?, Nvidia ?) (50PF)
– OpenMP/MPI Hybrid
– Automatic Tuning (AT)
• Parameter Selection, Stencil Construction
48
Developments for SPPEXA (2/2)
• Deployment of pK-Open-HPC will be support
JST/CREST (~2016.3), and RIKEN-U.Tokyo
Collaboration (starts 2014.12~2020)
– not much funding
• FY.2016, FY.2017: supported by JST/CREST if the
proposal is accepted
• FY.2018 (~2019.3): RIKEN-U.Tokyo Collaboration,
Other Funding needed
• Members
–
–
–
–
–
University of Tokyo: K. Nakajima, T. Katagiri
Hokkaido University: T. Iwashita (AMG)
Kyoto University: A. Ida (AMG, Low-Rank)
French Partners ?: Michel Dayde (Toulouse)
Users, Application Groups
49
Multigrid Solvers (1/2)
• Geometric, Algebraic
– Adaptive Mesh Refinement (AMR)
• Robust for various types of problems
– Target Applications: Poisson, Solid Mechanics, ME
– Smoothers: ILU/IC etc.
• Carefully designed strategy of stencil construction
for feasibility on post-peta/exascale systems
– Modified/Extended Sliced ELL
– Generally speaking, matrices of AMG at coarse lelve
become dense.
• Critical for load balancing on massively parallel systems
• ELL is difficult, some strategy for construction of stencil
50
Multigrid Solvers (2/2)
• MPI/OpenMP Hybrid
– Reordering for extraction of parallelism
– hCGA-type strategy for communication reducing
• Automatic Tuning
– Stencil construction
– Various parameters
51
Robust Preconditioner
• ILU/BILU(p,d,t)
–
–
–
–
Strategy for domain decomposition
Flexible overlapping, local reordering
Extended HID
AT for selection of optimum parameters
• Low-Rank Approximation
– H-Matrix solver
– MUMPS
52
(Rough) Schedule (1/2)
• 1st yr.
– Multgrid
• Development & Test
– ILU/BILU
• Development & Test
– Low-Rank-Approximation
• Development & Test
• 2nd yr.
– Multgrid
• Development & Test (cont.), Optimization, AT
– ILU/BILU
• Development & Test (cont.), Optimization, AT
– Low-Rank-Approximation
• Development & Test (cont.), Optimization
53
(Rough) Schedule (2/2)
• 3rd yr.
– Evaluations through real applications
– Multgrid
• Development & Test, Optimization, AT (cont.)
– ILU/BILU
• Development & Test, Optimization, AT (cont.)
– Low-Rank-Approximation
• Development & Test, Optimization, AT (cont.)
• Proposal
– Deadline: Jan.31, 2015
– KN attends a conference in Barcelona Jan.29-30, and can
visit Germany on Jan.27 (T).