MUMPS - GRAAL

Download Report

Transcript MUMPS - GRAAL

http://www.ens-lyon.fr/~jylexcel/MUMPS/
http://www.enseeiht.fr/apo/MUMPS/
MUMPS
A Multifrontal Massively Parallel Solver
Main features : MUMPS solves large systems of linear equations of IMPLEMENTATION
the form Ax=b by factorizing A into A=LU or LDLT
 symmetric or unsymmetric marices (partial pivoting),
 parallel factorization and solve phases (uniprocessor
version also available),
 Iterative refinement and backward error analysis,
 various matrix input formats
•assembled format
•distributed assembled format
•sum of elemental matrices
 Null space functionalities (experimental): rank detection
and null space basis
 Partial factorization and Schur complement matrix,
 Version for complex arithmetic.
• Distributed multifrontal solver
• MPI / F90 based (C user
interface also available)
• Stability based on partial
pivoting
• Dynamic Distributed
Scheduling to accomodate
both numerical fill-in and multiuser environment
• Use of BLAS, LAPACK,
ScaLAPACK
A fully asynchronous distributed solver
(VAMPIR trace, 8 processors).
Competitve performance
AVAILABILITY
• MUMPS is available free of charge for non commercial use.
• it has been used on a number of platforms (Cray T3E, Origin
2000, IBM SP, Linux clusters, …) by a few hundred current
users (finite elements, chemistry, simulation, aeronautics, …)
• If you are interested in obtaining MUMPS for you own use,
please refer to the MUMPS home page.
The MUMPS package has a good perfornance relative to other parallel
sparse solvers; for example we see in the table below comparisons
with the SuperLU code from Demmel and Li. These results are taken
from “Analysis and comparison of two general solvers for distributed
memory computers”, ACM TOMS, 27, 388-421.
Matrix
bbmat
Ordering
AMD
ND(metis)
BMW car body
148770 unknowns
5396386 nonzeros
MSC.Software.
ecl32
AMD
ND(metis)
Solver
MUMPS
SuperLU
MUMPS
SuperLU
MUMPS
SuperLU
MUMPS
SuperLU
Number of processors
1
4
8
16
32
64
128
-
44.8
64.7
32.1
132.9
53.1
106.8
23.9
48.5
23.6
36.6
10.8
72.5
31.3
56.7
13.4
26.6
15.7
21.3
12.3
39.8
20.7
31.2
9.7
15.7
12.6
12.8
10.4
23.5
14.7
18.3
6.6
9.6
10.1
9.2
9.1
15.6
13.5
12.3
5.6
7.6
9.5
7.2
7.8
11.1
12.9
8.2
5.4
5.6
Factorisation time in seconds of large matrices on the CRAY T3E; (1
proc=not enough memory).
CURRENT RESEARCH : ACTIVE RESEARCH IS FEEDING THE MUMPS SOFTWARE PLATFORM.
Reorderings and optimization of the memory usage
MUMPS uses state-of-the-art reordering techniques (AMD, AMF,
ND, SCOTCH, PORD, METIS). Those techniques have a strong
impact on the parallelism and number of operations and we are
currently studying their impact of such techniques on the dynamic
memory usage of MUMPS. In particular we designed algorithms to
optimize the memory occupation of the multifrontal stack. Future
work includes dynamic memory load balancing and the design of
an out-of-core version.
Matrix
Reordering
Percent. of memory
decrease
thermal THREAD af23560 xenon2 rma10
PORD
METIS
AMF SCOTCH AMD
73.5
30.4
32.2
24.7
17.6
Best decrease obtained using our algorithm to decrease the
stack for each reordering technique. Results obtained by A.
Guermouche, (PhD student in the INRIA ReMaP project).
Platforms with heterogeneous network (clusters of SMP)
In the MUMPS scheduling, work is given to processors according to
their load. Giving a penalty to the load of processors on a distant
node helps performing tasks with high communication on the same
node and improves the performance, as shown in the Table below.
Time for factorization
Standard MUMPS
Modified MUMPS
49.2 seconds
44.0 seconds
Total volume of
3957
MB
communication
Total volume of internode
2017
MB
communication
Mixing dynamic and static scheduling strategies
MUMPS uses a completely dynamic approach with distributed scheduling and
scales well until around 100 processors. Introducing more static information
helps reducing the costs of the dynamic decisions and makes MUMPS more
scalable.
Effect of a injecting more
static information to the
dynamic
scheduling
of
MUMPS. Rectangular grids
of increasing size, ND.
Results obtained by C.
Vömel (PhD Cerfacs) on a
CRAY T3E.
Mixing MPI and OpenMP on clusters of SMP
We report below on a preliminary experiment of hybrid parallelism on one
node (16 procs) of an IBM SP. Best results are obtained when using 8 MPI
processes with 2 OpenMP threads each. Regular problem from an 11pt
discretization (Cubic grid 64x64x64), ND used. Results obtained by S.Pralet
(PhD Cerfacs).
3600 MB
1004 MB
Effect of taking the hybrid network into account. Matrix PRE2,
SCOTCH, 2 nodes of 16 processors of an IBM SP. Results
obtained by S. Pralet (PhD CERFACS).
The MUMPS package has been partially supported by the Esprit IV Project PARASOL and by CERFACS, ENSEEIHT-IRIT, INRIA Rhône-Alpes, LBNL-NERSC,PARALLAB and RAL.
The authors are Patrick Amestoy, Jean-Yves L’Excellent, Iain Duff and Jacko Koster.
Functionalities related to rank-revealing were first implemented by M. Tuma (Institute of Computer Science, Academy of Sciences of the Czech Republic), while he was at CERFACS.
We are also grateful to C. Bousquet, C. Daniel, A. Guermouche, G. Richard, S. Pralet and C. Vömel who have been working on some specific parts of this software.
This poster was prepared by Jean-Yves
L’Excellent ([email protected]).