Motivation Theoretical Background Poisson Operator in CHOMBO

Download Report

Transcript Motivation Theoretical Background Poisson Operator in CHOMBO

OPTIMIZATION OF THE POISSON OPERATOR IN CHOMBO
Razvan Carbunescu
Motivation




Meriem Ben Salah
Andrew Gearhart
Theoretical Background
 CHOMBO is a framework for implementing finite difference methods for the solution of
partial differential equations on block structured adaptively refined rectangular grids.
Poisson Operator in CHOMBO
The Poisson operator, aka the Laplacian, is a second order elliptic differential operator
and defined in an n-dimensional Cartesian space by:

CHOMBO
 CHOMBO provides elliptic and time-dependent modules, as well as support for
standardized self-describing file formats.
 Chombo is architecture and operating system independent.

The Poisson operator appears in the definition of the Helmholtz differential equation:

The Helmholtz differential equation reduces to the Poisson equation:
BoxTools
AMRTools
AMRElliptic
EBTools
AMRTimeDependent
ParticleTools
Calculations over
union of rectangles
Communication
btw. refinement levels
Multigrid solvers on
discretized elliptic (poisson,
resistivity etc) and parabolic
equations
Embedded
boundary
discretization
Subcycling of time dependent
calculations
Particle dynamics
 For the use on parallel platforms, CHOMBO provides solely a distributed memory
implementation in MPI (Message Passing Interface).

Is the use of a distributed memory implementation always beneficial?


Project Strategy and Goals
1.
Focus on the stencil kernel which applies the poisson operator on a two dimensional cellcentered data that is embedded in the AMRElliptic tool, a multigrid-based elliptic and
parabolic equation solver for adaptive mesh hierarchies library.
The Poisson equation is used in the modeling of various boundary value physical
problems: e.g. electric potential in electrostatics, potential flow in fluid dynamics etc.

3.
4.
5.
Poisson
Solver
A numerical solution requires the discretization of the continuous Poisson’s equation,
e.g. by the standard centered-difference approximation, as well as a discrete handling
of the Dirichlet and Neumann boundary conditions.
Start with the serial and distributed memory implementation supplied in CHOMBO for
reference.
Implement parallel shared memory architectures.
A Poisson solve is conducted at the beginning of an incompressible flow simulation to obtain initial
conditions to the evolution of the velocity and pressure.
The definition of the appropriate boundary conditions, Dirichlet and Neumann, allows for
the solution of the Poisson problem.


2.
For the sake of a demonstrative exposition, we use the Poisson potential flow solve done in the
incompressible Navier Stokes equations.
The discrete Poisson operator, the focus of this project, is given by the following stencil:

Conduct a parameter study to analyze the benefits and the drawbacks of each of the
implementations in terms of computational time.
Draw global conclusions on the consequences of the limitations of the parallel
implementation on CHOMBO.
Fig.1 Flow Boundary Condition Initialialization
Targeted Architectures
Investigations and Results
Existing Implementations



CHOMBO’s implementation is currently tuned for distributed
memory with the domain being split up into small boxes on the
order of 32 squared for 2D and 32 cubed for 3D and each box
is being assigned to a processor which runs serial Fortran 77
code.

Because of the small size of the box there is not a lot of
computational intensity to use a threaded shared memory
implementation or to hide the cost of the transfer to the GPU.

Fig. 2 Flow Evolution Start-Up After A Poisson Solve
Speed
up
The serial, the distributed memory and the shared memory implementations of the Poisson solve have been run on the
NERSC Cray XT4 system, called Franklin, a massively parallel processing system with 9,532 compute nodes (quad
processor cores) and 38,128 processors. The implementation on the GPU is being conducted on personal Linux boxes as
well as the r56 and 57 nodes of the Millennium cluster.
The number of grid cells in the spatial partition as well as the number of grid refinements affect the computation of the
discrete Poisson stencil and are the focus of these test runs.
In order to have some sense of fair comparison, the number of grid cells and the number of grid refinements are kept equal
through the test runs. To account for different loading of the machine architecture each test was repeated 5 times and its
corresponding results averaged.
ncell
Fig.4 Speedup vs. C serial implementation
Our interest



While CHOMBO’s code obtains good performance for MPI on
a distributed memory system our study is aimed at determining
whether we can improve this speedup through the use of
locality and faster access time to on-chip cores.
Figure 3 shows a comparison of the runtimes of the C serial code
vs the Fortran 77 serial code. Besides the growing time, it is
interesting to note that despite the similar matrix storage in
Fortran77 and C (column major), the Fortran 77 code is
accessed fastest when loops are ordered as n,j,i vs the C code
that is indexed i,j,n.
time
Our current implementation on the shared memory model only
uses a limited basic strip-mining technique since it maintains
the abstractions of a data iterator going through all the boxes
each step but a more relaxed model could help with time also
being allowed to be blocked.
Figure 4 depicts the speed-up of the various parallel
implementations with respect to the associated serial C version,
C_MPI44 refers to running the MPI version with all 4 cores on 1
node and C_MPI41 refers to running the MPI version with 4
cores but 1 core per node.
Another interesting opportunity for speedup is running the
operator on the GPU but the benefits must outweigh the cost of
moving the data onto and off the GPU.
Figure 5 presents the relative speedup of all the codes with
respect to the fastest (Fortran) serial version.
Speed
up
ncell


The previous points above raise the issue of threads and how
for small boxes the lightweight threads are really important to
allow for fast context switching and therefore hopefully
increase performance. To allow for the use of pthreads and
CUDA we implemented a C version of the operator.
The particular reason for this choice of study is to allow for the
creation of heterogeneous systems that would automatically
adjust their code to either MPI. Pthreads or Cuda underneath
to achieve the best result.
Fig.3 Fortran vs C serial implementation
Conclusions and Future Work


Contact information:
References:
Meriem Ben Salah:
[1] John Kubiatowicz, 2009, “CS252 Graduate Computer Architecture Lecture 24, Network Interface

UC Berkeley ME Graduate Student, ParLab, [email protected]
Razvan Corneliu Carbunescu: UC Berkeley CS Graduate Student, ParLab, [email protected]
Design Memory Consistency Models”
Andrew Gearhart:
UC Berkeley CS Graduate Student, ParLab, [email protected]
[2] P. Colella et al., 2009, “Chombo Software Package for AMR Applications Design Document”
James Demmel:
UC Berkeley Math & CS Faculty, [email protected]
[3]Machine images obtained from nersc.gov, krunker.com, compsource,com and bit-tech.net
Phillip Colella:
LBNL ANAG, [email protected]
Brian Van Sraalen:
LBNL ANAG, [email protected]
ncell
Fig.5 Speedup vs. F serial implementation

The shared memory implementation has beneficial results vs. the C serial code but more analysis is required to correctly compare to the Fortran MPI code.
At the time of this presentation no results have yet completed from the GPU. It will be interesting to find a correct methodology to compare the GPU results from Millennium with the results of
MPI and Shared memory implementations. Since Franklin does not provide GPUs, the GPU test runs have to be verified independently. This work will be reported later in the project
document.
It is obvious that the choice of the C implementation led to a loss of computational performance, and therefore a Fortran 77 shared memory implementation could be attractive. However, a
POSIX threads interface to Fortran 77 is currently not available. A improvement on the shared memory implementation might exist in creating the OpenMP Fortran 77 solver code.
Currently our simulations are only performed on the Cray XT4 and the GPU. It would be interesting to conduct these studies on different machine architectures.