Document 7612660

Download Report

Transcript Document 7612660

Efficient Parallel Implementation of
Molecular Dynamics with Embedded
Atom Method on Multi-core Platforms
Reporter: Jilin Zhang
Authors:Changjun Hu, Yali Liu, and Jianjiang Li
Information Engineering School, University of
Science and Technology Beijing, Beijing, P.R.China
Outline
1 Motivation
 2 Related Works
 3 Spatial Decomposition Coloring (SDC)
Approach
 4 Short-Range Forces Calculations of EAM
using SDC method
 5 Experiments and Discussion
 6 Conclusion and Future Directions

1 Motivation
The process of molecular dynamics simulations

4
1
2
5
2
6
9
3
4
1
5
4
6
1
5
9
3
7
0
8
set init_state
Fig. 1
0
8
calculate forces
6
7
3
7
2
9
0
8
calculate new
positions of atoms
the process of molecular dynamics simulations.
1 Motivation

the intensive computation
appears in short-range force
calculations procedure of MD
simulations
 Neighbor-list method
decreases the intensive
computation largely. It
make each atom only
interacts with atoms in its
neighbor region.

Newton’s third law can
have the force
computations. And it brings
the reduction operations
on irregular arrays
for ( i = 0; i < N; i++)
{
neighstart = neighindex[i];
neighend = neighstart + neighlen[i];
for ( k = neighstart ; k < neighend; k++)
{
j = neighlist[k];
xd = Coord[j][X]- Coord[i][X];
yd = Coord[j][Y]- Coord[i][Y];
zd = Coord[j][Z]- Coord[i][Z];
…
forc = …
force[i][X] += forc*xd ;
force[i][Y] += forc*yd ;
force[i][Z] += forc*zd ;
force[j][X] -= forc*xd ;
force[j][Y] -= forc*yd ;
force[j][Z] -= forc*zd ;
}
}
Fig. 2 codes of force caluclations.
2 Related Works
--- parallel reduction operations on irregular arrays

Some types of solutions
 enclosing
reduction operation in a critical
section
 privating the reduction array
 using redundant computations strategy
2 Related Works
--- parallel reduction operations on irregular arrays

enclosing reduction operation in a critical
section
 create
a critical section in inner loop
 straight
 high
and easy to implement parallelization.
synchronization cost arose by critical
region, atomic or lock involved in inner loop
2 Related Works
--- parallel reduction operations on irregular arrays

private the reduction array
 each thread have to update share array in critical
region according the value of its private array
 it reduce times of entering into critical region and
reduce synchronization cost.
 high memory overhead of private array
 limit number of particles allowed in simulations
 compete for cache space and decrease program
speed
2 Related Works
--- parallel reduction operations on irregular arrays

redundant computations strategy
 does
not use Newton’s third law. So each pair
interaction has to be calculated twice.
 the high parallelizability since data
dependence has been removed between the
loop iterations
 there are double computations and that
neighbor list requires more memory space.
3 Spatial Decomposition Coloring
(SDC) Approach

Spatial Decomposition (SD) method
 distributed
memory multi-processors involving
several hundreds of processors
 change all array declarations and all loop
bounds, and explicitly codes the periodic
transfer of the boundary data between
processors.
 It is simple to implement SD in OpenMP.
3 Spatial Decomposition Coloring
(SDC) Approach
 SD
method places a restriction on parallelism
in OpenMP.

synchronization will be required to ensure that
multiple threads do not attempt to update the same
atom simultaneously.
1
2
3
4
5
6
7
8 rc
9
10 rc
11
12
15
16
17
18
13
14
Fig. 3 SD method.
3 Spatial Decomposition Coloring
(SDC) Approach

SDC method
 SDC
method consists of the following steps
 Step
1): Split domain
 Step
2): Coloring subdomains
 Step
3): Parallel Computing
3 Spatial Decomposition Coloring
(SDC) Approach

SDC method
 SDC
method consists of the following steps
 Step 1): Split domain
Split the spatial domain into subdomains.
Length of a subdomain must be longer
than diameter.
Number of subdomains in dimension
decomposed should be even.
3 Spatial Decomposition Coloring
(SDC) Approach

SDC method
 SDC
method consists of the following steps
 Step 2): Coloring subdomains
The number of subdomains with each
color must be equal
each subdomain is surrounded only by
those subdomains with different colors.
3 Spatial Decomposition Coloring
(SDC) Approach

SDC method
 SDC
method consists of the following steps
 Step 3): Parallel Computing
Calculations of forces on subdomains
with one color can be run in parallel.
a barrier should be given for waiting all
threads to complete computation on this
color.
Calculations on subdomains with
different colors must run in a serial
fashion.
3 Spatial Decomposition Coloring
(SDC) Approach

SDC method
 advantage
neighbor list usually doesn’t be updated in every timestep Cost of SDC method is very lowest.
 higher-dimensional decomposition method creates
more subdomains. scalable and suitable on multicore and many-core architectures.

 disadvantage
Spatial Decomposition method Overload imbalance
 under condition of simulation system has uniformity
of density

4 Short-Range Forces Calculations
of EAM using SDC method

EAM method
N
ρi 
 short-range
forces
 the intensive
computation
 three computational
phases
 the most time
consuming parts are
1 and 3
φ(r )
ij
j i
F ' (ρi )
N


Fi   (V' (rij )  F ' (ρi )ρ'ij  F ' (ρj )ρ' ji ) rij

j i
Fig. 4 short-range forces in EAM method.
4 Short-Range Forces Calculations
of EAM using SDC method

The parallel procedure of short-range
forces calculations using SDC method
 1)
Run electron density computations using
SDC method
 2) Calculate embedding function value and
their derivative in parallel
 3) Run force calculations using SDC method
4 Short-Range Forces Calculations
of EAM using SDC method

force calculations
based on SDC
method
L1: computations on
subdomains with different
color
L2 : computations on
subdomains with same
color
L3 deals with all atoms that
constitute a subdomain
L4 deals with neighbors of a
atom
L1: #pragma omp parallel private(cpart)
for (cpart = 0; cpart < colors; cpart++)
{
...
L2: #pragma omp for private(spart,i,j,k,…)
for (spart = cpart; spart < subdomains; spart += colors)
for ( ipart = pstart[spart]; ipart < pstart[spart+1]; ipart++)
L3:
{
i = partindex[ipart];
neighstart = neighindex[i];
neighend = neighstart + neighlen[i];
for ( k = neighstart ; k < neighend; k++)
L4:
{
j = neighlist[k];
…
forc = …
force[i][X] += forc*xd ;
force[i][Y] += forc*yd ;
force[i][Z] += forc*zd ;
force[j][X] -= forc*xd ;
force[j][Y] -= forc*yd ;
force[j][Z] -= forc*zd ;
}
}
}
Fig. 5 forces calculations using SDC.
5 Experiments and Discussion

Experimental environment



Four Intel Xeon(R) Quad-core E7320 (L2 Cache 4MB)
processors, 16 GB memory
OS is Fedora release 9 with kernel 2.6.25. The compiler is gcc
4.3.0.
Experimental cases




observe micro-deformation behaviors of pure Fe metals material
---came from XMD program
under periodic boundary conditions
initial state -- body-centered cubic (bcc) lattice arrangement
test cases




Small-scale case
(1):
54,000 atoms
Medium-scale case (2):
265,302 atoms
Large-scale case
(3): 1,062,882 atoms
Large-scale case
(4): 3,456,000 atoms
Table 1.
The Speedups of Spatial Decomposition Coloring (SDC) Methods
Small case (1) on 2~16 cores
Speedup
2
3
4
8
12
SDC (one-dim)
1.71
2.46
3.07
4.17
SDC (two-dim)
1.70
2.46
3.07
4.74
5.90
SDC (three-dim)
1.66
2.40
2.99
4.61
5.74
Medium case (2) on 2~16 cores
16
2
3
4
8
12
1.84
2.64
3.37
6.24
6.33
6.43
1.84
2.65
3.39
6.20
8.89
10.90
6.30
1.82
2.65
3.36
6.16
8.76
10.78
Large case (3) on 2~16 cores
16
Large case (4) on 2~16 cores
2
3
4
8
12
16
2
3
4
8
12
16
SDC (one-dim)
1.86
2.76
3.67
6.82
9.76
9.59
1.88
2.79
3.66
6.30
9.97
9.82
SDC (two-dim)
1.87
2.78
3.64
6.74
9.73
12.31
1.87
2.80
3.65
6.77
9.84
12.42
SDC (three-dim)
1.86
2.75
3.64
6.64
9.65
12.29
1.87
2.80
3.67
6.74
9.82
12.34
5 Experiments and Discussion


the scalability of our SDC method. performance of multidimensional SDC method has been improved with the
increase in the number of cores and the increase in the
number of atoms.
performance of SDC methods. We can see that twodimensional SDC method achieves highest efficiency.


two-dimensional decomposition algorithm strives to make
subdomains with small surface area and large volume, which
results in better cache locality compared to the one-dimensional
decomposition strategy.
three-dimensional SDC method slightly degrades the
performance due to the more overhead of fork-join threads and
scheduling.
SDC on small case(1)
CS on small case(1)
SAP on small case(1)
RC on small case(1)
14
SDC on medium case(2)
CS on medium case(2)
SAP on medium case(2)
RC on medium case(2)
SDC on large case(3)
CS on large case(3)
SAP on large case(3)
RC on large case(3)
SDC on large case(4)
CS on large case(4)
SAP on large case(4)
RC on large case(4)
Speedup
12
10
8
6
4
2
0
2
3
4
8
12
Number of cores
Fig. 6 The speedup of two-dimensional Spatial Decomposition Coloring
(SDC) method, Critical Section (CS) method, Share Array Privatization
(SAP) method and Redundant Computations (RC) method.
16
5 Experiments and Discussion

SDC method achieves a nearly linear speedup and
highest speedup than other methods


CSmethod


achieves lowest efficiency. CS method encloses reduction
operations on irregular array in critical section.
SAPmethod


The reason of nearly linear speedup is that the low
synchronization cost of implicit barriers in our method can be
amortized over a large amount of computation.
performance degrade with the increase of the number of
executing cores. memory overhead+synchronization overhead
RC VS SDC

there is nearly two-fold computation work for the short-range
force calculations in RC method than in SDC method, the
efficiency of RC method is low than that of SDC method.
Conclusion and Future Directions

A scalable spatial decomposition coloring (SDC)
method
 To
solve a class of short-range force calculations
problems on shared memory multi-core platforms
 It is scalable not only to large simulation system but
also to many-core architectures

Future directions
 To
 To
study SDC method on NUMA memory architecture
implement SDC method using MPI+OpenMP in
multi-core cluster
Thank You !