Modelling of Carbonates

Download Report

Transcript Modelling of Carbonates

Running in Parallel :
Theory and Practice
Julian Gale
Department of Chemistry
Imperial College
Why Run in Parallel?
• Increase real-time performance
• Allow larger calculations :
- usually memory is the critical factor
distributed memory essential for all
significant arrays
• Several possible mechanisms for parallelism :
MPI / PVM / OpenMP
-
-
Parallel Strategies
• Massive parallelism :
- distribute according to spatial location
- large systems (non-overlapping regions)
- large numbers of processors
• Modest parallelism :
- distribute by orbital index / K point
- spatially compact systems
- spatially inhomogeneous systems
- small numbers of processors
• Replica parallelism (transition states / phonons)
S. Itoh, P. Ordejón and R.M. Martin, CPC, 88, 173 (1995)
A. Canning, G. Galli, F. Mauri, A. de Vita and R. Car, CPC, 94, 89 (1996)
D.W. Bowler, T. Miyazaki and M. Gillan, CPC, 137, 255 (2001)
Key Steps in Calculation
• Calculating H (and S) matrices
- Hartree potential
- Exchange-correlation potential
- Kinetic / overlap / pseudopotentials
• Solving for self-consistent solution
- Diagonalisation
- Order N
One/Two-Centre Integrals
• Integrals evaluated directly in real space
• Orbitals distributed according to 1-D block cyclic scheme
Orbital 1 2 3 4 5 6 7 8 9 10 Blocksize = 4
• EachNode
node calculates
relevant
0 0 0 integrals
0 1 1 1
1 0 to
0 local orbitals
• Presently duplicated set up on each node for numerical
tabulations
16384 atoms of Si on 4 nodes :
Kinetic energy integrals
Overlap integrals
Non-local pseudopotential
Mesh
= 45s
= 43s
= 136s
= 2213s
Sparse Matrices
Order N memory
Compressed
2-D
Compressed
1-D
Parallel Mesh Operations
•
•
•
•
•
Spatial decomposition of mesh
2-D Blocked in y/z
Map orbital to mesh distribution
Perform parallel FFT  Hartree
XC calculation only involves
local communication
• Map mesh back to orbitals
0
4
8
1
5
9
2
6
10
3
7
11
2-D Blocked
Distribution of Processors
Better to divide work in y
direction than z
Command: ProcessorY
z
y
Example:
• 8 nodes
• ProcessorY 4
• 4(y) x 2(z) grid of nodes
Diagonalisation
•
•
•
•
H and S stored as sparse matrices
Solve generalised eigenvalue problem
Currently convert back to dense form
Direct sparse solution is possible
- sparse solvers exist for standard
eigenvalue problem
- main issue is sparse factorisation
Dense Parallel Diagonalisation
0
1
0
Two options :
- Scalapack
- Block Jacobi (Ian Bush, Daresbury)
- Scaling vs absolute performance
1-D Block Cyclic
(size ≈ 12 - 20)
Command: BlockSize
Order N
Kim, Mauri, Galli functional :
E  2 Hij  Sij 2ij  Sij   N
oc c
bs
i, j
Hij   Ci h Cj   Ci Fj
, 

Sij   Cis Cj   Ci Fj
s
 ,

E bs
Gi 
 4Fi  2 Fsj Hji  2 Fj Sji
Ci
j
j
Order N
• Direct minimisation of band structure energy
=> co-efficients of orbitals in Wannier fns
• Three basic operations :
- calculation of gradient
- 3 point extrapolation of energy
- density matrix build
• Sparse matrices : C, G, H, S, h, s, F, Fs
=> localisation radius
• Arrays distributed by rhs index :
- nbasis or nbands
Putting it into practice….
•
•
•
•
•
•
•
Model test system = bulk Si (a=5.43Å)
Conditions as previous scalar runs
Single-zeta basis set
Mesh cut-off = 40 Ry
Localisation radius = 5.0 Bohr
Kim / Mauri / Galli functional
Energy shift = 0.02 Ry
• Order N calculations -> 1 SCF cycle / 2 iterations
• Calculations performed on SGI R12000 / 300 MHz
• “Green” at CSAR / Manchester Computing Centre
Time (s)
Scaling of Time with System Size
140000
120000
100000
80000
60000
40000
20000
0
0
32 processors
50000
100000
Atoms
150000
Scaling of Memory with System Size
Peak me mor y (MB)
1200
1000
800
600
400
200
0
0
50000
100000
Atoms
NB : Memory is per processor
150000
Parallel Performance on Mesh
• 16384 atoms of Si / Mesh = 180 x 360 x 360
• Mean time per call
• Loss of performance is due to orbital - mesh mapping
(XC shows perfect scaling (LDA))
4000
Time (s)
3000
2000
1000
0
0
20
40
Proce ssor s
60
80
Parallel Performance in Order N
•
•
•
•
16384 atoms of Si / Mesh = 180 x 360 x 360
Mean total time per call in 3 point energy calculation
Minimum memory algorithm
Needs spatial decomposition to limit internode communication
1200
Time (s)
1000
800
600
400
200
0
0
20
40
Proce ssor s
60
80
Installing Parallel SIESTA
• What you need:
- f90
- MPI
- scalapack
- blacs
- blas
Also needed for serial runs
- lapack
• Usually ready installed on parallel machines
• Source/prebuilt binaries from www.netlib.org
• If compiling, look out for f90/c cross compatibility
• arch.make - available for several parallel machines
Running Parallel SIESTA
• To run a parallel job:
mpirun -np 4 siesta < job.fdf > job.out
Number of processors
• Sometimes must use “prun” on some sites
• Notes:
- generally must run in queues
- copy files on to local disk of run machine
- times reported in output are sum over nodes
- times can be erratic (Green/Fermat)
Useful Parallel Options
• ParallelOverK :
Distribute K points over nodes - good for metals
• ProcessorY :
Sets dimension of processor grid in Y direction
• BlockSize :
Sets size of blocks into which orbitals are divided
• DiagMemory :
Controls memory available for diagonalisation.
Memory required depends on clusters of eig
values
See also DiagScale/TryMemoryIncrease
• DirectPhi :
Phi values are calculated on the fly - saves
memory
Why does my job run like a dead
donkey?
• Poor load balance between nodes:
Alter BlockSize / ProcessorY
• I/O is too slow:
Could set “WriteDM false”
• Job is swapping like crazy:
Set “DirectPhi true”
• Scaling with increasing number of nodes is poor:
Run a bigger job!!
• General problems with parallelism:
Latency / bandwidth
Linux clusters with 100MB ethernet switch forget it!