Transcript Matt McKenzie, LONI presentation
Preliminary CPMD Benchmarks
On Ranger, Pople, and Abe TG AUS Materials Science Project Matt McKenzie LONI
What is CPMD ?
• • Car Parrinello Molecular Dynamics ▫ www.cpmd.org
Parallelized plane wave / pseudopotential implementation of Density Functional Theory • Common chemical systems: liquids, solids, interfaces, gas clusters, reactions ▫ Large systems ~500atoms Scales w/ # electrons NOT atoms
Key Points in Optimizing CPMD
• Developers have done a lot of work here • The Intel compiler is used in this study • • BLAS/LAPACK ▫ BLAS levels 1 (vector ops) and 3 (matrix-matrix ops) Some level 2 (vector-matrix) Integrated optimized FFT Library ▫ Compiler flag: -DFFT_DEFAULT
Benchmarking CPMD is difficult because…
• Nature of the modeled chemical system ▫ Solids, liquids, interfaces Require different parameters stressing the memory along the way ▫ Volume and # electrons • Choice of the pseudopotential (psp) ▫ Norm-conserving, ‘soft’, non-linear core correction (++memory) • Type of simulation conducted ▫ CPMD, BOMD, Path Integral, Simulated Annealing, etc… ▫ CPMD is a robust code • Very chemical system specific ▫ Any one CPMD sim. cannot be easily compared to another ▫ However, THERE ARE TRENDS • FOCUS: simple wave function optimization timing ▫ This is a common ab initio calculation
Probing Memory Limitations
• • • • For any ab initio calculation: Accuracy is proportional to # basis sets used Stored in matrices, requiring increased RAM Energy cutoff determines the size of the Plane wave basis set,
N
PW
= (1/2π
2
)ΩE
cut 3/2
Model Accuracy & Memory Overview
Image obtained from the CPMD user’s manual Pseudopotential’s convergence behavior w.r.t. basis set size (cutoff) NOTE: Choice of psp is important i.e. ‘softer’ psp = lower cutoff = loss of transferability VASP specializes in soft psp’s ; CPMD works with any psp’s
Memory Comparison
Ψ optimization , 63 Si atoms, SGS psp
Ecut = 50 Ryd Ecut = 70 Ryd
• • N PW ≈ 134,000 Memory = 1.0 GB • • N PW ≈ 222,000 Memory = 1.8 GB Well known CPMD benchmarking model: www.cpmd.org
Results can be shown either by: Wall time = (n steps x iteration time/step) + network overhead Typical Results / Interpretations, nothing new here Iteration time = fundamental unit, used throughout any given CPMD calculation It neglects the network, yet results are comparable Note, CPMD runs well on a few nodes connected with gigabyte ethernet Two important factor which affects CPMD performance
MEMORY BANDWIDTH FLOATING-POINT
Pople, Abe, Ranger CPMD Benchmarks
8 Pople 50 Ryd Pople 70 Ryd 7 Abe 50 Ryd Abe 70 Ryd 6 5 4 3 2 1 0 0 32 64 Ranger 50 Ryd 96 128 160
Number of Cores
192 Ranger 70 Ryd 224 256
Results I
• • All calculations ran no longer than 2 hours Ranger is not the preferred machine for CPMD • Scales well between 8 and 96 cores ▫ This is a common CPMD trend • CPMD is known to super-linearity scale above ~1000 processors ▫ Will look into this ▫ Chemical system would have to change as this smaller simulation is likely not to scale in this manner
Results II
• Pople and Abe gave the best performance • IF a system requires more than 96 procs, Abe would be a slightly better choice • Knowing the difficulties in benchmarking CPMD, ( psp, volume, system phase, sim. protocol ) this benchmark is not a good representation of all the possible uses of CPMD.
▫ Only explored one part of the code • How each system performs when taxed with additional memory requirements is a better indicator of CPMD’s performance ▫ To increase system accuracy, increase E cut
Percent Difference
between 70 and 50 Ryd %Diff = [(t 70 -t 50 ) / t 50 ]*100 70 60 50 40 30 20 10 0 0 32 64 96 128
Number of Cores
160 Pople Abe Ranger 192 224 256
Conclusions
RANGER • Re-ran Ranger calculations • Lower performance maybe linked to Intel compiler on AMD chips ▫ PGI compiler could show an improvement ▫ Nothing over 5% is expected: still be the slowest ▫ Wanted to use the same compiler/math libraries ABE • • Possible super-linear scaling, t Abe, 256procs < t others, 256procs Memory size effects hinders performance below 96 procs POPLE • Is the best system for wave function optimization • Shows a (relatively) stable, modest speed decrease as the memory requirement is increased, it is the recommended system
Future Work
• • Half-node benchmarking Profiling Tools • Test the MD part of CPMD ▫ Force calculations involving the non-local parts of the psp will increase memory ▫ Extensive level 3 BLAS & some level 2 ▫ Many FFT all-to-all calls, Now the network plays a role ▫ Memory > 2 GB A new variable ! Monitor the fictitious electron mass • Changing the model ▫ Metallic system (lots of electrons, change of psp; E cut ) ▫ Check super-linear scaling