Parallel Methods for Nano/Materials Science Applications (Electronic Structure Calculations) Andrew Canning Computational Research Division LBNL & UC Davis.

Download Report

Transcript Parallel Methods for Nano/Materials Science Applications (Electronic Structure Calculations) Andrew Canning Computational Research Division LBNL & UC Davis.

Parallel Methods for Nano/Materials Science
Applications
(Electronic Structure Calculations)
Andrew Canning
Computational Research Division LBNL
& UC Davis
Outline
• Introduction to Nano/Materials science
• Electronic Structure Calculations (DFT)
• Code performance on High Performance
Parallel Computers
• New Methods and Applications for
Nanoscience
Milestones in Parallel Calculations
1991 Silicon surface reconstruction (7x7), Phys. Rev.
(Stich, Payne, King-Smith, Lin, Clarke)
Meiko I860, 64 processor Computing Surface
(Brommer, Needels, Larson, Joannopoulos)
Thinking Machines CM2, 16,384 bit processors
1998 FeMn alloys (exchange bias), Gordon Bell prize
(Ujfalussy, Stocks, Canning, Y. Wang, Shelton et al.)
Cray T3E, 1500 procs. first > 1 Tflop Simulation
2005 1000 atom Molybdenum simulation with Qbox
SC05. (F. Gygi et al. ) BlueGene/L, 32,000
processors (LLNL)
Electronic Structure Calculations
• Accurate Quantum Mechanical treatment for the electrons
• Each electron represented on grid or with some basis functions
(eg. fourier components)
• Compute Intensive: Each electron requires 1 million points/basis
(need 100s of electrons)
• 70-80% NERSC Materials Science Computer Time
(first-principles electronic structure)
InP quantum dot (highest electron
energy state in valence band)
Motivation for Electronic Structure Calculations
• Most Materials Properties Only
Understood at a fundamental level from
Accurate Electronic Structure
(Strength, Cohesion etc)
• Many Properties Purely Electronic eg.
Optical Properties (Lasers)
• Complements Experiments
• Computer Design Materials at the
nanoscale
Materials Science Methods
• Continuum Methods
atoms
• Many Body Quantum Mechanical Approach
(Quantum Monte Carlo) 20-30 atoms
• Single Particle QM (Density Functional Theory)
No free parameters. 100-1000 atoms
• Empirical QM Models eg. Tight Binding
1000-5000 atoms
• Empirical Classical Potential Methods
thousand-million atoms
Ab initio Method: Density Functional
Theory
(Kohn 98 Nobel Prize)
Many Body Schrodinger Equation (exact but exponential scaling )
1 2
1
Z
{ i  

}(r1 ,..rN )  E(r1 ,..rN )
i 2
i , j | ri  rj |
i , I | ri  RI |
Kohn Sham Equation (65): The many body ground
state problem can be mapped onto a single particle
problem with the same electron density and a
different effective potential (cubic scaling).
1 2
 (r )
Z
{   
dr   
 V XC } i (r )  Ei i (r )
2
| r  r |
I | r  RI |
 (r )  |  i (r ) |2 | (r1 ,..rN ) |2
i
Use Local Density Approximation
(LDA) for V [  (r )] (good Si,C)
XC
Selfconsistent calculation
Selfconsistency
1 2
{   V (r ,  )} i (r )  Ei i (r )
2
{ i }i 1,..,N
N
 ( r )   |  i ( r ) |2
i
V (r ,  )
N electrons
N wave functions
lowest N
eigenfunctions
Choice of Basis for DFT(LDA)
Increasing basis size M
Gaussian
FLAPW
Fourier
grid
Percentage of eigenpairs M/N
30%
2%
Eigensolvers
Direct (scalapack)
Iterative
Plane-wave Pseudopotential Method in
DFT
1 2
 (r )
Z

{   
dr  
 V XC (  (r ))} j (r )  E j j (r )
2
| r  r |
I | r  RI |
Solve Kohn-Sham Equations self-consistently for electron
wavefunctions within the Local Density Appoximation
1. Plane-wave expansion for
 j ,k (r )  C (k )e
j
g
i ( g k ).r
g
2. Replace “frozen” core by a pseudopotential
Different parts of the Hamiltonian calculated
in different spaces (fourier and real) 3d FFT
used
PARATEC (PARAllel Total Energy Code)
• PARATEC performs first-principles
quantum mechanical total energy
calculation using pseudopotentials
& plane wave basis set
• Written in F90 and MPI
• Designed to run on large parallel
machines IBM SP etc. but also
runs on PCs
• PARATEC uses all-band CG approach to obtain wavefunctions of
electrons
• Generally obtains high percentage of peak on different platforms
• Developed with Louie and Cohen’s groups (UCB, LBNL),
Raczkowski
PARATEC: Code Details
•
•
•
•
Code written in F90 and MPI (~50,000 lines)
33% 3D FFT, 33% BLAS3, 33% Hand coded F90
Global Communications in 3D FFT (Transpose)
Parallel 3D FFT handwritten, minimize comms. reduce
latency
(written on top of vendor supplied 1D complex FFT )
PARATEC: Parallel Data distribution
and 3D FFT
(a)
(c)
(e)
(b)
–
Load Balance Sphere by giving
columns to different procs.
–
3D FFT done via 3 sets of 1D
FFTs and 2 transposes
–
Most communication in global
transpose (b) to (c) little
communication (d) to (e)
–
Flops/Comms ~ logN
–
Many FFTs done at the same
time to avoid latency issues
–
Only non-zero elements
communicated/calculated
–
Much faster than vendor
supplied 3D-FFT
(d)
(f)
PARATEC: Performance
NERSC (Power3)
Problem
P
Jacquard
(Opteron)
Thunder
(Itanium2)
ORNLCray (X1)
NEC ES (SX6*)
NEC SX8
Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak Gflops/P %peak
128
488 Atom
256
CdSe
Quantum 512
Dot
1024
0.93
0.85
0.73
0.60
62%
57%
49%
40%
1.98
0.95
45%
21%
2.8
2.6
2.4
1.8
51%
47%
44%
32%
3.2
3.0
25%
24%
5.1
5.0
4.4
3.6
64%
62%
55%
46%
7.5
6.8
 All architectures generally achieve high performance due to





computational intensity of code (BLAS3, FFT)
ES achieves highest overall performance to date: 5.5Tflop/s on 2048
procs
 Main ES advantage for this code is fast interconnect
SX8 achieves highest per-processor performance
X1 shows lowest % of peak
 Non-vectorizable code much more expensive on X1
IBM Power5 4.8 Gflops/P (63% peak on 64 procs)
BGL got 478 Mflops/P (17% of peak on 512 procs)
Developed with Louie and Cohen’s groups (UCB, LBNL), also work with L. Oliker, J Carter
47%
43%
Self-consistent all band method for
metallic systems
Previous methods use selfconsistent (SC) band by band,
with Temperature smearing
(eg. VASP code)
drawbacks – band-by-band slow
on modern computers (cannot
use fast BLAS3 matrix-matrix
routines)
New Method uses occupancy in
inner iterative loop with all band
Grassman method (GMCG
method)
Al (100) surface, 10 layers + vacuum
GMCG: new method with occupancy
Self-consistent all band
method for metals
Potential Mixing
Vout  Vin
1
min  i | {  2  Vin } | i 
2
i
min
i
f
1 2
|
{

i 2   Vin } | i 
i
/
{ i }
 (r )  
i
f  (r )  ( r )
*
i
i
Vout(r )
i
KS - DFT
The Quantization Condition of Quantum-well
States in Cu/Co(100)
• Theoretical investigation of Quantum Well states in Cu
films using our codes (PARATEC, PEtot) to compare with
experiments at the ALS (E. Rotenberg, Y.Z. Wu, Z.Q. Qiu)
photon beam in
electrons
out
• New computational methods for metallic systems used in
the calculations.
•Lead to an understanding of surface effects on the
Quantum Well States. Improves on simple Phase
Accumulation Model used previously
54 Å
0
d
Copper Wedge
Cobalt layer
Copper substrate
0.0
-0.5
E-EF (eV)
QW states in
Copper Wedge
-1.0
-1.5
0
5
10
Cu thickness(ML)
15
20
Difference between theory and experiment improved by
taking surface effects into account
Computational challenges
(larger nanostructures)
atoms
molecules
size
method
1-100
atoms
Ab initio
Method
PARATEC
nanostructures
1000-10^6
atoms
Challenge for
computational
nanoscience.
bulk
Infinite
(1-10 atoms
in a unit cell)
•Ab initio method
PARATEC
O( N 3 )
Ab initio
elements
and reliability
New methodology
and algorithm
(ESCAN)
Even larger
Supercomputer
Example: Quantum Dots (QD) CdSe
•Band gap increase
CdSe quantum dot (size)
•Single electron effects
on transport (Coulomb
blockade).
•Mechanical properties,
surface effects and no
dislocations
Charge patching method for larger systems(Wan
Selfconsistent LDA
calculation of a single
graphite sheet
Non-selfconsistent LDA
quality potential for
nanotube
Get information from small
system ab initio calc., then generate
the charge densities for large systems
Motif based charge patching method (Wang)
 graphite (LDA)

patch
nanotube
(r )   
aligned
motif
( r  R)
R
Error: 1%, ~20 meV eigen energy error.
 motif
+ Folded Spectrum Method
(ESCAN)
1 2
{   V (r )} i (r )  Ei i (r )
2
H i   i i
N
(H   ref )2 i  ( i   ref )2 i
Charge patching: free standing quantum dots
In675P652 LDA quality calculations (eigen energy error ~ 20 meV) L-W
64 processors (IBM SP3) for ~ 1 hour
CBM
VBM
Total charge density
motifs
eigenvalue (eV)
Left part of the spectrum of the
Hamiltonian
10
5
0
-5
-10
-15
-20
0
50
100
eigenvalue rank
150
200
Nanowire Single Electron Memory
Samuelson group
Lund, Sweden
Nano Letters Vol2, 2,
2002.
Nanowire Single Electron Memory
(LOBPCG)
• Comparison of LOBPCG with band by band CG (64 procs on IBM SP)
• Matrix Size = 2,265,837 (nano-wire InP InAs with 67,000 atoms)
1.00E+00
|| A . psi - psi E ||
1.00E-01
1.00E-02
LOBPCG
PCG
1.00E-03
1.00E-04
1.00E-05
0
5000
10000
15000
# matvecs
20000
25000
Using code to
determine size
regimes in which
single electron
behavior occurs
(~60nm length,
~20nm diameter),
also using LCBB
code for larger
systems.
Work carried out with G. Bester S. Tomov, J. Langou
Future Directions
• O(N) based methods (exploit locality)
gives sparse matrix problem
• Excited state calculations
• Transport Calculations
Multi-Teraflops Spin Dynamics Studies of
the Magnetic Structure of FeMn and
FeMn/Co Interfaces
Exchange bias, which involves the use of an
antiferromagnetic (AFM) layer such as FeMn to pin
the orientation of the magnetic moment of a
proximate ferromagnetic (FM) layer such as Co, is
of fundamental importance in magnetic multilayer
storage and read head devices.
A larger simulation of 4000 atoms of FeMn ran at
4.42 Teraflops 4000 processors.
(ORNL, Univ. of Tennessee, LBNL(NERSC) and
PSC)
IPDPS03 A. Canning, B. Ujfalussy, T.C. Shulthess, X.-G.
Zhang, W.A. Shelton, D.M.C. Nicholson, G.M. Stocks, Y.
Wang, T. Dirks
Section of an FeMn/Co (Iron Manganese/
Cobalt) interface showing the final
configuration of the magnetic moments
for five layers at the interface.
Shows a new magnetic structure which
is different from the 3Q magnetic
structure of pure FeMn.
Contact: Andrew Canning ([email protected])
Conclusion
First principle
calculation
+
New algorithm
methodology
+
Accurate
Nanostructures simulations
Large scale
supercomputers