Parallel Multi-Reference Configuration Interaction on JAZZ Ron Shepard (CHM)

Download Report

Transcript Parallel Multi-Reference Configuration Interaction on JAZZ Ron Shepard (CHM)

Parallel Multi-Reference
Configuration Interaction on
JAZZ
Ron Shepard (CHM)
Mike Minkoff (MCS)
Mike Dvorak (MCS)
The COLUMBUS Program
System
• Molecular Electronic Structure
• Collection of individual programs that
communicate through external files
• 1: Atomic-Orbital Integral Generation
2: Orbital Optimization (MCSCF, SCF)
3: Integral Transformation
4: MR-SDCI
5: CI Density
6: Properties (energy gradient, geometry
optimization)
Real Symmetric Eigenvalue
Problem
• Use the iterative Davidson Method for the
lowest (or lowest few) eigenpairs
• Direct CI: H is not explicitly constructed,
w=Hv are constructed in “operator” form
• Matrix dimensions are 104 to 109
• All floating point calculations are 64-bit
Davidson Method
Generate an initial vector x1
MAINLOOP: DO n=1, NITER
Compute and save wn = H xn
Compute the nth row and column of G = XTHX = WTX
Compute the subspace Ritz pair: (G – 1) c = 0
Compute the residual vector r = W c –  X c
Check for convergence using |r|, c, , etc.
IF (converged) THEN
EXIT MAINLOOP
ELSE
Generate a new expansion vector xn+1 from r, , v=Xc, etc.
ENDIF
ENDDO MAINLOOP
Matrix Elements
• Hmn = <m| Hop |n>
• |n> = |(r1) 1 (r2)2 … (rn)n |
with j=, 

Ze Za
Z
2

 j  

j 2m e
j a r j  Ra
jk r j  rk
n
• H
•
op

2
  
n Nuc
dr1dr2 drn
n
2
e
…Matrix Elements
• H mn 
Norb
1
h
m
E
n

 pq pq
2
Norb
p,q
p,q,r,s
g
pqrs
m e pqrs n
• hpq and gpqrs are computed and stored as
arrays (with index symmetry)
• <m|Epq|n> and <m|epqrs|n> are coupling
coefficients; these are sparse and are
recomputed as needed
Matrix-Vector Products
•w=Hx
Ncsf
wm   H mn x n
n
Ncsf Norb
Ncsf Norb
   h pq m E pq n x n  12 
n
p,q
g
pqrs
m e pqrs n x n
n p,q,r,s
• The challenge is to bring together the
different factors in order to compute w
efficiently
Coupling Coefficient Evaluation
• Graphical Unitary Group Approach (GUGA)
• Define a directed graph with nodes and arcs:
Shavitt Graph
• Nodes correspond to spin-coupled states
consisting of a subset of the total number of
orbitals
• Arcs correspond to the (up to) four allowed spin
couplings when an orbital is added to the graph
…Coupling Coefficient Evaluation
 graph head
Internal orbitals
w,x,y,z
External orbitals
graph tail
…Coupling Coefficient Evaluation
m E ij m

Integral Types
• 0: gpqrs
• 1: gpqra
• 2: gpqab,
gpa,qb
• 3: gpabc
• 4: gabcd
z
y
x
w
z
0
1
2
2
y
x
1 0,2 1,3
2 1,3 0,2,4
w 2 1,3
2
1,3
2
0,2,4
Original Program (1980)
• Need to optimize wave functions for
Ncsf=105 to 106
• Available memory was typically 105 words
• Must segment the vectors, v and w, and
partition the matrix H into subblocks, then
work with one subblock at a time.
…First Parallel Program (1990)
• Networked workstations using TCGMSG
• Each matrix subblock corresponds to a compute
task
• Different tasks require different resources (pay
attention to load balancing)
• Same vector segmentation for all gpqrs types
• gpqrs, <m| epqrs |n>, w, and v were stored on
external shared files (file contention bottlenecks)
Current Parallel Program
• Eliminate shared file I/O by distributing data
across the nodes with the GA Library
• Parallel efficiency depends on the vector
segmentation and corresponding H subblocking
• Apply different vector segmentation for different
gpqrs types
• Tasks are timed each Davidson iteration, then
sorted into decreasing order and reassigned for the
next iteration in order to optimize load balancing
• Manual tuning of the segmentation is required for
optimal performance
• Capable of optimizing expansions up to Ncsf=109
COLUMBUSPetaflops
Application
Mike Dvorak, Mike Minkoff
MCS Division
Ron Shepard
Chemistry Division
Argonne National Lab
Notes on software
engineering
• PCIUDG parallel code
– Fortran 77/90
– Compiled with Intel/Myrinet on Jazz
• 70k lines in PCIUDG
– 14 files containing ~205 subroutines
• Versioning system
– Currently distributed in a tar file
– Created a LCRC CVS repository for personal code
mods
Notes on Software
Engineering (cont)
• Homegrown preprocessing system
– Uses “*mdc*if parallel” statements to
comment/uncomment parts of the code
– Could/should be replaced with CPP directives
• Global Arrays library
– Provides global address space for matrix
computation
– Used mainly for chemistry codes but applicable for
other applications
– Ran with most current version --> no perf gain
– Installed on Softenv on Jazz (version 3.2.6)
Gprof Output
• 270 subroutines called
• loopcalc subroutine using ~20% of
simulation time
• Added user defined MPE states to 50
loopcalc calls
– Challenge due to large number of subroutines in
file
– 2 GB file size severe limiter on number of procs
– Broken logging
• Show actual output
Jumpshot/MPE
Instrumentation
Live Demo of a 20 proc run
Using FPMPI
• Relinked code with FPMPI
• Tell you total number of MPE calls made
• Output file size smalled (compared to other
tools i.e. Jumpshot)
• Produces a histogram of message sizes
• Not installed in Softenv on Jazz yet
– ~riley/fpmpi-2.0
• Problem for runs
– Double Zeta C2H4 without optimizing the load
balance
Total Number of MPI
calls
Max/Avg MPI Complete
Time
Avg/Max Time MPI
Barrier
COLUMBUS Performance
Results
COLUMBUS Performance Data
R. Shepard, M. Dvorak, M. Minkoff
Timing of Steps (Sec.)
Time
Basis Set
QZ
Integral
Time
388
Orbital
Opt. Time
11806
CI Time
TZ
26
104
31,415
DZ
1
34
3,281
382,221
Walks Vs. Basis Set (Millions)
Walk Type
Z
Y
X
W
Matrix
Dim.
.08
15
536
305
858
.08
7
120
69
198
.08
2
13
8
24
Basis Set
cc-pVQZ
cc-pVTZ
cc-pVDZ
Timing of CI Iteration
Basic Model of Performance
Time = C1+C2*N+C3/N
Constrained Linear Term
C2 > 0