charm.cs.uiuc.edu

Download Report

Transcript charm.cs.uiuc.edu

Optimizing Quantum Chemistry using
Charm++
Eric Bohm
http://charm.cs.uiuc.edu
Parallel Programming Laboratory
Department of Computer Science
University of Illinois at Urbana Champaign
1
Overview
CPMD
9 phases
Charm applicability
Overlap
Decomposition
Decomposition
State Planes
3d FFT
3d matrix multiply
Utilizing Charm++
Portability
Prioritized nonlocal
Communication
Optimization
Commlib
Projections
2
Quantum Chemistry
LeanCP Collaboration
Glenn Martyna (IBM TJ Watson)
Mark Tuckerman (NYU)
Nick Nystrom (PSU)
PPL: Kale, Shi, Bohm, Pauli, Kumar
(now at IBM), Vadali
CPMD Method
Plane wave QM : 100s of atoms
Charm++ Parallelization
PINY MD Physics engine
3
CPMD on Charm++
11 Charm Arrays
Adaptive Overlap
4 Charm Modules
Prioritized computation
for phased application
13 Charm Groups
3 Commlib strategies
BLAS
FFTW
PINY MD
Communication
optimization
Load balancing
Group caches
Rth Threads
4
Practical Scaling
Single Wall Carbon
Nanotube Field Effect
Transistor
BG/L Performance
Processors
32
64
128
256
512
256
512
1024
Nodes Cputime(s) Parallel Efficiency
32
19.8
1.00
64
10.9
0.90
128
6.9
0.71
256
4.3
0.58
512
2.7
0.45
128
4.9
0.51
256
3.1
0.40
512
2.1
0.28
5
Computation Flow
6
Charm++
Uses the approach of virtualization
Divide the work into VPs
Typically much more than #proc
Schedule each VP for execution
Advantage:
Computation and communication can be overlapped (between
VPs)
Number of VPs can be independent of #proc
Other: load balancing, checkpointing, etc.
7
Decomposition
Higher degree of virtualization better for
Charm++
Real Space State Planes, Gspace State Planes,
Rho Real and Rho G, S-Calculators for each
gspace state plane.
Tens of thousands of chares for a 32 mol problem
Careful scheduling to maximize efficiency
Most of the computation is in FFTs and Matrix
Multiplies
8
3-D FFT Implementation
“Dense” 3-D
FFT
“Sparse” 3-D
FFT
9
Parallel FFT Library
Slab-based parallelization
We do not re-implement the sequential routine
Utilize 1-D and 2-D FFT routines provided by FFTW
Allow for
Multiple 3-D FFTs simultaneously
Multiple data sets within the same set of slab objects
Useful as 3-D FFTs are frequently used in CP computations
10
Multiple Parallel 3-D FFTs
11
Matrix Multiply
AKA Scalculator or Pair Calculator
Decompose state-plane values into smaller
objects.
Use DGEMM on smaller sub-matrices
Sum together via reduction back to Gspace
12
Matrix Multiply VP based approach
13
Charm++ Tricks and Tips
Message driven execution and high degree of
virtualization present tuning challenges
Flow of control using Rth-Threads
Prioritized messages
Commlib framework
Charm++ arrays vs groups
Problem identification with projections
Problem isolation techniques
14
Flow Control in Parallel
Rth Threads
Based on Duff's device these are user level threads
with negligible overhead.
Essentially Goto and Return without readability loss
Allow for an event loop style of programming
Makes flow of control explicit
Uses familiar threading semantic
15
Rth Threads for Flow Control
RTH_Routine_code(CP_Rho_RealSpacePlane,run) {
while(1) {
RTH_Suspend();
c->acceptDensity();
RTH_Suspend();
c->doneFFT();
RTH_Suspend();
c->doneFFT();
RTH_Suspend();
c->doneFFT();
RTH_Suspend();
c->doneFFT();
if(!(c->gotAllRhoEnergy&&(c->doneDoingFFT)))// PRE: doneDoingFFT==TRUE
RTH_Suspend();
}
// Here the density from all the states is added up. The data from all the
// states is received via an array section reduction.
void CP_Rho_RealSpacePlane::acceptDensity(CkReductionMsg *msg) {
double *realValues = (double *) msg->getData();
for(int i = 0; i < rho_rs.sizeZ*rho_rs.sizeX; i++){
rho_rs.doFFTonThis[i] = complex(realValues[i] * probScale, 0);
}
delete msg;
RTH_Runtime_resume(run_thread);
}
16
Prioritized Messages for Overlap
17
Communication Library
Fine grained decomposition can result in many
small messages.
Message combining via the Commlib framework
in Charm++ addresses this problem.
Streaming protocol optimizes many to many
personalized.
Forwarding protocols like Ring or Multiring can
be beneficial.
But not on BG/L
18
Commlib Strategy Selection
19
Streaming Commlib
Saves time
610ms
vs
480ms
20
Bound Arrays
Why?
Efficiency and clarity of expression.
Two arrays of the same dimensionality where like
indices are co-placed.
Gspace and the non-local computation both have
plane based computations and share many data
elements.
Use ck-local to access elements, like local
functions and local function calls.
Remain distinct parallel objects
21
Group Caching Techniques
Group objects have 1 element per processor
Making excellent cache points for arrays which
may have many chares per processor
Place low volatility data in the group
Array elements use cklocal to access
In CPMD: the Structure Factor for all chares
which have plane P use the same memory
22
Charm++ Performance Debugging
Complex parallel applications hard to debug
Event based model with high degree of
virtualization presents new challenges
Projections and Charm++ debugger Tools
Bottleneck identification:
using the Projections Usage Profile tool
23
Old S->T Orthonormalization
24
After Parallel S->T
25
Problem isolation techniques
Using Rth threads its easy to isolate phases by
adding a barrier.
Contribute to Reduction -> suspend
Reduction proxy is broadcast client ->resume
In the following example we break up the Gspace
IFFT into computation and communication entry
methods.
We then insert a barrier between them to highlight
a specific performance problem
26
Projections Timeline Analysis
27
Optimizations Motivated by BG/L
Finer decomposition
Structure Factor and non-local computation now
operate on groups of atoms within a plane
Improved scaling
Avoid creating network bottlenecks
No DMA or communication offload on BG/L's torus net
Workarounds for MPI progress engine
Set eager <1000
Add network probes inside inner loops
Shift communication to avoid cross computation
phase interference
28
After the fixes
29
Future Work
Scaling to 20k processors on BG/L - density
pencil ffts
Rhospace real->complex doublepack
optimization
New FFT based algorithm for Structure Factor
More systems
Topology aware chare mapping
HLL Orchestration expression
30
What time is it in Scotland?
There is a 1024 node
BG/L in Edinburg
Time is 6 hours ahead
of CT there.
During this non
production time we
can run on the full
rack at night
Thank you EPCC!
31