F. Spiga - Achievements and challengesx

Download Report

Transcript F. Spiga - Achievements and challengesx

Achievements and challenges running
GPU-accelerated Quantum ESPRESSO
on heterogeneous clusters
Filippo Spiga1,2 <[email protected]>
1HPCS,
University of Cambridge
2Quantum ESPRESSO Foundation
«What I cannot compute, I do not understand.»
(adapted from Richard P. Feynman)
What is Quantum ESPRESSO?
•
QUANTUM ESPRESSO is an integrated suite of computer codes for atomistic
simulations based on DFT, pseudo-potentials, and plane waves
•
"ESPRESSO" stands for opEn Source Package for Research in Electronic
Structure, Simulation, and Optimization
•
QUANTUM ESPRESSO is an initiative of SISSA, EPFL, and ICTP, with many
partners in Europe and worldwide
•
QUANTUM ESPRESSO is free software that can be freely downloaded.
Everybody is free to use it and welcome to contribute to its development
2
What Quantum ESPRESSO can do?
•
•
•
•
•
•
ground-state calculations
– Kohn-Sham orbitals and energies, total energies
and atomic forces
– finite as well as infinite system
– any crystal structure or supercell
– insulators and metals (different schemes of BZ
integration)
– structural optimization (many minimization
schemes available)
– transition states and minimum-energy paths (via
NEB or string dynamics) electronic polarization
via Berry’s phase
– finite electric fields via saw-tooth potential or
electric enthalpy
norm-conserving as well as ultra-soft and PAW pseudopotentials
many different energy functionals, including meta-GGA,
DFT+U, and hybrids (van der Waals soon to be
available)
scalar-relativistic as well as fully relativistic (spin-orbit)
calculations
magnetic systems, including non-collinear magnetism
Wannier intepolations
•
•
•
ab-initio molecular dynamics
– Car-Parrinello (many ensembles and flavors)
– Born-Oppenheimer (many ensembles and flavors)
– QM-MM (interface with LAMMPS)
linear response and vibrational dynamics
– phonon dispersions, real-space interatomic force
constants
– electron-phonon interactions and
superconductivity effective charges and dielectric
tensors
– third-order an-harmonicities and phonon lifetimes
– infrared and (off-resonance) Raman cross
sections
– thermal properties via the quasi-harmonic
approximation
electronic excited states
– TDDFT for very large systems (both real-time and
“turbo-Lanczos”)
– MBPT for very large systems (GW, BSE)
.... plus several post processing tools!
3
Quantum ESPRESSO in numbers
•
350,000+ lines of FORTRAN/C code
•
46 registered developers
•
1600+ registered users
•
5700+ downloads of the latest 5.x.x version
•
2 web-sites (QUANTUM-ESPRESSO.ORG & QE-FORGE.ORG)
•
1 official user mailing-list, 1 official developer mailing-list
•
24 international schools and training courses (1000+ participants)
4
PWscf in a nutshell
program flow
3D-FFT +
GEMM + LAPACK
3D-FFT
3D-FFT +
GEMM
5
Spoiler!
•
Only PWscf ported to GPU
•
Performance serial (full socket vs full socket + GPU): 3x ~ 4x
•
Performance parallel (best MPI+OpenMP vs ... + GPU): 2x ~ 3x
•
Designed to run better at low number of nodes (efficiency not high)
•
spin magnetization and noncolin not ported (working on it)
•
I/O set low on purpose
•
NVIDIA Kepler GPU not exploited at their best (working on it)
6
Achievement: smart and selective BLAS
phiGEMM: CPU+GPU GEMM operations
•
•
•
•
•
Drop-in library wont work as expected, need control
overcome limit of the GPU memory
flexible interface (C on the HOST, C on the DEVICE)
dynamic workload adjustment (SPLIT) -- heuristic
call-by-call profiling capabilities
A1
C1
B
×
+
GPU
A2
CPU
I
C2
B
D2H
H2D
×
+
unbalance
7
Challenge: rectangular GEMM
bad shape, poor performance
Issues:
•A and B can be larger than GPU memory
•A and B matrices are "badly" rectangular (dominant dimension)
n
n
k
k
m
Common case due to
data distribution
m
Solutions: ~ +15% performance
•tiling approach
•
•
not too big, not too small
GEMM computation must exceed copies (H-D, D-H), especially for small tiles
•handling the "SPECIAL-K" case
•
•
adding beta × C done once
accumulating alpha × Ai × Bi times
Optimizations included in phiGEMM ( version >1.9)
8
Challenge: parallel 3D-FFT
•
3D-FFT burns up to 40%~45% of total SCF run-time
•
90-ish % 3D-FFT of PWscf are inside vloc_psi ("Wave" grid)
•
3D-FFT is "small"  <3003 COMPLEX DP
•
3D-FFT can be not a cube
•
In serial a 3DFFT is called as it is, in parallel 3D-FFT = Σ1D-FFT
•
In serial data layout is straightforward, in parallel not*
•
MPI communication become big issue for many-node problem
•
GPU FFT is mainly memory bounded  grouping & batching 3D-FFT
9
Challenge: FFT data layout
it is all about sticks & planes
There are two "FFT grid" representation in
Reciprocal Space: wave functions (Ecut) and
charge density (4Ecut)
A single 3D-FFT is divided in independent 1D-FFTs
0 1 2 3 0 1 2 3
0 1 2 3 0 1 2 3 0 1
2 3
Ec
4Ec
y
z
x
~ N x N y / 5 F F T a lo n g z
Pa ra llel Tran sp ose ~ N x N y N z / (5 N p) data exchange d pe r P E
Tra n s fo rm a lon g Z
Data are not contiguous and not “trivially”
distributed across processors
Tra n s fo rm a lon g X
Tra n s fo rm a lon g Y
0
0
1
1
2
2
3
3
z
y
z
x
y
N x N z / 2 F F T a lo n g y
0
PE 0
2
PE 2
1
PE 1
3
PE 3
x
N y N z F F T a lo n g x
Zeros are not transformed. Different cutoffs preserve accuracy
10
Challenge: parallel 3D-FFT
Optimization #1
•
•
•
CUDA-enabled MPI for P2P (within socket)
Overlap FFT computation with MPI communication
MPI communication >>> FFT computation for many nodes
Sync
MemCpy HD
MPI
MemCpy DH
11
Challenge: parallel 3D-FFT
Optimization #2
Observation: Limitation in overlapping D-H copy due to MPI communication
•pinned needed (!!!)
•Stream D-H copy to hide CPU copy and FFT computation
Optimization #3
Observation: MPI “packets” small for many nodes
•Re-order data before communication
•Batch MPI_Alltoallv communications
Optimization #4
Idea: reduce data transmitted (risky...)
•Perform FFTs and GEMM in DP, truncate data before communication to SP
12
Achievements: parallel 3D-FFT
miniDFT 1.6 (k-points calculations, ultra-soft pseudo-potentials)
Optimization #1: +37% improvement in communication
Optimization #2:
without proper stream mng
with proper stream mng
Optimization #3: +10% improvement in communication
Optimization #4: +52% (!!!) improvement in communication SP vs DP
Lower gain in PWscf !!!
13
Challenge: parallel 3D-FFT
All data of all FFT computed
back to host mem
1
2
Data reordering before
GPU-GPU communication
Image courtesy of D.Stoic
14
Challenge: H*psi
compute/update H * psi:
compute kinetic and non-local term (in G space)
 complexity : Ni × (N × Ng+ Ng × N × Np)
Loop over (not converged) bands:
FFT (psi) to R space
 complexity : Ni
× Nb × FFT(Nr)
 complexity : Ni
× Nb × Nr
 complexity : Ni
× Nb × FFT(Nr)
compute V * psi
FFT (V * psi) back to G space
compute Vexx
 complexity : Ni × Nc
2×FFT(Nr))
N = 2×Nb (where Nb = number of valence bands)
Ng = number of G vectors
Ni = number of Davidson iteration
× Nq × Nb × (5 × Nr +
Np = number of PP projector
Nr = size of the 3D FFT grid
Nq = number of q-point (may be different from Nk)
15
Challenge: H*psi
non-converged electronic bands dilemma
Non-predictable number of FFT
across all SCF iterations
16
Challenge: parallel 3D-FFT
the orthogonal approach
FFT GR
CUFFT GR
PSI
PSIC
PSI
PSIC
PSIC
Multiple LOCAL grid to
compute
“MPI_Allgatherv”
DISTRIBUTED
products
products
“MPI_Allscatterv”
HPSI
HPSI
PSIC
FFT RG
Overlapping is
possible!!
PSIC
PSIC
CUFFT RG
Considerations:
• memory on GPU  ATLAS K40 (12 GByte)
• (still) too much communication  GPU Direct capability needed
• enough 3D-FFT  not predictable in advance
Not ready for production yet
• benefit also for CPU-only!
17
Challenge: eigen-solvers
which library?
•
LAPACK  MAGMA (ICL, University of Tennessee)
– hybridization approach (CPU + GPU), dynamic scheduling based on DLA (QUARK)
– single and multi-GPU, no memory distributed (yet)
– some (inevitable) numerical "discrepancies"
•
ScaLAPACK  ELPA  ELPA + GPU (RZG + NVIDIA)
– ELPA (Eigenvalue SoLvers for Petaflop Applications) improves ScaLAPACK
– ELPA-GPU proof-of-concept based on CUDA FORTRAN
– effective results below expectation
•
Lancronz diagonaliz w/ tridiagonal QR algorithm (Penn State)
– simple (too simple?) and designed to be GPU friendly
– take advantage of GPU Direct
– experimental, need testing and validation
18
HPC Machines
TITAN (ORNL) [CRAY]
WILKES (HPCS) [DELL]
•
•
•
•
128 nodes dual-socket
dual 6-core Intel Ivy Bridge
dual NVIDIA K20c per node
dual Mellanox Connect-IB FDR
#2 Green500 Nov 2013
( ~3632 MFlops/W )
•
•
•
•
18688 nodes single-socket
single 16-core AMD Opteron
one NVIDIA K20x per node
Gemini interconnection
#2 Top500 Jun 2013
( ~17.59 PFlops Rmax )
19
Achievement: Save Power
serial multi-threaded, single GPU, NVIDIA Fermi generation
Shilu-3 (C2050)
-57%
0.08
0.06
0.04
0.02
5
Power [KW/h]
Power [KW/h]
Power [KW/h]
0.16
0.7
1200
-58%
0.5
0.3
0.1
9000
-54%
0.12
0.08
0.04
4
2100
4
3.2x
3.1x
3.67x
4
3
300
2
1
0
6 OMP
6 MPI
6 OMP
1 GPU
3
2
700
2
1
0
3
3000
Time [s]
600
1400
6000
Time [s]
900
Time [s]
Water-on-Calcite (C2050)
AUSURF112, k-point (C2050)
0.1
0
6 OMP
6 MPI
6 OMP
1 GPU
Tests run early 2012 @ ICHEC
1
6 OMP
6 MPI
6 OMP
1 GPU
20
Achievement: improved time-to-solution
2.4x
~2.9x
~3.4x
~3.4x
~3.5x
2.5x
2.4x
~2.1x
Serial
Parallel
Parallel tests run on Wilkes
Serial tests run on SBN machine
21
Challenge: running on CRAY XK7
Key differences...
•AMD Bulldozer architecture, 2 cores shares same FPU pipeline
 aprun -j 1
•NUMA locality matters a lot , for both CPU-only and CPU+GPU
 aprun –cc numanode
•GPU Direct over RDMA is not supported (yet?)  3D-FFT not working
•Scheduling policy "unfriendly"  input has to be really big
Performance below expectation (<2x) 
Tricks: many-pw.x, __USE_3D_FFT
22
Challenge: educate users
•
Performance portability myth
•
"configure, compile, run" same as the CPU version
•
All dependencies (MAGMA, phiGEMM) compiled by QE-GPU
•
No more than 2 MPI process per GPU
– Hyper-Q does not work automatically, an additional running deamon is needed
•
Forget about 1:1 output comparison
•
QE-GPU can run on every GPU but some GPU are better than others...
23
Lessons learnt
being "heterogeneous" today and tomorrow
•
GPU does not really improve code scalability, only time-to-solution
•
Re-think about data distribution for massive parallel architectures
•
Deal with un-controlled "numerical fluctuations" (GPU magnifies this)
•
The "data movement" constrain will soon disappear
 new Intel Phi Kings Landing, NVIDIA project Denver expected by 2015
•
Looking for true alternatives, new algorithms
– not easy, extensive validation _plus_ module dependencies
•
Performance is a function of human effort
•
Follow the mantra «Do what you are good at.»
24
THANK YOU FOR YOUR
ATTENTION!
Links:
•
http://hpc.cam.ac.uk
•
http://www.quantum-espresso.org/
•
http://foundation.quantum-espresso.org/
•
http://qe-forge.org/gf/project/q-e/
•
http://qe-forge.org/gf/project/q-e-gpu/