SciDAC Software Infrastructure for Lattice Gauge Theory

Download Report

Transcript SciDAC Software Infrastructure for Lattice Gauge Theory

Multi-grid Algorithms for QCD
Richard C. Brower
(Mike Clark, James Brannick, Claudio Rebbi et al)
CCS, University of Tsukuba
March 10 , 2009
Future Code distribution at: http://www.usqcd.org/software.html
Outline
I. Failure of attempts in 1990’s : Why?
II. Motivation and Current Progress
III. Adaptive Multigrid
– Wilson Dirac MG Algorithm
IV. Discussion of Future Directions
–
–
–
–
–
Success of Adaptive GM: Why?
Prune content of Null Space
Actions: Domain Wall/Staggered MG
Applications: Variance Reduction,RHMC
Software Tools: Multi-level QMG API, ILDG
– Consequence for GPGPU (see Clark’s talk)
Failure of attempts in 1990’s :
Why?
– Partial success (RG) weak coupling
– Maintain Gauge invariance
– Maintain °5 Hermiticity
– Learn from Failure
I. Early QCD attempts:
See Thomas Kalkretuer
hep-lat/9409008
review on “MG Methods
for Propators in LGT”.
Israel: Ben-Av, M. Harmatz,
P.G. Lauwers & S.Solomon
Boston: Brower, Edwards,
Rebbi & Vicari
Amsterdam: A. Hulsebos,
J Smit J. C. Vick
Hamburg: T. Kalkreuter,
G. Mack & M. Speh
y R. C. Brower, R. Edwards, C.Rebbi,and E. Vicari,
"Projective multigrid forWilson fermions", Nucl. Phys.B366 (1991) 689
(aka Spectral AMG, Tim Chatier, 199?)
2x2 Blocks for U(1) Dirac
=1
2-d Lattice,
U(x) on links (x) on sites
Gauss-Jacobi (Diamond), CG (circle),
V cycle (square), W cycle (star)
Universal critical slowing:  = F(m l)
Gauss-Jacobi (Diamond), CG(circle),
3 level (square & star)
 = 3 (cross) 10(plus) 100( square)
Instantons, Confinement length l
l
Derek Leinweber http://www.physics.adelaide.edu.au/~dleinweb/VisualQCD/Nobel/index.html
II. Motivation and Current
Progress
Higher resolution QCD
• Lattice scales:
– a(lattice) << 1/Mproton << 1/m¼ << L (box)
– 0.06 fermi << 0.2 fermi << 1.4 fermi << 6.0 fermi
3.3 x
• Consequences:
7
x
4.25 ' 100
– Increasing ill-conditioned Dirac operator
– Suffer from worse critical slowing down (CSD)
– O(100^4) lattice volume
– 1/4 Terabyte file for a single Dirac propagator
Improved Dirac Inverters
• Little progress in last 20 years?
– Red-black preconditioning
• (DeGrand 1988)
• But recent progress (now that it’s needed!)
– Eigenvector Deflation
• (Morgan/Wilcox, Orginos/Stathopoulos)
– Inexact Deflation + Schwarz Domain Decomp.
• (Lüscher)
– Adaptive Multi-grid
• (BU/TOPS)
BU Applied Math/Physics Collaboration!
Harvard U
Mike Clark
Curing ill-conditioning of Dirac operators
Slow convergence of Dirac solver is due small
eigenvalues for vectors in near null subspace: S .
smoothing
prolongation
(interpolation)
D: S ' 0
Fine Grid
restriction
The Multigrid
V-cycle
Smaller Coarse Grid
Common feature of all
Deflation, Schwarz and
Multi-grid algorithms is
to spit the vector space
into near null space S
and the complement S?
intro to multigrid:
Laplace Operator
Define the Prolongator P
Define the Restriction operator R = P†
Operator on coarse space
intro to multigrid:
V-Cycle
n-grid correction scheme
huge improvement
Iterate until exact solve
possible
Interpolate back to fine grid
Essence of multigrid V-Cycle
O(N) to O(N log N) scaling
Result of classical multigrid
MG can be used as a
direct solver
More typically used as a
Krylov preconditioner
In free field theory no
critical slowing down
O(n): Faster than an
FFT at fixed precision!
General Problem: D Ã = b
• “split” vector space into:
– near null D S ' 0 & Complement S?
• Schur decomposition (of course) does this!
– Coarse = near null (IR) , Fine = complement (UV)
Schur:
Implies
Block solution to D Ã = b
•Questions:
•How to choose the splitting?
•How to iterate to find Solution?
3 approaches to near null space
1.
“Deflation”: Nº exact eigenvector projection
2.
“Inexact deflation” plus Schwarz (Lüscher)
3.
Multi-grid preconditioning
Little Dirac:
–
2 & 3 use the same splitting S and S?
Eigenvalue Deflation (Orginos & Stathopoulos)
Number of eigenvalues scale like O(N)
2-level Multigrid Cycle (simplified)
• Smooth:
x’ = (1 - D) x + b
• Project:
Dc = Py D P
•
Ac ec = r
Solve:
• Prolongate
e = P e_c
• Update
x’ = x + e
) r’ = (1- D) r
& rc = P
)
y
r
ec = A-1c Py r
) r’ = b -D(x + e)
= [ 1 - D P (Py D P)-1 Py] r
RESULT: D is preconditioned by M = P (Py D P)-1 Py
M-1 D x = M-1 b
) r’ = ( 1 - D M-1 ) r
Note since Py r’ = 0 ) full (exact) deflation on S
Choosing the Prolongator (P) and Restrictor (R = Py) ?
Relax from random vector to find near null vectors.
Cut up on sublattice in Blocks of size 4d
d=2
for d=1,
s =1
P=
Ã1
Ã2
Ã3
Ã4
0
0
0
0
Ã5
Ã6
Ã
7
Ã
8
............
0
0
0
0
0
0
0
0
0
0
0
0
....
....
Py:fine ! coarse ( non-square matrixy)
(fine lattice vector space)
ker(Py)
(coarse lattice vector space)
But
UV
PyP = 1cc so Ker(P) = 0
fine space
S?
Py
span(Py)
span(P)
IR
P
S= span(P) = Image(Py)
rank(P) = rank(Py) =dim(S)= Nº NB = 2Nº L4/44
y
See Front cover of Gilbert Strang’s undergraduate text !
Oblique Projector Algebra of splitting
But P2 ≠ P is not a “proper projection operator” -The projectors operator (¦2 = ¦ ) are:
Lüscher’s “oblique” projectors are: PL = 1- ¦yL and PR = 1 - ¦R
• Real algorithm has lots of tuning!
– MG proofs only for normal equation (Dy D Ã = b)
– Multigrid is recursive to multi-levels.
– Near null vectors Ãsx found recursive use of MG itself.
– Preserves °5 ( [°5,P]= 0) and Gauge invariance
– pre and post-smoothing is done by Minimum Residual.
– Entire cycle is used as preconditioner in CG
• Current benchmarks for Wilson-Dirac:
–
–
–
–
–
–
V=163 x32, β=6.0, mcrit = 0.8049,
4
Coarse lattice Block = 4 x Nc x 2, Nº =20.
3 level V(2,2) MG cycle.
1 CG application per 6 Dirac application
Note Nº scales O(1) but deflation Nº = O(V)
Reducing Nº by pruning
Multigrid QCD TOPS project
® SA/®AMG: Adaptive Smooth Aggregations Algebraic MultiGrid
see Oct 10-10 workshop (http://super.bu.edu/~brower/MGqcd/)
® SA/®AMG timings for QCD
Brannick, Brower, Clark, McCormick,Manteuffel,Osborn and Rebbi,
“The removal of critical slowing down” Lattice 2008 proceedings
Multigrid vs EigCG
msea = -0.4125.
163 x 64 asymmetric lattice (Mike Clark’s figure)
IV. Discussion of Future Directions
– Success of Adaptive GM: Why?
– Prune content of Null Space
– Other Lattice Actions
• Domain Wall (or Overlap ?) w. Scott Machachlan at
Tufts
• Staggered w. Carleton Detar and Mehmet Okay at
Utah
– Applications:
• Multiple RHS (all to all/disconnected)
• Variance Reduction (BU disco project
• RHMC (see Luscher)
Instantons, Topological Zero Modes
(Atiyah-Singer index) and Confinement length l
l
Physics: Disconnected Diagrams
Connected
vs.
Disconnected
X
u,d,s
u,d
N
X
u,d
N
N
N
Want matrix element:
X
t = tf
t = t'
t=0
31
How strangey is the proton? Who cares?
• Violation of Standard Model:
– Dark Energy (Neutralino scattering):
– NuTev anomaly:
• Nucleon Physics (include u/d + s quarks):
– iso-scalar Form Factors, nucleon structure function,
– Spin crisis for proton, matrix element etc.
y
see Lattice 2008:
Ohki et al Lattice plenary talk.
S.Collins, G. Bali, A.Schafer “Hunting for the strangeness ... nucleon”
Takumi Doi et al
Ron Babich et al
“Strangeness and glue in the nucleon from lattice QCD
“Strange quark content of the nucleon”
Direct detection of dark matter
• In SUSY, the neutralino
scatters from a nucleon via
Higgs exchange:
• The strange scalar matrix
element is a major
uncertainty:
• Uncertainty in fTs gives up to a
factor of 4 uncertainty in the
cross-section!
• Bottino et al., hep-ph/0111229;
• Ellis et al., hep-ph/0502001
33
Multi-grid Variance Reduction
• The signal and variance of the first term is down by 1 to 2
orders of magnitude because Dc » D
• The Coarse level Trace for D-1c is as cheap to calculate as
the level down operator inverse.
• This can of course be done recursively giving an O(N log N)
trace calculation to fixed tolerance ?
Application to HMC: Lüscher’s intermittenty
update of S subspace(0710.6417v1)
y
Combined with “Chronological Inverter” Brower, Ivaneko, Levi, Orginos
IV. More Future Directions
– Software Tools: Multi-level SciDAC API
• QMG w. James Osborn
– ILDG
• Store MG precondition with lattice?
– Consequences for GPGPU
• Mike Clark’s talk
MILC
PERI
Level 4
QDPQOP
SciDAC-2 QCD API
QCD Physics Toolbox
Shared Alg,Building Blocks, Visualization,Performance Tools
Level 3
Level 2
Level 1
/
Application Codes:
CPS /
Chroma /
TOPS
Workflow
and Data Analysis tools
QOP (Optimized in asm)
Reliability
Dirac Operator, Inverters, Force etc
Runtime, accounting, grid,
QDP (QCD Data Parallel)
QIO
Lattice Wide Operations, Data shifts
Binary / XML files & ILDG
QLA
QMP
QMT
(QCD Linear Algebra)
(QCD Message Passing)
(QCD Treads: Multi-core )
SciDAC-1/SciDAC-2 = Gold/Blue
Need Dirac Propagator Farm
• The Clark-Kennedy RHMC Paradox:
(Faster you go the harder it is to keep up)
• Analysis is now the
“Ἀχιλλεύς heel”
QuickTime™ and a
decompressor
are needed to see this picture.
Nvidia Tesla Quad S1070 1U System $8K
Processors
4 x Tesla T10P
Number of cores
960
Core clock
1.5 Hz
Performance
4 Teraflops
memory BW
16.0 GB
bandwidth
408 GB/sec
Memory I/0
2048 bit,800MHz
Form factor
1U (EIA 19” rack)
System I/O
2 PCIe x 16 Gen2
Typical power
700 W
GPGPU: 240 core CUDAy code
Nvidia’s C extension: All GPGPU architecturesy from Nvidia (Tesla),
AMD/ATI and Intel (Larabee) will have a common language :
OpenCL (Computing Language) http://www.khronos.org/registry/cl/
y
Commercial Break:
QuickTime™ and a
decompressor
are needed to see this picture.
(QCDNA in Boston Fall 2009?)