Transcript SLIDES
Beyond GEMM: How Can We Make Quantum Chemistry Fast? or: Why Computer Scientists Don’t Like Chemists Devin Matthews 9/25/14 2014 BLIS Retreat 1 A Motivating Example Equation-of-Motion Coupled Cluster Theory: what is the difference in energy between the ground and excited states of some molecule? “matrix”: Describes the interactions in the system. The bar means it is “dressed” (i.e. tuned to a specific ground state). 9/25/14 “vector”: Describes the excited state. Should be an eigenvector of H. 2014 BLIS Retreat S1 E ? S0 scalar: The energy difference. 2 This is Linear Algebra, But… R1 R2 R3 R4 Tensors! 9/25/14 2014 BLIS Retreat 3 This is Linear Algebra, But… (+ all permutations!) 9/25/14 2014 BLIS Retreat 4 …It’s Really Multi-(non)-linear Algebra Hundreds of tensor contractions in a single “matrixvector multiply”… 9/25/14 2014 BLIS Retreat 5 Oh Yeah, It’s Sparse Too… O2 ~0.002% non-zero… 9/25/14 ~0.39% non-zero… 2014 BLIS Retreat 6 Oh Yeah, It’s Sparse Too… , ,… 100.0% Spin-orbital 0.174% +Symmetry 0.047% +Spin-integration +Non-orthogonal spin-adaptation 0.016% 9/25/14 2014 BLIS Retreat +More symmetry 7 Oh Yeah, It’s Sparse Too… • This symmetry is very unwieldy to use and maintain when using GEMM. 9/25/14 ABEF 0001 ABEF 0002 ABEF 0010 ABEF 0011 ABEF 0012 ABEF • Blocks may be distributed to disk or other processors. • No symmetry makes using GEMM easier. … • This tensor may be very large and need to be split amongst several processors or be cached to disk. ijkl= 0000 2014 BLIS Retreat 8 Oh Yeah, It’s Sparse Too… The final reduction from 0.016% to ~0.002% in the previous example is due to point group symmetry: 9/25/14 2014 BLIS Retreat 9 Oh Yeah, It’s Sparse Too… The final reduction from 0.016% to ~0.002% in the previous example is due to point group symmetry: a ij b ab 9/25/14 2014 BLIS Retreat 10 Adding It All Up 1 matrix-vector multiply 100s-1000s of tensor contractions X 1 complicated tensor 100s-1000s of simpler tensors X Point group symmetry Multiple GEMMs per contraction X Column symmetry 10s of permutations X Solution of eigenproblem 10s of iterations Potentially billions (!!) of calls to GEMM 9/25/14 2014 BLIS Retreat 11 Adding It All Up 9/25/14 2014 BLIS Retreat 12 The Big Picture Chemistry “Simple” eigenproblem… , In terms of tensors… In terms of other tensors… Linear Algebra With structured sparsity… With symmetry… With slicing (or blocking etc.)… With more sparsity… In terms of matrices. 9/25/14 2014 BLIS Retreat 13 Status Quo (CFOUR) “Simple” eigenproblem… , In terms of tensors… Layer 4 Me In terms of other tensors… With structured sparsity… Layer 3 With symmetry… With slicing (or blocking etc.)… Layer 2 MPI + OMP Layer 1 OMP Someone Else With more sparsity… 9/25/14 In terms of matrices. 2014 BLIS Retreat 14 Dealing With Chemistry: Large Scale Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Pros: Cons: • Each block has little to no • Blocks require padding for edge case. Padding can be symmetry/sparsity. excessive for many dimensions or short edge lengths. • Blocks can be distributed in many ways. • To avoid padding, some blocks must keep complex • Load balancing can be static or dynamic. structure. 9/25/14 2014 BLIS Retreat 15 Dealing With Chemistry: Large Scale Pros: • Load balancing is automatic. • Communication is regular. • Little to no padding needed. • Can be composed with blocking. 9/25/14 Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9 Cons: • Complex structure is retained at all levels. • Communication and local computation needs to take this structure into account. 2014 BLIS Retreat 16 Dealing With Chemistry: Small Scale The Old Way The New Way? = Memory movement ck BLIS: A Framework 0:15 nr nr ai mr += mr micro-kernel kc BLIS: += mc A~i Ci kc ck mc em ai BLAS: em … Ci n += mc Ai ~ B n B kc 3 Fig. 11. I llust r at ion of t he var ious levels of blocking and r elat ed packing when implement ing GE M M in t he st yle of [Got o and van de Geijn 2008a]. H er e, m c and k c ser ve as cache block sizes used by t he higher-level blocked algor it hms t o par t it ion t he mat r ix pr oblem down t o a so-called “block-panel” subpr oblem (depict ed in t he middle of t he diagr am), implement ed in BL I S as a por t able macr o-ker nel. Similar ly, m r and n r ser ve as r egist er block sizes for t he micr o-ker nel in t he m and n dimensions, r espect ively, which also cor r espond t o t he lengt h and widt h of t he individual packed panels of mat r ices à i and B̃ , r espect ively. 5.1. Level-3 techniques 9/25/14 2014 L evel-3 oper at ions make up per haps t he most -used por t ion of t he BL AS.11 These oper at ions ar e also t ypically t he most diffi cult t o implement , in par t because t he space of pot ent ial solut ions is much lar ger t han t hat of level-2 and lower oper at ions. I t may, BLIS Retreat 17ed t her efor e, seem count er int uit ive t hat t he por t ion of t he BL I S fr amewor k associat wit h level-3 oper at ions is act ually si mpl er (and smaller ) t han t hat of t he r emaining oper at ions. We now walk t hr ough t he met hods used t o implement t his funct ionalit y. Dealing With Chemistry: Small Scale abcd kl Z BLIS: A Framework 0:15 nr nr mr += mr micro-kernel n kc BLIS: … += mc A~i Ci ~ B kc Ci mc AXPY! += mc n B Ai 3 Fig. 11. I llust r at ion of t he var ious levels of blocking and r elat ed packing when implement ing GE M M in t he st yle of [Got o and van de Geijn 2008a]. H er e, m c and k c ser ve as cache block sizes used by t he higher-level blocked algor it hms t o par t it ion t he mat r ix pr oblem down t o a so-called “block-panel” subpr oblem (depict ed in t he middle of t he diagr am), implement ed in BL I S as a por t able macr o-ker nel. Similar ly, m r and n r ser ve as r egist er block sizes for t he micr o-ker nel in t he m and n dimensions, r espect ively, which also cor r espond t o t he lengt h and widt h of t he individual packed panels of mat r ices à i and B̃ , r espect ively. kl R 5.1. Level-3 techniques 2014 BLIS mn abcd mn 9/25/14 kc W L evel-3 oper at ions make up per haps t he most -used por t ion of t he BL AS.11 These oper at ions ar e also t ypically t he most diffi cult t o implement , in par t because t he space of pot ent ial solut ions is much lar ger t han t hat of level-2 and lower oper at ions. I t may, t her efor e, seem count er int uit ive t hat t he por t ion of t he BL I S fr amewor k associat ed wit h level-3 oper at ions is act ually si mpl er (and smaller ) t han t hat of t he r emaining oper at ions. We now walk t hr ough t he met hods used t o implement t his funct Retreat 18 ionalit y. 5.1.1. Operation. Per haps t he most obvious fr amewor k dimension we wish t o manage Flexibility Through Interfaces Capabilities: Tensor<…> , Basic Operator Similarity-transform operator Spin-orbital operator Index permutation symmetry Distributed Point group symmetry Commutator expansion Factorization, operator resolution Tensor<DIST|IPS|SO| PGS> Spin-integration or spin-adaptation Blocking/packing Tensor<DIST|IPS> (Basic tensor functionality) 9/25/14 2014 BLIS Retreat 19 Summary • Chemistry is hard. • A fast GEMM implementation is nice, but doesn’t go far enough. • Complex structure can be dealt with – By breaking the problem into simple blocks, – By incorporating the structure into communication and computation, – By relating a complex object to a simpler one (a matrix) bit by bit. • Layered and composable interfaces are important. – Implementations written at a “high level” can use “low level” interfaces through intermediate ones. – Adapters can go from one well-defined interface to another. 9/25/14 2014 BLIS Retreat 20 Thanks! BLIS: Field van Zee Tyler Smith Many others… Tensormental: Martin Schatz Bryan Marker Robert van de Geijn John Stanton CTF/AQ: Edgar Solomonik Jeff Hammond 9/25/14 Tensor packing: Woody Austin Martin Schatz 2014 BLIS Retreat The CFOUR developers 21