Transcript SLIDES

Beyond GEMM: How Can We Make
Quantum Chemistry Fast?
or: Why Computer Scientists Don’t
Like Chemists
Devin Matthews
9/25/14
2014 BLIS Retreat
1
A Motivating Example
Equation-of-Motion Coupled Cluster Theory: what
is the difference in energy between the ground and
excited states of some molecule?
“matrix”:
Describes the interactions in
the system. The bar means it
is “dressed” (i.e. tuned to a
specific ground state).
9/25/14
“vector”:
Describes the excited
state. Should be an
eigenvector of H.
2014 BLIS Retreat
S1
E
?
S0
scalar:
The energy
difference.
2
This is Linear Algebra, But…
R1
R2
R3
R4
Tensors!
9/25/14
2014 BLIS Retreat
3
This is Linear Algebra, But…
(+ all permutations!)
9/25/14
2014 BLIS Retreat
4
…It’s Really Multi-(non)-linear Algebra
Hundreds of tensor
contractions in a
single “matrixvector multiply”…
9/25/14
2014 BLIS Retreat
5
Oh Yeah, It’s Sparse Too…
O2
~0.002%
non-zero…
9/25/14
~0.39%
non-zero…
2014 BLIS Retreat
6
Oh Yeah, It’s Sparse Too…
,
,…
100.0%
Spin-orbital
0.174%
+Symmetry
0.047%
+Spin-integration
+Non-orthogonal
spin-adaptation
0.016%
9/25/14
2014 BLIS Retreat
+More symmetry
7
Oh Yeah, It’s Sparse Too…
• This symmetry is very
unwieldy to use and
maintain when using
GEMM.
9/25/14
ABEF
0001
ABEF
0002
ABEF
0010
ABEF
0011
ABEF
0012
ABEF
• Blocks may be
distributed to
disk or other
processors.
• No symmetry
makes using
GEMM easier.
…
• This tensor may be very
large and need to be split
amongst several
processors or be cached
to disk.
ijkl=
0000
2014 BLIS Retreat
8
Oh Yeah, It’s Sparse Too…
The final reduction from 0.016% to ~0.002% in the previous
example is due to point group symmetry:
9/25/14
2014 BLIS Retreat
9
Oh Yeah, It’s Sparse Too…
The final reduction from 0.016% to ~0.002% in the previous
example is due to point group symmetry:
a
ij
b
ab
9/25/14
2014 BLIS Retreat
10
Adding It All Up
1 matrix-vector multiply
100s-1000s of tensor contractions
X
1 complicated tensor
100s-1000s of simpler tensors
X
Point group symmetry
Multiple GEMMs per contraction
X
Column symmetry
10s of permutations
X
Solution of eigenproblem
10s of iterations
Potentially billions (!!) of calls to GEMM
9/25/14
2014 BLIS Retreat
11
Adding It All Up
9/25/14
2014 BLIS Retreat
12
The Big Picture
Chemistry
“Simple” eigenproblem…
,
In terms of tensors…
In terms of other tensors…
Linear Algebra
With structured sparsity…
With symmetry…
With slicing (or blocking etc.)…
With more sparsity…
In terms of matrices.
9/25/14
2014 BLIS Retreat
13
Status Quo (CFOUR)
“Simple” eigenproblem…
,
In terms of tensors…
Layer 4
Me
In terms of other tensors…
With structured sparsity…
Layer 3
With symmetry…
With slicing (or blocking etc.)…
Layer 2
MPI
+
OMP
Layer 1
OMP
Someone
Else
With more sparsity…
9/25/14
In terms of matrices.
2014 BLIS Retreat
14
Dealing With Chemistry: Large Scale
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
Node 7
Node 8
Node 9
Pros:
Cons:
• Each block has little to no
• Blocks require padding for edge case. Padding can be
symmetry/sparsity.
excessive for many dimensions or short edge lengths.
• Blocks can be distributed in many ways. • To avoid padding, some blocks must keep complex
• Load balancing can be static or dynamic.
structure.
9/25/14
2014 BLIS Retreat
15
Dealing With Chemistry: Large Scale
Pros:
• Load balancing is automatic.
• Communication is regular.
• Little to no padding needed.
• Can be composed with blocking.
9/25/14
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
Node 7
Node 8
Node 9
Cons:
• Complex structure is retained at all levels.
• Communication and local computation needs to take
this structure into account.
2014 BLIS Retreat
16
Dealing With Chemistry: Small Scale
The Old Way
The New Way?
=
Memory
movement
ck
BLIS: A Framework
0:15
nr
nr
ai
mr
+=
mr
micro-kernel
kc
BLIS:
+=
mc
A~i
Ci
kc
ck
mc
em
ai
BLAS:
em
…
Ci
n
+=
mc
Ai
~
B
n
B
kc
3
Fig. 11. I llust r at ion of t he var ious levels of blocking and r elat ed packing when implement ing GE M M in t he
st yle of [Got o and van de Geijn 2008a]. H er e, m c and k c ser ve as cache block sizes used by t he higher-level
blocked algor it hms t o par t it ion t he mat r ix pr oblem down t o a so-called “block-panel” subpr oblem (depict ed
in t he middle of t he diagr am), implement ed in BL I S as a por t able macr o-ker nel. Similar ly, m r and n r ser ve
as r egist er block sizes for t he micr o-ker nel in t he m and n dimensions, r espect ively, which also cor r espond
t o t he lengt h and widt h of t he individual packed panels of mat r ices à i and B̃ , r espect ively.
5.1. Level-3 techniques
9/25/14
2014
L evel-3 oper at ions make up per haps t he most -used por t ion of t he BL AS.11 These oper at ions ar e also t ypically t he most diffi cult t o implement , in par t because t he space
of pot ent ial solut ions is much lar ger t han t hat of level-2 and lower oper at ions. I t may,
BLIS
Retreat
17ed
t her
efor e, seem count er int uit ive t hat t he por t ion of t he BL I S fr amewor k associat
wit h level-3 oper at ions is act ually si mpl er (and smaller ) t han t hat of t he r emaining
oper at ions. We now walk t hr ough t he met hods used t o implement t his funct ionalit y.
Dealing With Chemistry: Small Scale
abcd
kl
Z
BLIS: A Framework
0:15
nr
nr
mr
+=
mr
micro-kernel
n
kc
BLIS:
…
+=
mc
A~i
Ci
~
B
kc
Ci
mc
AXPY!
+=
mc
n
B
Ai
3
Fig. 11. I llust r at ion of t he var ious levels of blocking and r elat ed packing when implement ing GE M M in t he
st yle of [Got o and van de Geijn 2008a]. H er e, m c and k c ser ve as cache block sizes used by t he higher-level
blocked algor it hms t o par t it ion t he mat r ix pr oblem down t o a so-called “block-panel” subpr oblem (depict ed
in t he middle of t he diagr am), implement ed in BL I S as a por t able macr o-ker nel. Similar ly, m r and n r ser ve
as r egist er block sizes for t he micr o-ker nel in t he m and n dimensions, r espect ively, which also cor r espond
t o t he lengt h and widt h of t he individual packed panels of mat r ices à i and B̃ , r espect ively.
kl
R
5.1. Level-3 techniques
2014 BLIS
mn
abcd
mn
9/25/14
kc
W
L evel-3 oper at ions make up per haps t he most -used por t ion of t he BL AS.11 These oper at ions ar e also t ypically t he most diffi cult t o implement , in par t because t he space
of pot ent ial solut ions is much lar ger t han t hat of level-2 and lower oper at ions. I t may,
t her efor e, seem count er int uit ive t hat t he por t ion of t he BL I S fr amewor k associat ed
wit h level-3 oper at ions is act ually si mpl er (and smaller ) t han t hat of t he r emaining
oper at ions. We now walk t hr ough t he met hods used t o implement t his funct
Retreat
18 ionalit y.
5.1.1. Operation. Per haps t he most obvious fr amewor k dimension we wish t o manage
Flexibility Through Interfaces
Capabilities:
Tensor<…>
,
Basic Operator
Similarity-transform operator
Spin-orbital operator
Index permutation symmetry
Distributed
Point group symmetry
Commutator
expansion
Factorization,
operator resolution
Tensor<DIST|IPS|SO|
PGS>
Spin-integration or
spin-adaptation
Blocking/packing
Tensor<DIST|IPS>
(Basic tensor functionality)
9/25/14
2014 BLIS Retreat
19
Summary
• Chemistry is hard.
• A fast GEMM implementation is nice, but doesn’t go far enough.
• Complex structure can be dealt with
– By breaking the problem into simple blocks,
– By incorporating the structure into communication and computation,
– By relating a complex object to a simpler one (a matrix) bit by bit.
• Layered and composable interfaces are important.
– Implementations written at a “high level” can use “low level”
interfaces through intermediate ones.
– Adapters can go from one well-defined interface to another.
9/25/14
2014 BLIS Retreat
20
Thanks!
BLIS:
Field van Zee
Tyler Smith
Many others…
Tensormental:
Martin Schatz
Bryan Marker
Robert van de Geijn
John Stanton
CTF/AQ:
Edgar Solomonik
Jeff Hammond
9/25/14
Tensor packing:
Woody Austin
Martin Schatz
2014 BLIS Retreat
The CFOUR developers
21