GPU Clusters at NCSA - IEEE International Parallel and

Download Report

Transcript GPU Clusters at NCSA - IEEE International Parallel and

Direct Self-Consistent Field
Computations on GPU Clusters
Guochun Shi,
Volodymyr Kindratenko
Ivan Ufimtsev,
Todd Martinez
National Center for
Supercomputing Applications
Department of Chemistry
Stanford University
University of Illinois at
Urbana-Champaign
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
Presentation Outline
• GPU computing
• NCSA’s Lincoln GPU cluster
• SCF theory in Quantum Chemistry
• Implementation on a GPU cluster
• Kernels for J and K matrices
• Parallelization strategy for GPU cluster
• Performance
• Conclusions and future work
IPDPS 200
Why GPUs?
GPU Performance Trends
1200
GTX 285
1000
GTX 280
GFlop/s
800
8800 Ultra
600
8800 GTX
NVIDIA GPU
Intel CPU
400
7900 GTX
7800 GTX
6800 Ultra
200
5800
Intel Xeon Quad-core 3 GHz
5950 Ultra
0
9/22/02
IPDPS 200
2/4/04
6/18/05
10/31/06
3/14/08
NVIDIA Tesla T10 GPU Architecture
PCIe interface
Input assembler
Thread execution manager
TPC 1
TPC 10
Geometry controller
Geometry controller
SMC
SMC
SM
SM
SM
SM
SM
I cache
I cache
I cache
I cache
I cache
SM
I cache
MT issue
MT issue
MT issue
MT issue
MT issue
MT issue
C cache
C cache
C cache
C cache
C cache
C cache
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SFU SFU
SFU SFU
SFU SFU
SFU SFU
SFU SFU
SFU SFU
Shared
memory
Shared
memory
Shared
memory
Shared
memory
Shared
memory
Shared
memory
Texture units
Texture L1
Texture units
Texture L1
512-bit memory interconnect
L2
ROP
DRAM DRAM DRAM DRAM
IPDPS 200
ROP
DRAM DRAM DRAM
L2
DRAM
• T10 architecture
• 240 streaming
processors arranged
as 30 streaming
multiprocessors
• At 1.3 GHz this
provides
• 1 TFLOP SP
• 86.4 GFLOP DP
• 512-bit interface to
off-chip GDDR3
memory
• 102 GB/s bandwidth
Intel 64 Tesla Linux Cluster Lincoln
• Dell PowerEdge 1955
server
• Intel 64 (Harpertown) 2.33
GHz dual socket quad core
• 16 GB DDR2
• Infiniband SDR
• Tesla S1070 1U GPU
Computing Server
• 1.3 GHz Tesla T10 processors
• 4x4 GB GDDR3 SDRAM
• Cluster
• Servers: 192
• Accelerator Units: 96
IPDPS 200
• Two Compute Nodes
IB
SDR IB
Dell PowerEdge
1955 server
SDR IB
Dell PowerEdge
1955 server
PCIe x8
PCIe x8
PCIe interface
PCIe interface
T10
T10
T10
T10
DRAM
DRAM
DRAM
DRAM
Tesla S1070
HPL Benchmark for Lincoln
2500
45%
40%
35%
30%
1500
1000
Lincoln (GFLOPS)
25%
20%
Lincoln (% of peak)
15%
10%
500
5%
0
0%
1 node
2 nodes 4 nodes 8 nodes 16 nodes 32 nodes
system size
We used Massimiliano Fatica(nvidia)’s GPU enabled HPL package.
IPDPS 200
% of peak
achieved GFLOPS
2000
Quantum Chemistry
Why do we need to deal with…
Energy (H = E):
Quantifies intra/intermolecular interactions
Drives chemistry, little interesting happens on flat surface
Geometry optimization (RE = 0)
Searches for stable atomic arrangements (molecular shapes)
Molecular dynamics (∂2R/ ∂t2 = -1/M RE)
The chemistry itself (at some, sometimes crude, approximation)
Studies system at atomistic time, and length scales
IPDPS 200
Exact energy is a hard problem
IPDPS 200
Hartree-Fock approximation is one of the simplest
 is an antisymmetrized product of N 1-electron orbitals 
Expand  over predefined basis set 
IPDPS 200
Hartree-Fock Self Consistent Field (SCF) procedure
Repeat until Ck+1 more or less equals Ck
IPDPS 200
Hartree-Fock equations
• All matrices are of NN size (N ~ 1,000 … 10,000)
• N3 operations to solve HF equations (need to deal with diagonalization)
• N4 operations to get F
IPDPS 200
Kernel In GPU
2e integral grid
leaves only N2 out of N4 integrals
|kl]
[ij|
SIMD warp
IPDPS 200
Most negligibly small
integrals will be calculated
SIMD warp
Only significant integrals
will be calculated
Kernel in GPU: J-matrix implementation
IPDPS 200
Kernels in GPU: K-matrix implementation
IPDPS 200
Singe node execution time breakdown
18
250.00
16
14
J
12
K
J/K reduction
runtime 10
(seconds) 8
LA
uncounted
6
200.00
J
K
150.00
LA
runtime
(seconds)
uncounted
100.00
J/K reduction
KPQ
KPQ
4
JPQ
50.00
JPQ
2
0
0.00
olestra
bpti
•
The J and K matrices computation and Linear Algebra (LA)
computation dominate the overall execution time
•
Pair quantity computations can be significant
IPDPS 200
GPU cluster parallelization strategy
• Each GPU has a global id
• nodeid * num_gpu_per_node + local_gpu_index
• J/K matrices work distribution
• Computations for elements in J and K matrices are not
even.
• Sort pre-computed pair quantities and choose every one
element in N to compute for each GPU
• LA using intel MKL
IPDPS 200
Parallelization strategy (II)
start
•
precompute
pre-compute pair-wise quantities
2
Compute
J and K
compute J and K (Eq. 8, 9)
3
gather Fock matrix
One MPI process per node is
designated as “master”
The master MPI processes create
threads for controlling GPUs as
well as CPU work threads
MPI processes/GPU management
threads/CPU work threads are
awaken or put to sleep as needed
4
5
6
master MPI processes,
multiple POSIX threads,
GPUs
form Fock sub-matrices (Eq. 7)
master MPI processes,
rank 0 MPI process
gather complete Fock matrix F
scatter F
all MPI processes
compute matrix C (Eq. 5)
all MPI processes
gather and broadcast P
all MPI processes,
rank 0 MPI process
no
Converge?
yes
done
IPDPS 200
master MPI processes,
multiple POSIX threads
Distribute
Fock
matrix
•
1
Solve
eigenvalue
problem
•
Start as MPI program, each node
has as many MPI processes as
CPU cores
final
gather
•
Guess initial molecular orbital coefficients matrix C
and compute density matrix P (Eq.10)
MPI process
node
0
node
1
node
2
node
3
CPU work thread
Generated guess matrix
C and compute matrix
P
CPU thread for
managing GPU
kernels
Pair-quantity computing on CPU
Using density matrices P
Computing J and K matrices on GPUs
Partial J and K
Reduction of J and K matrices, form the Fock
matrix
Fock
matrix
Distribute the Fock matrix, do linear algebra,
compute matrix C and P, gather P
Distr-ed fork matrix
Distr-ed P matrix
Broadcast P
P
matrix
IPDPS 200
•
balanced K matrix
Computation
30
25
20
15
10
5
0
K matrix
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Node Index
Sorting for pair quantity computations
and work selection strategy makes the
computation on GPUs well balanced,
reducing performance degradation
IPDPS 200
Computation time (seconds)
Unbalanced K matrix
computation
30
25
20
15
10
K matrix
5
0
0 1 2 3 4 5 6 7 8 9 101112131415
Node Index
balanced J matrix
Computation
Computation time (seconds)
Computation time (seconds)
Performance: load balancing
4
3
2
J matrix
1
0
0 1 2 3 4 5 6 7 8 9 101112131415
Node Index
Performance
Atoms
Electrons
Orbitals
S shells
P shells
Olestra
453
1366
2131
1081
350
BPTI
875
3400
4893
2202
897
CspA
1732
6290
8753
4220
1511
Olestra
BPTI
60.00
Runtime (s)
50.00
250.00
600.00
200.00
500.00
400.00
40.00
150.00
300.00
30.00
100.00
200.00
20.00
10.00
0.00
0
1
2
4
8
16
32
50.00
100.00
0.00
0.00
1
2
4
8 16 32 64 128
# of nodes
IPDPS 200
CspA
2
4
8
16
32
64 128
Using 321g basis set
Scalability of J, K and LA
Olestra
CspA
BPTI
450.00
40
140.00
35
120.00
30
100.00
300.00
80.00
250.00
60.00
200.00
25
20
400.00
350.00
150.00
15
40.00
100.00
10
20.00
50.00
5
0.00
0.00
0
0
1
2
4
8
16
32
1
2
4
8
16
32
64
128
2
4
8
16
32
number of nodes
•
J and K matrices computation can scale well to 128 nodes
•
Linear Algebra scales only up to 16 nodes even for CsPA molecule
IPDPS 200
64
128
Performance: Linear Algebra breakdown
450
400
P matrix assembly
time per iteration (secs)
350
diagonalization
dgemm
300
250
200
150
100
50
0
2
4
8
16
32
64
128
# of cluster nodes
IPDPS 200
•
Diagonization scales the worst, dgemm is also important
•
A fast, scalable GPU based SCALAPACK is needed
• Magma from UTK?
• Cula?
Results: Olestra molecule
Olestra molecule consisting of 453 atoms (a small example model used of
testing the developed software) can be computed by the state-of-the-art
quantum chemistry software package GAMESS running on an Intel Pentium D
3 GHz processor in over 12,408 seconds whereas our 8-node GPU cluster
implementation performs the same computation in just over 5 seconds, a
2,452× speedup.
IPDPS 200
Example: CspA molecule
For larger models, one SCF iteration for Cold shock protein A (CspA) molecule
consisting of 1,732 atoms can be done in 88 seconds on a 16 node GPU
cluster.
IPDPS 200
Conclusions and future work
• GPU computing brings Quantum Chemistry
computing to a new level
• Parallelization enables computing of large molecules
in shorter time
• J and K matrices show good scalability
• Linear Algebra can only scale up to 16 nodes
• Linear Algebra becomes a major bottleneck
• A linear algebra package using GPUs with good scalability is
needed
• Matrix multiplication and eigenvalue solver
• Only S and P orbitals are supported at this moment
• Alexey Titov (NCSA) is working on D orbitals
IPDPS 200
Acknowledgement
• This work was supported by the National Science
Foundation grant CHE-06-26354.
IPDPS 200