Transcript Title

Large-scale geophysical
electromagnetic imaging
and modeling on graphical
processing units
Michael Commer (LBNL)
Filipe R. N. C. Maia (LBNL-NERSC)
Gregory A. Newman (LBNL)
Overview
 Introduction: Geophysical modeling
on GPUs
 Iterative Krylov solvers on GPU and
implementation details
 Krylov solver performance tests
 Conclusions
CSEM data inversion using QMR
CSEM imaging experiment of Troll gas field (North Sea)
EMGeo-GPU
has already
16 NVIDIA Tesla C 2050 (Fermi) GPUs, 3 GB memory,
been
448 parallel CUDA processor
coresrun
Compared to
successfully
16  8 Intel Quad core Nehalem, 2.4 GHz
ERT data inversion using CG
CO2 plume imaging study
SIP data inversion using BiCG
Rifle SIP monitoring study
Finite-difference representation of
Maxwell and Poisson equations
Maxwell equation
    E  i 0 σE   i 0 J
 13-point stencil
Poisson equation
  σ φ     J
 7-point stencil
Iterative Krylov subspace methods
Solution of the linear system
A N N x  b
involves constructing the Krylov subspace

K m  K m  A , r0   span r0 , A r0 ,
,A
m 1
r0

in order to compute the optimal approximation
xm  x0  K m
Numerical modeling on GPUs
Main challenge:
Manage memory access in most efficient way
Sparse matrix types arising in electrical
and electromagnetic modeling problems
Maxwell:
Controlled-source EM,
Magnetotelluric
Poisson:
Electrical resistivity tomography,
Induced polarization
Sparse Matrix Storage Formats
Structured
Unstructured
ELLPACK Format
 Storage of N non-zeros per matrix row
 Zero-padding for rows with < N non-zeros
 Ease of implementation
ELL SpMV GPU implementation
n – number of rows in the matrix (large)
m – max number of non-zeros per row (small)
Index matrix
Value matrix
x
y
ELL SpMV GPU implementation
Memory position with matrix element 1,3
GPU thread number 1
One thread per row, row concatenation.
ELL SpMV GPU implementation
Memory position with matrix element 1,3
GPU thread number 1
One thread per row, row concatenation.
ELL SpMV GPU implementation
Memory position with matrix element 1,3
GPU thread number 1
One thread per row, row concatenation.
ELL SpMV GPU implementation
Memory position with matrix element 1,3
GPU thread number 1
One thread per row, row concatenation.
Memory access not coalesced!
ELL SpMV GPU implementation
Memory position with matrix element 1,3
GPU thread number 1
Many threads per row, row concatenation.
Coalesced reads.
ELL SpMV GPU implementation
Memory position with matrix element 1,3
GPU thread number 1
Many threads per row, row concatenation.
In block reduction.
ELL SpMV GPU implementation
Memory position with matrix element 1,3
GPU thread number 1
Many threads per row, row concatenation.
In block reduction.
Reduction and writing rhs are slow!
ELL SpMV GPU implementation
Memory position with matrix element 1,3
GPU thread number 1
One thread per row, column concatenation.
(from another block)
ELL SpMV GPU implementation
Memory position with matrix element 1,3
GPU thread number 1
One thread per row, column concatenation.
(from another block)
ELL SpMV GPU implementation
Memory position with matrix element 1,3
GPU thread number 1
One thread per row, column concatenation.
(from another block)
Coalesced reads and no reductions
ELL SpMV GPU implementation
For 13 non zero elements per row on a Tesla C2050.
Minimize Memory Bandwidth
• Use fused kernels.
• Use pointer swaps instead of memory
copies when possible.
CPU communication
Multi GPU communication
• Use the same layout for vectors on the
CPU and GPU.
• Simplifies MPI communication
routines.
• Extra complication of the data transfer
to CPU.
Multi GPU communication
GPU communication diagram.
Multi GPU communication
Blocking communication
Multi GPU communication
Non blocking communication
Iterative Krylov solver performance tests
Typically used for EM problems:
CG, BiCG, QMR
Computing times
for 1000 Krylov solver iterations
SpMV with “Constant-Coefficient-Matrix”
Vector Helmholtz equation
    E  i 0 σE   i 0 J
=2pf
Ax  b
SpMV with Constant-Coefficient-Matrix
(      i 0 σ ) x  b
Choose Dirichlet boundary
conditions such that the operator
  ℝnn
A
  
A

x  b
SpMV with Constant-Coefficient-Matrix
A
CA
  
  
 a11



a
 n1

 diag A
  

a1 n 



a nn 
 0

a 2 ,1




 a n ,1
a1,2
0
0
a n , n 1
n n
a1, n 


a n 1, n 

0 
n n
SpMV with Constant-Coefficient-Matrix

D  diag A
d  ( d 11 d 22
  
d nn )  ( d 1 d 2
A

dn ) 
n n
n
Cx  dx  b
Ax  b
n n
 A

,
C
n n
,
d, x, b 
n
Pseudo code for SpMV with
“standard” matrix: Ax=b
Pseudo code for SpMV with ConstantCoefficient-Matrix: Cx+dx=b
Scaling of solution vector
Scaling of rhs vector
QMR solver performace on CPU & GPU
using CCM – solution times for 1000
Krylov solver iterations
Example grid size: 190190100
QMR solver performace on GPU using
CCM – memory throughput
Grid intervals  Coefficients
Example grid size: 100100100
Number of unique coefficients
6000
5000
4000
3000
2000
1000
0
1
2
3
4
5
6
7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of unique grid intervals
Grid intervals  Solution times
Solution times (sec)
27.5
27
26.5
26
25.5
25
24.5
24
23.5
23
1
2
3
4
5
6
7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of unique grid intervals
Increase in computing time: 17 %
Grid intervals  Memory usage
Device memory usage (MB)
768.05
768.04
768.03
768.02
768.01
768
767.99
767.98
767.97
1
2
3
4
5
6
7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of unique grid intervals
Only significant portion given by index array
Conclusions
• Our GPU implementation of iterative
Krylov methods exploits massive
parallelism of modern GPU hardware
• Efficiency increases with problem size
• Memory limitations are overcome by
multi-GPU scheme and novel SpMV
method for structured grids
Thanks to
National Energy Research Scientific
Computing Center (NERSC)
for support provided through NERSC
Petascale Program