Transcript Title
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory A. Newman (LBNL) Overview Introduction: Geophysical modeling on GPUs Iterative Krylov solvers on GPU and implementation details Krylov solver performance tests Conclusions CSEM data inversion using QMR CSEM imaging experiment of Troll gas field (North Sea) EMGeo-GPU has already 16 NVIDIA Tesla C 2050 (Fermi) GPUs, 3 GB memory, been 448 parallel CUDA processor coresrun Compared to successfully 16 8 Intel Quad core Nehalem, 2.4 GHz ERT data inversion using CG CO2 plume imaging study SIP data inversion using BiCG Rifle SIP monitoring study Finite-difference representation of Maxwell and Poisson equations Maxwell equation E i 0 σE i 0 J 13-point stencil Poisson equation σ φ J 7-point stencil Iterative Krylov subspace methods Solution of the linear system A N N x b involves constructing the Krylov subspace K m K m A , r0 span r0 , A r0 , ,A m 1 r0 in order to compute the optimal approximation xm x0 K m Numerical modeling on GPUs Main challenge: Manage memory access in most efficient way Sparse matrix types arising in electrical and electromagnetic modeling problems Maxwell: Controlled-source EM, Magnetotelluric Poisson: Electrical resistivity tomography, Induced polarization Sparse Matrix Storage Formats Structured Unstructured ELLPACK Format Storage of N non-zeros per matrix row Zero-padding for rows with < N non-zeros Ease of implementation ELL SpMV GPU implementation n – number of rows in the matrix (large) m – max number of non-zeros per row (small) Index matrix Value matrix x y ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 One thread per row, row concatenation. ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 One thread per row, row concatenation. ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 One thread per row, row concatenation. ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 One thread per row, row concatenation. Memory access not coalesced! ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 Many threads per row, row concatenation. Coalesced reads. ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 Many threads per row, row concatenation. In block reduction. ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 Many threads per row, row concatenation. In block reduction. Reduction and writing rhs are slow! ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 One thread per row, column concatenation. (from another block) ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 One thread per row, column concatenation. (from another block) ELL SpMV GPU implementation Memory position with matrix element 1,3 GPU thread number 1 One thread per row, column concatenation. (from another block) Coalesced reads and no reductions ELL SpMV GPU implementation For 13 non zero elements per row on a Tesla C2050. Minimize Memory Bandwidth • Use fused kernels. • Use pointer swaps instead of memory copies when possible. CPU communication Multi GPU communication • Use the same layout for vectors on the CPU and GPU. • Simplifies MPI communication routines. • Extra complication of the data transfer to CPU. Multi GPU communication GPU communication diagram. Multi GPU communication Blocking communication Multi GPU communication Non blocking communication Iterative Krylov solver performance tests Typically used for EM problems: CG, BiCG, QMR Computing times for 1000 Krylov solver iterations SpMV with “Constant-Coefficient-Matrix” Vector Helmholtz equation E i 0 σE i 0 J =2pf Ax b SpMV with Constant-Coefficient-Matrix ( i 0 σ ) x b Choose Dirichlet boundary conditions such that the operator ℝnn A A x b SpMV with Constant-Coefficient-Matrix A CA a11 a n1 diag A a1 n a nn 0 a 2 ,1 a n ,1 a1,2 0 0 a n , n 1 n n a1, n a n 1, n 0 n n SpMV with Constant-Coefficient-Matrix D diag A d ( d 11 d 22 d nn ) ( d 1 d 2 A dn ) n n n Cx dx b Ax b n n A , C n n , d, x, b n Pseudo code for SpMV with “standard” matrix: Ax=b Pseudo code for SpMV with ConstantCoefficient-Matrix: Cx+dx=b Scaling of solution vector Scaling of rhs vector QMR solver performace on CPU & GPU using CCM – solution times for 1000 Krylov solver iterations Example grid size: 190190100 QMR solver performace on GPU using CCM – memory throughput Grid intervals Coefficients Example grid size: 100100100 Number of unique coefficients 6000 5000 4000 3000 2000 1000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of unique grid intervals Grid intervals Solution times Solution times (sec) 27.5 27 26.5 26 25.5 25 24.5 24 23.5 23 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of unique grid intervals Increase in computing time: 17 % Grid intervals Memory usage Device memory usage (MB) 768.05 768.04 768.03 768.02 768.01 768 767.99 767.98 767.97 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of unique grid intervals Only significant portion given by index array Conclusions • Our GPU implementation of iterative Krylov methods exploits massive parallelism of modern GPU hardware • Efficiency increases with problem size • Memory limitations are overcome by multi-GPU scheme and novel SpMV method for structured grids Thanks to National Energy Research Scientific Computing Center (NERSC) for support provided through NERSC Petascale Program