04_Ltaief_MAGMA

Transcript 04_Ltaief_MAGMA

Solving Challenging Numerical Linear
Algebra Algorithms Using GPU Accelerators
Hatem Ltaief
KAUST Supercomputing Laboratory
Stanimire Tomov
University of Tennessee, Knoxville
ISC’12 Tutorial, Hamburg
June 17, 2012
Outline
MAGMA: LAPACK for GPUs
Methodology overview
• Use both GPUs and multicore CPUs
MAGMA: from single to multiGPU support
• One-sided factorizations and linear solvers
• Two-sided factorizations and eigensolvers
Dynamic scheduling approaches to DLA
MAGMA algorithms with dynamic scheduling
Conclusions
MAGMA: LAPACK for GPUs
MAGMA
• Matrix algebra for GPU and multicore architecture
• To provide LAPACK/ScaLAPACK on hybrid architectures
• http://icl.cs.utk.edu/magma/
MAGMA BLAS
• A subset of BLAS for GPUs, highly optimized for NVIDIA GPGPUs
• Fast GEMM for Fermi (CUBLAS3.2) [IJHPCA’10]
MAGMA developers & collaborators
• UTK, UC Berkeley, UC Denver, INRIA (France), KAUST (Saudi
Arabia)
• Community effort, similarly to LAPACK/ScaLAPACK
A New Generation of DLA Software
MAGMA
Hybrid Algorithms
(heterogeneity friendly)
Rely on
- hybrid scheduler
- hybrid kernels
MAGMA Software Stack
MAGMA 1.1

50+ hybrid LAPACK algorithms have been developed (total of 200+
routines)




Every algorithm is in 4 precisions (s/c/d/z)
There are 3 mixed precision algorithms (zc & ds)
These are hybrid algorithms, expressed in terms of BLAS
MAGMA BLAS

A subset of GPU BLAS, optimized for Tesla and Fermi GPUs
MAGMA Methodology
A methodology to use all available resources:

MAGMA uses HYBRIDIZATION methodology based on
– Representing linear algebra algorithms as collections
of TASKS and DATA DEPENDENCIES among them
– Properly SCHEDULING tasks' execution over
Hybrid CPU+GPU algorithms
(small tasks for multicores and large
tasks for GPUs)
multicore and GPU hardware components

Successfully applied to fundamental
linear algebra algorithms
– One and two-sided factorizations and solvers
– Iterative linear and eigen-solvers

Productivity
– 1) High-level; 2) Leveraging prior developments; 3) Exceeding in performance
homogeneous solutions
Hybrid Algorithms
One-sided factorizations (LU, QR, Cholesky)
Hybridization
–
Panels (Level 2 BLAS) are factored on CPU using LAPACK
–
Trailing matrix updates (Level 3 BLAS) are done on the GPU
using “look-ahead”
A Hybrid Algorithm Example

Left-looking hybrid Cholesky factorization in MAGMA 1.0

The difference with LAPACK – the 3 additional lines in red

Line 10 (done on CPU) is overlapped with work on the GPU
LU Factorization (Single GPU)
From single to multiple GPUs support
Data distribution
• 1-D block-cyclic distribution
nb
Algorithm
• GPU holding current panel is sending
it to CPU
• All updates are done in parallel on
the GPUs
• Look-ahead is done with GPU holding
the next panel
GPU GPU GPU GPU
0
1
2
0
...
LU factorization (multiGPUs)
LU factorization (multiGPUs)
Matrix out of
GPU memory
Out of GPU Memory Algorithms
Perform left-looking factorizations on sub-matrices
that fit in the GPU memory (using existing algorithms)
The rest of the matrix stays on the CPU
Left-looking versions minimize writing on the CPU
1) Copy A2 to the GPU
Untouched 2) Update A2 using A1 (a panel of A1
at a time)
part of the
3) Factor the updated A2 using
To be
Factored factored
matrix
existing
sub-matric sub...
A1 on CPU matrix
hybrid code
A2 on
4) Copy factored A2 to the CPU
GPU
Trivially extended to multiGPUs:
A2 is “larger” with 1-D block
cyclic distribution, again reusing
existing algorithms
Hybrid Algorithms
Two-sided factorizations (to bidiagonal, tridiagonal, and upper
Hessenberg forms) for eigen- and singular-value problems
Hybridization
–
Trailing matrix updates (Level 3 BLAS) are done on the GPU
(similar to the one-sided factorizations)
–
Panels (Level 2 BLAS) are hybrid
– operations with memory footprint restricted to the panel are done
on CPU
– The time consuming matrix-vector products involving the entire
trailing
matrix are done on the GPU
MultiGPUs Two-Sided Factorizations
Need HP multiGPU Level 2 BLAS (e.g., 50% of flops in the tridiagonal
reduction)
Performance of DSYMV on multi M2090s
T. Dong, J. Dongarra, S. Tomov, I. Yamazaki, T. Schulthess, and R. Solca, Symmetric dense matrix-vector
multiplication on multiple GPUs and its application to symmetric dense and sparse eigenvalue problems, ICL
Technical report, 03/2012.
Hybrid Two-Sided Factorizations
MAGMA Tridiagonalization in DP
 50 % of the flops are in SYMV
 Memory bound, i.e. does not
scale well on multicore CPUs
 Use the GPU’s high memory
bandwidth and optimized
SYMV
 8 x speedup over 12 Intel
cores
(X5660 @2.8 GHz)
Keeneland system, using one node
3 NVIDIA GPUs (M2070@ 1.1 GHz, 5.4 GB)
2 x 6 Intel Cores (X5660 @ 2.8 GHz, 23 GB)
T. Dong, J. Dongarra, S. Tomov, I. Yamazaki, T. Schulthess, and R. Solca, Symmetric dense
matrix-vector multiplication on multiple GPUs and its application to symmetric dense and
sparse eigenvalue problems, ICL Technical report, 03/2012.
Further GPU Kernel Optimizations
Fermi 2070
A. Abdelfattah, J. Dongarra, D. Keyes and H. Ltaief, Optimizing Memory-Bound Numerical
Kernels on GPU Hardware Accelerators, VECPAR, Japan, 2012.
Further GPU Kernel Optimizations
Fermi 2070
Further GPU Kernel Optimizations
Fermi 2070
From Static to Dynamic Scheduling…
Static may stall in situations where work is available
Hand tuned optimizations
Hardware heterogeneity
Kernel heterogeneity
Separation of concerns
Dynamic Runtime System
Punch Lines
Productivity!
Productivity!
Productivity!
Oh… Did I say Productivity?
Block Algorithms
Panel-Update Sequence
Transformations are blocked/accumulated within the Panel (Level
2 BLAS)
Transformations applied at once on the trailing submatrix (Level 3
BLAS)
Parallelism hidden inside the BLAS
Fork-join Model
Block QR Factorization
Fork-Join Paradigm
Leveraging Block Algorithms…
Column-major data
layout
Tile data layout
Lessons Learnt from PLASMA
PLASMA: Parallel Linear Algebra for Scalable Multi-core
Architectures: http://icl.cs.utk.edu/plasma/
Tile Algorithms on homogeneous x86 cores
Parallelism is brought to the fore
May require the redesign of linear algebra algorithms
Tile data layout translation
Remove unnecessary synchronization points between PanelUpdate sequences
DAG execution where nodes represent tasks and edges define
dependencies between them
Dynamic runtime system environment QUARK
Example: Tile QR Factorization
First panel
factorization and
corresponding
updates
DAG for a 4x4 tiles matrix
Let’s go crazy!
DAG of 20x20 tile QR Factoriza
Dynamic Scheduling
Conceptually similar to out-of-order processor scheduling because
it has:
• Dynamic runtime DAG scheduler
• Out-of-order execution flow of fine-grained tasks
• Task scheduling as soon as dependencies are satisfied
• Producer-Consumer
Data Flow Programming Model: five decades old concept
• Think "how things connect" rather than "how things
happen”
• Assembly line
• Inherently parallel
Matrices Over Runtime Systems at Exascale
(MORSE)
Mission statement: "Design dense and sparse linear algebra
methods that achieve the fastest possible time to an accurate
solution on large-scale Hybrid systems”
Runtime challenges due to the ever growing hardware
complexity
Algorithmic challenges to exploit the hardware capabilities at
most
Integrated into MAGMA software stack
MAGMA-MORSE: x86 + MultiGPUs
Lessons Learned from PLASMA!
CUDA-based hybrid systems
New high performance numerical kernels
StarPU Runtime System (Augonnet et. Al, INRIA, Bordeaux)
Both: x86 and GPUs => Hybrid Computations
Similar to LAPACK in functionality
Achieving High Level of Productivity
From Sequential Nested-Loop Code to Parallel Execution
for (k = 0; k < min(MT, NT); k++){
zgeqrt(A[k;k], ...);
for (n = k+1; n < NT; n++)
zunmqr(A[k;k], A[k;n], ...);
for (m = k+1; m < MT; m++){
ztsqrt(A[k;k],,A[m;k], ...);
for (n = k+1; n < NT; n++)
ztsmqr(A[m;k], A[k;n], A[m;n], ...);
}
}
Achieving High Level of Productivity
From Sequential Nested-Loop Code to Parallel Execution
for (k = 0; k < min(MT, NT); k++){
starpu_Insert_Task(&cl_zgeqrt, k , k, ...);
for (n = k+1; n < NT; n++)
starpu_Insert_Task(&cl_zunmqr, k, n, ...);
for (m = k+1; m < MT; m++){
starpu_Insert_Task(&cl_ztsqrt, m, k, ...);
for (n = k+1; n < NT; n++)
starpu_Insert_Task(&cl_ztsmqr, m, n, k,
...);
}
}
Hybrid Architecture Targeted
⇒ PCI Interconnect 16X 64Gb/s, very thin pipe!
⇒ Fermi C2050 448 cuda cores 515 Gflop/s
Performance Charts
Cholesky
QR
LU
Symmetric Matrix Inversion
Cholesky Performance
8 Intel x86 cores
+ 3 Tesla GPUs
E. Agullo, C. Augonnet, J. Dongarra, H. Ltaief, S. Thibault, R. Namyst, S. Thibault, S. Tomov,
Software for GPUs, GPU Computing GEMs, vol.2, 2011.
QR Performance
8 AMD x86 cores
+ 4 Tesla GPUs
E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, S. Tomov, IEEE International Parallel and
Distributed Processing Symposium, 2011.
QR Performance
+~200 Gflop/s but 12
cores = ~150 Gflop/s
8 AMD x86 cores
+ 4 Tesla GPUs
E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H. Ltaief, S. Thibault, S. Tomov, IEEE International Parallel and
Distributed Processing Symposium, 2011.
Performance Breakdown
Kernel
CPU
GPU
Speedup
sgeqrt
9 Gflops
60 Gflops
~6
stsqrt
12 Gflops
67 Gflops
~6
sormqr
8.5 Gflops
227 Gflops
~ 27
stsmqr
10 Gflops
285 Gflops
Task distribution observed on StarPU:
~ 27
• sgeqrt : 20% of tasks on GPUs
• stsmqr : 92.5% of tasks on GPUs
Taking advantage of heterogeneity !
• Only do what you are good for
• Don’t do what you are not good for
LU Performance
8 Intel x86 cores +
3 Tesla GPUs
E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, J. Langou, H. Ltaief and S. Tomov, ACS/IEEE
International Conference on Computer Systems and Applications (best paper award), 2011
Symmetric Matrix Inversion
A−1, Seriously???
YES!
Critical component of the variance-covariance matrix computation
in statistics
Three steps:
• Cholesky factorization (DPOTRF)
• Inverting the Cholesky factor (DTRTRI)
• Calculating the product of the inverted Cholesky factor with its
transpose (DLAUUM)
Built on previous work from E. Agullo, H. Bouwmeester, J. Dongarra, J. Kurzak, J. Langou,
and L. Rosenberg
Scheduling Algorithms as DAGs
44
A−1 Performance
8 Intel x86 cores + 2
Fermi GPUs
H. Ibeid, D. Kaushik, D. Keyes and H. Ltaief, HIPC'11, India
Summary and Future Directions
Two methodologies for solving challenging DLA
Static Scheduling (performance)
Dynamic Scheduling (productivity)
LAPACK compliant API
Source codes freely available in MAGMA
What’s next?
• Extended numerical functionality
• Distributed Memory Systems
Colloborators / Support



MAGMA [Matrix Algebra on GPU
and Multicore Architectures] team
http://icl.cs.utk.edu/magma/
PLASMA [Parallel Linear Algebra
for Scalable Multicore
Architectures] team
http://icl.cs.utk.edu/plasma
Collaborating partners
University of Tennessee, Knoxville
University of California, Berkeley
University of Colorado, Denver
INRIA, France
KAUST, Saudi Arabia
Questions?