Linear Algebra on GPUs Vasily Volkov GPU Architecture Features • SIMD architecture – Don’t be confused by scalar ISA which is only a.

Download Report

Transcript Linear Algebra on GPUs Vasily Volkov GPU Architecture Features • SIMD architecture – Don’t be confused by scalar ISA which is only a.

Linear Algebra on GPUs
Vasily Volkov
GPU Architecture Features
• SIMD architecture
– Don’t be confused by scalar ISA which is only a program model
• We use vector thread program model instead
– Similar to SSE/Cell SPE/vector units, not superscalar units
– Native vector length is 32; can simulate larger (thread blocks)
• Multithreading using register windows
– Context switch never flushes registers to memory
– If more threads than can fit, some don’t start until others finish
• Huge register files, small high-latency caches
– Fundamental difference with CPUs, similar to vector processors
– Cache is to save memory bandwidth, not reduce latency
– Use register blocking, not cache blocking
• Large startup costs (≈5s)
– Not good for fine-grain computations
– Use global barriers instead? (≈1.5 s)
Relation to Other Multicores
•
•
Processor
Core2 SSE, 2.66GHz
Cell SPEs
G80/G84
Gflop/s/core
21.3 Gflop/s
25.6 Gflop/s
19-23 Gflop/s
Vector length
4
4
32
# of cores
2-4
8
4-16
All offer both multithreading and SIMD capabilities
Use CUDA to program all of them?
Pointer Chase Benchmark, 8800GTX
run k=A[k]
in a loop
in a scalar thread
latency bound
larger latency at
cache miss
Reveals cache
sizes
8-word cache line
L1: 8 x 5kB, each is 20-way associative
L2: 6 x 32kB, each is 4-way associative
512kB memory pages
TLB: 16 entries, fully associative
8800GTX/8800GTS/8600GTS/FX5600:
Different number of similar caches
GPU Memory System (8800GTX)
Matrix-Matrix multiply, C = C + AB
• 64x16 blocks in C, rank-4 updates
– Ensures that our code is compute-bound
• Each thread computes one block in C
– ijk/jik form; other choices produce race condition in C
• Keep A’s and C’s block in vector registers
– Similarly done on IBM 3090 VF and Cray X1
• Keep B’s block in local memory
– Others keep it in scalar registers (n/a on GPU); or caches (slow)
• Use compiler options to enforce tight register budget
– To hide memory latencies better by multithreading
• Use prefetching to hide latencies even better
– Now, performance is not bound by latency and bandwidth in
reading blocks in A and B!
– Bound by instruction issue and local memory ops (230 Gflop/s)
Performance of SGEMM
CUBLAS 1.1:
• keeps square blocks in A and B in local memory
• uses long vectors (poor instruction mix)
• exposing too much of data parallelism may cost you
Our SGEMM is now in CUBLAS 2.0 beta
SGEMM, 8800GTX, k = 1024
Constant work per vector thread (function of k)
Optimized version does better load balancing by computing partial sums
Panel Factorization
CPU: runtime on Core2 Duo 2.66GHz, Intel MKL 10.0 (includes CPU-GPU transfer!)
GPU: estimated for 8800GTX as Time 5s 
bandwidth required
75GB/s
Design of Matrix Factorizations
• Right-looking scheme = most parallelism = best on 16-core GPUs
• Crout scheme = least bandwidth = best on 4-core GPU and if using CUBLAS
1.1
• Left-looking = half of work in triangular solve = limited parallelism =
inefficient
• 2-level blocking
– Both levels are right-looking + premature exit from finer level to keep
– Up to 6% speedup only, at large matrices (n≈10,000)
• Block size on GPU is 64 (same as in matrix multiply)
– Autotuning in QR (up to 7% speedup)
• Row-major layout on GPU in LU decomposition
– Since gathers with large stride are inefficient
– Requires transposition at every CPU-GPU transfer
– >2x speedup!
• Panel factorization on CPU overlapped with BLAS3 on GPU (use lookahead)
• Multiply by inverse (GEMM) instead of triangular solve (TRSM)
– TRSM vs. GEMM is 13 Gflop/s vs. 160 Gflop/s if matrix is 64x64
– Parallelism in TRSM is not embarrassing enough
Test Platform
• GPU
– Core2 Duo 2.67GHz + GeForce 8800 GTX
– Core2 Duo 2.67GHz + two GeForce 8800 GTX
• CPU
– Core2 Duo 2.67GHz
– Core2 Quad 2.4GHz
Summary of Performance
Speedups vs. CPU
Summary of Speedups
8800GTX
Gflop/s
Core2 Duo
Gflop/s speedup
Core2 Quad
Gflop/s speedup
Cholesky
183
23.0
7.4
34.9
5.5
LU
179
22.7
7.7
59.2
3.0
QR
192
22.5
8.3
44.8
4.3
SGEMM
206
25.9
8.0
69.8
3.0
Speedup using 2 GPUs
Using column-cyclic layout
Breakdown of runtime (LU)
What if omit one optimization in LU?
Other Work Done
• Tridiagonal eigenvalue solver (bisection)
– Most work: factorize A–iI = LDLT, count signs in D (compute bound)
– Done for many i in parallel — traditionally vectorized
– If need more parallelism — do multisection instead of bisection
• But it increases total flop count
– Rest is difficult to parallelize, does not worth it
– Our solution:
•
•
•
•
Run vectorized loops on the GPU, rest (least work) on the CPU
Autotune to decide optimal redundancy and when involve CPU
Use features of IEEE arithmetic to save another 15-30% of runtime
Up to 156x faster than LAPACK on Pentium 4
• Tridiagonal eigenvector solver (inverse iteration)
– Most work: Solve (A–iI)xk+1=xk for fixed i (bandwidth bound)
– Factorize A–iI = LDLT once, keep D only. Reconstruct L on need
• Reconstruction is overlapped with memory access; still bandwidth bound
– Don’t pivot — recompute using safe code if fails (do it on CPU)
– Up to 120x faster than LAPACK on Core2 Duo so far
– More complicated when eigenvalues are clustered
• Stencil computation (7-point on 3D grid)
– Blocks in registers and local memory
– Bandwidth-bound, runs at up to 66% of pin-bandwidth
Future Work
• Analysis of architecture
– Find best parallels in past architectures to reuse methods
– Catching up with newer GPUs
– More micro-benchmarking to get better performance models
• More scientific kernels
– CUFFT is ≈50Gflop/s, can do better (e.g. by not doing sin/cos in the inner loop)
• More LAPACK
– Two-sided factorizations used in eigensolvers and SVD
• LAPACK does 50% of work is in BLAS1/BLAS2
• Mostly BLAS3 algorithm is known, but has requires more flops if eigenvectors are needed
• May use divide-and-conquer instead
– MRRR (improved inverse iteration algorithm, also rich in parallelism)
– Non-symmetric eigensolvers such as QR iterations
• currently fine-grained, can do better?
– Iterative refinement for eigenvalue problem?
• ScaLAPACK (distributed memory LAPACK)
– One-sided factorizations on a GPU cluster