CS 267: Applications of Parallel Computers Lecture 15: Parallel Sparse Matrix-Vector Multiplication Kathy Yelick http://www-inst.eecs.berkeley.edu/~cs267 11/7/2015 CS267, Yelick.
Download ReportTranscript CS 267: Applications of Parallel Computers Lecture 15: Parallel Sparse Matrix-Vector Multiplication Kathy Yelick http://www-inst.eecs.berkeley.edu/~cs267 11/7/2015 CS267, Yelick.
CS 267: Applications of Parallel Computers Lecture 15: Parallel Sparse Matrix-Vector Multiplication Kathy Yelick http://www-inst.eecs.berkeley.edu/~cs267 11/7/2015 CS267, Yelick 1 Outline • Matrix-Vector multiplication (review) • Matrices and Graphs (review) • Serial optimizations • • • • Register blocking Cache blocking Exploiting symmetry Reordering • Parallel optimizations • Reordering • Multiple vectors or operations • Other operations 11/7/2015 CS267, Yelick 2 Parallel Dense Matrix-Vector Product (Review) • y = y + A*x, where A is a dense matrix Po P1 P2 P3 x Po • Layout: • 1D by rows y • Algorithm: P1 P2 • Foreach processor j • Broadcast X(j) • Compute A(p)*x(j) P3 • A(i) refers to the n by n/p block row that processor i owns • Algorithm uses the formula Y(i) = Y(i) + A(i)*X = Y(i) + Sj A(i)*X(j) 11/7/2015 CS267, Yelick 3 Graphs and Sparse Matrices • Sparse matrix is a representation of a (sparse) graph 1 2 3 4 1 1 2 1 5 6 6 1 1 3 4 5 1 1 1 3 1 4 1 1 1 2 1 1 1 1 6 5 • Matrix entries are edge weights • Diagonal contains self-edges (usually non-zero) • Matrix-vector multiplication: • Vector is spread over nodes • Compute new vector by multiplying incoming edges & source node 11/7/2015 CS267, Yelick 4 Data Structure for Sparse Matrix • Matvec (Matrix Vector Multiplication) will • Access rows of matrix serially use an array for each row (indices and values)* • Access destination vector only once per element layout doesn’t matter • Access source vector once per row, only nonzeros • Only opportunity for memory hierarchy improvements is in the source vector * Layout by rows may not be best – more on this later 11/7/2015 CS267, Yelick 5 Compressed Sparse Row Format • Compressed Sparse Row (CSR), also called Compressed Rows Storage (CRS) is: • General – does not make assumptions about the matrix structure • Relatively efficient – rows are sorted, only 1 index per nonzero 11/7/2015 CS267, Yelick 6 Performance Challenge Consider a 167 MHz UltraSPARC I: • Dense matrix operation • Naïve matrix-vector multiplication: 38Mflops • Vendor optimized matrix-vector multiplication: 57Mflops • Vendor optimized matrix-matrix multiplication: 185Mflops Reason: Low Flop to memory ratio: 2 • Sparse matrix operation • Naïve matrix-vector multiplication of dense matrix in sparse representation (Compressed Sparse Row) : 25Mflops • As low as 5.7Mflops for some matrices Reason: Indexing overhead and irregular memory access patterns 11/7/2015 CS267, Yelick 7 Serial Optimizations • Several optimizations, depending on matrix structure • Register blocking • Sparsity and PetSC • Cache blocking • Sparsity • Symmetry • BeBop and project at CMU • Reordering • Toledo (’97) and Oliker et al (’00) do this • Exploit bands, and other structures: • 11/7/2015 Sparse compiler project by Bik (’96) does this CS267, Yelick 8 Register Blocking Optimization Identify a small dense blocks of nonzeros. 2x2 register blocked matrix Fill in extra zeros to 2 1 4 3 3 2 1 2 complete blocks 00 12 41 22 1 2 Use an optimized 21 50 multiplication code for the 0 3 1 0 3 1 particular block size. 02 35 11 04 3 0 2 3 0 2 3 1 Reuse source elements 0 0 4 3 1 7 2 5 Unroll loop 0 1 1 4 Saves index memory Improves register reuse, lowers indexing overhead. Computes extra Flops for added zeros 11/7/2015 CS267, Yelick 9 Register Blocking: Blocked Row Format • Blocked Compressed Sparse Row Format A00 A10 0 0 A01 0 A11 0 0 A22 0 A32 0 0 0 A33 A04 0 0 A34 0 A15 A25 A35 024 0424 A00 A01 A10 A11 A04 0 0 A15 A22 0 A32 A33 A25 0 A34 A35 • Advantages of the format • Better temporal locality in registers • The multiplication loop can be unrolled for better performance 11/7/2015 CS267, Yelick 10 Register Blocking : Fill Overhead • We use uniform block size, adding fill overhead. fill overhead = 12/7 = 1.71 • This increases both space and the number of floating point operations. 11/7/2015 CS267, Yelick 11 Performance Model for Register Blocking • Estimate performance of register blocking: • Estimated raw performance: Mflop/s of dense matrix in sparse bxb blocked format on machine of choice • Estimated overhead: for particular matrix to fill in bxb blocks • Maximize over b: Estimated raw performance Estimated overhead Machine-dependent • Use sampling to further reduce time, row and column dimensions are computed separately Matrix-dependent 11/7/2015 CS267, Yelick 12 Register Blocking: Profile of Ultra I • Dense Matrix profile on an UltraSPARC I 6/13/00 U.C.Berkeley 13 Register Blocking: Profile of MIPS R10K • Performance of dense matrix in sparse blocked format on MIPS R10K 11/7/2015 CS267, Yelick 14 Register Blocking: Profile of P III • Performance of dense matrix in sparse blocked format on P III 11/7/2015 CS267, Yelick 15 Register Blocking: Profile of P4 • Performance of dense matrix in sparse blocked format on P 4 11/7/2015 CS267, Yelick 16 Benchmark matrices • Matrix 1: Dense matrix (1000 x 1000) • Matrices 2-17 : Finite Element Method matrices • Matrices 18-39 : matrices from Structural Engineering, Device Simulation • Matrices 40-44 : Linear Programming matrices • Matrix 45 : document retrieval matrix used for Latent Semantic Indexing • Matrix 46 : random matrix (10000 x 10000, 0.15%) 11/7/2015 CS267, Yelick 17 Register Blocking : Performance • The optimization is effective most on FEM matrices and dense matrix (lower-numbered). • A single search step (compare blocked and unblocked) raises the numbers below the line to 1 11/7/2015 CS267, Yelick 18 Register Blocking : Performance • Speedup is generally best on MIPS R10000, which is competitive with the dense BLAS performance. (DGEMV/DGEMM = 0.38) 11/7/2015 CS267, Yelick 19 Register Blocking on P4 for FEM/fluids matrix 1 11/7/2015 CS267, Yelick 20 Register Blocking on P4 for FEM/fluids matrix 2 11/7/2015 CS267, Yelick 21 Register Blocking: Model Validation 11/7/2015 CS267, Yelick 22 Register Blocking : Overhead • Pre-computation overhead : • Estimating fill overhead (red bars) • Reorganizing the matrix (yellow bars) • The ratio means the number of repetitions for which the optimization is beneficial. 11/7/2015 CS267, Yelick 23 Cache Blocking Optimization • Keeping part of source vector in cache Source vector (x) Destination Vector (y) = Sparse matrix(A) Improves cache reuse of source vector. Challenge: choosing a block size 11/7/2015 CS267, Yelick 24 Cache Blocking • Temporal locality of access to source vector • Makes sense when X does not fit in cache • Rectangular matrices, in particular Source vector x Destination Vector y In memory 11/7/2015 CS267, Yelick 25 Cache Blocking: Speedup • MIPS speedup is generally better. Larger cache, larger miss penalty (26/589 ns for MIPS, 36/268 ns for Ultra.) • Document retrieval and random matrix are exceptions. Perform very poorly in baseline code 11/7/2015 CS267, Yelick 26 Cache Blocking: Web Document Matrix 40 35 30 25 20 Raw Perf. 15 Cache Block 10 5 0 R10K Ultra P-III 21164 • This matrix seems to be an exception • Cache blocking helps • Only on a random matrix with similar shape/size does it help (without other optimizations) 11/7/2015 CS267, Yelick 27 Cache Blocking: Searching for Block Size 11/7/2015 CS267, Yelick 28 Combination of Register and Cache blocking • The combination is rarely beneficial, often slower than either of the two optimization. 11/7/2015 CS267, Yelick 29 Combination of Register and Cache blocking 11/7/2015 CS267, Yelick 30 Cache Blocking: Summary • On matrices from real problems, rarely useful. • Need to be: • Random in structure • Rectangular • Can determine reasonable block size through coarse grained search. • Combination of register and cache blocking not useful. • Work on disjoint sets of matrices (as far as we can tell) 11/7/2015 CS267, Yelick 31 Exploiting Symmetry • Many matrices are symmetric (undirected graph) • CG only works, for example, on symmetric matrices • Symmetric storage: only store/access ½ of matrix x • Update y twice for each element of A • y[i] += A[i,j]*x[j] y • y[j] += A[i,j]*x[i] • Save up to ½ of memory traffic (to A) • Works for sparse format 11/7/2015 CS267, Yelick 32 Symmetry: Speedups 11/7/2015 CS267, Yelick 33 Multiple Vector Optimization Better potential for reuse Loop unrolled codes multiplying across vectors are generated by a code generator. x j1 yi1 yi2 aij yi1 aij xj1 yi 2 aij xj 2 Allows reuse of matrix elements. Choosing the number of vectors for loop unrolling. 11/7/2015 CS267, Yelick 34 Multiple Vector Multiplication • Better chance of optimization : BLAS2 vs. BLAS3 Repetition of single-vector case Multiple-vector case • Blocked algorithms (blocked CG) may use this 11/7/2015 CS267, Yelick 35 Multiple Vector Effect on Synthetic Matrices matrix with random sparsity pattern dense matrix in sparse format 11/7/2015 CS267, Yelick 36 Multiple Vectors with Register Blocking • • The speedup is larger than single vector register blocking. Even the performance of the matrices that did not speedup improved. (middle group in UltraSPARC) 11/7/2015 CS267, Yelick 37 Multiple Vectors with Cache Blocking MIPS UltraSPARC • Noticeable speedup for matrices that did not speedup with cache blocking alone (UltraSPARC) • Block sizes are much smaller than that of single vector cache blocking. 11/7/2015 CS267, Yelick 38 Sparsity System Organization • Guide a choice of optimization • Automatic selection of optimization parameters such as block size, number of vectors • http://www.cs.berkeley.edu/~yelick/sparsity • Released version does not include optimization for symmetry Representative Matrix Sparsity machine profiler Sparsity optimizer Machine Profile Optimized Format & Code Maximum # vectors 11/7/2015 CS267, Yelick 39 Summary of Serial Optimizations • Sparse matrix optimization has same problems as dense, plus nonzero pattern • Optimization of format/code is effective, but expensive • Should be done across an application, not at each mvm call or even iterative solve • Optimization choice may depended on matrix characteristics rather than specifics • Optimizing over large “kernel” is also important • Heuristics plus search can be used to select optimization parameters • Future • Different code organization, instruction mixes • Different view of A: A = A1 + A2 • Different matrix orderings 11/7/2015 CS267, Yelick 40 Tuning other sparse operations • Symmetric matrix-vector multiply A*x • Solve a triangular system of equations T-1*x • AT*A*x • Kernel of Information Retrieval via LSI • Same number of memory references as A*x • A2*x, Ak*x • Kernel of Information Retrieval used by Google • Changes calling algorithm • AT*M*A • Matrix triple product • Used in multigrid solver •11/7/2015 … CS267, Yelick 41 Symmetric Sparse Matrix-Vector Multiply P4 using icc 11/7/2015 CS267, Yelick 42 Sparse Triangular Solve P4 using icc 11/7/2015 CS267, Yelick 43 Parallel Sparse MatrixVector Multiplication 11/7/2015 CS267, Yelick 44 Parallel Sparse Matrix-Vector Product • y = y + A*x, where A is a sparse matrix Po P1 P2 P3 x Po • Layout: • 1D by rows y • Algorithm: P1 P2 • Foreach processor j • Broadcast X(j) • Compute A(p)*x(j) P3 • Same algorithm works, but: • May be unbalanced • Communicates more of X than needed 11/7/2015 CS267, Yelick 45 Optimization Opportunities • Send only necessary parts of x • Overlap communication and computation • Load balance • Take advantage of symmetry • Trickier than serial case, because of multiple accesses to Y • Take advantage of blocked structure • For serial performance, using FEM matrices or similar • Take advantage of bands • Store only diagonal bands with starting coordinate • Reduces index storage and access, running time? • Reorder your matrix • Use blocked CG • Better serial performance, reduces a, but not b in comm 11/7/2015 CS267, Yelick 46 Following slides based on: Ordering for Efficiency of Sparse CG X. Li, L. Oliker, G. Heber, R. Biswas] 11/7/2015 CS267, Yelick 47 Partitioning and Ordering Schemes • RCM, METIS • Self-Avoiding Walk (SAW) [Heber/Biswas/Gao ‘99] • A SAW over a triangular mesh is an enumeration of the triangles such that two consecutive triangles in the SAW share an edge, or a vertex (there are no jumps). • It is a mesh-based linearization technique with similar application areas as space-filling curves. • Construction time is linear in the number of triangles. • Amenable to hierarchical coarsening and refinement, and can be easily parallelized. 11/7/2015 CS267, Yelick 48 Experiment • Mesh generation • 2D Delaunay triangulation, generated by Triangle [Shewchuk ‘96]. • The mesh is shaped like letter “A”: • 661,054 vertices, 1,313,099 triangles • Matrix generation • For each vertex pair (vi, vj) with distance <= 3, we assign a random value in (0, 1) to A(i, j). • Diagonal is chosen to make it diagonally dominant. • About 39 nonzeros per row, total of 25,753,034 nonzeros. • Implementations • MPI version uses Aztec. • Origin 2000 and Tera MTA: uses 11/7/2015 CS267, Yelickcompiler directives. 49 Sparse CG Timings 11/7/2015 CS267, Yelick 50 Sparse Matvec Cache Misses 11/7/2015 CS267, Yelick 51 Sparse Matve Communication Volume 11/7/2015 CS267, Yelick 52 Observations • On T3E and Origin, ordering methods significantly improve the CG performance. • On T3E, matrix-vector efficiency depends more on cache reuse than communication reduction. • RCM, SAW, METIS : >= 93% total time servicing cache misses • ORIG : 72-80% total time servicing cache misses. • On Origin, can use SMP directives, but require data distribution. • On the Tera MTA, no ordering and data distribution are required. Program is the simplest. 11/7/2015 CS267, Yelick 53 Summary • Lots of opportunities for optimization • Data structures • Local performance • Communication performance • Numerical techniques (preconditioning) • Take advantage of special structures • OK to focus on class of matrices 11/7/2015 CS267, Yelick 54 Graph Partitioning Strategy: MeTiS • Most popular class of multilevel partitioners • Objectives • balance computational workload • minimize edge cut (interprocessor communication) • Algorithm • collapses vertices & edges using heavy-edge matching scheme • applies greedy partitioning algorithm to coarsest graph • uncoarsens it back using greedy graph growing + KernighanLin Initial partitioning 11/7/2015 CS267, Yelick 55 Strategy: Reverse Cuthill-McKee (RCM) • Matrix bandwidth (profile) has significant impact on efficiency of linear systems & eigensolvers • Geometry-based algorithm that generates a permutation so that non-zero entries are close to diagonal • Good preordering for LU or Cholesky factorization (reduces fill) • Improves cache performance (but does not explicitly reduce edge cut) • Can be used as partitioner by assigning segments to processors • Starting from vertex of min degree, generates level structure by BFS & orders vertices by decreasing distance from original vertex 11/7/2015 CS267, Yelick 56 Strategy: Self-Avoiding Walks (SAW) • Mesh-based technique similar to space-filling curves • Two consecutive triangles in walk share edge or vertex (no jumps) • Visits each triangle exactly once, entering/exiting over edge/vertex • Improves parallel efficiency related to locality (cache reuse) & load balancing, but does not explicitly reduce edge cuts • Amenable to hierarchical coarsening & refinement Biswas, Oliker, Li, Concurrency: Practice & Experience, 12 (2000) 85-109 11/7/2015 CS267, Yelick 57 Test Problem • 2D Delaunay triangulation of letter “A” shaped mesh generated using Triangle package (661,054 vertices & 1,313,099 triangles) • Underlying matrix assembled by assigning random value in (0,1) to each (i,j) entry corresponding to vertex pair (vi,vj) if dist(vi,vj) < 4 (other entries set to 0) • Diagonal entries set to 40 (to make matrix positive definite) • Final matrix has approx 39 entries/row and 25,753,034 nonzeros • CG converges in 13 iterations; SPMV accounts for 87% of flops 11/7/2015 CS267, Yelick 58 Distributed-Memory Implementation • Each processor has local memory that only it can directly access; message passing required to access memory of another processor • User decides how to distribute data and organize comm structure • Allows efficient code design at the cost of higher complexity • Parallel CG calls Aztec sparse linear library • Matrix A partitioned into blocks of rows; each block to a processor • Associated component vectors (x, b) distributed accordingly • Comm needed to transfer some components of x • Data from • T3E (450 MHz Alpha processor, 900 Mflops peak, 96 KB secondary cache, 3D torus interconnect) 11/7/2015 CS267, Yelick 59 Distributed-Memory Implementation P 8 16 32 64 Avg. Cache Misses (10 ) ORIG MeTiSIncr RCM SAW 6 Avg. Comm (10 bytes) ORIG MeTiS RCM SAW 3.684 2.007 1.060 0.601 2.004 0.971 0.507 0.290 3.228 2.364 1.492 0.828 3.034 1.330 0.658 0.358 3.749 1.905 1.017 0.515 6 0.011 0.011 0.009 0.008 0.031 0.032 0.032 0.032 0.049 0.036 0.030 0.023 • ORIG ordering has large edge cut (interprocessor comm) and poor locality (high number of cache misses) • MeTiS minimizes edge cut, while SAW minimizes cache misses 11/7/2015 CS267, Yelick 60 Distributed Shared-Memory System • Origin2000 (SMP of nodes, each with dual 250 MHz R10000 processor & local memory) • Hardware makes all memory equally accessible from software perspective by sending memory requests through routers on nodes • Memory access time non-uniform (depends on # hops) • Hypercube interconnect (maximum log(p) hops) • Each processor has 4 MB cache where only it can fetch/store data • If processor accesses data not in cache, delay while copy fetched • When processor modifies word, all other copies of that cache line invalidated 11/7/2015 CS267, Yelick 61 Distributed Shared-Memory Implementation • Use OpenMP-style directives to parallelize loops • requires less effort than MPI • Two implementation approaches taken • SHMEM: assume that Origin2000 has flat shared-memory (arrays not explicitly distributed, non-local data handled by cache coherence) • CC-NUMA: consider underlying distributed shared-memory by explicit data distribution • Computational kernels identical for SHMEM and CCNUMA • Each processor assigned equal # rows in matrix (block) • No explicit synchronization required since no concurrent writes • Global reduction required for DOT operation 11/7/2015 CS267, Yelick 62 Distributed Shared-Memory Implementation P ORIG SHMEM RCM 1 4 16 64 46.91 30.64 16.35 10.81 37.18 25.35 15.52 7.782 SAW CC-NUMA ORIG RCM SAW MPI SAW 36.79 24.75 15.55 8.450 46.91 17.61 6.205 2.365 7.880 1.926 0.905 37.18 10.65 2.845 0.885 36.79 10.59 2.872 0.848 • CC-NUMA performs significantly better than SHMEM • RCM & SAW reduce runtimes compared to ORIG (more for CC-NUMA) • Little difference between RCM & SAW, probably due to large cache • CC-NUMA (with ordering) & MPI runtimes comparable, even though programming methodologies quite different 11/7/2015 CS267, Yelick 63 Tera MTA Multithreaded Architecture • 255 MHz Tera uses multithreading to hide latency (100-150 cycles per word) & keep processors saturated with work • No data cache • Randomized memory mapping (data layout impossible) • Near uniform data access from any processor to any memory location • Each processor has ~100 hardware streams (32 registers & PC) • Context switch on each cycle, choose next instruction from ready streams • A stream can execute an instruction only once every 21 cycles, even if no instructions reference memory • Synchronization between threads accomplished using full/empty bits in memory, allowing fine-grained threads • No explicit load balancing required since dynamic scheduling of work to threads can keep processor saturated • No difference between uni- and multiprocessor parallelism 11/7/2015 CS267, Yelick 64 Tera MTA Implementation • Multithreading implementation trivial (only require compiler directives) • Special assertions used to indicate no loop-carried dependencies • Compiler then able to parallelize loop segments • Load balancing handled by OS (dynamically assigns rows of matrix to threads) • Other than reduction for DOT, no special synchronization constructs required (no possible race conditions in CG) • No special ordering required (or possible) to achieve good parallel performance 11/7/2015 CS267, Yelick 65 Tera MTA Results P ORIG SPMV CG SAW CG 1 2 4 8 0.378 0.189 0.095 0.051 9.74 5.01 2.64 1.36 9.86 5.02 2.53 1.35 • Both SPMV and CG show high scalability (over 90%) with 60 streams per processor • Sufficient TLP to tolerate high overhead of memory access • 8-proc Tera faster than 32-proc Origin2000 & 16-proc T3E with no partitioning / ordering (but will scaling continue beyond 8 procs?) • Adaptivity = No extra work required to maintain performance 11/7/2015 CS267, Yelick 66