CS 267: Applications of Parallel Computers Lecture 15: Parallel Sparse Matrix-Vector Multiplication Kathy Yelick http://www-inst.eecs.berkeley.edu/~cs267 11/7/2015 CS267, Yelick.

Download Report

Transcript CS 267: Applications of Parallel Computers Lecture 15: Parallel Sparse Matrix-Vector Multiplication Kathy Yelick http://www-inst.eecs.berkeley.edu/~cs267 11/7/2015 CS267, Yelick.

CS 267: Applications of Parallel Computers
Lecture 15:
Parallel Sparse Matrix-Vector Multiplication
Kathy Yelick
http://www-inst.eecs.berkeley.edu/~cs267
11/7/2015
CS267, Yelick
1
Outline
• Matrix-Vector multiplication (review)
• Matrices and Graphs (review)
• Serial optimizations
•
•
•
•
Register blocking
Cache blocking
Exploiting symmetry
Reordering
• Parallel optimizations
• Reordering
• Multiple vectors or operations
• Other operations
11/7/2015
CS267, Yelick
2
Parallel Dense Matrix-Vector Product (Review)
• y = y + A*x, where A is a dense matrix
Po P1
P2
P3
x
Po
• Layout:
• 1D by rows
y
• Algorithm:
P1
P2
• Foreach processor j
• Broadcast X(j)
• Compute A(p)*x(j)
P3
• A(i) refers to the n by n/p block row that processor i owns
• Algorithm uses the formula
Y(i) = Y(i) + A(i)*X = Y(i) + Sj A(i)*X(j)
11/7/2015
CS267, Yelick
3
Graphs and Sparse Matrices
• Sparse matrix is a representation of a (sparse) graph
1
2
3
4
1 1
2
1
5
6
6
1
1
3
4
5
1
1
1
3
1
4
1
1
1
2
1
1
1
1
6
5
• Matrix entries are edge weights
• Diagonal contains self-edges (usually non-zero)
• Matrix-vector multiplication:
• Vector is spread over nodes
• Compute new vector by multiplying incoming edges & source node
11/7/2015
CS267, Yelick
4
Data Structure for Sparse Matrix
• Matvec (Matrix Vector Multiplication) will
• Access rows of matrix serially
 use an array for each row (indices and values)*
• Access destination vector only once per element
 layout doesn’t matter
• Access source vector once per row, only nonzeros
• Only opportunity for memory hierarchy improvements is
in the source vector
* Layout by rows may not be best – more on this later
11/7/2015
CS267, Yelick
5
Compressed Sparse Row Format
• Compressed Sparse Row (CSR), also called
Compressed Rows Storage (CRS) is:
• General – does not make assumptions about the matrix
structure
• Relatively efficient – rows are sorted, only 1 index per nonzero
11/7/2015
CS267, Yelick
6
Performance Challenge
Consider a 167 MHz UltraSPARC I:
• Dense matrix operation
• Naïve matrix-vector multiplication:
38Mflops
• Vendor optimized matrix-vector multiplication: 57Mflops
• Vendor optimized matrix-matrix multiplication: 185Mflops
 Reason: Low Flop to memory ratio: 2
• Sparse matrix operation
• Naïve matrix-vector multiplication of dense matrix in sparse
representation (Compressed Sparse Row) :
25Mflops
• As low as 5.7Mflops for some matrices
 Reason: Indexing overhead and irregular memory
access patterns
11/7/2015
CS267, Yelick
7
Serial Optimizations
• Several optimizations, depending on matrix structure
• Register blocking
•
Sparsity and PetSC
• Cache blocking
•
Sparsity
• Symmetry
•
BeBop and project at CMU
• Reordering
•
Toledo (’97) and Oliker et al (’00) do this
• Exploit bands, and other structures:
•
11/7/2015
Sparse compiler project by Bik (’96) does this
CS267, Yelick
8
Register Blocking Optimization
 Identify a small dense
blocks of nonzeros.
2x2 register blocked matrix
 Fill in extra zeros to
2 1
4 3
3
2
1
2
complete blocks
00 12
41 22
1 2
 Use an optimized
21 50
multiplication code for the
0 3 1 0
3 1
particular block size.
02 35
11 04
3 0 2
3 0
2
3 1
 Reuse source elements
0 0
4 3
1 7
2
5
 Unroll loop
0 1 1 4
 Saves index memory
 Improves register reuse, lowers indexing overhead.
 Computes extra Flops for added zeros
11/7/2015
CS267, Yelick
9
Register Blocking: Blocked Row Format
• Blocked Compressed Sparse Row Format
 A00

 A10
 0

 0

A01 0
A11 0
0 A22
0 A32
0
0
0
A33
A04
0
0
A34
0 

A15 
A25 

A35 
024
0424
A00 A01 A10 A11 A04 0 0 A15 A22 0 A32 A33 A25 0 A34 A35
• Advantages of the format
• Better temporal locality in registers
• The multiplication loop can be unrolled for better performance
11/7/2015
CS267, Yelick
10
Register Blocking : Fill Overhead
• We use uniform block size, adding fill overhead.
fill overhead = 12/7 = 1.71
• This increases both space and the number of floating
point operations.
11/7/2015
CS267, Yelick
11
Performance Model for Register Blocking
• Estimate performance of register blocking:
• Estimated raw performance: Mflop/s of dense matrix in sparse
bxb blocked format on machine of choice
• Estimated overhead: for particular matrix to fill in bxb blocks
• Maximize over b:
Estimated raw performance
Estimated overhead
Machine-dependent
• Use sampling to further reduce time, row and column
dimensions are computed separately
Matrix-dependent
11/7/2015
CS267, Yelick
12
Register Blocking: Profile of Ultra I
• Dense Matrix profile on an UltraSPARC I
6/13/00
U.C.Berkeley
13
Register Blocking: Profile of MIPS R10K
• Performance of dense matrix in sparse blocked format on MIPS R10K
11/7/2015
CS267, Yelick
14
Register Blocking: Profile of P III
• Performance of dense matrix in sparse blocked format on P III
11/7/2015
CS267, Yelick
15
Register Blocking: Profile of P4
• Performance of dense matrix in sparse blocked format on P 4
11/7/2015
CS267, Yelick
16
Benchmark matrices
• Matrix 1: Dense matrix (1000 x 1000)
• Matrices 2-17 : Finite Element Method matrices
• Matrices 18-39 : matrices from Structural Engineering, Device
Simulation
• Matrices 40-44 : Linear Programming matrices
• Matrix 45 : document retrieval matrix
used for Latent Semantic Indexing
• Matrix 46 : random matrix (10000 x 10000, 0.15%)
11/7/2015
CS267, Yelick
17
Register Blocking : Performance
• The optimization is effective most on FEM matrices and dense
matrix (lower-numbered).
• A single search step (compare blocked and unblocked) raises the
numbers below the line to 1
11/7/2015
CS267, Yelick
18
Register Blocking : Performance
•
Speedup is generally best on MIPS R10000, which is competitive with the
dense BLAS performance. (DGEMV/DGEMM = 0.38)
11/7/2015
CS267, Yelick
19
Register Blocking on P4 for FEM/fluids matrix 1
11/7/2015
CS267, Yelick
20
Register Blocking on P4 for FEM/fluids matrix 2
11/7/2015
CS267, Yelick
21
Register Blocking: Model Validation
11/7/2015
CS267, Yelick
22
Register Blocking : Overhead
• Pre-computation overhead :
• Estimating fill overhead (red bars)
• Reorganizing the matrix (yellow bars)
• The ratio means the number of repetitions for which the
optimization is beneficial.
11/7/2015
CS267, Yelick
23
Cache Blocking Optimization
• Keeping part of source vector in cache
Source vector (x)
Destination
Vector (y)
=
Sparse matrix(A)
 Improves cache reuse of source vector.
 Challenge: choosing a block size
11/7/2015
CS267, Yelick
24
Cache Blocking
• Temporal locality of access to source vector
• Makes sense when X does not fit in cache
• Rectangular matrices, in particular
Source vector x
Destination
Vector
y
In memory
11/7/2015
CS267, Yelick
25
Cache Blocking: Speedup
• MIPS speedup is generally better.
Larger cache, larger miss penalty (26/589 ns for MIPS, 36/268 ns for Ultra.)
• Document retrieval and random matrix are exceptions.
Perform very poorly in baseline code
11/7/2015
CS267, Yelick
26
Cache Blocking: Web Document Matrix
40
35
30
25
20
Raw Perf.
15
Cache Block
10
5
0
R10K
Ultra
P-III
21164
• This matrix seems to be an exception
• Cache blocking helps
• Only on a random matrix with similar shape/size does it help (without other
optimizations)
11/7/2015
CS267, Yelick
27
Cache Blocking: Searching for Block Size
11/7/2015
CS267, Yelick
28
Combination of Register and Cache blocking
• The combination is rarely beneficial, often slower than
either of the two optimization.
11/7/2015
CS267, Yelick
29
Combination of Register and Cache blocking
11/7/2015
CS267, Yelick
30
Cache Blocking: Summary
• On matrices from real problems, rarely useful.
• Need to be:
• Random in structure
• Rectangular
• Can determine reasonable block size through coarse
grained search.
• Combination of register and cache blocking not useful.
• Work on disjoint sets of matrices (as far as we can tell)
11/7/2015
CS267, Yelick
31
Exploiting Symmetry
• Many matrices are symmetric (undirected graph)
• CG only works, for example, on symmetric matrices
• Symmetric storage: only store/access ½ of matrix
x
• Update y twice for each
element of A
• y[i] += A[i,j]*x[j]
y
• y[j] += A[i,j]*x[i]
• Save up to ½ of memory
traffic (to A)
• Works for sparse format
11/7/2015
CS267, Yelick
32
Symmetry: Speedups
11/7/2015
CS267, Yelick
33
Multiple Vector Optimization
 Better potential for
reuse
 Loop unrolled codes
multiplying across
vectors are
generated by a code
generator.
x j1
yi1
yi2
aij
yi1  aij  xj1
yi 2  aij  xj 2
 Allows reuse of matrix elements.
 Choosing the number of vectors for loop unrolling.
11/7/2015
CS267, Yelick
34
Multiple Vector Multiplication
• Better chance of optimization : BLAS2 vs. BLAS3
Repetition of single-vector case
Multiple-vector case
• Blocked algorithms (blocked CG) may use this
11/7/2015
CS267, Yelick
35
Multiple Vector Effect on Synthetic Matrices
matrix with random
sparsity pattern
dense matrix in sparse
format
11/7/2015
CS267, Yelick
36
Multiple Vectors with Register Blocking
•
•
The speedup is larger than single vector register blocking.
Even the performance of the matrices that did not speedup improved.
(middle group in UltraSPARC)
11/7/2015
CS267, Yelick
37
Multiple Vectors with Cache Blocking
MIPS
UltraSPARC
• Noticeable speedup for matrices that did not speedup with cache
blocking alone (UltraSPARC)
• Block sizes are much smaller than that of single vector cache
blocking.
11/7/2015
CS267, Yelick
38
Sparsity System Organization
• Guide a choice of optimization
• Automatic selection of optimization parameters such as block size,
number of vectors
• http://www.cs.berkeley.edu/~yelick/sparsity
• Released version does not include optimization for symmetry
Representative
Matrix
Sparsity
machine
profiler
Sparsity
optimizer
Machine
Profile
Optimized
Format &
Code
Maximum
# vectors
11/7/2015
CS267, Yelick
39
Summary of Serial Optimizations
• Sparse matrix optimization has same problems as
dense, plus nonzero pattern
• Optimization of format/code is effective, but expensive
• Should be done across an application, not at each mvm call
or even iterative solve
• Optimization choice may depended on matrix characteristics
rather than specifics
• Optimizing over large “kernel” is also important
• Heuristics plus search can be used to select
optimization parameters
• Future
• Different code organization, instruction mixes
• Different view of A: A = A1 + A2
• Different matrix orderings
11/7/2015
CS267, Yelick
40
Tuning other sparse operations
• Symmetric matrix-vector multiply A*x
• Solve a triangular system of equations T-1*x
• AT*A*x
• Kernel of Information Retrieval via LSI
• Same number of memory references as A*x
• A2*x, Ak*x
• Kernel of Information Retrieval used by Google
• Changes calling algorithm
• AT*M*A
• Matrix triple product
• Used in multigrid solver
•11/7/2015
…
CS267, Yelick
41
Symmetric Sparse Matrix-Vector Multiply
P4 using icc
11/7/2015
CS267, Yelick
42
Sparse Triangular Solve
P4 using icc
11/7/2015
CS267, Yelick
43
Parallel Sparse MatrixVector Multiplication
11/7/2015
CS267, Yelick
44
Parallel Sparse Matrix-Vector Product
• y = y + A*x, where A is a sparse matrix
Po P1
P2
P3
x
Po
• Layout:
• 1D by rows
y
• Algorithm:
P1
P2
• Foreach processor j
• Broadcast X(j)
• Compute A(p)*x(j)
P3
• Same algorithm works, but:
• May be unbalanced
• Communicates more of X than needed
11/7/2015
CS267, Yelick
45
Optimization Opportunities
• Send only necessary parts of x
• Overlap communication and computation
• Load balance
• Take advantage of symmetry
• Trickier than serial case, because of multiple accesses to Y
• Take advantage of blocked structure
• For serial performance, using FEM matrices or similar
• Take advantage of bands
• Store only diagonal bands with starting coordinate
• Reduces index storage and access, running time?
• Reorder your matrix
• Use blocked CG
• Better serial performance, reduces a, but not b in comm
11/7/2015
CS267, Yelick
46
Following slides based on:
Ordering for Efficiency of Sparse CG
X. Li, L. Oliker, G. Heber, R. Biswas]
11/7/2015
CS267, Yelick
47
Partitioning and Ordering Schemes
• RCM, METIS
• Self-Avoiding Walk (SAW) [Heber/Biswas/Gao ‘99]
• A SAW over a triangular mesh is an enumeration of the
triangles such that two consecutive triangles in the SAW share
an edge, or a vertex (there are no jumps).
• It is a mesh-based linearization technique with similar
application areas as space-filling curves.
• Construction time is linear in the number of triangles.
• Amenable to hierarchical coarsening and refinement, and can
be easily parallelized.
11/7/2015
CS267, Yelick
48
Experiment
• Mesh generation
• 2D Delaunay triangulation, generated by Triangle [Shewchuk
‘96].
• The mesh is shaped like letter “A”:
•
661,054 vertices, 1,313,099 triangles
• Matrix generation
• For each vertex pair (vi, vj) with distance <= 3, we assign a
random value in (0, 1) to A(i, j).
• Diagonal is chosen to make it diagonally dominant.
• About 39 nonzeros per row, total of 25,753,034 nonzeros.
• Implementations
• MPI version uses Aztec.
• Origin 2000 and Tera MTA:
uses
11/7/2015
CS267,
Yelickcompiler directives.
49
Sparse CG Timings
11/7/2015
CS267, Yelick
50
Sparse Matvec Cache Misses
11/7/2015
CS267, Yelick
51
Sparse Matve Communication Volume
11/7/2015
CS267, Yelick
52
Observations
• On T3E and Origin, ordering methods significantly
improve the CG performance.
• On T3E, matrix-vector efficiency depends more on
cache reuse than communication reduction.
• RCM, SAW, METIS : >= 93% total time servicing cache misses
• ORIG : 72-80% total time servicing cache misses.
• On Origin, can use SMP directives, but require data
distribution.
• On the Tera MTA, no ordering and data distribution are
required. Program is the simplest.
11/7/2015
CS267, Yelick
53
Summary
• Lots of opportunities for optimization
• Data structures
• Local performance
• Communication performance
• Numerical techniques (preconditioning)
• Take advantage of special structures
• OK to focus on class of matrices
11/7/2015
CS267, Yelick
54
Graph Partitioning Strategy: MeTiS
• Most popular class of multilevel partitioners
• Objectives
• balance computational workload
• minimize edge cut (interprocessor communication)
• Algorithm
• collapses vertices & edges using heavy-edge matching scheme
• applies greedy partitioning algorithm to coarsest graph
• uncoarsens it back using greedy graph growing + KernighanLin
Initial partitioning
11/7/2015
CS267, Yelick
55
Strategy: Reverse Cuthill-McKee (RCM)
• Matrix bandwidth (profile) has significant impact on
efficiency of linear systems & eigensolvers
• Geometry-based algorithm that generates a permutation
so that non-zero entries are close to diagonal
• Good preordering for LU or Cholesky factorization
(reduces fill)
• Improves cache performance (but does not explicitly
reduce edge cut)
• Can be used as partitioner by assigning segments to
processors
• Starting from vertex of min degree, generates level
structure by BFS & orders vertices by decreasing
distance from original vertex
11/7/2015
CS267, Yelick
56
Strategy: Self-Avoiding Walks (SAW)
• Mesh-based technique similar to space-filling curves
• Two consecutive triangles in walk share edge or vertex (no
jumps)
• Visits each triangle exactly once, entering/exiting over
edge/vertex
• Improves parallel efficiency related to locality (cache
reuse) & load balancing, but does not explicitly reduce
edge cuts
• Amenable to hierarchical coarsening & refinement
Biswas, Oliker, Li, Concurrency:
Practice & Experience,
12 (2000) 85-109
11/7/2015
CS267, Yelick
57
Test Problem
• 2D Delaunay triangulation of letter “A” shaped mesh
generated using Triangle package (661,054 vertices &
1,313,099 triangles)
• Underlying matrix assembled by assigning random value
in (0,1) to each (i,j) entry corresponding to vertex pair
(vi,vj) if dist(vi,vj) < 4 (other entries set to 0)
• Diagonal entries set to 40 (to make matrix positive
definite)
• Final matrix has approx 39 entries/row and 25,753,034
nonzeros
• CG converges in 13 iterations; SPMV accounts for 87%
of flops
11/7/2015
CS267, Yelick
58
Distributed-Memory Implementation
• Each processor has local memory that only it can
directly access; message passing required to access
memory of another processor
• User decides how to distribute data and organize comm
structure
• Allows efficient code design at the cost of higher complexity
• Parallel CG calls Aztec sparse linear library
• Matrix A partitioned into blocks of rows; each block to a
processor
• Associated component vectors (x, b) distributed accordingly
• Comm needed to transfer some components of x
• Data from
• T3E (450 MHz Alpha processor, 900 Mflops peak, 96 KB
secondary cache, 3D torus interconnect)
11/7/2015
CS267, Yelick
59
Distributed-Memory Implementation
P
8
16
32
64
Avg. Cache Misses (10 )
ORIG MeTiSIncr RCM
SAW
6
Avg. Comm (10 bytes)
ORIG MeTiS
RCM
SAW
3.684
2.007
1.060
0.601
2.004
0.971
0.507
0.290
3.228
2.364
1.492
0.828
3.034
1.330
0.658
0.358
3.749
1.905
1.017
0.515
6
0.011
0.011
0.009
0.008
0.031
0.032
0.032
0.032
0.049
0.036
0.030
0.023
• ORIG ordering has large edge cut (interprocessor
comm) and poor locality (high number of cache misses)
• MeTiS minimizes edge cut, while SAW minimizes cache
misses
11/7/2015
CS267, Yelick
60
Distributed Shared-Memory System
• Origin2000 (SMP of nodes, each with dual 250 MHz
R10000 processor & local memory)
• Hardware makes all memory equally accessible from software
perspective by sending memory requests through routers on
nodes
• Memory access time non-uniform (depends on # hops)
• Hypercube interconnect (maximum log(p) hops)
• Each processor has 4 MB cache where only it can
fetch/store data
• If processor accesses data not in cache, delay while copy
fetched
• When processor modifies word, all other copies of that cache
line invalidated
11/7/2015
CS267, Yelick
61
Distributed Shared-Memory Implementation
• Use OpenMP-style directives to parallelize loops
• requires less effort than MPI
• Two implementation approaches taken
• SHMEM: assume that Origin2000 has flat shared-memory
(arrays not explicitly distributed, non-local data handled by
cache coherence)
• CC-NUMA: consider underlying distributed shared-memory by
explicit data distribution
• Computational kernels identical for SHMEM and CCNUMA
• Each processor assigned equal # rows in matrix (block)
• No explicit synchronization required since no concurrent
writes
• Global reduction required for DOT operation
11/7/2015
CS267, Yelick
62
Distributed Shared-Memory Implementation
P
ORIG
SHMEM
RCM
1
4
16
64
46.91
30.64
16.35
10.81
37.18
25.35
15.52
7.782
SAW
CC-NUMA
ORIG
RCM
SAW
MPI
SAW
36.79
24.75
15.55
8.450
46.91
17.61
6.205
2.365
7.880
1.926
0.905
37.18
10.65
2.845
0.885
36.79
10.59
2.872
0.848
• CC-NUMA performs significantly better than SHMEM
• RCM & SAW reduce runtimes compared to ORIG (more
for CC-NUMA)
• Little difference between RCM & SAW, probably due to
large cache
• CC-NUMA (with ordering) & MPI runtimes comparable,
even though programming methodologies quite different
11/7/2015
CS267, Yelick
63
Tera MTA Multithreaded Architecture
• 255 MHz Tera uses multithreading to hide latency (100-150 cycles
per word) & keep processors saturated with work
• No data cache
• Randomized memory mapping (data layout impossible)
• Near uniform data access from any processor to any memory location
• Each processor has ~100 hardware streams (32 registers & PC)
• Context switch on each cycle, choose next instruction from ready
streams
• A stream can execute an instruction only once every 21 cycles, even if
no instructions reference memory
• Synchronization between threads accomplished using full/empty
bits in memory, allowing fine-grained threads
• No explicit load balancing required since dynamic scheduling of
work to threads can keep processor saturated
• No difference between uni- and multiprocessor parallelism
11/7/2015
CS267, Yelick
64
Tera MTA Implementation
• Multithreading implementation trivial (only require
compiler directives)
• Special assertions used to indicate no loop-carried
dependencies
• Compiler then able to parallelize loop segments
• Load balancing handled by OS (dynamically assigns
rows of matrix to threads)
• Other than reduction for DOT, no special synchronization
constructs required (no possible race conditions in CG)
• No special ordering required (or possible) to achieve
good parallel performance
11/7/2015
CS267, Yelick
65
Tera MTA Results
P
ORIG
SPMV
CG
SAW
CG
1
2
4
8
0.378
0.189
0.095
0.051
9.74
5.01
2.64
1.36
9.86
5.02
2.53
1.35
• Both SPMV and CG show high scalability (over 90%) with 60
streams per processor
• Sufficient TLP to tolerate high overhead of memory access
• 8-proc Tera faster than 32-proc Origin2000 & 16-proc T3E with no
partitioning / ordering (but will scaling continue beyond 8 procs?)
• Adaptivity = No extra work required to maintain performance
11/7/2015
CS267, Yelick
66