Matrix Multiplication - Parallel Programming Laboratory

Download Report

Transcript Matrix Multiplication - Parallel Programming Laboratory

Matrix Multiplication
The Myth, The Mystery, The Majesty
Matrix Multiplication
• Simple, yet important problem
• Many fast serial implementations exist
– Atlas
– Vendor BLAS
• Natural to want to parallelize
The Problem
• Take two matrices, A and B, and multiply them to
get a 3rd matrix C
– C = A*B (C = B*A is a different matrix)
• C(i,j) = vector product of row i of A with column j
of B
• The “classic” implementation uses the triply
nested loop
– for (i…)
• for (j …)
– for (k…)
» C[i][j] += A[i][k] * B[k][j]
Parallelizing Matrix Multiplication
• The problem can be parallelized in many
ways
– Simplest is to replicate A and B on every
processor and break up the iteration space by
rows
• for (i = lowerbound; i < upperbound; i++) …
– This method is easy to code and has good
speedups, but very poor memory scalability
Parallelizing MM
• One can refine the previous approach by
only storing the rows of A that each
processor needs
– Better, but still needs all of B
• 1-D Partitioning is simple to code, but has
very poor memory scalability
Parallelizing MM
• 2-D Partitioning
– Instead break up the matrix into blocks
• Each processor stores a block of C to compute and “owns” a
block of A and a block of B
– Now one only needs to know about the other blocks
of A in the same rows and the blocks of B in the same
columns
• One can buffer all necessary blocks or only one block of A
and B at a time with some smart tricks (Fox’s Algorithm /
Cannon’s Algorithm)
• Harder to code, but memory scalability is MUCH
better
– Bonus - If the blocks become sufficiently small, can
exploit good cache behavior
OpenMP
• 1-D partitioning easy to
code – parallel for
• Unclear what data lives
where
• 2-D partitioning possible
– each processor needs
to compute its bounds
and work on them in a
parallel section
– Again, not sure what data
lives where so no
guarantees this helps
P
Time
Speedup
1
127.97
1
2
171.47
.74
4
123.63
1.04
1024x1024 Matrix
MPI
• 1-D very easy to program with
data replication
• 2-D relatively simple to program
as well
– Only requires two broadcasts
• Send A block to my row
• Send B block to my column
– Fancier algorithms replace
broadcasts with circular shifts so that
the right blocks arrive at the right
time
P
Speedup
1
Time
1-D
184.85
2
96.92
1.907243
4
49.2
3.757114
8
24.73
7.474727
16
12.34
14.97974
32
6.47
28.57032
64
1
1
3.13
59.05751
2-D
200.5083
1
4
47.60128 4.212246
16
4.269821 46.95942
64
0.360899 555.5801
Others
• HPF and Co-array Fortran both let you specify
data distribution
– One could conceivably have either 1-D or 2-D
versions with good performance
• Charm
– I suppose you could do either 1-D or 2-D versions,
but MM is a very regular problem so none of the loadbalancing/etc. is needed. Easier to just use MPI….
• STAPL
– Wrong paradigm.
Distributed vs Shared Memory?
• Neither is necessarily better
• The problem handles parallelization better
if the distribution of data in memory is
clearly defined