Performance Study of Domain Decomposed Parallel Matrix

Transcript Performance Study of Domain Decomposed Parallel Matrix

Performance Study of Domain
Decomposed Parallel Matrix
Vector Multiplication Programs
Yana Kortsarts
Jeff Rufinus
Widener University
Computer Science Department
Introduction
 Matrix computation is still one of many active areas of parallel





numerical computing research.
There are many serial and parallel matrix computation
algorithms discussed in the literature. Discussions on the
parallel matrix-vector multiplication (MVM) algorithm are
rare, except perhaps in some parallel computing textbooks
MVM problem is simple but computationally very rich.
The concepts of domain decomposition as well as
communication between the processors are inherent in MVM.
MVM is an excellent example for exploring and learning the
concept of parallelism.
MVM has applications in science and engineering.
The Problem and Sequential Algorithm





A – dense square matrix of size N x N
b is a dense vector of size N x 1
The problem is to compute a product A  b = c
MVM can be viewed as a series of inner product operations
The sequential version of the MVM algorithm has a
complexity (N2)
2
3
1
14
-3 1
4
2
11
1
1
-2 3
X
3
=
6
Domain Decomposition
 There are three straightforward ways to decompose NxN
dense matrix



Row-wise decomposition method (A)
Column-wise decomposition method (B)
Block-wise decomposition method (C)
(A)
(B)
(C)
Implementation
 Three algorithms were implemented in C using MPI libraries.
 Most of the codes were adapted from M. J. Quinn, Parallel
Programming in C with MPI and OpenMP .
 These three programs were benchmarked in a cluster of
computers at the University of Oklahoma, using a different input
sizes and a different number of processors.
 Two dense matrices of sizes 1,000 x 1,000 and 10,000 x 10,000
and their respective vectors of sizes 1,000 x 1 and 10,000 x 1
were used as inputs.
 After many runs were performed, the average total run-times
were calculated.
Row-wise Decomposition Algorithm
M. J. Quinn, Parallel Programming in C with MPI and OpenMP
 Row-wise decomposition of the matrix and replicated
vectors b and c
 Primitive task i has row i of A and a copy of vector b
 After the inner product of row i by b task i has element
i of vector c
 An all-gather step communicates each task’s element
of c to all other tasks, and the algorithm terminates
 Mapping strategy: agglomerate primitive tasks
associated with contiguous groups of rows and assign
each of these combined tasks to a single process
Example
8x8 Matrix and 3 processors
A1
b1
c1
A2
b2
c2
A3
b3
c3
A4
b4
c4
A5
b5
c5
A6
b6
c6
A7
b7
c7
A8
b8
c8
A
b
c
Row-wise Decomposition Algorithm
M. J. Quinn, Parallel Programming in C with MPI and OpenMP
 After each process performs its portion of the
multiplication, it has produced a block of result
vector c
 Block-distributed vector c should be transformed
into replicated vector
 An all-gather communication concatenates blocks
of a vector distributed among a group of processes
and copies the resulting whole vector to all the
processes
Column-wise Decomposition Algorithm
M. J. Quinn, Parallel Programming in C with MPI and OpenMP
 Column-wise decomposition of the matrix and
block-decomposed vectors
 Primitive task has column i of A and the element i
of vector b
 Multiplication of column i by element bi produces
the vector of partial results
 All-to-all communication is required to transfer
partial results between tasks
 Mapping strategy: agglomeration of adjacent
columns
Column-wise Decomposition Algorithm
a11
a12
a13
a14
a21
a22
a23
a24
a31
a32
a33
a34
a41
a42
a43
a44
b1
b2
b3
b4
a11b1
+
a12b2 +
b4
a21b1
+
a22b2
a13b3
+
a14
a23b3 +
a24
+ a33b3 +
a34
+
b4
a31b1
Proc 1
a41b1
+
+
a32b2
Proc 2
b4
Proc 3
a42b2 + a43b3 +
Proc 4
a44 b4
All - to - all exchange: After performing N multiplications, each
task needs to distribute N -1 results it doesn’t need to the other
processors and collect N - 1 results it does need from them.
After all - to - all exchange, primitive task i adds the N elements
now in its possession to produce ci
Block-wise Decomposition Algorithm
M. J. Quinn, Parallel Programming in C with MPI and OpenMP
 Primitive task associated with each element of A
 Each primitive task multiplies element aij by bj
 Mapping strategy: agglomeration of primitive tasks into




rectangular blocks
Vector b is distributed by blocks among the processes
Each task performs a matrix-vector multiplication with its
block of A and b
Tasks in each row perform the sum-reduction on their portion
of c
After the sum-reduction, result vector c is distributed by
blocks among the tasks
Block-wise Decomposition Algorithm
a11
a12
a13
a14
a21
a22
a23
a24
a31
a32
a33
a34
a41
a42
a43
a44
b1
b2
b3
b4
a11b1
+
a12b2 +
b4
a21b1
+
a22b2
b1
b2
b3
b4
a13b3
+
a14
a23b3 +
a24
+ a33b3 +
a34
+
b4
a31b1
+
a32b2
b4
a41b1
+
a42b2 + a43b3 +
a44 b4
Scalability of a Parallel System
M. J. Quinn, Parallel Programming in C with MPI and OpenMP
 Parallel system: parallel program executing on a parallel





computer
Scalability of a parallel system: measure of its ability to
increase performance as number of processors increases
Parallel overhead increases when the number of processors
increases
The way to maintain efficiency when increasing the number of
processors is to increase the size of the problem
The maximum problem size is limited by the amount of primary
memory that is available
A scalable system maintains efficiency as processors are added
Scalability Function
 Row-wise and column-wise decomposition
algorithms:
Scalability function = ( p )
 To maintain the constant efficiency, memory
utilization must grow linearly with the number of
processors. The algorithm is not highly scalable.
 Block-wise decomposition algorithm
Scalability function = ( log2p )
 This parallel algorithm is more scalable than the
other two algorithms
Results
 First, we are interested to benchmark the
performance of row-wise and column-wise
decomposition methods using a small number of
processors.
 In this case, we calculated the average run-times
of row-wise and column-wise decomposition
methods for N = 1,000 and N = 10,000. The plot of
speed-up calculations vs. the number of processors
(p = 1, 2, 4, 6, 8, 10, 12, 14, 16) is shown on the
next slide
Results
 From these benchmarking results we conclude:



the speed-up and the performance increase as the
size of the matrix is increased from 1,000 to 10,000
the two decomposition algorithms tend to have the
same speedup at N = 10,000
in the case of small input (N = 1,000) row-wise
decomposition method performs slightly better
than column-wise decomposition method, probably
due to more inter-processor communication that
was used in the later method.
Results
 We extended our performance study to
include 36 and 64 processors.
 Next two slides shows the speed-up
versus number of processors
(p = 1, 4, 16, 36, 64) for the three
domain decomposed algorithms with
N = 1,000 and with N = 10,000.
Speedup versus the number of processors for matrix size N = 1,000
. Speed-up versus the number of processors for matrix size N = 10,000
Results
 From these results we can draw the following conclusions:
 Compared to both row-wise and column-wise decomposition
methods, the block-wise decomposition method, as theoretically
predicted, produces better speed-up at larger number of
processors. This is true for small and large sizes of matrix. Thus,
the scalability of block-wise method is indeed better than the
other two methods.
 In the case of N = 1,000, the performance of both row-wise and
column-wise decreases at a large number of processors. This
means it is useless to use these two methods, with a small
number of inputs, beyond 16 processors.
 In the case of N = 10,000, the performance of row-wise method
is still good for up to 64 processors. The performance of
column-wise method, however, decreases as we increase the
number of processors from 36 to 64.
Refences
1. G. H. Golub and C. F. van Loan, “Matrix Computations”, Third edition,
Johns Hopkins University Press (1996).
2. A. Grama, G. Karypis, V. Kumar, and A. Gupta, “An Introduction to Parallel
Computing: Design and Analysis of Algorithms”, Second edition, Addison
Wesley (2003)
3. W. Gropp et. al., “The Sourcebook of Parallel Computing”, Morgan
Kaufmann Publisher (2002)
4. G. E. Karniadakis and R. M. Kirby III, “Parallel Scientific Computing in
C++ and MPI”, Cambridge University Press (2003)
5. M. J. Quinn, “Parallel Programming in C with MPI and OpenMP”, McGraw
Hill (2004)
6. B. Wilkinson and M. Allen, “Parallel Programming”, Second edition,
Prentice Hall (2005)
7. F. M. Ham and I. Kostanic, “Principles of Neurocomputing for Science &
Engineering”, McGraw Hill (2001)
8. P. S. Pacheco, “Parallel Programming with MPI”, Morgan Kaufmann
Publishers (1997)
9. M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra, “MPI The
complete Reference Volume 1, The MPI Core”, MIT Press (1998)

Performance Study of Domain Decomposed Parallel Matrix

Transcript Performance Study of Domain Decomposed Parallel Matrix

Directory