Document 7903794

Download Report

Transcript Document 7903794

On Grid-based Matrix Partitioning
for Networks of Heterogeneous
Processors
Alexey Lastovetsky
School of Computer Science and Informatics
University College Dublin
[email protected]
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
1
Heterogeneous parallel computing
 Heterogeneity
of processors
– The processors run at different speeds
– Even distribution of computations do not balance
processors’ load
» The performance is determined by the slowest processor
– Data must be distributed unevenly
» So that each processor will perform the volume of
computation proportional to its speed
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
2
Constant performance models of
heterogeneous processors
 The
simplest performance model of heterogeneous
processors
– p, the number of the processors,
– S={s1, s2, ..., sp}, the speeds of the processors (positive
constants).

The speed
– Absolute: the number of computational units performed by
the processor per one time unit
p
– Relative:
s 1

i 1 i
1
t

– Some use the execution time: i s
i
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
3
Data distribution problems with constant
models of heterogeneous processors
Typical
design of heterogeneous parallel
algorithms
– Problem of distribution of computations in
proportion to the speed of processors
» Problem of partitioning of some mathematical
objects
 Sets, matrices, graphs, geometric figures,
etc.
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
4
Partitioning matrices with constant models
of heterogeneous processors
 Matrices
– Most widely used math. objects in scientific computing
» Studied partitioning problems mainly deal with matrices
» Matrix partitioning in one dimension over a 1D arrangement of
processors
 Often reduced to partitioning sets or well-ordered sets
» Design of algorithms often results in matrix partitioning problems
not imposing the restriction of partitioning in one dimension
 E.g., in parallel linear algebra for heterogeneous platforms
 We will use matrix multiplication
– A simple but very important linear algebra kernel
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
5
Partitioning matrices with constant models
of heterogeneous processors (ctd)
 A heterogeneous
matrix multiplication algorithm
– A modification of some homogeneous one
» Most often, of the 2D block cyclic ScaLAPACK algorithm
1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
2
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
3
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
5
9
13
17
21
2
6
10
14
18
22
3
7
11
15
19
23
4
8
12
16
20
24
1
4
7
10
4
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
5
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
6
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
7
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P13
P11
P12
P13
P14
8
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
9
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
10
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
11
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
12
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
13
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
14
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
15
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
16
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
17
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
18
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
19
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
20
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
21
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
13
16
P11
P12
P13
P14
P21
P22
P23
P24
P32
P33
P34
19
22
2
22
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
P11
P12
P13
P14
23
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
P21
P22
P23
P24
24
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
P31
P32
P33
P34
5
8
11
14
17
20
23
3
P31
6
9
12
15
18
P31
21
24
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
6
Partitioning matrices with constant models
of heterogeneous processors (ctd)
 2D
block cyclic ScaLAPACK MM algorithm (ctd)
A
ak
B
C
cij  cij  aik  bkj
bk 
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
7
Partitioning matrices with constant models
of heterogeneous processors (ctd)
 2D
block cyclic ScaLAPACK MM algorithm (ctd)
– The matrices are identically partitioned into rectangular
generalized blocks of the size (p×r)×(q×r)
» Each generalized block forms a 2D p×q grid of r×r blocks
» There is 1-to-1 mapping between this grid of blocks and the p×q
processor grid
– At each step of the algorithm
» Each processor not owing the pivot row and column receives
horizontally (n/p)×r elements of matrix A and vertically (n/q)×r
elements of matrix B
 n n
   r , i.e., ~ the half-perimeter of the rectangle
 p q
» => in total, 
area allocated to the processor
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
8
Partitioning matrices with constant models
of heterogeneous processors (ctd)
 General
design of heterogeneous modifications
– Matrices A, B, and C are identically partitioned into equal
rectangular generalized blocks
– The generalized blocks are identically partitioned into
rectangles so that
» There is one-to-one mapping between the rectangles and the
processors
» The area of each rectangle is (approximately) proportional to the
speed of the processor which has the rectangle
– Then, the algorithm follows the steps of its homogeneous
prototype
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
9
Partitioning matrices with constant models
of heterogeneous processors (ctd)
P1
P7
P4
P1
P5
P2
P7
P4
P5
P8
P2
P8
P3
P9
P3
P9
P1
P7
P1
P7
P6
P4
P6
P5
P2
P4
P8
P2
P6
P3
P5
P8
P6
P9
P3
P9
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
10
Partitioning matrices with constant models
of heterogeneous processors (ctd)
 Why
to partition the GBs in proportion to the speed
– At each step, updating one r×r block of matrix C needs the
same amount of computation for all the blocks
– => the load will be perfectly balanced if the number of
blocks updated by each processor is proportional to its
speed
– The number = ni×NGB
– ni = the area of the GB partition allocated to i-th processor
(measured in r×r blocks)
– => if the area of each GB partition ~ to the speed of the
owing processor, their load will be perfectly balanced
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
11
Partitioning matrices with constant models
of heterogeneous processors (ctd)
 A generalized
block from partitioning POV
– An integer-valued rectangular
– If we need an asymptotically optimal solution, the problem
can be reduced to a geometrical problem of optimal
partitioning of a real-valued rectangle
» the asymptotically optimal integer-valued solution can be obtained
by rounding off an optimal real-valued solution of the geometrical
partitioning problem
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
12
Geometrical partitioning problem
 The
general geometrical partitioning problem
– Given a set of p processors P1, P2, ..., Pp, the relative speed
of each of which is characterized by a positive constant, si,
p
( i 1 si  1 ),
– Partition a unit square into p rectangles so that
» There is one-to-one mapping between the rectangles and the
processors
» The area of the rectangle allocated to processor Pi is equal to si
p
» The partitioning minimizes
w  h  , where wi is the width

i 1
i
i
and hi is the height of the rectangle allocated to processor Pi
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
13
Geometrical partitioning problem (ctd)
 Motivation
behind the formulation
– Proportionality of the areas to the speeds
» Balancing the load of the processors
– Minimization of the sum of half-perimeters
» Multiple partitionings can balance the load
» Minimizes the total volume of communications
 At each step of MM, each receiving processor
receives data ~ the half-perimeter of its rectangle
p
 => In total, the communicated data ~
w  h   2

i 1
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
i
i
14
Geometrical partitioning problem (ctd)
Motivation behind the formulation (ctd)
– Option: minimizing the maximal half-perimeter
» Parallel communications
– The use of a unit square instead of a rectangle
» No loss of generality
» the optimal solution for an arbitrary rectangle is
obtained by straightforward scaling of that for the unit
square
Proposition. The general geometrical partitioning
problem is NP-complete.
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
15
Restricted geometrical partitioning
problems
Restricted
problems having polynomial
solutions
– Column-based
– Grid-based
 Column-based
partitioning
– Rectangles make up columns
– Has an optimal solution of complexity O(p3)
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
16
Column-based partitioning problem
P1
P8
P12
P7
P9
P2
P4
P11
P3
P6
P10
P5
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
17
Column-based partitioning problem (ctd)
 A more
restricted form of the column-based
partitioning problem
– The processors are already arranged into a set of columns

Algorithm 1: Optimal partitioning a unit square between p
heterogeneous processors arranged into c columns, each of
which is made of rj processors, j=1,…,c :
– Let the relative speed of the i-th processor from the j-th
column, Pij, be sij.
– Then, we first partition the unit square into c vertical
r
rectangular slices such that the width the j-th slice w j   sij
j
i 1
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
18
Column-based partitioning problem (ctd)
 Algorithm
1: (ctd) :
– Second, each vertical slice is partitioned independently into
rectangles in proportion with the speed of the processors in
the corresponding processor column.

Algorithm 1 is of linear complexity.
P11
P13
P12
P1
P 2 P3
P21
P22
P23
P32
P31
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
P33
19
Grid-based partitioning problem
 Grid-based
partitioning problem
– The heterogeneous processors form a two-dimensional grid
P11
P21
P12
P22
P13
P23
P32
P33
P31
P42
P41
P43
- There exist p and q such
that any vertical line
crossing the unit square will
pass through exactly p
rectangles and any
horizontal line crossing the
square will pass through
exactly q rectangles
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
20
Grid-based partitioning problem (ctd)
 Proposition.
Let a grid-based partitioning of the unit
square between p heterogeneous processors form c
columns, each of which consists of r processors,
p=r×c. Then, the sum of half-perimeters of the
rectangles of the partitioning will be equal to (r+c).
– The shape r×c of the processor grid formed by any optimal
grid-based partitioning will minimize (r+c).
– The sum of half-perimeters of the rectangles of the optimal
grid-based partitioning does not depend on the mapping of
the processors onto the nodes of the grid.
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
21
Grid-based partitioning problem (ctd)
 Algorithm
2: Optimal grid-based partitioning a unit
square between p heterogeneous processors:
– Step 1: Find the optimal shape r×c of the processor grid
such that p=r×c and (r+c) is minimal.
– Step 2: Map the processors onto the nodes of the grid.
– Step 3: Apply Algorithm 3 of the optimal partitioning of
the unit square to this r×c arrangement of the p
heterogeneous processors.
The correctness of Algorithm 2 is obvious.
 Algorithm 2 returns a column-based partitioning.

ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
22
Grid-based partitioning problem (ctd)
P11
P12
P22
P21
P13
P23
P32
P33
P31
P42
P41

P43
The optimal grid-based partitioning can be seen as a
restricted form of column-based partitioning.
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
23
Grid-based partitioning problem (ctd)
 Algorithm
3: Finding r and c such that p=r×c and
(r+c) is minimal:
r
 p;
while(r>1)
if((p mod r)==0))
goto stop;
else
r--;
stop: c = p / r;
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
24
Grid-based partitioning problem (ctd)
 Proposition.
Algorithm 3 is correct.
 Proposition. The complexity of Algorithm 2 can be
bounded by O(p3/2).
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
25
Experimental results
Specifications of sixteen Linux computers on which the matrix multiplication is executed
Computer
Cpu / Main memory / Cache (GHz/Mbytes/Kbytes)
hcl01
3.60 / 256 / 2048
2171
hcl02
3.00 / 256 / 2048
2099
hcl03
3.40 / 1024 / 1024
1761
hcl04
3.40 / 1024 / 1024
1787
hcl05
3.40 / 256 / 1024
1735
hcl06
3.40 / 256 / 1024
1653
hcl07
3.40 / 256 / 1024
1879
hcl08
3.40 / 256 / 1024
1635
hcl09
1.00 / 1024 / 1024
3004
hcl10
1.00 / 1024 / 1024
2194
hcl11
3.00 / 512 / 1024
4580
hcl12
3.40 / 512 / 1024
1762
hcl13
3.40 / 1024 / 1024
4934
hcl14
2.80 / 1024 / 1024
4096
hcl15
3.60 / 1024 / 16
2697
hcl16
3.60 / 1024 / 2048
4840
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
Absolute speed (MFlops)
26
Experimental results (ctd)
Matrix size (n)
Processor grid (r×c)
Total execution time (sec)
Communication time (sec)
5120
1×16
348
180
5120
2×8
211
61
5120
4×4
179
23
6144
1×16
537
258
6144
2×8
335
95
6144
4×4
276
30
7168
1×16
770
352
7168
2×8
514
150
7168
4×4
420
51
8192
1×16
1264
464
8192
2×8
709
181
8192
4×4
582
50
9216
1×16
1444
590
9216
2×8
969
233
9216
4×4
828
93
10240
1×16
1916
727
10240
2×8
1292
297
10240
4×4
1100
110
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
27
Application to Cartesian partitioning
Cartesian partitioning:
– A column-based partitioning, the rectangles of
which make up rows.
P11
P12
P13
P21
P22
P23
P31
P32
P33
P41
P42
P43
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
28
Application to Cartesian partitioning (ctd)
Cartesian
partitioning
– Plays important role in design of heterogeneous
parallel algorithms (e.g., in scalable algorithms)
The
Cartesian partitioning problem
– Very difficult
» Their may be no Cartesian partitionings
perfectly balancing the load of processors
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
29
Application to Cartesian partitioning (ctd)
Cartesian
partitioning problem in general
form
– Given p processors, the speed of each of which is
characterized by a given positive constant,
– Find a Cartesian partitioning of a unit square such
that
» There is 1-to-1 mapping between the rectangles and the
processors
 hi  w j 
» The partitioning minimizes max


i, j

s ij
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007

30
Application to Cartesian partitioning (ctd)
The
Cartesian partitioning problem
– Not even studied in the general form.
» If shape r×c is given, it proved NP-complete.
» Unclear if there exists a polynomial algorithm
when both the shape and the processors’
mapping are given
» There exists an optimal Cartesian partitioning
with processors arranged in a non-increasing
order of speed
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
31
Application to Cartesian partitioning (ctd)
 Approximate
solutions of the Cartesian
partitioning problem are based on the
observation
– Let the speed matrix {sij} of the given r×c
processor arrangement be rank-one
– Then there exists a Cartesian partitioning perfectly
balancing the load of the processors
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
hi  w j
s ij
 const
32
Application to Cartesian partitioning (ctd)
 Algorithm
4: Finding an approximate solution of the
simplified Cartesian problem (when only the shape
r×c is given):
– Step 1: Arrange the processors in a non-increasing order of
speed
– Step 2: For this arrangement, let
hi 
s
j
 s
i
j
s

s
ij
ij
and
ij
wj
i
i
ij
j
be the parameters of the partitioning
– Step 3: Calculate the areas hi×wj of the rectangles of this
partitioning
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
33
Application to Cartesian partitioning (ctd)
 Algorithm
5: Finding an approximate solution of the
simplified Cartesian problem when only the shape
r×c is given (ctd):
– Step 4: Re-arrange the processors so that
i, j, k , l : sij  skl  hi  w j  hk  wl
– Step 5: If Step 4 does not change the arrangement of the
processors then return the current partitioning and stop the
procedure else go to Step 2
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
34
Application to Cartesian partitioning (ctd)

Proposition. Let a Cartesian partitioning of the unit square
between p heterogeneous processors form c columns, each of
which consists of r processors, p=r×c. Then, the sum of halfperimeters of the rectangles of the partitioning will be (r+c).
– Proof is a trivial exercise
– Minimization of the communication cost does not depend
on the speeds of the processors but only on their number
– => minimization of communication cost and minimization
of computation cost are two independent problems
– Any Cartesian partitioning minimizing (r+c) will optimize
communication cost
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
35
Application to Cartesian partitioning (ctd)
 Now
we can extend Algorithm 5
– By adding the 0-th step, finding the optimal r×c
– The modified algorithm returns an approximate
solution of the extended Cartesian problem
» Aimed at minimization of both computation and
communication cost
 The
modified Algorithm 5 will return an
optimal solution if the speed matrix for the
arrangement is a rank-one matrix
ISPDC 2007, Hagenberg, Austria, 5-8 July 2007
36